Abstract
Dogs and laboratory mice are commonly trained to perform complex tasks by guiding them through a curriculum of simpler tasks (‘shaping’). What are the principles behind effective shaping strategies? Here, we propose a machine learning framework for shaping animal behavior, where an autonomous teacher agent decides its student’s task based on the student’s transcript of successes and failures on previously assigned tasks. Using autonomous teachers that plan a curriculum in a common sequence learning task, we show that near-optimal shaping algorithms adaptively alternate between simpler and harder tasks to carefully balance reinforcement and extinction. Based on this intuition, we derive an adaptive shaping heuristic with minimal parameters, which we show is near-optimal on the sequence learning task and robustly trains deep reinforcement learning agents on navigation tasks that involve sparse, delayed rewards. Extensions to continuous curricula are explored. Our work provides a starting point towards a general computational framework for shaping animal behavior.
I. INTRODUCTION
Animal trainers “shape” an animal’s behavior towards a specific sequence of actions [1–4], for example, training a dog to roll, fetch and sit. An untrained animal is unlikely to execute this sequence in the right order, even if it can perform each action separately. One intuitive teaching strategy is to first reinforce the animal for rolling. Once the animal rolls consistently, rolling is no longer reinforced (or becomes variable) and the animal is instead reinforced for successfully fetching after a roll. This iterative process is repeated until the animal learns the right sequence. In some cases, the trainer further breaks down the task or “lures” the animal to carry out the desired action.
Here, a shaping process is essential as the animal will rarely execute the right sequence during innate behavior. This simple intuition highlights a fundamental constraint: learning a particular behavioral sequence through random, unguided exploration is inefficient when the dimensionality of behavior is large, regardless of the learning rule the animal employs. Shaping tackles this issue by iteratively approximating longer bits of the sequence, limiting the search space at every stage of training.
Laboratory animals solving a perceptual discrimination task spend a significant fraction of their training time learning the rules of the task. For example, a free-moving two-action-forced-choice paradigm often involves an animal triggering a stimulus through a nose poke at a particular location, which leads to reward delivery at two possible distal locations in the arena. The spatiotemporal relationship between the nose poke and reward, that paying attention to the stimulus matters for obtaining reward and that there is a temporal cost for a wrong choice are non-trivial rules of the environment that the animal has to learn before learning the perceptual features that distinguish the stimulus sets. Significant attention has been paid to active learning [5–8], which addresses the latter problem of choosing a perceptual stimulus set to efficiently teach the stimulus-outcome relationship. Behavioral shaping, on the other hand, is used to teach the rules of the task and closely reflects the curriculum design process used in education.
A shaping protocol typically involves hand-designing a series of simpler tasks leading to the full task during training. The animal is rewarded for successfully completing an assigned sub-task, and the curriculum progresses once the animal is sufficiently good at completing this sub-task [9–13]. However, it is unclear whether such heuristics are close to optimal even in simple scenarios, or when these strategies might fail. Understanding the principles that drive effective shaping, coupled with closed-loop training strategies, could considerably reduce the training time for both laboratory animals and artificial agents, while providing insight into factors that contribute to slow or fast learning [14, 15]. Our goal is to develop a general computational framework for shaping animal behavior, paying particular attention to the constraints that trainers face.
In machine learning, the importance of shaping-inspired approaches for training agents was recognized early on [16–22]. More recently, numerous automatic curriculum learning (ACL) techniques have been developed for training deep reinforcement learning (RL) agents (reviewed in [23]). Within the ACL framework, an autonomous teacher agent determines the distribution of the student’s tasks based on the student’s past behavior. However, these approaches rely on arbitrary control over the agent’s states [24–26], exploration [27–34] or the reward structure [35–37]. For example, a well-known strategy known as potential-based reward shaping [35] modifies the reward function to expedite learning while preserving the optimal policy. Such a procedure is infeasible in experimental situations where the animal has to interrupt its behavior in order to acquire reward. In other cases, these methods assume that the agent’s performance can be measured on a range of arbitrary test tasks [38] or require access to expert demonstrations [26, 39, 40].
Although these assumptions are reasonable for training artificial RL agents and have demonstrated success in numerous tasks, they are not suitable for training animals. When training animals, we typically have 1) limited flexibility in controlling rewards and exploration statistics, 2) partial observability, as animals can often be evaluated based only on whether they have succeeded or failed on the task (their true “state” remains unknown), and 3) no delineation between training and test trials. In addition, animals often have an innate repertoire of responses and behaviors they may resort to by default, and a training procedure which recognizes and takes advantage of this feature may be more successful.
II. FRAMEWORK
To address these issues, we propose an ACL framework, which we term outcome-based curriculum learning (OCL). In OCL, a teacher agent decides the student’s next task based solely on the student’s outcomes, i.e., its history of successes or failures on past tasks, with the long-term goal of minimizing the time to reach a desired level of performance on the final task. By observing and delivering rewards based on binary outcomes, teacher algorithms are task-agnostic and can be applied for training both artificial agents and animals. Closest to our framework is the teacher-student curriculum learning framework [41, 42], which relies on observed scores. Inspired by the concept of learning progress [43] in developmental psychology, Matiisen et al [41] propose heuristic strategies where the teacher selects the task on which the student shows the greatest improvement on scores. However, we find below that these heuristics perform poorly compared to our simpler alternatives.
To gain intuition, it is helpful to visualize teaching with OCL as ‘navigation’ through an (unknown) difficulty landscape that is shaped by the student’s innate biases towards performing behaviors pertinent to the task. Such a difficulty landscape is illustrated in Figure 1a for a task whose difficulty increases along two independent axes. We define difficulty as the negative log probability of success on a task (here parameterized by the two skill axes) given the student’s current policy. The difficulty landscape thus depends on the task as well as the student’s innate biases and learned behavior. The goal of OCL is to progressively flatten regions of the landscape to solve the full task as quickly as possible.
FIG. 1:
(a) Teaching using our OCL framework can be visualized using a difficulty landscape (here, parameterized by two skill axes), which quantifies the student’s success probability for each difficulty level. A student assigned an extremely difficult task will not learn, since they are unlikely to succeed and thus do not receive significant reinforcement. The teacher’s purpose is to adaptively assign tasks (shown in red) while simultaneously inferring the difficulty landscape to flatten it as quickly as possible. (b) Tasks from a pre-defined set are ordered based on their difficulty, as measured by the success probability of a naive agent. An autonomous teacher decides the student’s task based on the student’s transcript of successes and failures (represented here as 0s and 1s respectively) on previously assigned tasks. (c) We apply our OCL framework to three biologically relevant goal-oriented tasks involving delayed rewards: a generic sequence learning task, an odor-guided trail tracking task and a plume-tracking task involving localization to a target based on sparse cues.
In this manuscript, we consider tasks that can be decomposed into a single difficulty scale. Such tasks lend themselves naturally to a curriculum. A student begins with the simplest version of the task and progresses through difficulty levels (as set by the curriculum) until they succeed at the entire task. In the discrete version of OCL, the experimenter designs tasks and rates them based on their difficulty in discrete levels from 1 to N (Figure 1b). A desired threshold level of performance is specified for the N th task (the full task). Given this input, the teacher algorithms that we consider below choose the appropriate difficulty level for the student based on their past transcripts. At the start of every interaction, the teacher receives as input a transcript and proposes the difficulty level k. The student attempts the task for T (fixed) rounds, adding to the performance transcript. This two-way interaction continues until the student attains a satisfactory level of success on the final task.
We first investigate in detail a sequence learning task, where an RL-based student is required to learn the correct sequence of N actions (Figure 1c). The sequence learning task encompasses a large variety of behavioral tasks, including tricks such as the roll → fetch → sit sequence described above, numerous Skinnerian tasks, as well as common laboratory behavioral experiments which have a self-initiated trial structure. The difficulty landscape of such tasks is determined by the complexity of the sequence (N) and the innate probability that the student will execute the correct action at each step of the sequence. Since the probability of success decreases exponentially with N, the agent is unlikely to learn the full task without shaping when N is sufficiently large.
The simplicity of the task structure allows us to examine normative teacher strategies using modified Monte Carlo planning algorithms for decision-making under uncertainty. Using insights from these normative strategies, we use differential evolution to design near-optimal heuristics that are agnostic to the task, learning rule, and learning parameters. Next, we apply our method on two novel, naturalistic, sequential decision-making tasks that involve delayed rewards: odor-guided trail tracking and plume-based odor localization (Figure 1c). We show that deep reinforcement learning agents can be trained using our adaptive teacher algorithms to solve these tasks using only a single reward delivered at the end of the task. Finally, we extend this framework to continuous parameterizations of the task, where the teacher has the option of breaking down the task into simpler components.
III. RESULTS
A. Sequence learning
In the sequence learning task, a student RL agent begins each trial at a fixed start state and receives a reward r if they perform the correct sequence of N actions (Figure 2a, see Appendix A for full details). If the student fails to take the correct action at any step in the sequence, the student receives no reward and the episode terminates. The probability that the student takes the correct action at step i is given by σ(qi + εi), where εi is the (fixed) innate bias that determines the probability the student will take the correct action before any learning occurs. σ is the logistic function. For example, if the agent prior to learning takes K possible actions at step i with equal probability and only one of them is correct, we have εi = −log(K − 1). The action value qi is initially set to zero and updated using a standard temporal-difference (TD) learning rule with learning rate α.
FIG. 2:
(a) The sequence learning setup. In the full task, the student is required to take a sequence of N correct actions to get reward. In intermediate levels of the task, the reward is delivered if the student takes n ≤ N correct actions. εi is the innate bias of the student to take the correct action at the ith step, prior to training. We assume εi = ε for all i unless otherwise specified. (b) The incremental teacher (INC) fails once . (c) The q values (in grayscale) for the correct action at each step shown for ε = −1.5 (top) and ε = −1.8 (bottom). The red line shows the assigned task level. Note the striped dynamics in the top row caused due to alternating reinforcement and extinction. In the bottom row, ε is too small, forcing learning to stall. (d) Time series of q values for actions at the first (solid black) and third (dashed gray) steps for the two examples shown in panel (c).
The sequence learning task naturally splits into discrete difficulty levels: the teacher modulates difficulty by increasing, decreasing or maintaining the step k at which the student is rewarded. The innate biases εi’s play a key role in the dynamics since they determine the probability of success (and thus the rate of reinforcement) when the difficulty level is increased. We assume for simplicity that all εi’s are equal to ε; the general case is considered later. We seek OCL algorithms that minimize the time the student takes to succeed at a rate greater than a threshold τ on the full task without prior knowledge of the student’s innate biases and learning parameters.
B. An incremental teacher strategy is not robust
An intuitive baseline strategy when designing a curriculum is an incremental (INC) approach: the teacher increments the difficulty by one when the student’s estimated success rate exceeds τ at the current level. Note that since the success rate changes due to learning, a reasonable estimator should consider recent transcripts yet a sufficient number of them to minimize sampling noise. We consider different estimation procedures for computing and find that an exponential moving average estimator is computationally inexpensive and achieves comparable performance as other more sophisticated methods (Appendix E, Figure S2).
INC is stable for large ε (Figure 2b). However, INC abruptly and consistently fails when ε is below a threshold ( in Figure 2b). Examining the dynamics of the q values provides insight into why this catastrophic failure occurs.
Let us first examine q value dynamics when the student is required to directly solve the case k = 5, where ε is chosen such that the student is capable of learning without a curriculum. The dynamics of q values exhibit a ‘reinforcement wave’, where actions are sequentially reinforced backwards from the final state to the start [44]. This backward propagation is a generic feature of RL, since the goal acts as the sole source of reward and reinforcement propagates through RL rules that act locally. Now, suppose the difficulty is incremented by one (k = 6). Immediately after this change, the student executes the correct sequence of actions until the fifth step, but will likely fail to receive reward as the final step has not been reinforced. These (possibly brief) series of failures produce a long-lasting extinction wave that propagates backwards to earlier steps with dynamics that parallel those of the reinforcement wave. In short, transient failures after every difficulty increment have long-term effects on learning dynamics and success rate.
When visualized over the course of a curriculum, q values assume characteristic “striped”-dynamics that emerge due to alternating waves of extinction and reinforcement (top panel in Figure 2c). These striped dynamics reflect the transient failures and eventual successes that follow an increment to higher difficulty when ε is larger than the failure threshold. Extinction dominates reinforcement when ε is below a critical value, leading to catastrophic unlearning of previous actions and subsequent lack of learning progress. Since extinction is unavoidable after significant increases in difficulty , optimal strategies that are robust in this regime will have to ameliorate this effect while completing the curriculum as quickly as possible. That is, effective curriculum design strategies should achieve an optimal balance between extinction and reinforcement.
C. Near-optimal teacher algorithms alternate between difficulty levels
To gain insight into near-optimal strategies, we formulate the teacher’s task for the sequence learning task as optimal decision-making under uncertainty using the framework of Partially Observable Markov Decision Processes (POMDPs) [45–47]. Specifically, the teacher decides whether to increase, decrease or keep the same difficulty level based on the student’s past history, and receives a unit reward when the student crosses the threshold success rate on the full task. A discount factor incentivizes the teacher to minimize the time to reach this goal. As when training animals, one challenge is that the student’s true learning state (encoded by the q values) are hidden as the teacher receives only a finite transcript of successes and failures on previously assigned tasks. Another challenge is that the teacher is not a priori aware of the student’s innate biases and learning rate. Moreover, the long horizon and sparse reward makes planning computationally prohibitive.
To solve this task, we employ an online POMDP solver (called POMCP [48]) that relies on Monte Carlo planning and inference (Figure 3a). This solver plans based on the inferred joint distribution of q’s, ε and α, which is represented as a collection of particles with different parameter values. A planning algorithm based on Monte Carlo Tree Search (MCTS) [49] balances exploration and exploitation to decide the next action. The student’s transcript on the following round is then used to update the joint distribution using Bayes’ rule implemented as a particle filter, after which this cycle is repeated. With sufficient sampling of particles and planning paths, the solver provides a near-optimal adaptive teacher algorithm for the sequence learning task. Due to the large size of our POMDP, the implementation of POMCP is nontrivial, with details in Appendix E and F.
FIG. 3:
(a) An overview of the POMCP teacher, which cycles between inferring the student’s q values, innate bias and learning rate based on the transcript and planning using a Monte Carlo tree search. (b) The adaptive heuristic (ADP), which employs a simple decision rule to stay, increment or decrement the current difficulty based on the estimated success rate (computed using an exponential moving average over past transcripts). (c) POMCP and ADP are comparable and significantly outperform other algorithms [41] when the task is non-trivial (low ε), including when INC fails . Here N = 10. Note that planning using POMCP is intractable when ε = −3. (d,e) POMCP and ADP adaptively alternate between difficulty levels, thereby preventing catastrophic extinction. Note the drop in difficulty levels after significant extinction in both cases. Here ε = −2.
The POMCP teacher exhibits a non-monotonic curriculum, repeatedly reverting back to easier tasks before ramping up the difficulty. The q values for earlier steps in the sequence are relatively stable and lack the alternating reinforcement and extinction dynamics that we observe for the INC teacher (Figure 3d). This robustness extends to ε values lower than the critical value at which INC fails (ε = −2 in Figure 3c). Indeed, as shown in the example in Figure 3d, the POMCP teacher recognizes and compensates for significant extinction by rapidly decreasing the difficulty, increasing difficulty only after sufficient relearning occurs.
D. A heuristic adaptive algorithm achieves near-optimal curriculum design
The POMCP teacher’s strategy suggests simple principles to overcome extinction while making learning progress. Specifically, a robust teacher algorithm has to 1) increase difficulty when the estimated success rate is sufficiently large (similar to INC), 2) continue at the same difficulty level when the success rate is below this threshold value as long as the student continues to learn , and 3) decrease difficulty if the student begins to show signs of significant extinction for some µ. These three principles motivate our choice of a decision-tree-based teacher algorithm that uses and as features. The precise splits and leaves of the trees can be optimized using various search procedures. More complex trees can be constructed by taking into account second or higher-order differences of the success rate. For the sequence learning task, we find that the features (, ) are adequate to produce a successful teacher, which we term Adaptive (ADP). We optimize the decision tree using differential evolution (Figure 3b, see Appendix B 3 for details). Note that this optimized ADP is used for all benchmarks below with no additional tuning.
The ADP teacher shows dynamics similar to POMCP, mitigating extinction waves by alternating between difficulty levels (Figure 3e). We benchmark ADP against INC, POMCP and four algorithms proposed by Matiisen et al [41] (Figure 3c). These latter four algorithms are based on the principle of maximizing learning progress [43]: a student should attempt the difficulty level at which they make the fastest progress (as measured by the slope of the learning curve on a particular task). The algorithms differ in how progress is measured and how tasks are sampled based on their relative progress.
ADP is competitive with POMCP (for the range of parameter values that POMCP can be feasibly evaluated) and significantly outperforms the other algorithms for small values of ε, which is the regime where curriculum design is non-trivial and baseline algorithms such as INC fail. Moreover, ADP is robust when the innate biases for not equal (Figure S1). Since our OCL framework is task-agnostic and model-agnostic, ADP can be directly applied to other tasks and artificial agents provided that sub-tasks are arranged on a discrete, monotonic difficulty scale.
E. Performance of ADP on deep RL tasks with delayed rewards
To examine whether ADP can design curricula for complex behaviorally relevant tasks and learning models, we train deep RL agents to solve two navigation tasks with delayed rewards: odor-guided trail tracking and plume-source localization.
Dogs are routinely trained to track odor trails, and various heuristics have been developed by trainers to efficiently teach dogs [50]. In a successful trail tracking episode, the student begins with a random orientation from one end of the trail and receives a reward only when they get to the other end of the trail. Trails are long, meandering and broken so that the agent is highly unlikely to get to the end through random exploration and should thus learn a non-trivial strategy to actively follow the trail and receive reward.
The trail tracking paradigm (Figure 4a–d) provides a natural split of tasks onto a difficulty scale. We design a parametric generative model for trails where the parameters control the length, average curvature and brokenness of the trails (Appendix C 5). Samples of trails along tasks of increasing difficulty are shown in Figure 4b. We develop a deep RL framework for trail tracking, where the tracking student uses its sensorimotor history of sensed odor and self-motion to modulate their orientation in the subsequent step (see Appendix C 5 for full details). Sensorimotor history is encoded using a visuospatial, egocentric representation (Figure 4a), so that the student has a memory determined by the size of the visuospatial observation window. The student uses a convolutional neural network architecture which is trained using Proximal Policy Optimization (PPO) [51].
FIG. 4:
Deep reinforcement learning agents trained using a curriculum solve navigation tasks with delayed rewards. (a) The trail tracking paradigm. A sample trajectory of a trained agent navigating a randomly sampled odor trail. The colors show odor concentration. The inset shows the egocentric visuospatial input received by the network. (b) Sample trails from the six difficulty levels. (c) ADP outperforms INC and RAND (each teacher-student interaction is a step). Note that the agent does not learn the task without a curriculum. (d) The success rate of the agent in finding the target over training (black dashed line) for INC and ADP. The curriculum is shown in red. Note the significant forgetting shown by the student trained using INC approach compared to ADP. (e-g) As in panels a-d for a localization task. The agent is required to navigate towards a source which emits Poisson-distributed cues whose detection probability decreases with distance from the source (colored in green on a log scale).
The student does not learn without a curriculum. ADP outperforms both INC and a curriculum (RAND) that randomly chooses from the task set (Figure 4c). Their curricula show that ADP alternates as in the sequence learning task, presumably mitigating extinction effects associated with the transition to more difficult tasks. INC is comparable to ADP but experiences a greater degree of forgetting as seen in the longer time it spends at the highest difficulty level (Figure 4d). The path of an agent tracking the trail is shown in Figure 4a. The agent exhibits a preference for localizing at the edge of the gradient. When it encounters a break, the student performs repeated loops of increasing radius until it re-establishes contact with the trail. A detailed analysis of the student’s tracking behavior during trail tracking is postponed to future work.
Next, we extended this framework to a localization task (Figure 4e–g) inspired by naturalistic plume tracking [52, 53] and sound localization tasks. In each episode, the student begins at a random location a certain distance from a target whose (fixed) location is unknown. A unit reward is delivered when the student localizes at the target. The student receives sparse, Poisson-distributed cues from the target with probability that depends on the relative location to the target. These cues provide information about the location of the target, which can be used by the student to solve the task (see Appendix C 6 for full details). The delayed reward and sparse cues provide a challenge for training agents without a shaping protocol. We consider a curriculum where the difficulty scale is determined by the rate of detecting a cue at the student’s initial position, as well as the student’s distance from the target (Figure 4f). Similar to the trail tracking setting, results recapitulate the better performance of ADP compared to INC and RAND (Figure 4g,h).
F. Continuous curricula
Our analysis up to this point assumes discrete curricula. A consequence of discrete curricula is that an unexpectedly large jump in difficulty from one level to the next can stall learning. In such situations, an animal trainer has the option of decomposing the task further and proceed with an INC approach. However, if the jump from one level to the next is too small, the student will progress in small steps while the teacher incurs a temporal cost on unnecessary evaluations. On a continuous curriculum, an optimal teacher has to adjust difficulty increments such that they reflect the student’s innate biases. Here, we explore preliminary ideas for designing continuous curricula using a continuous extension of the sequence learning task and a concomitant modification of the student’s learning algorithm (see Appendix D for more details).
We consider a continuous ADP teacher modified to accommodate the particulars of a continuous curriculum. At the start of an interaction, the experimenter proposes an initial “rough guess” for the difficulty increment used by the teacher. As the ADP teacher progresses, it tweaks the size of this increment based on the student’s performance. In addition to the three actions (increase, decrease and maintain difficulty), we introduce a second set of three actions: increase, decrease and retain the increment interval (the teacher selects from nine actions at each step). As in the discrete case, we use differential evolution to find the best decision tree (Figure 5a). Figure 5c,d shows trajectories for the continuous ADP teacher, which compares favorably with INC in benchmarks (Figure 5b).
FIG. 5:
Algorithms for designing continuous curricula (a) Decision tree showing the continuous version of ADP which includes actions that “grow” and “shrink” the increments between continuously parameterized difficulty levels. See the text for more details of the task in the continuous setting. (b) ADP significantly outperforms INC when the task is difficult (low ε). (c,d) The q values plotted as in Figure 3d,e. Similar to the discrete setting, INC shows catastrophic extinction and never learns the task for sufficiently small ε. Continuous ADP first decreases increment size and smoothly increases the difficulty level while balancing reinforcement and extinction.
IV. DISCUSSION
From Skinner’s missile guidance pigeons [54] to laboratory rodent experiments to state of the art artificial RL agents, curriculum design plays a foundational role in training agents to solve complex tasks. Here, inspired by behavioral shaping, we propose an outcome-based curriculum learning (OCL) framework and develop adaptive algorithms aimed primarily for training laboratory animals. In a sequence learning task, dual waves of reinforcement and extinction modulate the student’s performance, necessitating a careful shaping strategy that balances reinforcement and extinction. A näıve teacher, INC, fails to prevent extinction when students encounter large jumps in difficulty. A near-optimal teacher strategy (POMCP), discovered by formulating teaching as optimal planning under uncertainty, relies on frequent alternations between the current and previous task difficulty levels, which ameliorates extinction. Inspired by this observation, we use differential evolution to design a decision-tree-based heuristic algorithm, ADP. ADP is much more efficient and achieves performance comparable to that of POMCP, significantly outperforms other algorithms on the sequence learning task and requires no fine-tuning for the task or student. ADP outperforms other curriculum strategies when applied to train deep RL agents on complex, naturalistic navigational tasks.
We focus primarily on cases where the curriculum can be decomposed into rigid, discrete difficulty levels. Real-world tasks can often be further broken down when student’s encounter a bottleneck. We explore one continuous generalization of ADP that relies on finite approximations to continuous intervals, coupled with a K-step TD learning rule. The continuous setting poses a distinct challenge: since the teacher a priori does not know whether the student can solve an incrementally harder version of the task, estimating this through a transcript takes additional samples and thus incurs a temporal cost. Infintesimal increases in difficulty are not optimal. On the other hand, large jumps in difficulty will stall learning. We expect competitive algorithms to appropriately balance these two factors; a more exhaustive exploration of continuous OCL algorithms will be considered in future work.
The curricula we explore here have all involved a single axis of difficulty. For many real-world tasks, there are multiple axes that must all be optimized simultaneously. For example, a tennis player has to learn and compose multiple elements – footwork, various racquet motions, tactics – in order to improve general playing skill. In the trail tracking setting, we have simplified all such factors (length, average curvature, brokenness of the trails) into a single difficulty scale, when ideally, the teacher should choose how to modulate the difficulty along each factor. One avenue for future work is to generalize our teacher algorithms to settings where there are multiple independent skills that need to be learned to solve the full task.
Finally, shaping is a crucial aspect of training animals. Concepts like task difficulty levels, innate bias ε, and behavioral extinction have natural analogs in biological agents. The teacher algorithms developed in this work can be readily deployed on real animals, and their efficacy measured. Many laboratory tasks in model systems such as mice involve extensive training lasting for weeks or more [13]. It is often unclear whether this lengthy training is due to the innate difficulty in animals learning the tasks at hand, or inefficient curriculum design [15]. Developing better teacher algorithms for animal training may result in significant savings in time and cost to produce well-trained subjects. In addition to practical benefits for laboratory research, any demonstration of more rapid training of animals will also shed light on their capabilities and limits of learning. Gradual shaping we discuss here may also be related to gradual introduction of more intuitive coincidences that exploit an animal’s priors to allow more rapid learning [15]. Such hand-crafted shaping is common in laboratory experiments [9–13], but more precise quantitative descriptions of behavioral learning algorithms such as ours opens the possibility of designing near-optimal teaching strategies in more general scenarios, similar to the POMCP formulation that we have developed here for a RL-based student. Optimistically, such formulations might even impact curriculum design for human students.
Supplementary Material
ACKNOWLEDGMENTS
We thank Jacob Zavatone-Veth and members of the Murthy lab for helpful discussions. This work was supported by a joint research agreement between NTT Research Inc. and Harvard University, including grant A47994. VNM is partially supported by NIH RF1NS128865 and R01DC017311.
Contributor Information
William L. Tong, School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA.
Anisha Iyer, University of California Berkeley, CA, USA.
Venkatesh N. Murthy, Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA and Center for Brain Science, Harvard University, Cambridge, MA, USA.
Gautam Reddy, Physics & Informatics Laboratories, NTT Research, Inc., Sunnyvale, CA, USA and Center for Brain Science, Harvard University, Cambridge, MA, USA.
References
- [1].Cooper J. O., Heron T. E., and Heward W. L., Applied behavior analysis (Pearson UK, 2020). [Google Scholar]
- [2].Lindsay S. R., Handbook of applied dog behavior and training, adaptation and learning, Vol. 1 (John Wiley & Sons, 2013). [Google Scholar]
- [3].Skinner B., The behavior of organisms: An experimental analysis (BF Skinner Foundation, 2019). [Google Scholar]
- [4].Pryor K., Don’t shoot the dog: The art of teaching and training (Simon & Schuster, 2019). [Google Scholar]
- [5].MacKay D. J., Information-based objective functions for active data selection, Neural computation 4, 590 (1992). [Google Scholar]
- [6].Bak J. H., Choi J. Y., Akrami A., Witten I., and Pillow J. W., Adaptive optimal training of animal behavior, Advances in neural information processing systems 29 (2016). [Google Scholar]
- [7].Cohn D., Ghahramani Z., and Jordan M., Active learning with statistical models, Advances in neural information processing systems 7 (1994). [Google Scholar]
- [8].Ren P., Xiao Y., Chang X., Huang P.-Y., Li Z., Gupta B. B., Chen X., and Wang X., A survey of deep active learning, ACM computing surveys (CSUR) 54, 1 (2021). [Google Scholar]
- [9].Bancroft S. L., Weiss J. S., Libby M. E., and Ahearn W. H., A comparison of procedural variations in teaching behavior chains: Manual guidance, trainer completion, and no completion of untrained steps, Journal of applied behavior analysis 44, 559 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Pinto L., Koay S. A., Engelhard B., Yoon A. M., Deverett B., Thiberge S. Y., Witten I. B., Tank D. W., and Brody C. D., An accumulation-of-evidence task using visual pulses for mice navigating in virtual reality, Frontiers in behavioral neuroscience 12, 36 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Guo Z. V., Hires S. A., Li N., O’Connor D. H., Komiyama T., Ophir E., Huber D., Bonardi C., Morandell K., Gutnisky D., et al. , Procedures for behavioral experiments in head-fixed mice, PloS one 9, e88678 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Rokni D., Hemmelder V., Kapoor V., and Murthy V. N., An olfactory cocktail party: figure-ground segregation of odorants in rodents, Nature neuroscience 17, 1225 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Laboratory I. B., Aguillon-Rodriguez V., Angelaki D., Bayer H., Bonacchi N., Carandini M., Cazettes F., Chapuis G., Churchland A. K., Dan Y., et al. , Standardized and reproducible measurement of decision-making in mice, Elife 10, e63711 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Kepple D. R., Engelken R., and Rajan K., Curriculum learning as a tool to uncover learning principles in the brain, in International Conference on Learning Representations (2022). [Google Scholar]
- [15].Meister M., Learning, fast and slow, Current opinion in neurobiology 75, 102555 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Selfridge O. G., Sutton R. S., and Barto A. G., Training and tracking in robotics., in Ijcai (1985) pp. 670–672. [Google Scholar]
- [17].Gullapalli V. and Barto A. G., Shaping as a method for accelerating reinforcement learning, in Proceedings of the 1992 IEEE international symposium on intelligent control (IEEE, 1992) pp. 554–559. [Google Scholar]
- [18].Elman J. L., Learning and development in neural networks: The importance of starting small, Cognition 48, 71 (1993). [DOI] [PubMed] [Google Scholar]
- [19].Randløv J. and Alstrøm P., Learning to drive a bicycle using reinforcement learning and shaping., in ICML, Vol. 98 (1998) pp. 463–471. [Google Scholar]
- [20].Krueger K. A. and Dayan P., Flexible shaping: How learning in small steps helps, Cognition 110, 380 (2009). [DOI] [PubMed] [Google Scholar]
- [21].Dorigo M. and Colombetti M., Robot shaping: an experiment in behavior engineering (MIT press, 1998). [Google Scholar]
- [22].Bengio Y., Louradour J., Collobert R., and Weston J., Curriculum learning, in Proceedings of the 26th annual international conference on machine learning (2009) pp. 41–48. [Google Scholar]
- [23].Portelas R., Colas C., Weng L., Hofmann K., and Oudeyer P.-Y., Automatic curriculum learning for deep rl: A short survey, arXiv preprint arXiv:2003.04664 (2020). [Google Scholar]
- [24].Florensa C., Held D., Wulfmeier M., Zhang M., and Abbeel P., Reverse curriculum generation for reinforcement learning, in Conference on robot learning (PMLR, 2017) pp. 482–495. [Google Scholar]
- [25].Ivanovic B., Harrison J., Sharma A., Chen M., and Pavone M., Barc: Backward reachability curriculum for robotic reinforcement learning, in 2019 International Conference on Robotics and Automation (ICRA) (IEEE, 2019) pp. 15–21. [Google Scholar]
- [26].Salimans T. and Chen R., Learning montezuma’s revenge from a single demonstration, arXiv preprint arXiv:1812.03381 (2018). [Google Scholar]
- [27].Chentanez N., Barto A., and Singh S., Intrinsically motivated reinforcement learning, Advances in neural information processing systems 17 (2004). [Google Scholar]
- [28].Forestier S., Portelas R., Mollard Y., and Oudeyer P.-Y., Intrinsically motivated goal exploration processes with automatic curriculum learning, The Journal of Machine Learning Research 23, 6818 (2022). [Google Scholar]
- [29].Bellemare M., Srinivasan S., Ostrovski G., Schaul T., Saxton D., and Munos R., Unifying count-based exploration and intrinsic motivation, Advances in neural information processing systems 29 (2016). [Google Scholar]
- [30].Pathak D., Gandhi D., and Gupta A., Self-supervised exploration via disagreement, in International conference on machine learning (PMLR, 2019) pp. 5062–5071. [Google Scholar]
- [31].Shyam P., Jaśkowski W., and Gomez F., Model-based active exploration, in International conference on machine learning (PMLR, 2019) pp. 5779–5788. [Google Scholar]
- [32].Eysenbach B., Gupta A., Ibarz J., and Levine S., Diversity is all you need: Learning skills without a reward function, arXiv preprint arXiv:1802.06070 (2018). [Google Scholar]
- [33].Yang T., Tang H., Bai C., Liu J., Hao J., Meng Z., Liu P., and Wang Z., Exploration in deep reinforcement learning: a comprehensive survey, arXiv preprint arXiv:2109.06668 (2021). [Google Scholar]
- [34].Ladosz P., Weng L., Kim M., and Oh H., Exploration in deep reinforcement learning: A survey, Information Fusion (2022). [Google Scholar]
- [35].Ng A. Y., Harada D., and Russell S., Policy invariance under reward transformations: Theory and application to reward shaping, in Icml, Vol. 99 (Citeseer, 1999) pp. 278–287. [Google Scholar]
- [36].Hu Y., Wang W., Jia H., Wang Y., Chen Y., Hao J., Wu F., and Fan C., Learning to utilize shaping rewards: A new approach of reward shaping, Advances in Neural Information Processing Systems 33, 15931 (2020). [Google Scholar]
- [37].Laud A. D., Theory and application of reward shaping in reinforcement learning (University of Illinois at Urbana-Champaign, 2004). [Google Scholar]
- [38].Fournier P., Sigaud O., Chetouani M., and Oudeyer P.-Y., Accuracy-based curriculum learning in deep reinforcement learning, arXiv preprint arXiv:1806.09614 (2018). [Google Scholar]
- [39].Nair A., McGrew B., Andrychowicz M., Zaremba W., and Abbeel P., Overcoming exploration in reinforcement learning with demonstrations, in 2018 IEEE international conference on robotics and automation (ICRA) (IEEE, 2018) pp. 6292–6299. [Google Scholar]
- [40].Bajaj V., Sharon G., and Stone P., Task phasing: Automated curriculum learning from demonstrations, arXiv preprint arXiv:2210.10999 (2022). [Google Scholar]
- [41].Matiisen T., Oliver A., Cohen T., and Schulman J., Teacher-student curriculum learning, IEEE transactions on neural networks and learning systems 31, 3732 (2019). [DOI] [PubMed] [Google Scholar]
- [42].Portelas R., Colas C., Hofmann K., and Oudeyer P.-Y., Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments, in Conference on Robot Learning (PMLR, 2020) pp. 835–853. [Google Scholar]
- [43].Oudeyer P.-Y., Kaplan F., and Hafner V. V., Intrinsic motivation systems for autonomous mental development, IEEE transactions on evolutionary computation 11, 265 (2007). [Google Scholar]
- [44].Reddy G., A reinforcement-based mechanism for discontinuous learning, Proceedings of the National Academy of Sciences 119, e2215352119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Astrom K. J., Optimal control of markov processes with incomplete state information, Journal of mathematical analysis and applications 10, 174 (1965). [Google Scholar]
- [46].Sondik E. J., The optimal control of partially observable markov processes over the infinite horizon: Discounted costs, Operations research 26, 282 (1978). [Google Scholar]
- [47].Kaelbling L. P., Littman M. L., and Cassandra A. R., Planning and acting in partially observable stochastic domains, Artificial intelligence 101, 99 (1998). [Google Scholar]
- [48].Silver D. and Veness J., Monte-carlo planning in large pomdps, Advances in Neural Information Processing Systems 23 (2010). [Google Scholar]
- [49].Browne C. B., Powley E., Whitehouse D., Lucas S. M., Cowling P. I., Rohlfshagen P., Tavener S., Perez D., Samothrakis S., and Colton S., A survey of monte carlo tree search methods, IEEE Transactions on Computational Intelligence and AI in games 4, 1 (2012). [Google Scholar]
- [50].Gerritsen R. and Haak R., K9 Scent Training: A Manual for Training Your Identification, Tracking and Detection Dog (Dog Training Press, 2015). [Google Scholar]
- [51].Schulman J., Wolski F., Dhariwal P., Radford A., and Klimov O., Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017). [Google Scholar]
- [52].Vergassola M., Villermaux E., and Shraiman B. I., ‘infotaxis’ as a strategy for searching without gradients, Nature 445, 406 (2007). [DOI] [PubMed] [Google Scholar]
- [53].Reddy G., Murthy V. N., and Vergassola M., Olfactory sensing and navigation in turbulent environments, Annual Review of Condensed Matter Physics 13, 191 (2022). [Google Scholar]
- [54].Skinner B. F., Pigeons in a pelican., American Psychologist 15, 28 (1960). [Google Scholar]
- [55].Sutton R. S. and Barto A. G., Reinforcement Learning: an Introduction (MIT press, 2018). [Google Scholar]
- [56].Van Seijen H., Van Hasselt H., Whiteson S., and Wiering M., A theoretical and empirical analysis of expected sarsa, in 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (IEEE, 2009) pp. 177–184. [Google Scholar]
- [57].Astrom K. J., Optimal Control of Markov Processes with Incomplete State Information I, Journal of Mathematical Analysis and Applications, Vol. 10 (Elsevier, 1965) pp. 174–205. [Google Scholar]
- [58].Storn R. and Price K., Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces, Journal of Global Optimization 11, 341 (1997). [Google Scholar]
- [59].Hansen N., The cma evolution strategy: A tutorial, arXiv preprint arXiv:1604.00772 (2016). [Google Scholar]
- [60].Barros R. C., Basgalupp M. P., De Carvalho A. C., and Freitas A. A., A survey of evolutionary algorithms for decision-tree induction, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 291 (2011). [Google Scholar]
- [61].Bergstra J. and Bengio Y., Random search for hyperparameter optimization., Journal of machine learning research 13 (2012). [Google Scholar]
- [62].Draft R. W., McGill M. R., Kapoor V., and Murthy V. N., Carpenter ants use diverse antennae sampling strategies to track odor trails, Journal of Experimental Biology 221, 10.1242/jeb.185124 (2018). [DOI] [PubMed] [Google Scholar]
- [63].Hepper P. G. and Wells D. L., How many footsteps do dogs need to determine the direction of an odour trail?, Chemical Senses 30, 291 (2005). [DOI] [PubMed] [Google Scholar]
- [64].Wallace D. G., Gorny B., and Whishaw I. Q., Rats can track odors, other rats, and themselves: implications for the study of spatial behavior, Behavioural brain research 131, 185 (2002). [DOI] [PubMed] [Google Scholar]
- [65].Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., and Riedmiller M., Playing atari with deep reinforcement learning, arXiv preprint arXiv:1312.5602 (2013). [Google Scholar]
- [66].Ng A. Y., Harada D., and Russell S., Policy invariance under reward transformations: Theory and application to reward shaping, in ICML, Vol. 99 (1999) pp. 278–287. [Google Scholar]
- [67].Hu Y., Wang W., Jia H., Wang Y., Chen Y., Hao J., Wu F., and Fan C., Learning to utilize shaping rewards: A new approach of reward shaping, Advances in Neural Information Processing Systems 33, 15931 (2020). [Google Scholar]
- [68].Wiewiora E., Potential-based shaping and q-value initialization are equivalent, Journal of Artificial Intelligence Research 19, 205 (2003). [Google Scholar]
- [69].Schulman J., Wolski F., Dhariwal P., Radford A., and Klimov O., Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017). [Google Scholar]
- [70].Raffin A., Hill A., Gleave A., Kanervisto A., Ernestus M., and Dormann N., Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research 22, 1 (2021). [Google Scholar]
- [71].Reddy G., Shraiman B. I., and Vergassola M., Sector search strategies for odor trail tracking, Proceedings of the National Academy of Sciences 119, e2107431118 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [72].Vergassola M., Villermaux E., and Shraiman B. I., ‘infotaxis’ as a strategy for searching without gradients, Nature 445, 406 (2007). [DOI] [PubMed] [Google Scholar]
- [73].Mnih V., Kavukcuoglu K., Silver D., Rusu A. A., Veness J., Bellemare M. G., Graves A., Riedmiller M., Fidjeland A. K., Ostrovski G., et al. , Human-level control through deep reinforcement learning, nature 518, 529 (2015). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.