How the Mind Creates Structure: Hierarchical Learning of Action Sequences

Maria K Eckstein; Anne GE Collins

. Author manuscript; available in PMC: 2021 Dec 27.

Published in final edited form as: Cogsci. 2021 Jul;43:618–624.

How the Mind Creates Structure: Hierarchical Learning of Action Sequences

Maria K Eckstein ¹, Anne GE Collins ¹

PMCID: PMC8711273 NIHMSID: NIHMS1764449 PMID: 34964045

Abstract

Humans have the astonishing capacity to quickly adapt to varying environmental demands and reach complex goals in the absence of extrinsic rewards. Part of what underlies this capacity is the ability to flexibly reuse and recombine previous experiences, and to plan future courses of action in a psychological space that is shaped by these experiences. Decades of research have suggested that humans use hierarchical representations for efficient planning and flexibility, but the origin of these representations has remained elusive. This study investigates how 73 participants learned hierarchical representations through experience, in a task in which they had to perform complex action sequences to obtain rewards. Complex action sequences were composed of simpler action sequences, which were not rewarded, but whose completion was signaled to participants. We investigated the process with which participants learned to perform simpler action sequences and combined them into complex action sequences. After learning action sequences, participants completed a transfer phase in which either simple sequences or complex sequences were manipulated without notice. Relearning progressed slower when simple than complex sequences were changed, in accordance with a hierarchical representations in which lower levels are quickly consolidated, potentially stabilizing exploration, while higher levels remain malleable, with benefits for flexible recombination.

Keywords: Hierarchical cognition, reinforcement learning, action sequence learning, hierarchical reinforcement learning

Introduction

The necessity of hierarchy for complex problem solving.

Humans need appropriate abstract representations to solve everyday problems because our world is intractably complex. One issue is the world’s high dimensionality: because of it, planning anything on human time scales would be impossible if the plan was made in the space of raw features because it would quickly lead to a combinatorial explosion of possibilities. Thus, real-world problems can only be solved by relying on abstract, lower-dimensional representations. Furthermore, real-world problems are sparse in rewards. Without explicit rewards to guide our way step-by-step toward desired long-term goals, we need to rely on intrinsic motivation to provide this scaffolding by providing sub-goals. Learning which environmental states make useful sub-goals is part of forming an abstract representation.

Previous research on abstraction.

Previous research across disciplines has shown the central role of hierarchy in complex problem solving. Artificial Intelligence (AI) has developed algorithms that abstract over time (Sutton et al., 1999), states (Finn et al., 2017; Vezhnevets et al., 2017), and learning itself (Wang et al., 2016) to solve increasingly difficult problems. In Psychology, decades of research have shown that mental representations are hierarchical, most notably in the domains of cognitive control, expertise, and sequential action (Broadbent, 1977; Cohen, 2000; Newell, 1994).Recent research has increasingly focused on formalizing models of hierarchical cognition, using hierarchical Bayesian models (e.g., Griffiths et al., 2019; Kemp and Tenenbaum, 2008; Solway et al., 2014).and hierarchical Reinforcement Learning (RL) models (e.g., Botvinick and Weinstein, 2014; Eckstein and Collins, 2020; Frank and Badre, 2012).Neuroscience research has shown that the brain itself is organized hierarchically, both in the sense of ”processing hierarchies”, in which superordinate levels operate over longer time scales and asymmetrically modulate subordinate processing, and of ”representational hierarchies”, in which superordinate representations form abstractions over subordinate representations, favoring generality over detail, and allowing information to be inherited asymmetrically from higher to lower levels (for reviews, see Balleine et al., 2015; Miller and Cohen, 2001).

The difficulty of discovering hierarchy.

Despite this near-universal conviction that hierarchical representations are necessary to solve problems of real-world complexity, the question remains how to create appropriate hierarchical representations. In AI, this has been called the ”option discovery problem” because abstract, multi-step policies are often called ”options” (Sutton et al., 1999). Hierarchical representations are only beneficial when they condense the important aspects of a task, but can be disadvantageous otherwise (Sutton et al., 1999). Humans have been shown to discover the Bayes-optimal task decomposition when solving complex problems (Solway et al., 2014), but it is unclear how they discover these decompositions, lacking access to the full state space, and with limited computational resources. Research in AI has investigated several promising avenues for discovering appropriate hierarchical representations. Some equip agents with intrinsic motivation, a form of motivation that is independent of extrinsic rewards and often tries to mimic novelty seeking and curiosity observed in humans and animals (Gershman and Niv, 2015; Lieshout et al., 2018). Others analyze the abstract problem structure and try of locate bottlenecks of the state-space or locations of advantageous graph-theoretical measures such as maximum between-ness (e.g., Pathak et al., 2017, for review, see Konidaris, 2019). The goal of both approaches is to identify states that would make appropriate targets for multi-step actions, and thereby create a hierarchical representation.

Our hypothesis.

Our study tests whether humans use such approaches to create hierarchical representations. We assess whether participants create hierarchical representations piece-by-piece, continuously learning new, ever more complex actions. We propose that, starting from a set of basic actions, humans first explore the world by executing actions randomly. Some combinations of basic actions lead to unexpected events, and thereby trigger curiosity (an interest in unexpected, non-rewarded events). Curiosity motivates further exploration until the target event can be reproduced reliably, using a sequence of basic actions. Once the target event can be achieved reliably, the required sequence of basic actions has been consolidated as a new skill, laying the foundation of hierarchical structure. Employing learned skills instead of basic actions enables more targeted exploration, and can speed up the acquisition of even more abstract skills, which are combinations of less abstract skills, following the same curiosity-guided process. This approach is appropriate for environments with sparse rewards because environmental signals other than rewards are used as targets to create skills. Combinatorial explosion is reduced because consolidated skills reduce the number of multi-step actions compared to a random combination of basic skills.

Methods

Task overview.

To test our hypothesis, we created a task in which participants learned to execute complex action sequences, which were composed of simpler action sequences, which were composed of basic actions (Fig. 1). Participants’ goal in the task was to create a specific star on each trial, using a star-making machine. The machine accepted 4 key presses per trial, and created a star when a correct 4-key sequence was typed in. Four different stars, learned in successive blocks, required four different 4-key sequences. Crucially, each star’s 4-key sequence was composed of two ”valid” 2-key sequences. The definition of a valid 2-key sequence is that executing this sequences leads to an item appearing on the machine. This task is fundamentally hierarchical: basic actions (individual key presses) form the lowest level; valid 2-key sequences, which lead to an unexpected event (item appearing) but are not rewarded, have to be learned through intrinsic motivation and form the intermediate level; valid 4-key sequences, which lead to reward (stars) and are composed of valid 2-key sequences, form the most abstract level. The goal of this task was to elicit intrinsically-motivated learning of hierarchical structure.

Figure 1: — Task design. (A) On each trial, participants sequentially enter four key presses in a self-paced manner. Each key press is acknowledged by the appearance of a colored circle in the response board. Each valid 2-key sequence (see rules in part C) is acknowledged by the appearance of a unique item. In the top row, keys 0 (subtrial 3; orange) and 1 (subtrial 4; teal) led to item a (gear). In the second row, the combination of 2-key sequences a (keys 0, 1) and b (keys 2, 3) led to the appearance of star S0. (B) Rules for valid sequences. Table ”Learning phase rules” shows key sequences for the learning phase. The column ”Low rules” shows valid 2-key sequences, with ”Actions” referring to the identity and order of actions that need to be executed, and ”Item” referring to the resulting item. Keyboard keys were randomly assigned to actions, and images were randomly assigned to items. The column ”High rules” shows how valid 4-key sequences are composed of 2-key sequences (”Items”) and to which star they lead. Table ”Transfer phase rules”: In the transfer phase, either low-level rules or high-level rules were manipulated. For low-level (high-level) manipulation, the low-level (high-level) rules in the Learning table were replaced by the low-level (high-level) rules in this table. Differences between learning and transfer rules are highlighted in red. (C) Overview of stars presented during learning and transfer phases. All four stars were learned, but only two were selected for the transfer phase. (E) Learning curves for learning (left) and transfer phase (right). Accuracy indicates whether the goal star (shown below each block) was achieved.

After a learning phase, an unsignaled transfer phase investigated whether participants represented actions at different levels of abstraction (low or high) differently by changing some actions that were required to make stars. In the low-level manipulation, some 2-key sequences were modified by replacing individual keys; in the high transfer-level manipulation, some 4-key sequences were modified by replacing entire 2-key sequences (Fig. 1B, right). Even though both manipulations affected similar numbers of individual keys in the tested stars (Fig. 1C), they lead to performance differences when using a hierarchical representation. Specifically, low-level 2-key sequences, being more consolidated, should be difficult to re-learn, whereas high-level combination of 2-key sequences, more flexible and malleable, should be less affected by transfer.

Experimental details.

All participants first provided online informed consent, in accordance with the Institutional Review Board of the University of California, Berkeley, and completed a standard demographics form. They then worked on the task, which consisted of a tutorial, a learning and transfer phase with one machine using one hand, and another learning and transfer phase with a different machine using the other hand. After the task, participants completed a questionnaire about their strategies employed during the task.

On each trial, participants pressed four keys with the goal of finding the current trial’s goal star, shown at the top of the screen, to receive a point (Fig. 1A). A point counter kept track of participants’ cumulative points. We call each key press within a trial a ”subtrial”. Four keys were available on each trial, depending on the machine: Q, W, E, and R (left hand); or U, I, O, and P (right hand). Participants were allowed a maximum of 2.5 seconds for each trial; When the four key presses took longer, participants were told to respond faster next time and the trial was counted as missed. Each trial was followed by a 0.5-second inter-trial interval, after which the next trial started. Each key press was immediately visualized as a colored circle in a response box underneath the star machine, with a one-to-one match between key and color. When participants executed a valid 2-key sequence within the first (last) two slots, an item immediately appeared on the left (right) side of the machine’s window. Each of the four valid 2-key sequences was represented by a unique item. When participants executed a valid 4-key sequence, a star immediately appeared. When the star coincided with the goal star, the point counter incremented by 1 point. When a trial did not form a valid 4-key sequence, no star appeared. Incorrect trials were not signaled otherwise.

Valid 2-key and 4-key sequences were constructed to maximize similarity between high-level and low-level transfer for experimental control. The same abstract rules were used for all participants (Fig. 1B–C). Systematic biases were avoided by randomizing the assignment of actions to keys, 2-key sequences to items, and 4-key sequences to stars.

For each machine, participants completed 12 blocks of 25 trials (with 4 key presses each) during the training phase, and 8 blocks of 25 trials during the transfer phase. Within each block, the goal star remained the same. All four stars were shown in the training phase, and two in the transfer phase (Fig. 1D). Block order was pseudo-randomized, such that the same goal star did not occur in two subsequent blocks, and each goal star was presented once in each mega-block of 4 blocks (training phase only). The transition between learning and transfer phase was not signaled.

After completing the first machine (learning and transfer), participants took a 1-minute break. After the break, they were presented with a new machine, and were instructed to use the opposite hand on a different set of keys. Order of hands and order of machines (low vs high transfer) were jointly randomized between participants. The new machine followed the same abstract rules as the old machine, but keys were randomly re-assigned to the new set of keys to avoid biases and minimize transfer effects. A novel set of items indicated valid 2-key sequences, and a novel set of stars indicated valid 4-key sequences. The task was written in jsPsych (de Leeuw, 2015), a JavaScript library that facilitates online data collection.

Participants.

73 undergraduate participants completed the task online for course credit (58 females, 13 males, 2 declined to answer). Two participants were excluded because they reported present or past psychological illness; 2 more were excluded because they had experienced head trauma or loss of consciousness. Six were excluded because they missed more than 50 trials (mean missed trials after excluding: 11.6, sd: 9.2, min: 1, max: 35). Two were excluded because they took more than 60 minutes for the task (mean duration after excluding: 36 minutes, min: 26, max: 46, sd: 5.3). Eleven were excluded because they used pen and paper or other external devices to help with the task, which potentially obscured the cognitive processes we aimed to investigate. (Because the study was conducted online, we could not monitor use of pen and paper directly, and relied on an explicit question in the post-experiment questionnaire for exclusion.) In total, 17 participants were excluded, leading to a final sample of 56 participants (45 females, 10 males, 1 declined to answer; mean age: 20.6, min: 18.1, max: 31.8, sd: 1.96).

Data analysis.

We used Python for data analysis and visualization. Regression models were conducted using the statsmodels package (Seabold and Perktold, 2010). Unless otherwise specified, we used mixed-effects models and defined each participant as a group.

Results

Creating Hierarchy by Learning Action Sequences

We first investigated how participants learned new action sequences, focusing on just the learning phase.

Slowing after unexpected event.

We had hypothesized that unexpected items would trigger participants’ curiosity and facilitate learning 2-key sequences. To test this, we tested whether participants slowed down after discovering a new item for the first time. Slowing commonly arises after errors (Danielmeier and Ullsperger, 2011), rewards (Raio et al., 2020), or surprising events (Parmentier et al., 2019), and is usually interpreted as an orienting response, potentially related to learning and the processing of prediction errors.

We assessed response times for the third key press in a trial (subtrial 3) when an item was discovered for the first time in a block on subtrial 2, comparing trials in which an item was discovered to the preceding and subsequent trials (Fig. 2A, red line). Repeated-measures t-tests, Bonferroni-corrected for multiple comparisons, revealed that participants were significantly slower on the trial of item discovery compared to both the preceding (t(54) = 4.1, p = 0.0003) and subsequent trial (t(54) = 6.9, p < 0.001). This slowing was a specific post-item effect rather than general slowing, as it uniquely occurred on subtrial 3 when an item appeared on subtrial 2, but not subtrial 4 (Fig. 2A, blue line).

Figure 2: — (A) Post-item slowing. Response times were elevated on subtrial 3 when a 2-key sequence was discovered before (on subtrials 1 and 2; red), but not after (subtrials 3 and 4; blue), revealing specific post-item slowing. Dots represent means; error bars between-participant 95% confidence intervals. (B) Repetition of valid and invalid sequences after first discovery. Trial 0 shows the first execution of a 2-key sequence in a block. Subsequent trials show the proportion of trials on which the same sequence was executed, separately for valid (blue) and invalid (red) 2-key sequences. The inset shows within-participant difference between valid and invalid sequences. (C) Number of 2-key sequences executed per trial (block 1 only). The blue line shows the average of the four valid 2-key sequences (signaled by item appearance), and the red line shows the average of four matched invalid 2-key sequences (not signaled by items). The maximum number of 2-key sequences per trial is two because each trial allows for four key presses. The red and blue lines do not add up to two because only matched invalid sequences were included in the analysis. (D) Response time for each key press within a trial. Colors indicate block number (dark to light). Stars show results of repeated-measures t-tests described in the main text.

Repetition of valid sequences.

If participants were indeed learning valid 2-key sequences based on item feedback, the frequency of valid 2-key sequences should increase over time. To tests this hypothesis, we compared the frequency of executing the four valid 2-key sequences, aligned to their first discovery in a block, to four randomly selected invalid, but structurally-similar 2-key sequences (Fig. 2B). We then calculated the difference between the proportion of valid versus invalid sequences for each participant and each trial (Fig. 2B, inset), and used mixed-effects regression to predict this difference from the trial since sequence discovery. This analysis revealed a significant difference from zero (Intercept β = 0.13, z = 14.7, p < 0.001) with a negative effect of trial (β = −0.01, z = −6.6, p < 0.001), confirming that participants repeated valid 2-key sequences more often than invalid ones, with a negative effect of time since first sequence execution. This analysis was restricted to trials in which participants did not reach the goal star (incorrect trials) because correct trials naturally have a higher proportion of valid compared to invalid 2-key sequences (because all valid 4-key sequences are composed of valid 2-key sequences), and would therefore bias the result. In sum, participants more often repeated valid than invalid 2-key sequences, suggesting that the appearance of items indeed triggered intrinsically motivated learning.

Increased use of valid 2-key sequences.

We next assessed whether the overall proportion of valid compared to invalid sequences increased over the course of the first learning block, in accordance with the hypothesized expansion of the action repertoire (Fig. 2C). We used mixed-effects regression to predict the number of valid and invalid 2-key sequences on each trial (0, 1, or 2) from sequence validity (valid vs invalid) and trial number (1–25), as well as their interaction. The significant interaction between sequence validity and trial (β = 0.036, z = 7.1, p < 0.001) confirmed that the trajectories of valid and invalid sequences differed. Follow-up models revealed a positive slope for valid sequences (β = 0.011, p = 0.04), indicating increase in use, and a negative slope for invalid sequences (β = −0.025, p < 0.001), indicating decreased in use. For the same reason as above, the analysis was limited to incorrect trials only. This is in accordance with participants expanding their action repertoire, which initially contained only individual keys (basic actions), by adding temporally-extended actions. Rather than exploring the environment based on basic actions, the exploration strategy seemed to shift toward using 2-key sequences for exploration.

Patterned response times.

We next assessed the temporal structure of participants’ key presses, hypothesizing that if participants treated valid 2-key sequences like stand-alone actions, the two keys of the sequence would be executed in quick succession, compared to slower execution at sequence boundaries. Indeed, participants clicked faster at sequence completion than initiation, with faster response times on subtrial 2 compared to 1, and subtrial 4 compared to 3, for both correct and incorrect trials, as revealed by repeated-measures t-tests with 8-way Bonferroni correction (all t(55)s > 2.9, all ps < 0.04; Fig. 2D). Interestingly, participants also responded faster on subtrial 3 compared to 1, and 4 compared to 2, suggesting frontloading of processing, such that in part, the second 2-key sequence was already prepared before or during the first sequence. This pronounced slow-fast-slow-fast response pattern suggests that participants executed two distinct 2-key actions rather than four individual actions, supporting our hypothesis that participants chunked pairs of key presses into a single unit, consolidating 2-key sequences into distinctive, temporally-extended actions.

Using Hierarchy for Exploration and Planning

We next investigated whether and how participants used their learned hierarchical action space for exploration and planning in the transfer phase.

Using 2-key sequences for exploration.

To this end, we tested whether participants actively moved 2-key sequences between slots (first slot: subtrials 1 and 2; second slot: subtrials 3 and 4). This would indicate that they did not just learn 2-key sequences as distinct, stand-alone actions, but also actively explored how to reach the star with them. Specifically, we assessed the number of trials that passed between the first discovery of a new 2-key sequence and its first use in the opposite slot, and compared it between valid and invalid sequences in each block (Fig. 3A). On average, participants took 5.6 trials to transfer valid sequences but 6.2 for invalid ones. Mixed-effects regression on the difference revealed a significant intercept (β = −0.87, se = 0.33, z = −2.63, p = 0.008), with no effect of block (β = −0.004, p = 0.94), confirming that participants transferred valid 2-key sequences faster than expected based on baseline (invalid sequences). In sum, participants quickly transferred valid action sequences from the position in which they were originally discovered to the opposite position, suggesting flexible reuse and exploration.

Figure 3: — (A) Number of trials to use a 2-key sequence that was first discovered in one position in the opposite position of a trial. (B) Performance in the transfer phase. Accuracy (left) and response times (right) over trials, averaged over blocks, for both high (red) and low (blue) transfer phases.

Differences between high and low transfer.

Finally, we assessed the transfer phase of the experiment, comparing the impact of modifying 2-key sequences (low-level manipulation) versus sequences of 2-key sequences (high-level manipulation), while controlling for overall change in 4-key sequences (Fig 1C). We predicted that the low-level manipulation would impair performance more because 2-key sequences are more consolidated and less accessible to change, once they are part of participants’ abstract action space. The high-level manipulation should affect performance less because the combination of 2-key sequences is less consolidated than the 2-key sequences themselves, and new associations can be re-learned more easily. We tested this prediction by probing differences between accuracy during high-level and low-level manipulation (Fig. 3B), using mixed-effects regression to predict accuracy from transfer type (high versus low), trial number (1–25), and their interaction. The model showed main effects of transfer type (β = 0.10, z = 6.4, p < 0.001) and trial (β = 0.1, z = 15.7, p = 0.10) and no interaction (β = 0.002, z = 1.6, p = 0.11), revealing that performance was indeed more impaired during lower-level manipulation, and that this effect did not diminish over time. In sum, performance suffered more due to low-level than high-level manipulation, in accordance with a role of 2-key sequences as building blocks for planning and complex action.

Discussion

This study provides a detailed analysis of human hierarchical action spaces, focusing on its step-by-step creation and subsequent use.

Creating hierarchical action spaces.

We first assessed how humans learned new action sequences, creating temporal hierarchy. We found that participants slowed down when their actions unexpectedly produced an unknown event (item), in accordance with previous accounts on feedback processing (Parmentier et al., 2019) and curiosity-driven learning (Gershman and Niv, 2015). Having discovered a new item, participants subsequently increased the use of the 2-key sequence that led to it, revealing intrinsically motivated behavior. Overall, participants used increasingly more valid and fewer invalid 2-key sequences, suggesting that they progressively explored the task based on 2-key sequences rather than individual keys. Achieving similar behaviors has been the target of hierarchically-structured AI algorithms (e.g., Singh et al., 2005; Wang et al., 2016), and our research sheds light on its progress in humans. Response time patterns within trials confirmed that individual keys were clustered into 2-key sequences, tying our research to rodent research with insight into the neural implementation of hierarchical action sequences (Geddes et al., 2018).

Using hierarchical action spaces.

Having learned new 2-key sequences, participants successfully reused them in different trial slots, showing active, hypothesis-driven exploration. In the transfer phase, manipulating the low-level rules of 2-key sequences affected performance more than manipulating the high-level rules of the composition of 4-key sequences. This confirms that adapting 2-key sequences was more difficult than adapting 4-key sequences, in accordance with a fundamental integration of 2-key sequences into the action repertoire. These findings are in accordance with previous literature on cognitive hierarchies, which showed that humans entertain representations with multiple levels of abstraction (Kemp and Tenenbaum, 2008), whereby levels differ in precision and learning speed (Goodman et al., 2011).

Limitations and future research.

One limitation of our study is the inherent difficulty of directly comparing low-level and high-level manipulation. By definition, they affect key sequences differently, and we experienced that controlling one aspect of experimental design (e.g., positions of manipulated keys) often led to disturbances in others (e.g., number of manipulated keys). We aimed to address this issue by choosing imbalances that worked against our hypothesis (e.g., favoring performance in low-level manipulation), thereby ascertaining that an existing effect (e.g., better performance in high-level manipulation) was due to hierarchical structure, rather than design imbalances.

Future research will include computational modeling of this task. We have previously presented an algorithm that formalizes our hypotheses on the creation of hierarchical action spaces, and makes predictions about task variants with more (or fewer) levels and more (or fewer) basic actions (Eckstein and Collins, 2017). We will next apply this algorithm to this dataset.

Acknowledgements

Zisu Dong was involved in earlier version of this project, Aram Moghadassi helped with task design, and Amy Zou implemented the task and collected data.

References

Balleine BW, Dezfouli A, Ito M, & Doya K (2015). Hierarchical control of goal-directed action in the cortical–basal ganglia network. Current Opinion in Behavioral Sciences, 5, 1–7. 10.1016/j.cobeha.2015.06.001 [DOI] [Google Scholar]
Botvinick M, & Weinstein A (2014). Model-based hierarchical reinforcement learning and human action control. Phil. Trans. R. Soc. B, 369(1655), 20130480. 10.1098/rstb.2013.0480 [DOI] [PMC free article] [PubMed] [Google Scholar]
Broadbent DE (1977). Levels, hierarchies, and the locus of control [Place: United Kingdom Publisher: Taylor & Francis]. The Quarterly Journal of Experimental Psychology, 29(2), 181–201. 10.1080/14640747708400596 [DOI] [Google Scholar]
Cohen G (2000). Hierarchical models in cognition: Do they have psychological reality? European Journal of Cognitive Psychology, 12(1), 1–36. 10.1080/095414400382181 [DOI] [Google Scholar]
Danielmeier C, & Ullsperger M (2011). Post-Error Adjustments. Frontiers in Psychology, 2. 10.3389/fpsyg.2011.00233 [DOI] [PMC free article] [PubMed] [Google Scholar]
de Leeuw JR (2015). jsPsych: A JavaScript library for creating behavioral experiments in a Web browser. Behavior Research Methods, 47(1), 1–12. 10.3758/s13428-014-0458-y [DOI] [PubMed] [Google Scholar]
Eckstein MK, & Collins AGE (2017). CHRL: Combining intrinsic motivation and hierarchical reinforcement learning. Advances in Neural Information Processing Systems, workshop. [Google Scholar]
Eckstein MK, & Collins AGE (2020). Computational evidence for hierarchically structured reinforcement learning in humans. Proceedings of the National Academy of Sciences, 117(47), 29381–29389. 10.1073/pnas.1912330117 [DOI] [PMC free article] [PubMed] [Google Scholar]
Finn C, Abbeel P, & Levine S (2017). Model-agnostic meta-learning for fast adaptation of deep networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135. [Google Scholar]
Frank MJ, & Badre D (2012). Mechanisms of Hierarchical Reinforcement Learning in Cortico-Striatal Circuits 1: Computational Analysis. Cerebral Cortex, 22(3), 509–526. 10.1093/cercor/bhr114 [DOI] [PMC free article] [PubMed] [Google Scholar]
Geddes CE, Li H, & Jin X (2018). Optogenetic Editing Reveals the Hierarchical Organization of Learned Action Sequences. Cell, 174(1), 32–43.e15. 10.1016/j.cell.2018.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
Gershman SJ, & Niv Y (2015). Novelty and Inductive Generalization in Human Reinforcement Learning. Topics in Cognitive Science, 7(3), 391–415. 10.1111/tops.12138 [DOI] [PMC free article] [PubMed] [Google Scholar]
Goodman ND, Ullman TD, & Tenenbaum JB (2011). Learning a theory of causality. Psychological Review, 118(1), 110–119. 10.1037/a0021336 [DOI] [PubMed] [Google Scholar]
Griffiths TL, Callaway F, Chang MB, Grant E, Krueger PM, & Lieder F (2019). Doing more with less: Meta-reasoning and meta-learning in humans and machines. Current Opinion in Behavioral Sciences, 29, 24–30. 10.1016/j.cobeha.2019.01.005 [DOI] [Google Scholar]
Kemp C, & Tenenbaum JB (2008). The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31), 10687–10692. [DOI] [PMC free article] [PubMed] [Google Scholar]
Konidaris G (2019). On the necessity of abstraction. Current Opinion in Behavioral Sciences, 29, 1–7. 10.1016/j.cobeha.2018.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lieshout L. L. F. v., Vandenbroucke ARE, Müller NCJ, Cools R, & Lange F. P. d. (2018). Induction and relief of curiosity elicit parietal and frontal activity. Journal of Neuroscience, 2816–17. 10.1523/JNEUROSCI.2816-17.2018 [DOI] [PMC free article] [PubMed] [Google Scholar]
Miller EK, & Cohen JD (2001). An integrative theory of pre-frontal cortex function. Annual Review of Neuroscience, 24, 167–202. 10.1146/annurev.neuro.24.1.167 [DOI] [PubMed] [Google Scholar]
Newell A (1994). Unified Theories of Cognition. Harvard University Press. [Google Scholar]
Parmentier FBR, Vasilev MR, & Andrés P (2019). Surprise as an explanation to auditory novelty distraction and post-error slowing. Journal of Experimental Psychology: General, 148(1), 192–200. 10.1037/xge0000497 [DOI] [PubMed] [Google Scholar]
Pathak D, Agrawal P, Efros AA, & Darrell T (2017). Curiosity-driven exploration by self-supervised prediction. arXiv preprint arXiv:1705.05363. [Google Scholar]
Raio CM, Konova AB, & Otto AR (2020). Trait impulsivity and acute stress interact to influence choice and decision speed during multi-stage decision-making. Scientific Reports, 10(1), 7754. 10.1038/s41598-020-64540-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
Seabold S, & Perktold J (2010). Statsmodels: Econometric and Statistical Modeling with Python, 5. [Google Scholar]
Singh S, Barto AG, & Chentanez N (2005). Intrinsically Motivated Reinforcement Learning: (tech. rep.). Defense Technical Information Center. Fort Belvoir, VA. 10.21236/ADA440280 [DOI] [Google Scholar]
Solway A, Diuk C, Córdova N, Yee D, Barto AG, Niv Y, & Botvinick M (2014). Optimal behavioral hierarchy. PLoS computational biology, 10(8), e1003779. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sutton RS, Precup D, & Singh S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1), 181–211. 10.1016/S0004-3702(99)00052-1 [DOI] [Google Scholar]
Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, & Kavukcuoglu K (2017). FeUdal Networks for Hierarchical Reinforcement Learning. arXiv:1703.01161. [Google Scholar]
Wang JX, Kurth-Nelson Z, Tirumala D, Soyer H, Leibo JZ, Munos R, Blundell C, Kumaran D, & Botvinick M (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763. [Google Scholar]

[R1] Balleine BW, Dezfouli A, Ito M, & Doya K (2015). Hierarchical control of goal-directed action in the cortical–basal ganglia network. Current Opinion in Behavioral Sciences, 5, 1–7. 10.1016/j.cobeha.2015.06.001 [DOI] [Google Scholar]

[R2] Botvinick M, & Weinstein A (2014). Model-based hierarchical reinforcement learning and human action control. Phil. Trans. R. Soc. B, 369(1655), 20130480. 10.1098/rstb.2013.0480 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Broadbent DE (1977). Levels, hierarchies, and the locus of control [Place: United Kingdom Publisher: Taylor & Francis]. The Quarterly Journal of Experimental Psychology, 29(2), 181–201. 10.1080/14640747708400596 [DOI] [Google Scholar]

[R4] Cohen G (2000). Hierarchical models in cognition: Do they have psychological reality? European Journal of Cognitive Psychology, 12(1), 1–36. 10.1080/095414400382181 [DOI] [Google Scholar]

[R5] Danielmeier C, & Ullsperger M (2011). Post-Error Adjustments. Frontiers in Psychology, 2. 10.3389/fpsyg.2011.00233 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] de Leeuw JR (2015). jsPsych: A JavaScript library for creating behavioral experiments in a Web browser. Behavior Research Methods, 47(1), 1–12. 10.3758/s13428-014-0458-y [DOI] [PubMed] [Google Scholar]

[R7] Eckstein MK, & Collins AGE (2017). CHRL: Combining intrinsic motivation and hierarchical reinforcement learning. Advances in Neural Information Processing Systems, workshop. [Google Scholar]

[R8] Eckstein MK, & Collins AGE (2020). Computational evidence for hierarchically structured reinforcement learning in humans. Proceedings of the National Academy of Sciences, 117(47), 29381–29389. 10.1073/pnas.1912330117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Finn C, Abbeel P, & Levine S (2017). Model-agnostic meta-learning for fast adaptation of deep networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135. [Google Scholar]

[R10] Frank MJ, & Badre D (2012). Mechanisms of Hierarchical Reinforcement Learning in Cortico-Striatal Circuits 1: Computational Analysis. Cerebral Cortex, 22(3), 509–526. 10.1093/cercor/bhr114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Geddes CE, Li H, & Jin X (2018). Optogenetic Editing Reveals the Hierarchical Organization of Learned Action Sequences. Cell, 174(1), 32–43.e15. 10.1016/j.cell.2018.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Gershman SJ, & Niv Y (2015). Novelty and Inductive Generalization in Human Reinforcement Learning. Topics in Cognitive Science, 7(3), 391–415. 10.1111/tops.12138 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Goodman ND, Ullman TD, & Tenenbaum JB (2011). Learning a theory of causality. Psychological Review, 118(1), 110–119. 10.1037/a0021336 [DOI] [PubMed] [Google Scholar]

[R14] Griffiths TL, Callaway F, Chang MB, Grant E, Krueger PM, & Lieder F (2019). Doing more with less: Meta-reasoning and meta-learning in humans and machines. Current Opinion in Behavioral Sciences, 29, 24–30. 10.1016/j.cobeha.2019.01.005 [DOI] [Google Scholar]

[R15] Kemp C, & Tenenbaum JB (2008). The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31), 10687–10692. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Konidaris G (2019). On the necessity of abstraction. Current Opinion in Behavioral Sciences, 29, 1–7. 10.1016/j.cobeha.2018.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Lieshout L. L. F. v., Vandenbroucke ARE, Müller NCJ, Cools R, & Lange F. P. d. (2018). Induction and relief of curiosity elicit parietal and frontal activity. Journal of Neuroscience, 2816–17. 10.1523/JNEUROSCI.2816-17.2018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Miller EK, & Cohen JD (2001). An integrative theory of pre-frontal cortex function. Annual Review of Neuroscience, 24, 167–202. 10.1146/annurev.neuro.24.1.167 [DOI] [PubMed] [Google Scholar]

[R19] Newell A (1994). Unified Theories of Cognition. Harvard University Press. [Google Scholar]

[R20] Parmentier FBR, Vasilev MR, & Andrés P (2019). Surprise as an explanation to auditory novelty distraction and post-error slowing. Journal of Experimental Psychology: General, 148(1), 192–200. 10.1037/xge0000497 [DOI] [PubMed] [Google Scholar]

[R21] Pathak D, Agrawal P, Efros AA, & Darrell T (2017). Curiosity-driven exploration by self-supervised prediction. arXiv preprint arXiv:1705.05363. [Google Scholar]

[R22] Raio CM, Konova AB, & Otto AR (2020). Trait impulsivity and acute stress interact to influence choice and decision speed during multi-stage decision-making. Scientific Reports, 10(1), 7754. 10.1038/s41598-020-64540-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Seabold S, & Perktold J (2010). Statsmodels: Econometric and Statistical Modeling with Python, 5. [Google Scholar]

[R24] Singh S, Barto AG, & Chentanez N (2005). Intrinsically Motivated Reinforcement Learning: (tech. rep.). Defense Technical Information Center. Fort Belvoir, VA. 10.21236/ADA440280 [DOI] [Google Scholar]

[R25] Solway A, Diuk C, Córdova N, Yee D, Barto AG, Niv Y, & Botvinick M (2014). Optimal behavioral hierarchy. PLoS computational biology, 10(8), e1003779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Sutton RS, Precup D, & Singh S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1), 181–211. 10.1016/S0004-3702(99)00052-1 [DOI] [Google Scholar]

[R27] Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, & Kavukcuoglu K (2017). FeUdal Networks for Hierarchical Reinforcement Learning. arXiv:1703.01161. [Google Scholar]

[R28] Wang JX, Kurth-Nelson Z, Tirumala D, Soyer H, Leibo JZ, Munos R, Blundell C, Kumaran D, & Botvinick M (2016). Learning to reinforcement learn. arXiv preprint arXiv:1611.05763. [Google Scholar]

PERMALINK

How the Mind Creates Structure: Hierarchical Learning of Action Sequences

Maria K Eckstein

Anne GE Collins

Abstract

Introduction