Abstract
The skills required for the learning and use of language are the focus of extensive research, and their evolutionary origins are widely debated. Using agent-based simulations in a range of virtual environments, we demonstrate that challenges of foraging for food can select for cognitive mechanisms supporting complex, hierarchical, sequential learning, the need for which arises in language acquisition. Building on previous work, where we explored the conditions under which reinforcement learning is out-competed by seldom-reinforced continuous learning that constructs a network model of the environment, we now show that realistic features of the foraging environment can select for two critical advances: (i) chunking of meaningful sequences found in the data, leading to representations composed of units that better fit the prevalent statistical patterns in the environment; and (ii) generalization across units based on their contextual similarity. Importantly, these learning processes, which in our framework evolved for making better foraging decisions, had been earlier shown to reproduce a range of findings in language learning in humans. Thus, our results suggest a possible evolutionary trajectory that may have led from basic learning mechanisms to complex hierarchical sequential learning that can support advanced cognitive abilities of the kind needed for language acquisition.
Keywords: evolution, learning, cognition, foraging, language
1. Introduction
The abilities that underlie the learning and use of language are the focus of extensive research, and their evolutionary origins are widely debated [1–6]. Despite the attention that this matter has received, a detailed account of the selective pressures that may have given rise to relevant cognitive abilities is lacking. One such potentially relevant ability is statistical learning, which supports the discovery of statistical regularities in the environment [7]. In many cases, statistical learning can be accounted for by reinforcement learning mechanisms that act locally, in the sense that the learner associates cues and actions with contingent (positive or negative) rewards [8–12]. Yet we have recently shown [13] that such mechanisms can be outcompeted by variants that are less dependent on direct reinforcement and that learn regularities over elements in the environment continuously (see also [14,15]). By implementing different learners and comparing their foraging success in a range of virtual environments, we found that a learner that keeps track of the transition probabilities (TPs) between elements in its environment, including elements that provide no direct reinforcement, can, in the long run, use this information adaptively to choose among alternative paths in the context of foraging for food. This type of learning requires an internal representation of the environment, such as a Markov model over the elements that comprise it, which serves as a kind of grammar, distinguishing likely sequences of elements from less likely ones. Notably, sensitivity to TPs has been demonstrated in infants and is instrumental in explaining unconditioned learning of regularities by a range of organisms [16–18]. However, representing TPs may not be enough: any learning-based theory of complex sequential behaviours (including language as an extreme case) must account for the evolution of sequence segmentation and of similarity-based categorization of elements and element sequences.
As in language, where there are no silences between words [19,20], the input streams of sensory data from which animals must learn are continuous; in order to learn their underlying structure, they must be parsed correctly (i.e. broken down into their natural constituents, over which useful regularities can then be inferred). In the general case, these units are not known a priori; in some cases, their segmentation from the data stream is straightforward, based on salient perceptual cues, but it may also be complex and require some form of statistical inference regarding what should be viewed as an independent unit [21–26]. Such segmentation (also termed ‘chunking’1) has been proposed to explain social learning of motor behavioural sequences by great apes [27] and in composition of songs that combine chunks learned from different tutors in songbirds [16,28].
What type of natural structure can one expect to find in the environment? Physical environments are typically structured hierarchically [29], with some meaningful units consisting in turn of yet smaller elements. This is true for very different environments and at many scales of magnitude; for example, a leaf is composed of a petiole, veins and skin; a tree is composed of trunk, branches, leaves and fruit; and a grove is composed of different bushes and trees. To learn to forage efficiently in such an environment, for example, an organism must first segment it correctly into meaningful units, and then learn the statistical regularities that govern the relations among these units.
In addition to hierarchical composition (structural hierarchy), many environments also exhibit functional hierarchies, such as hierarchical categorical similarities among elements, a characteristic that makes it possible to generalize across elements. For example, for a bird, both a grasshopper and a cicada may be associated with being found on vegetation, being able to move their legs and to walk, and with being palatable when caught. Based on this functional similarity, after repeatedly observing that a grasshopper can fly away when approached, the bird may also expect that a cicada can fly away when approached. The possibility of similarity-based generalization [30] may alleviate the challenge of statistical inference about regularities that characterize some units, particularly if they are rare, as regularities that had been learned with regard to one unit may be assumed—with caution (cf. [31])—to be shared by other units that are similar to it. This mode of inference may carry adaptive value in foraging for food.
In the light of these considerations, we propose that selection for finding food in structured environments may have facilitated the evolution of the critical protolinguistic abilities of chunking and similarity-based generalization. In what follows, we describe a number of simple learning mechanisms, which differ from one another by small increments in an increasing order of complexity, along with a number of virtual environments that we constructed for testing these mechanisms. Based on a comparison of how different mechanisms perform in learning and foraging in these environments, we identify conditions that may select for one mechanism over another. Finally, we discuss these results in relation to a previous study in which we implemented the same learning processes to reproduce a range of findings in language learning in humans.
2. The model
Our aim is to explore the conditions under which two critical improvements over simple continuous learning may evolve and form a model with protolinguistic abilities. These are (i) chunking of meaningful sequences in the data, which helps identify stable units and hierarchies in the environment, and (ii) generalization across these units based on contextual similarity. The two were chosen because, alongside their importance in learning of animal sequential behaviour [16,27,28], they are typically viewed as fundamental aspects of language and language learning [32–36]. Our model thus implements three learning mechanisms—simple learning, learning with chunking, and learning with chunking and generalization—and compares their foraging success in a range of virtual environments. In what follows, we describe the learning mechanisms, the virtual environments, and the training and test procedures. Foraging environments were constructed using Matlab (2012) scripts, simulations were programmed in Java (using JDK v. 6.0), and statistical analysis was carried out in JMP v. 10.0. Full methodological details for this section are provided in electronic supplementary materials S1 and S3.
(a). Learning mechanisms
The first mechanism, element-based learning (EBL), constructs a world model in the form of a directed graph in which it represents the statistical associations between the basic elements in the environment: how frequently every two elements follow one another. This mechanism has been previously shown to be useful in learning environmental regularities in the context of foraging, and the conditions in which it may have evolved are described in [13], where it is referred to as ‘continuous learning’. Here, we call it EBL to emphasize the fact that it can only learn the basic elements of the environment without creating chunks or making generalizations (which can be done by the second and third mechanisms that are also based on continuous learning). Figure 1a illustrates a representation learned by an EBL learner.
Figure 1.
Representations and a learning set. (a) The representation constructed by an EBL learner following training on the training set in (c). Weights were omitted for visual clarity. (b) A part of the representation constructed by a CBL learner following training on the training set in (c). Weights, some vertices and some links have been omitted for clarity. Note that in addition to units over which the environment's regularities were defined, learners may also learn parts of such units and other recurring sequences (not illustrated). (c) A list of units that comprise an environment, and a part of the training set from that environment. The learners are exposed to the environment as a non-separated sequence of characters; breakpoints among units were added in this illustration for clarity.
The second mechanism, chunk-based learning (CBL) or ‘chunking’, can, in addition to representing basic elements, represent in its directed graph short sequences of basic elements, chunks, as independent units in its world model, and learn their statistical associations to other units. The chunking process takes place via two complementary mechanisms: one (CBL-1, ‘bottom-up’) that may concatenate into a chunk previously known units that have high TPs between them, and another (CBL-2, ‘top-down’) that recognizes sequences that recur in the data within a short-time window and are therefore likely to represent meaningful units [22,24,37,38]. This is done by retaining the most recent section of the input sequence in a short-term working memory, and continuously aligning this sequence to shifted versions of itself to find recurring subsequences. See further explanations below and electronic supplementary material S1 for full details. In reality, it is likely that the two mechanisms are not distinct; we separate them here in order to gain insights about each and to tease apart their effects. Figure 1b illustrates the representation learned by a CBL learner in a very simple environment.
The third mechanism, chunk-based learning with similarity (CLS), is akin to CBL, but additionally uses the data acquired about each unit in the world representation, in the form of spatial relations to other units, to estimate the similarity among units: the similarity between two units is the inner product of the vectors that describe their TPs to other units. This means that two units that tend to occur in similar contexts will be viewed as highly similar to one another. Such similarity can then be used for learning from one unit about others, and more broadly, for generalization.
We have shown in previous work how the latter two mechanisms, despite their simplicity, may successfully account for a range of findings in language learning [26]. The details of the learning mechanisms, their parameters, implementation and the reasoning behind them, are described in electronic supplementary material S1.
(b). Foraging environments
Each environment in our simulations can be represented as a graph, whose vertices represent stable high-order units in the environment, such as trees or flowers that are composed of smaller basic elements. The basic elements may represent the most basic building blocks that form the input stream (set by the perceptual system) or they may be themselves composed of smaller basic ‘perceptual’ elements, but they are regarded here as the basic elements for simplicity. Except where noted otherwise, each unit in our virtual environments is composed of two to four basic elements (which represent its constituents, such as petals or branches), each represented by a letter, henceforth referred to as a ‘base element’ (figure 2). The edges in the graph denote the immediate spatial proximity between the units. The structure of the foraging environments (and hence of the graphs that represent them) is governed by a set of stochastic rules that define the probability of adjacency of each unit to every other unit in the environment. For simplicity, we generated the environments as linear sequences, which correspond to the learners’ experience of encountering the stimuli while moving forward, and not to the (possibly more complex) structure of the world. Each environment can thus be described as a transition probability (TP) matrix M, of the size N × N, where N is the number of unit types that occur in the environment, and each entry mij corresponds to the probability that unit i will be followed by unit j. All environments include a special single-element unit, which represents food, is denoted by F and is typically rare (figure 2).
Figure 2.
An illustration of a simulated foraging environment composed of a string of basic elements denoted by letters, and a possible visual representation of the elements in the environment that they represent. The environment is constructed of high-order units, such as those represented by jkg, lg and hgh, and its regularities are defined over these; thus, for example, food in this environment has the highest probability of occurring adjacent to the unit represented as lg.
We simulated three types of environments (further details are found in electronic supplementary material S1):
(1) Web environment: an environment in which any unit may be found adjacent to any other unit at some probability. These probabilities are usually not equal as they are drawn from a gamma distribution and then normalized to sum to 1. Note that the name ‘web environment’ highlights the possibility of finding any two units next to each other with some probability; however, this environment, like all environments in our framework, was realized in the form of a linear sequence of elements that the learner is exposed to.
(2) Patchy environment: a type of web environment that contains a number of groups of units that tend to co-occur (i.e. instead of all TPs between every two units being randomly drawn, only the TPs between units within each group are drawn randomly and normalized, whereas the TP between each unit and units that belong to other groups is set to be low and constant). This leads to high TPs among units within each group and low TPs among units from different groups. The result is an environment composed of ‘patches’: sequences of units that belong to the same group.
(3) Web environment with similar units: the environment is similar to the web environment described above, but additionally every unit has a ‘similar other’: another unit which has the same TPs with other units (i.e. that occurs in nature at the same contexts), which may allow CLS learners to generalize from one unit to its similar one (see above and electronic supplementary material S1). To create a realistic scenario in which generalization may be advantageous, one of the units in each pair of similar units was never associated with food during training, but was associated with food as frequently as its ‘similar other’ during the test phase (see below). This procedure mimics a natural situation in which, for example, one of two very similar types of tree rarely gives fruit early in the season while the other does, but both provide fruit later in the season. Clearly, there is no need to generalize from one unit to another if the association with food can be learned equally well, with regard to each of them separately, in parallel. To further explore the conditions for adaptive generalization additional types of ‘web environments with similar units' were also tested (see §3; electronic supplementary material S2).
(c). Training and test procedure
In each simulation, each learner was trained on a particular environment, and then tested for its foraging success in that environment. For each simulation run, an independent TP matrix that defines the environment was constructed (see detailed explanations in electronic supplementary material S1), and was used to construct a 14 000-base-element training set (see example in figure 1c), and a test set composed of 10 000 sequences of 12 base elements each. Except where noted otherwise, 500 simulations were conducted for each type of environment.
In every simulation, each learner was provided first with the training set as input (figure 1c). After training, the learner was presented with the 10 000 test sequences (snippets of 12 base elements). The learner could only ‘see’ the first six elements of each snippet, whereas the last six elements were left unexposed. The learner had to choose the most promising 5000 among the 10 000 snippets: because the sequence of exposed elements may be predictive of food, owing to the underlying TPs between units, successful learning of the environmental regularities may allow a learner to assign a higher food-finding score (see below) to the sequences that are more likely to contain a food element in their unexposed section. The success rate of a learner in each simulation was defined as the total number of unexposed food elements that occurred in the sequences it had chosen, divided by the total number of food elements that occurred in the test set (excluding snippets that contained food in their exposed section). Because the learners choose half of the overall number of snippets, the expected random success rate is 0.5: although food elements are rare, approximately half of the total food elements in the test set are expected to occur in the randomly chosen sentences. We note that as a result of the exclusion of snippets that contain food in their exposed section, the mean success rate of a random chooser, which constitutes the baseline, may be slightly less than 0.5 (figure 3; see also §b in electronic supplementary material S3).
Figure 3.

Mean success rate of the four learners (EBL, CBL-1, CBL-2 and CBL) and of a random choice simulation. (a) Environments whose regularities are defined over 10 single base elements and contain no high-order units. (b) Web environments whose regularities are defined over high-order units. (c) Patchy environments. Error bars depict 95% CI.
The training corpus and test sets in each simulation were unique, and thus the statistical analyses we conducted were two-tailed paired t-tests between every two learners (which were trained and tested with the same corpus and the same test set, but which differed in their learning mechanism). Unless noted otherwise, all ‘statistically significant differences' refer to paired t-tests with N = 500, d.f. = 498 and p < 0.0001, which remained highly significant also when controlling for multiple testing. Terms such as ‘similar success’ or ‘no differences' and the like refer to statistically non-significant differences (N = 500, d.f. = 498 and p > 0.05).
(d). Reaching foraging decisions
After being exposed to a sequence of input and applying to it a learning mechanism that constructs a network representing the environment (i.e. a directed graph), a learner can use its network to make foraging decisions. All learners in our framework (of all types) use the same basic decision-making process (although CBL and CLS learners can also take advantage of high-order units, and CLS learners can use similarity links; see electronic supplementary material S1 for details). Adaptive foraging is expressed in our framework as a learner's ability to choose the paths that have a high likelihood of containing food elements. To do so, the learner assigns a food-finding score to a sequence that it encounters at the beginning of each possible path (see §2c above).
The learners we implemented use a two-phase heuristic for this assessment. In the first phase, the sequence is interpreted in terms of the learned representation of the environment (i.e. it is analysed to uncover familiar units in it, which are held in a temporary memory cache). In the second phase, a score is assigned to the sequence, based on a heuristic method inspired by the spread of excitation in neural networks: treating the learned graph as a neural network, activation is injected into it at a chosen unit and propagates through the graph, and the sum of activations that reach the node representing food is taken to be the food-finding score of that unit. In the CLS learner, this activation propagates also along similarity links, thus incorporating data gleaned with regard to similar units into the assessment of a unit's food finding score (see more details in electronic supplementary material S1).
3. Results
(a). Web environment
(i). A null test of a web environment without hierarchies
In web environments whose regularities are defined as TPs between base elements only, and are therefore not constructed from higher-order chunks, all learners had higher foraging success than random, but the EBL mechanism had significantly higher foraging success than the chunk-based learning mechanisms (CBL-1, CBL-2 and CBL), all of which had similar success (figure 3a). This result serves as an important sanity check: in this ‘chunk-less' environment, the EBL learner is able to represent the world's structure perfectly, and so our finding is the expected result. The EBL learner's advantage over the CBL learners stems from the fact that CBL learners incorporate in their internal representation many artificial high-order chunks that are not necessary and lead to errors in assessing sequences’ probability of containing food (additionally, excessive construction of chunks may also incur computational and memory costs; see §a in electronic supplementary material S1 and §a1 in electronic supplementary material S2 for more details and further discussion).
(ii). Hierarchical environments, in which base elements take part in multiple chunks
In web environments composed of hierarchies (50 high-order units of two to four base elements, where all units are different combinations of the same six base elements) all learners had significantly higher foraging success rates than random, and chunking learners were significantly more successful than the EBL (figure 3b). The higher success of CBL-2 than CBL-1 is specific to the current parameter setting, and changes with different parameter values (see electronic supplementary material S4). On the other hand, the significant advantage of CBL over CBL-1 and CBL-2 was robust in all simulations, suggesting that the two mechanisms have complementary effects on learning success (as CBL learners use both mechanisms).
To understand the complementary effect of the CBL-1 and CBL-2 mechanisms, recall that they add novel chunks as units to the repertoire in two different ways. CBL-1 acts bottom-up, concatenating known units based on gradually accumulated statistical evidence. CBL-2, on the other hand, acts top-down, adding to the repertoire sequences that recur within a short time window. The different characteristics of these two processes make them particularly adaptive in different environments or learning situations: CBL-1 is gradual and relatively reliable; it rarely introduced into the repertoire units that did not represent meaningful regularities in the environment (O.K. 2015, unpublished data). CBL-2, on the other hand, is ‘quick and dirty’; it may quickly uncover long meaningful units, but it might also introduce into the repertoire many spurious sequences that happened to recur in a short time window. Another difference between the two is that CBL-2 is not very likely to uncover rare units, as the probability of these occurring within a short time window is small, yet it can take advantage of the patchy structure of many environments: elements of a certain type tend to be clustered in many cases, like flowers on a bush or fallen leaves under a tree. The two mechanisms can thus complement one another. Interestingly, when both CBL-1 and CBL-2 are at work we find an additional complementary effect: CBL-2 incorporates into the repertoire short sequences that CBL-1 failed to construct (because the statistics of each individual element in them failed to cross the significance threshold; see §2), but then CBL-1 uses these units as building blocks of longer units, which in many cases the CBL-2 would have missed, because they do not recur frequently enough within a short time window.
Beyond the main finding that hierarchical web environments are expected to favour the evolution of chunking mechanisms (figure 3b), further examination of the effect of recognizing chunks on foraging success is provided in §c of electronic supplementary material S3, and a detailed analysis of how the chunking mechanisms and their parameters may be tuned by natural selection and individual experience to various foraging environments and to the costs of memory and computation is provided in electronic supplementary material S4.
Notably, the learning process typically results in a very large repertoire of candidate units, many of which are, in fact, subparts of units that compose the environment or concatenations of such units. For further discussion see electronic supplementary materials S1 and S4.
(iii). Hierarchical environments, in which each unit has unique base elements
Finally, we note that there is a special variant of the hierarchical web environment for which chunking mechanisms are not necessary. This is an environment in which each unit is composed of base elements that are not shared with other units. In such an environment, a unit can easily be identified by any of its idiosyncratic base elements, so that learning of the chunks does not increase the probability of finding food. Indeed, in this environment, we find that the EBL has the highest foraging success, on par with the success of CBL-2 and slightly higher than that of CBL-1 and CBL (see electronic supplementary material figure S2.1 and explanation therein).
(b). Patchy environments
We explored learning in a patchy environment in which there were five patch types. Each patch consisted of a sequence of units drawn from a set of units that was specific to the patch type. The units in the set of units of each of the five patches are unique, but the base elements from which these units are composed are shared by units across patches (see details in §2 and in electronic supplementary material S1). In this environment, each particular base element cannot be associated with a single patch, and therefore, learning the units through chunking is expected to be adaptive (see §b1 in electronic supplementary material S2 for the case in which base elements are not shared across patches). Indeed, in this patchy environment, as in the more general web environment, we found that all of the chunking learners had significantly greater foraging success than the EBL learners (figure 3c). Further analysis shows that the benefit of chunking, in this case, is mainly derived from using chunks to recognize high-quality patches rather than from following their statistical dependencies within a patch (see §b1 in electronic supplementary material S2).
As noted above, the CBL-2 mechanism may be particularly adapted to exploit the patchy structure of many environments, in which units may frequently recur within short time windows. This aspect is demonstrated by a comparison of the number of correct units identified by CBL-1 and CBL-2 learners in a number of environments composed of the same number of units but differing in the extent of their patchiness (see §b2 in electronic supplementary material S2).
(c). Web environment with similar units: the benefit of generalization
In the web environment with similar units (where every unit has a ‘similar other’, a situation that may allow CLS learners to generalize from one unit to its similar counterpart), CLS learners had significantly higher foraging success than CBL learners, but the difference between the two was small (figure 4a). This was somewhat surprising: generalization from a unit A to its similar unit A’ should clearly help to infer that if A leads to food then A’ should also lead to food, even though this had never occurred during training. Despite this, even CBL learners that are unable to generalize managed to do quite well in this environment. The reason for the small difference turned out to be related to the details of our test procedure. The test requires learners to predict the presence of food in the hidden section of the test snippets (i.e. along a sequence of six base elements). CBL learners did reasonably well in this task, because whenever food was not the first element in the test snippet, they could predict its presence down the road using ‘transitive closure’ knowledge of the type: if A’ leads to B, and B leads to food, then food is eventually likely to be found following A’. Such a regularity can be learned equally well by CBL and CLS learners in this environment. When we changed the test procedure to include only one base element in the hidden section of the test snippets, so that learners had to predict the immediate presence of food (not just its overall likelihood to be found down the road), the advantage of the CLS learners increased (figure 4b). Predicting the immediate presence of food may be important in nature when searching down the road is costly or when interference competition favour those who reach the food as quickly as possible. For further discussion, see §2c in electronic supplementary material S2.
Figure 4.

Mean success rate of the two advanced learners (CBL and CLS) and of a random choice simulation in a web environment with pairs of similar units. (a) Original test procedure with six base-elements in the hidden section of the test snippets. (b) Test procedure with only one base element in the hidden section of the test snippets (see text for more details). Different numbers of asterisks above bars mark statistically significant differences at p < 0.005 (corrected for multiple comparisons). Error bars depict 95% CI.
Another situation that increases the benefit of generalization and can be demonstrated without modifying our original test procedure is when food elements are directly associated with only a limited number of specific units that may be viewed as direct cues, such as the smell of a prey or the presence of its footprints. In electronic supplementary material S2, we simulate such an environment and show that because each food element is surrounded by such direct cues, predicting that the immediate presence of such cues becomes important even when food must be located within a hidden snippet of six base elements, giving a clear advantage to CLS over CBL learners (see electronic supplementary material, figure S2.5).
4. Discussion
In this work, we implemented three learning mechanisms, corresponding to two small incremental improvements on the basic learning, and compared their foraging success in a range of artificial environments. We find that realistic features in the environment can favour the transition from the relatively simple continuous learning of the structure of the environment [13] to the more advanced mechanisms that can learn high-order units in the environments and generalize across units based on their similar link structure. Importantly, these more advanced mechanisms, considered here in the context of animal foraging, were the same ones we used successfully to account for a range of findings in the field of language learning [26]. Accordingly, our results suggest that the basic abilities required for the learning and use of language may have evolved in the context of learning challenges that are common to many organisms, such as those that arise in animal foraging. Importantly, other cognitive tasks may be served by the same mechanisms that we have described, which thus may have been selected for by the pressures arising from needs other than foraging, such as learning to avoid predators, finding mates or navigating efficiently in space. We focus on a foraging framework for the sake of concreteness, and because it is a widespread behaviour whose effects on fitness are direct and may be readily interpreted.
It may be useful to ask when chunking, a necessary skill for learning language but also for learning other behavioural sequences (as done by various primates, for example [17,27,39]), becomes adaptive in the context of foraging. Both mechanisms that we explored (CBL-1 and CBL-2), which were inspired by empirical findings in animals and humans (reviewed in [22,40]), implicitly suggest that it is useful to encode as a chunk any sequence that recurs in the input frequently enough. The results of our simulations suggest that this is not necessarily true. Chunking is adaptive in our framework only when the chunks have different predictive value from that of their separate constituents: representing <d a d> as a unit is adaptive only if it leads to a different prediction regarding the following element than the prediction suggested by <d> alone (i.e. it depends on whether the regularities that govern the environment act on high-order hierarchies, so that the ‘d’ in <d a d> warrants a different prediction from that of the d in <m u d>).
Future studies should incorporate, on the one hand, the additional costs of computation and memory required for such chunking and the ensuing complex world representation, and on the other hand the additional potential benefits of chunking that were not considered in our framework, in the form of reducing processing time and potentially simplifying decision making. The realistic costs and benefits of these variables are unclear, and therefore they were not incorporated in our models. Our findings thus set minimal requirements for the evolution of chunking mechanisms (see note 2 in electronic supplementary material S1a and electronic supplementary material S4 for further discussion).
We have noted above (and more extensively in electronic supplementary material S4) that the values of the parameters of the learning mechanisms are likely to evolve or to be modified through experience in accordance with the typical challenges that a certain species faces in its natural ecology. Another way in which an individual may improve its learning success is by changing not the processing of the data but the stream of acquired input. This may be achieved through the evolutionary fine-tuning of data acquisition or input mechanisms [24,37,41], as well as by relatively simple behavioural modifications. For example, an organism can increase the frequency of unit recurrence in its input stream by choosing a foraging route with multiple sections of backtracking, particularly in high-value areas such as near food. Such is the type of foraging route that is empirically found in many organisms, and although this behaviour can be explained in other ways as well [42–45], it is quite possible that requirements of efficient learning have also played a role in shaping this behaviour. This is especially likely in the case of play and exploratory behaviour in young individuals, where re-inspecting objects or new locations is highly typical. An analogous behaviour is found in human child-directed speech, which typically contains frequent repetition of units within short time windows in similar, but not identical, contexts [46,47]. This phenomenon, known as ‘variation sets', is especially helpful for the CBL-2, top-down segmentation mechanism [22,38].
Our present findings, along with those of our previous explorations regarding the potential evolutionary roots of the mechanisms considered here, portray the evolution of what is typically considered ‘advanced cognition’ as a gradual process that could have taken place in many realistic settings [1,48], contrary to what is sometimes assumed with regard to language [49] (cf. [50]). This view is in line with recent advances in the fields of artificial intelligence and machine learning, where basic forms of reinforcement learning are being augmented by hierarchical processing and hierarchical reinforcement learning [51–56]. As in our framework, these paradigms involve learning of meaningful high-order units (sometimes called ‘options' in literature of hierarchical reinforcement learning), which correspond to regularities in the environment and facilitate complex goal-oriented behaviour. Our view is also in agreement with multiple lines of theoretical and empirical exploration in cognitive science, which suggest that ‘advanced’ abilities such as goal-directed behaviour and cognitive search may share a common function and possess a common origin with foraging in physical space [57–59].
Finally, we note that approaching advanced cognition from within a foraging framework may inspire a re-evaluation of the manner in which success in language learning is assessed and may suggest certain insights with regard to the possible roots of semantics (cf. [60,61]). Much of the work in the psychology of language focuses on syntactic acceptability of sequences, and tends to sidestep contextual or functional acceptability (as in whether or not an utterance is ‘good for something’). An evolutionary approach suggests a natural measure of acceptability, or success, in the form of fitness, which in our simulations was related to the success in foraging. Drawing a direct analogy between the two—that is, viewing the directed-graph representation of the environment, formed by our learners, as a cognitive data structure whose units may be words that refer to objects or events and whose links represent temporal relations among words—one realizes that both ‘semantics' and ‘syntax’ come to be expressed in the same architecture (cf. [62]). While related ideas have been explored in the past (e.g. in the context of translation, where meaning reduces to a mapping between syntactic structures rather than between such structures and the external environment [63]; see also [64]), the development of a viable evolutionary approach to semantics awaits future work.
Supplementary Material
Acknowledgements
We thank Thomas Hills and an anonymous reviewer for their helpful comments.
Endnote
There are also other uses of the term chunking. In this work we use the term chunk to refer to a meaningful unit in the data that is viewed as such by the learner. It can be composed of multiple shorter units, as in a word that is composed of multiple syllables, and can itself also act as a part of other, longer chunks. An insightful review of chunking in human learning is found in [40].
Data accessibility
A detailed methods description, a pseudo-code of the simulation software, and extended results are available in the online electronic supplementary material.
Authors' contributions
O.K., S.E. and A.L. constructed the model and designed the research. O.K. implemented the model and executed the simulations. O.K., S.E. and A.L. analysed the simulation results, wrote the article and approved its final version.
Competing interests
The authors have no competing interests.
Funding
O.K. was partially supported by a Dean's scholarship from the Faculty of Life Sciences at Tel-Aviv University and by a Wolf Foundation award. A.L. and O.K. were partially supported by the Israel Science Foundation grant no. 1312/11.
References
- 1.Tallerman M. 2014. No syntax saltation in language evolution. Lang. Sci. 46, 207–219. ( 10.1016/j.langsci.2014.08.002) [DOI] [Google Scholar]
- 2.Chersi F, Ferro M, Pezzulo G, Pirrelli V. 2014. Topological self-organization and prediction learning support both action and lexical chains in the brain. Top. Cogn. Sci. 6, 476–491. ( 10.1111/tops.12094) [DOI] [PubMed] [Google Scholar]
- 3.Goddard C, Wierzbicka A, Fabrega HJ. 2014. Evolutionary semantics: using NSM to model stages in human cognitive evolution. Lang. Sci. 42, 60–79. ( 10.1016/j.langsci.2013.11.003) [DOI] [Google Scholar]
- 4.Gong T, Shuai L, Comrie B. 2014. Evolutionary linguistics: theory of language in an interdisciplinary space. Lang. Sci. 41, 243–253. ( 10.1016/j.langsci.2013.05.001) [DOI] [Google Scholar]
- 5.Pulvermuller F. 2014. The syntax of action. Trends Cogn. Sci. 18, 219–220. ( 10.1016/j.tics.2014.01.001) [DOI] [PubMed] [Google Scholar]
- 6.Hauser M, Chomsky N, Fitch T. 2002. The faculty of language: what is it, who has it, and how did it evolve? Science 298, 1569–1579. ( 10.1126/science.298.5598.1569) [DOI] [PubMed] [Google Scholar]
- 7.Aslin RN, Saffran J, Newport EL. 1999. Statistical learning in linguistic and nonlinguistic domains. In The emergence of language (ed. MacWhinney B.), pp. 359–380. London, UK: Taylor & Francis. [Google Scholar]
- 8.Trimmer PC, McNamara JM, Houston AI, Marshall JAR. 2012. Does natural selection favour the Rescorla–Wagner rule? J. Theor. Biol. 302, 39–52. ( 10.1016/j.jtbi.2012.02.014) [DOI] [PubMed] [Google Scholar]
- 9.Gross R, Houston AI, Collins EJ, McNamara JM, Dechaume-Moncharmont FX, Franks NR. 2008. Simple learning rules to cope with changing environments. J. R. Soc. Interface. 5, 1193–1202. ( 10.1098/rsif.2007.1348) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Niv Y, Joel D, Meilijson I, Ruppin E. 2002. Evolution of reinforcement learning in uncertain environments: a simple explanation for complex foraging behaviors. Adapt. Behav. 10, 5–24. ( 10.1177/10597123020101001) [DOI] [Google Scholar]
- 11.Lange A, Dukas R. 2009. Bayesian approximations and extensions: optimal decisions for small brains and possibly big ones too. J. Theor. Biol. 259, 503–516. ( 10.1016/j.jtbi.2009.03.020) [DOI] [PubMed] [Google Scholar]
- 12.Arbilly M, Motro U, Feldman MW, Lotem A. 2010. Co-evolution of learning complexity and social foraging strategies. J. Theor. Biol. 267, 573–581. ( 10.1016/j.jtbi.2010.09.026) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kolodny O, Edelman S, Lotem A. 2014. The evolution of continuous learning of the structure of the environment. J. R. Soc. Interface 11 ( 10.1098/rsif.2013.1091) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Singh S, Lewis RL, Barto AG, Sorg J. 2010. Intrinsically motivated reinforcement learning: an evolutionary perspective. IEEE T. Auton. Ment. Des. 2, 70–82. ( 10.1109/TAMD.2010.2051031) [DOI] [Google Scholar]
- 15.Barto A. 2013. Intrinsic motivation and reinforcement learning. In Intrinsically motivated learning in natural and artificial systems (eds Baldassarre G, Mirolli M.), pp. 17–47. Berlin, Germany: Springer. [Google Scholar]
- 16.Berwick RC, Okanoya K, Beckers GJL, Bolhuis JJ. 2011. Songs to syntax: the linguistics of birdsong. Trends Cogn. Sci. 15, 113–121. ( 10.1016/j.tics.2011.01.002) [DOI] [PubMed] [Google Scholar]
- 17.Byrne RW. 1999. Imitation without intentionality. Using string parsing to copy the organization of behaviour. Anim. Cogn. 2, 63–72. ( 10.1007/s100710050025) [DOI] [Google Scholar]
- 18.Saffran JR, Aslin RN, Newport EL. 1996. Statistical learning by 8-month-old infants. Science 274, 1926–1928. ( 10.1126/science.274.5294.1926) [DOI] [PubMed] [Google Scholar]
- 19.Cutler A, Carter DM. 1987. The predominance of strong initial syllables in the English vocabulary. Comput. Speech Lang. 2, 133–142. ( 10.1016/0885-2308(87)90004-0) [DOI] [Google Scholar]
- 20.Thiessen ED, Saffran JR. 2003. When cues collide: use of stress and statistical cues to word boundaries by 7- to 9-month-old infants. Dev. Psychol. 39, 706–716. ( 10.1037/0012-1649.39.4.706) [DOI] [PubMed] [Google Scholar]
- 21.Finch S, Chater N. 1991. A hybrid approach to the automatic learning of linguistic categories. Artif. Intell. Simul. Behav. Q. 78, 16–24. [Google Scholar]
- 22.Goldstein MH, et al. 2010. General cognitive principles for learning structure in time and space. Trends Cogn. Sci. 14, 249–258. ( 10.1016/j.tics.2010.02.004) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Harris ZS. 1954. Distributional structure. Word 10, 140–162. [Google Scholar]
- 24.Lotem A, Halpern JY. 2012. Coevolution of learning and data-acquisition mechanisms: a model for cognitive evolution. Phil. Trans. R. Soc. B 367, 2686–2694. ( 10.1098/rstb.2012.0213) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Solan Z, Horn D, Ruppin E, Edelman S. 2005. Unsupervised learning of natural languages. Proc. Natl Acad. Sci. USA 102, 11 629–11 634. ( 10.1073/pnas.0409746102) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kolodny O, Lotem A, Edelman S. 2015. Learning a generative probabilistic grammar of experience: a process-level model of language acquisition. Cogn. Sci. 39, 227–267. ( 10.1111/cogs.12140) [DOI] [PubMed] [Google Scholar]
- 27.Hobaiter C, Byrne R. 2013. Gestural communication in great apes: intentionality, syntax, and semantics. Folia Primatol. 84, 287. [Google Scholar]
- 28.Ten Cate C. 2014. On the phonetic and syntactic processing abilities of birds: from song to speech and artificial grammars. Curr. Opin. Neurobiol. 28, 157–164. ( 10.1016/j.conb.2014.07.019) [DOI] [PubMed] [Google Scholar]
- 29.Simon HA. 1973. The organization of complex systems. In Models of discovery (ed. Simon HA.), pp. 245–261. Dordrecht, The Netherlands: Springer. [Google Scholar]
- 30.Shepard RN. 1987. Toward a universal law of generalization for psychological science. Science 237, 1317–1323. ( 10.1126/science.3629243) [DOI] [PubMed] [Google Scholar]
- 31.Modelling Animal Decisions Group, et al. 2014. The evolution of decision rules in complex environments. Trends Cogn. Sci. 18, 153–161. ( 10.1016/j.tics.2013.12.012) [DOI] [PubMed] [Google Scholar]
- 32.Brighton H, Smith K, Kirby S. 2005. Language as an evolutionary system. Phys. Life Rev. 2, 177–226. ( 10.1016/j.plrev.2005.06.001) [DOI] [Google Scholar]
- 33.Chater N, Vitányi PMB. 2003. The generalized universal law of generalization. J. Math. Psychol. 47, 346–369. ( 10.1016/S0022-2496(03)00013-0) [DOI] [Google Scholar]
- 34.Goldberg AE. 2005. Constructions at work: the nature of generalization in language, ch. 3. Oxford, UK: Oxford University Press. [Google Scholar]
- 35.Langacker RW. 1987. Foundations of cognitive grammar: theoretical prerequisites. Stanford, CA: Stanford University Press. [Google Scholar]
- 36.Smith K, Kirby S. 2008. Cultural evolution: implications for understanding the human language faculty and its evolution. Phil. Trans. R. Soc. B 363, 3591–3603. ( 10.1098/rstb.2008.0145) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lotem A, Halpern J. 2008. A data-acquisition model for learning and cognitive development and its implications for autism. Computing and Information Science Technical Reports. Ithaca, NY: Cornell University.
- 38.Onnis L, Waterfall HR, Edelman S. 2008. Learn locally, act globally: learning language from variation set cues. Cognition 109, 423–430. ( 10.1016/j.cognition.2008.10.004) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Genty E, Breuer T, Hobaiter C, Byrne RW. 2009. Gestural communication of the gorilla (Gorilla gorilla): repertoire, intentionality and possible origins. Anim. Cogn. 12, 527–546. ( 10.1007/s10071-009-0213-4) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gobet F, et al. 2001. Chunking mechanisms in human learning. Trends Cogn. Sci. 5, 236–243. ( 10.1016/S1364-6613(00)01662-4) [DOI] [PubMed] [Google Scholar]
- 41.Heyes C. 2012. What's social about social learning? J. Comp. Psychol. 126, 193–202. ( 10.1037/a0025180) [DOI] [PubMed] [Google Scholar]
- 42.Bartumeus F, Da Luz MGE, Viswanathan GM, Catalan J. 2005. Animal search strategies: a quantitative random-walk analysis. Ecology 86, 3078–3087. ( 10.1890/04-1806) [DOI] [Google Scholar]
- 43.Benhamou S. 2007. How many animals really do the Levy walk? Ecology 88, 1962–1969. ( 10.1890/06-1769.1) [DOI] [PubMed] [Google Scholar]
- 44.Hills TT, Kalff C, Wiener JM. 2013. Adaptive levy processes and area-restricted search in human foraging. PLoS ONE 8, e60488 ( 10.1371/journal.pone.0060488) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Reynolds AM, Rhodes CJ. 2009. The Levy flight paradigm: random search patterns and mechanisms. Ecology 90, 877–887. ( 10.1890/08-0153.1) [DOI] [PubMed] [Google Scholar]
- 46.Brodsky P, Waterfall HR, Edelman S. 2007. Characterizing motherese: on the computational structure of child-directed language. In Proc. 29th Cognitive Science Society Conf., pp. 833–838. Austin, TX: Cognitive Science Society. [Google Scholar]
- 47.Onnis L, Waterfall HR, Edelman S. 2008. Variation sets facilitate artificial language learning. In Proc. 30th Cognitive Science Society Conf Washington, DC: See http://kybele.psych.cornell.edu/∼edelman/OnnisWaterfallEdelman-variation-sets.CogSci08.pdf. [Google Scholar]
- 48.Kirby S, Griffiths T, Smith K. 2014. Iterated learning and the evolution of language. Curr. Opin. Neurobiol. 28, 108–114. ( 10.1016/j.conb.2014.07.014) [DOI] [PubMed] [Google Scholar]
- 49.Chomsky N. 1972. Language and mind, p. 194 New York, NY: Harcourt Brace Jovanovich. [Google Scholar]
- 50.Pinker S, Bloom P. 1990. Natural language and natural selection. Behav. Brain Sci. 13, 707–784. ( 10.1017/S0140525X00081061) [DOI] [Google Scholar]
- 51.Barto AG, Mahadevan S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dyn. Syst. 13, 343–379. [Google Scholar]
- 52.Botvinick MM, Niv Y, Barto AC. 2009. Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. Cognition 113, 262–280. ( 10.1016/j.cognition.2008.08.011) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Choi MJ, Lim JJ, Torralba A, Willsky AS. 2010. Exploiting hierarchical context on a large database of object categories. In IEEE Conf. Computer Vision and Pattern Recognition (CVPR). San Francisco, CA: See http://people.csail.mit.edu/lim/paper/cltw_cvpr10.pdf. [Google Scholar]
- 54.Cooper RP, Shallice T. 2006. Hierarchical schemas and goals in the control of sequential behavior. Psychol. Rev. 113, 887–916. ( 10.1037/0033-295X.113.4.887) [DOI] [PubMed] [Google Scholar]
- 55.Paine RW, Tani J. 2005. How hierarchical control self-organizes in artificial adaptive systems. Adapt. Behav. 13, 211–225. ( 10.1177/105971230501300303) [DOI] [Google Scholar]
- 56.Ribas-Fernandes JJF, et al. 2011. A neural signature of hierarchical reinforcement learning. Neuron 71, 370–379. ( 10.1016/j.neuron.2011.05.042) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Hills TT. 2006. Animal foraging and the evolution of goal-directed cognition. Cogn. Sci. 30, 3–41. ( 10.1207/s15516709cog0000_50) [DOI] [PubMed] [Google Scholar]
- 58.Hills TT, Todd PM, Goldstone RL. 2010. The central executive as a search process: priming exploration and exploitation across domains. J. Exp. Psychol. Gen. 139, 590–609. ( 10.1037/a0020666) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Hills TT, Todd PM, Lazer D, Redish AD, Couzin ID. 2014. Exploration versus exploitation in space, mind, and society. Trends Cogn. Sci. 19, 46–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Bartlett M, Kazakov D. 2005. The origins of syntax: from navigation to language. Connect. Sci. 17, 271–288. ( 10.1080/09540090500282479) [DOI] [Google Scholar]
- 61.Kazakov D, Bartlett M. (eds). 2005. Could navigation be the key to language? In Proc. 2nd Symp. on the Emergence and Evolution of Linguistic Communication (EELC'05). See http://www.cs.york.ac.uk/aig/papers/Kazakov_2005.pdf. [Google Scholar]
- 62.Hurford JR. 2002. Expression/induction models of language evolution: dimensions and issues. In Linguistic evolution through language acquisition: formal and computational models (ed. Briscoe T.), pp. 301–344. Cambridge, UK: Cambridge University Press. [Google Scholar]
- 63.Edelman S, Solan Z. 2009. Machine translation using automatically inferred construction-based correspondence and language models. In Proc. 23rd Pacific Asia Conf. on Language, Information, and Computation (PACLIC). See http://kybele.psych.cornell.edu/∼edelman/Edelman-Solan-PACLIC09.pdf [Google Scholar]
- 64.Evans V. 2006. The evolution of semantics. In Encyclopaedia of language and linguistics, pp. 345–353. Oxford, UK: Elsevier. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
A detailed methods description, a pseudo-code of the simulation software, and extended results are available in the online electronic supplementary material.


