Skip to main content
eLife logoLink to eLife
. 2024 May 7;13:e85694. doi: 10.7554/eLife.85694

Collaborative hunting in artificial agents with deep reinforcement learning

Kazushi Tsutsui 1,2,, Ryoya Tanaka 2,3, Kazuya Takeda 1,4, Keisuke Fujii 1,2,5,6
Editors: Malcolm A MacIver7, Michael J Frank8
PMCID: PMC11076040  PMID: 38711355

Abstract

Collaborative hunting, in which predators play different and complementary roles to capture prey, has been traditionally believed to be an advanced hunting strategy requiring large brains that involve high-level cognition. However, recent findings that collaborative hunting has also been documented in smaller-brained vertebrates have placed this previous belief under strain. Here, using computational multi-agent simulations based on deep reinforcement learning, we demonstrate that decisions underlying collaborative hunts do not necessarily rely on sophisticated cognitive processes. We found that apparently elaborate coordination can be achieved through a relatively simple decision process of mapping between states and actions related to distance-dependent internal representations formed by prior experience. Furthermore, we confirmed that this decision rule of predators is robust against unknown prey controlled by humans. Our computational ecological results emphasize that collaborative hunting can emerge in various intra- and inter-specific interactions in nature, and provide insights into the evolution of sociality.

Research organism: Human

eLife digest

From wolves to ants, many animals are known to be able to hunt as a team. This strategy may yield several advantages: going after bigger preys together, for example, can often result in individuals spending less energy and accessing larger food portions than when hunting alone. However, it remains unclear whether this behavior relies on complex cognitive processes, such as the ability for an animal to represent and anticipate the actions of its teammates. It is often thought that ‘collaborative hunting’ may require such skills, as this form of group hunting involves animals taking on distinct, tightly coordinated roles – as opposed to simply engaging in the same actions simultaneously.

To better understand whether high-level cognitive skills are required for collaborative hunting, Tsutsui et al. used a type of artificial intelligence known as deep reinforcement learning. This allowed them to develop a computational model in which a small number of ‘agents’ had the opportunity to ‘learn’ whether and how to work together to catch a ‘prey’ under various conditions. To do so, the agents were only equipped with the ability to link distinct stimuli together, such as an event and a reward; this is similar to associative learning, a cognitive process which is widespread amongst animal species.

The model showed that the challenge of capturing the prey when hunting alone, and the reward of sharing food after a successful hunt drove the agents to learn how to work together, with previous experiences shaping decisions made during subsequent hunts. Importantly, the predators started to exhibit the ability to take on distinct, complementary roles reminiscent of those observed during collaborative hunting, such as one agent chasing the prey while another ambushes it.

Overall, the work by Tsutsui et al. challenges the traditional view that only organisms equipped with high-level cognitive processes can show refined collaborative approaches to hunting, opening the possibility that these behaviors may be more widespread than originally thought – including between animals of different species.

Introduction

Cooperation among animals often provides fitness benefits to individuals in a competitive natural environment (Smith, 1982; Axelrod and Hamilton, 1981). Cooperative hunting, in which two or more individuals engage in a hunt to successfully capture prey, has been regarded as one of the most widely distributed forms of cooperation in animals (Packer and Ruttan, 1988), and has received considerable attention because of the close links between cooperative behavior, its apparent cognitive demand, and even sociality (Macdonald, 1983; Creel and Creel, 1995; Brosnan et al., 2010; Lang and Farine, 2017). Cooperative hunts have been documented in a wide variety of species (Lang and Farine, 2017; Bailey et al., 2013), yet ‘collaboration’ (or ‘collaborative hunting’), in which predators play different and complementary roles, has been reported in only a handful of vertebrate species (Stander, 1992; Boesch and Boesch, 1989; Gazda et al., 2005). For instance, previous studies have shown that mammals such as lions and chimpanzees are capable of dividing roles among individuals, such as when chasing prey or blocking the prey’s escape path, to facilitate capture by the group (Stander, 1992; Boesch and Boesch, 1989). Collaborative hunts appear to be achieved through elaborate coordination with other hunters, and are often believed to be an advanced hunting strategy requiring large brains that involve high-level cognition such as aspects of theory of mind (Boesch and Boesch-Achermann, 2000; Boesch, 2002).

However, recent findings have placed this previous belief under strain. In particular, cases of intra- and inter-specific collaborative hunting have also been demonstrated in smaller-brained vertebrates such as birds (Bednarz, 1988), reptiles (Dinets, 2015), and fish (Bshary et al., 2006; Steinegger et al., 2018). It seems possible that apparently elaborate hunting behavior can emerge in a relatively simple decision process in response to ecological needs (Steinegger et al., 2018). However, the decision process underlying collaborative hunting remains poorly understood because most previous studies thus far have relied exclusively on behavioral observations. Observational studies are essential for documenting such natural behavior, yet it is often difficult to identify the specific decision process that results in coordinated behavior. This limitation arises because seemingly simple behavior can result from complex processes (Evans et al., 2019) and vice versa (Couzin et al., 2002).

We, therefore, sought to further our understanding of the processes underlying collaborative hunting by adopting a different approach, namely, computational multi-agent simulation based on deep reinforcement learning. Deep reinforcement learning mechanisms were originally inspired by animal associative learning (Sutton and Barto, 1981), and are thought to be closely related to neural mechanisms for reward-based learning centering on dopamine (Schultz et al., 1997; Samejima et al., 2005; Doya, 2008). Given that associative learning is likely to be the most widely adopted learning mechanism in animals (Mackintosh, 1974; Wynne, 2001), collaborative hunting could arise through associative learning, where simple decision rules are developed based on behavioral cues [i.e. contingencies of reinforcement (Skinner, 2014)].

Specifically, we first explored whether predator agents based on deep reinforcement learning learn decision rules resulting in collaborative hunting and, if so, under what conditions through predator-prey interactions in a computational ecological environment. We then examined what internal representations are associated with the decision rules. Furthermore, we confirmed the generality of the acquired predators’ decision rules using joint plays between agents (predators) and humans (prey). Notably, our predator agents successfully learned to collaborate in capturing their prey solely through a reinforcement learning algorithm, without employing explicit mechanisms comparable to aspects of theory of mind (Yoshida et al., 2008; Foerster, 2019; Hu and Foerster, 2020). Moreover, our results showed that the acquisition of decision rules resulting in collaborative hunting is facilitated by a combination of two factors: the difficulty of capturing prey during solitary hunting, and food (i.e. reward) sharing following capture. We also found that decisions underlying collaborative hunts were related to distance-dependent internal representations formed by prior experience. Furthermore, the decision rules worked robustly against unknown prey controlled by humans. These provide insight that collaborative hunts do not necessarily require sophisticated cognitive mechanisms, and simple decision rules based on mappings between states and actions can be practically useful in nature. Our results support the recent suggestions that the underlying processes facilitating collaborative hunting can be relatively simple (Lang and Farine, 2017).

Results

We set out to model the decision process of predators and prey in an interactive environment. In this study, we focused on a chase and escape scenario in a two-dimensional open environment. Chase and escape is a potentially complex phenomenon in which two or more agents interact in environments that change from moment to moment. Nevertheless, many studies have shown that the rules of chase/escape behavior (e.g. which direction to move at each time in a given situation) can be described by relatively simple mathematical models consisting of the current state (e.g. positions and velocities) (Brighton et al., 2017; Tsutsui et al., 2020; Howland, 1974). We, therefore, considered modeling the agent’s decision process in a standard reinforcement learning framework for a finite Markov decision process in which each sequence is a distinct state. In this framework, the agent interacts with the environment through a sequence of states, actions, and rewards, and aims to select actions in a manner that maximizes cumulative future reward (Sutton and Barto, 2018).

Exploring the conditions under which collaborative hunting emerges

We first performed computational simulations with three experimental conditions to investigate the conditions under which collaborative hunting emerges (Figure 1b; Videos 13). As experimental conditions, we selected the number of predators, relative mobility, and prey (reward) sharing based on ecological findings (Bailey et al., 2013; Lang and Farine, 2017). For the number of predators, three conditions were set: 1 (one), 2 (two), and 3 (three). In all these conditions, the number of prey was set to 1. For the relative mobility, three conditions were set: 120% (fast), 100% (equal), and 80% (slow), which represented the acceleration of the predator, based on that of the prey. For the prey sharing, two conditions were set: with sharing (shared), in which all predators were rewarded when a predator catches the prey, and without sharing (individual), in which a predator was rewarded only when it catches the prey by itself. In total, there were 15 conditions.

Figure 1. Agent architecture and examples of movement trajectories.

(a) An agent’s policy is represented by a deep neural network (see Methods). A state of the environment is given as input to the network. An action is sampled from the network’s output, and the agent receives a reward and a subsequent state. The agent learns to select actions that maximize cumulative future rewards. In this study, each agent learned its policy network independently, that is, each agent treats the other agents as part of the environment. This illustration shows a case with three predators. (b) The movement trajectories are examples of interactions between predator(s) (dark blue, blue, and light blue) and prey (red) that overlay 10 episodes in each experimental condition. The experimental conditions were set as the number of predators (one, two, or three), relative mobility (fast, equal, or slow), and reward sharing (individual or shared), based on ecological findings.

Figure 1.

Figure 1—figure supplement 1. Network architecture.

Figure 1—figure supplement 1.

The neural network is composed of four layers. The input to the neural network was the state and the output was each possible action, namely, a total of 13 action of the ‘acceleration’ in 12 directions every 30 degrees in the relative coordinate system and ‘do nothing.’ After the first two hidden layers of the MLP with 64 units, the network branches off into two streams. Each branch has one MLP layer with 32 hidden units. Rectified linear unit (ReLU) was used as the activation function for each layer. In the visualization of the agents’ internal representations, the 32-dimensional hidden vector (parts filled in gray) was embedded in two dimensions, using t-distributed stochastic neighbor embedding (t-SNE).
Figure 1—figure supplement 2. Diagram of model input.

Figure 1—figure supplement 2.

We used position and velocity information as the state (model input) for each agent. Assuming subjective observation, each variable, except absolute position, was converted to a relative coordinate system to the opponent; namely, prey for predators and nearest predator for prey, and inputted to the model. Moreover, for the prey input in the three-predator condition, the predator indices were sorted according to the distance between the prey and each predator. Specifically, we considered predator 1, predator 2, and predator 3 descending order in terms of distance. Similarly, in the predator input in the three-predator condition, we set itself as predator 1, the closer predator to itself as predator 2, and the farther predator as predator 3. In the figure, abs., rel.,bold p, and bold v denote an absolute, relative, position, and velocity, respectively.

Video 1. Example videos in the one-predator conditions.

Download video file (100KB, mp4)

Video 2. Example videos in the two-predator conditions.

Download video file (138.1KB, mp4)

Video 3. Example videos in the three-predator conditions.

Download video file (116.6KB, mp4)

As the example trajectories show, under the fast and equal conditions, the predators often caught their prey shortly after the episode began, whereas under the slow condition, the predators somewhat struggled to catch their prey (Fig. 1b). To evaluate their behavior, we calculated the proportion of predations that were successful and mean episode duration. For the fast and equal conditions, predations were successful in almost all episodes, regardless of the number of predators and the presence or absence of reward sharing (e.g. 0.99 ± 0.00 for the one × fast and one × equal conditions; Figure 2—figure supplement 1). This indicates that in situations where predators were faster than or equal in speed to their prey, they almost always succeeded in capturing the prey, even when they were the sole predator. Although the mean episode duration decreased with an increasing number of predators in both fast and equal conditions, the difference was small. As a whole, these results indicate that there is little benefit of cooperation among multiple predators in the fast and equal conditions. As it is unlikely that cooperation among predators will emerge under such conditions in nature from an evolutionary perspective (Smith, 1982; Axelrod and Hamilton, 1981), the analysis below is limited to the slow condition. For the slow condition, a solitary predator was rarely successful, and the proportion of predations that were successful increased with the number of predators (Figure 2a). Moreover, the mean duration decreased with an increasing number of predators (Figure 2a bottom). These results indicate that, under the slow condition, the benefits of cooperation among multiple predators are significant. In addition, except for the two × individual condition, the increase in the proportion of success with an increasing number of predators was much greater than the theoretical prediction (Packer and Ruttan, 1988), calculated based on the proportion of solitary hunting, assuming that each predator’s performance is independent of the others’ (see Methods). These results indicate that under these conditions, elaborate hunting behavior (e.g. ‘collaboration’) that is qualitatively different from hunting alone may emerge.

Figure 2. Emergence of collaborations among predators.

(a) Proportion of predations that were successful (top) and mean episode duration (bottom). For both panels, quantitative data denote the mean of 100 episodes ± SEM across 10 random seeds. The error bars are barely visible because the variation is negligible. The theoretical prediction values were calculated based on the proportion of solitary hunts (see Methods). The proportion of predations that were successful increased as the number of predators increased (Fnumber(2,18) = 1346.67, p<0.001; η2 = 0.87; one vs. two: t(9) = 20.38, p<0.001; two vs. three: t(9) = 38.27, p<0.001). The mean duration decreased with increasing number of predators (Fnumber(2,18) = 1564.01, p<0.001; η2 = 0.94; one vs. two: t(9) = 15.98, p<0.001; two vs. three: t(9) = 40.65, p<0.001). (b) Typical example of different predator routes between the individual (left) and shared (right) conditions, in the two-predator condition. The numbers (1–3) show a series of state transitions (every second) starting from the same initial position. Each panel shows the agent positions and the trajectories leading up to that state. In these instances, the predators ultimately failed to capture the prey within the time limit (30 s) under the individual condition, whereas the predators successfully captured the prey in only 3 s under the shared condition. (c) Comparison of heat maps between individual (left) and shared (right) reward conditions. The heat maps of each agent were constructed based on the frequency of stay in each position, which was cumulative for 1000 episodes (100 episodes × 10 random seeds). In the individual condition, there were relatively high correlations between the heat maps of the prey and each predator, regardless of the number of predators (One:r=0.95,p<0.001, Two:r=0.83,p<0.001 in predator 1,r=0.78,p<0.001 in predator 2, Three:r=0.41,p<0.001 in predator 1,r=0.56,p<0.001 in predator 2,r=0.45,p<0.001 in predator 3). In contrast, in the shared condition, only one predator had a relatively high correlation, whereas the others had low correlations (Two:r=0.65,p<0.001 in predator 1,r=0.01,p=0.80 in predator 2, Three:r=0.17,p<0.001 in predator 1,r=0.54,p<0.001 in predator 2,r=0.03,p=0.23 in predator 3).

Figure 2.

Figure 2—figure supplement 1. Proportion of predations that were successful, mean episode duration, and heat maps for each condition.

Figure 2—figure supplement 1.

For both panels, quantitative data denote the mean of 100 episodes ± SEM across 10 random seeds.The theoretical prediction values were calculated based on the proportion of solitary hunts (see Methods). The heatmap of each agent was constructed based on the frequency of stay in each position, which was cumulative for 1,000episodes (100 episodes × 10 random seeds).
Figure 2—figure supplement 2. Circular histogram, concordance rate, and circular correlation.

Figure 2—figure supplement 2.

To visualize theassociation between each predator in the two- and three-predator conditions and the baseline, which is the predatorin the one-predator condition, we produced overlays of the frequency of selection for each action. We calculated concordance rates and circular correlations to quantitatively evaluate these associations of action selection. The concordance rate has the advantage of being able to compare all 13 actions, whereas the circular correlation has theadvantage of being able to consider the proximity among each action, although it can only evaluate 12 actions,excluding ‘do nothing.’ As shown in this figure, the two indices showed similar trends. The predators whose heat maps were similar to that of their prey tended to have higher values on these indices. For all panels, quantitative datadenote the mean of 100 episodes ± SEM across 10 random seeds.
Figure 2—figure supplement 3. Scaled distance among predators and proportion of prey capture.

Figure 2—figure supplement 3.

Scaled distance is ameasure of how far a predator moves to capture its prey compared with the other predators during a hunt. Although simplified, this distribution reflects the role each predator played in a hunt. Specifically, if there is a large difference in the scaled distance among predators, individuals with a larger scaled distance (greater than 1) could play the role of ‘chaser’ (or ‘driver’), while individuals with a smaller scaled distance (less than 1) could play the role of ‘blocker’ (or ‘ambusher’). Moreover, if these distances do not differ among predators (concentrated near 1), it is likely that each predator pursued prey in the same manner, suggesting that there was no role division during the hunt. Furthermore, these distributions can be used to capture the flexibility of role division among predators. That is, if the distributions are separate and do not overlap, there is a division of roles among predators, and these roles are fixed inany hunt. On the other hand, if the distributions are separate but some of them overlap, the roles may have switched across hunts. Our results show that the distribution in the individual condition was concentrated around 1, whereas in the shared condition it was divided among individuals. This means that there was rarely role division among predators in the individual condition, while there was role division in the shared condition. These characteristics were more pronounced in the two-predator condition than in the three-predator condition. Perhaps this depends on the episode duration; the duration tends to be longer in the two-predator condition, and the difference in distance is likely to be clearer. Note that even under the two-predator condition, there was some overlap in the distribution inthe shared condition. This indicates that the basic roles were fixed among individuals, but interchanged according to the situation (or episode) in the condition. Additionally, because these role divisions are often discussed in the context of cooperation and cheating, we calculated the proportion of prey capture for each predator. The results did not indicate which role was more likely to catch the prey. That is, in the shared × two conditions, the chaser tended to catch more prey, but, on the other hand, in the shared × three conditions, the blocker tended to catch more prey. For all panels, quantitative data denote the mean of 100 episodes ± SEM across 10 random seeds.
Figure 2—figure supplement 4. Typical example of coordinated hunting behavior in the three × individual condition.

Figure 2—figure supplement 4.

The numbers (1 to 3) show a series of state transitions (every 0.6 s). Each panel shows the agent positions and the trajectories leading up to that state.

Then, we examined agent behavioral patterns and found that there were differences in the movement paths that predators take to catch their prey among the conditions (Figure 2b). As shown in the typical example, under the individual condition, both predators moved in a similar manner toward their prey (Figure 2b left) and, in contrast, under the shared condition, one predator moved toward their prey while the other predator moved along a different route (Figure 2b right). To ascertain their behavioral patterns, we created heat maps showing the frequency of agent presence at each location in the area (Figure 2c). We found that there was a noticeable difference between the individual and shared reward conditions. In the individual condition, the heat maps of prey and respective predators were quite similar (Figure 2c), whereas this was not always the case in the shared condition (Figure 2c). In particular, the heat maps of predator 2 in the two-predator condition and predator 3 in the three-predator condition showed localized concentrations (Figure 2c far right, respectively). To assess these differences among predators in more detail, we compared the predators’ decisions (i.e. action selections) in these conditions with that in the one-predator condition (i.e. solitary hunts) using two indices, concordance rate, and circular correlation (Berens, 2009; Figure 2—figure supplement 2). Following previous studies (Scheel and Packer, 1991), we also calculated the ratios of distance moved during hunting among predators (Figure 2—figure supplement 3). Overall, these findings support the idea that predators with heat maps similar to their prey acted as ‘chasers’ (or ‘drivers’), while predators with different heat maps behaved as ‘blockers’ (or ‘ambushers’). That is, our results show that, although most predators acted as chasers, some predators acted as blockers rather than chasers in the shared condition, indicating the emergence of collaborative hunting characterized by role divisions among predators under the condition.

Mechanistic interpretability of collaboration

We next sought the predators’ internal representations to better understand how such collaborative hunting is accomplished. Using a two-dimensional t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008), we visualized the last hidden layers of the state and action streams in the policy network as internal representations of agents (Figure 3, Figure 3—figure supplements 13). To understand how each agent represents its environment and what aspects of the state are well represented, we examined the relationship between the scenes of a typical scenario and their corresponding points on the embedding (Figure 3a and b). As expected, when the predator is likely to catch its prey (e.g. scene 4), the predator estimated a higher state value, whereas, when the predator is not (e.g. scene 5), the predator estimated a lower state value (Figure 3a top). Related to this, the variance of action values tends to be larger for both predator and prey when they are close (Figure 3a bottom), indicating that the difference in the value of choosing each action is greater when the choice of action is directly related to the reward (see also Figure 3—figure supplement 4). These results suggest that the agents were able to learn the networks that output the estimations of state and action values consistent with our intuition.

Figure 3. Embedding of internal representations underlying collaborative hunting.

(a) Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers of the state-value stream (top) and action-value stream (bottom) in the shared reward condition. The representation is assigned by the policy network of each agent to states experienced during predator-prey interactions. The points are colored according to the state values and standard deviation of the action values, respectively, predicted by the policy network (ranging from dark red (high) to dark blue (low)). (b) Corresponding states for each number in each embedding. The number (1–5) in each embedding corresponds to a selected series of state transitions. The series of agent positions in the state transitions (every second) and, for ease of visibility, the trajectories leading up to that state are shown. (c) Embedding colored according to the distances between predators and prey in the individual (left) and shared (right) reward conditions. Distances 1 and 2 denote the distances between predator 1 and prey and predator 2 and prey, respectively. If both distances are short, the point is colored blue; if both are long, it is colored white.

Figure 3.

Figure 3—figure supplement 1. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers of the state-value stream (top) and action-value stream (bottom) in the individual reward condition, in the slow × two conditions.

Figure 3—figure supplement 1.

The representation is assigned by the policy network of each agent to states experienced during predator-prey interactions. The points are colored according to the state values and standard deviation of the actionvalues, respectively, predicted by the policy network (ranging from dark red (high) to dark blue (low)).
Figure 3—figure supplement 2. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding colored according to the absolute coordinates of itself in the individual (left) and shared (right) reward conditions, in the slow × two conditions.

Figure 3—figure supplement 2.

The absolute coordinates (i.e. x and y positions) are directly associated with the reward, as are the distances between prey and predators, because each agent receives a negative reward (-1) for leaving the play area. We, therefore, colored the internal representation of the agent according to its position. The upper left corner of the play area corresponds to cyan, the lower left to green, the upper right to white, and the lower right to yellow. The embedding of state representations in the prey seems to be roughly clustered according to absolute position, compared to those of the predators. These indicate that prey might estimate the state and action values and make decisions associated with absolute position-dependent representations compared to predators.
Figure 3—figure supplement 3. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers ofstate-value stream and action-value stream, in the slow × three conditions.

Figure 3—figure supplement 3.

The points are colored according to the state values and standard deviation of the action values predicted by the policy network (top), the distances between prey and predators (middle), and absolute coordinates of itself, respectively.
Figure 3—figure supplement 4. Corresponding state-action values (Q-values) for each state.

Figure 3—figure supplement 4.

We show the value for eachaction of each agent in a selected series of state transitions (scenes 1 to 5). The panels on the left side of the figure are the same as the plot in Figure 3. For both predator and prey, each action is defined in terms of a relative coordinate system to the opponent. In other words, action 1 denotes movement toward the opponent (for the prey, the nearest predator) and action 7 denotes movement in the opposite direction of the opponent. Thus, in the estimated action values of the prey, actions 5 to 9 tend to show relatively high values, and in those of predators, actions 1, 2, 3, 11, and 12 tend to show relatively high values. These results suggest that proximity to rewards plays a major role in estimating state-action values.
Figure 3—figure supplement 5. Rule-based predator agent architectures.

Figure 3—figure supplement 5.

For consistency with the deep reinforcement learning agents, the input to the rule-based agents used to make decisions is limited to the current information (e.g. position and velocity), and the output is provided in a relative coordinate system to the prey. The predator first determines whether it, or another predator, is closer to the prey, and then, if the other predator is closer, it determines whether the distance 2 is less than the specified distance threshold. The decision rule for each predator is selected by this branching, with predator 1 adopting the three rules ‘chase,’ ‘shortcut,’ and ‘approach,’ and predator 2 adopting the two rules ‘chase’ and ‘ambush’ (see Methods for details). In the chase, the predator first determines whether it is near the outer edge of the play area and, if so, selects actions that will prevent it from leaving the play area. If the predator is not on the outside of the play area, then it determines whether the prey is on the inside of the play area,and, if so, selects actions that will drive them to the outside. In other situations, it selects actions so that the direction of movement is aligned with that of their prey. In the shortcut, the predator determines whether it is near the outeredge of the play area, and if so, selects the actions described above, otherwise selects actions that will produce shorter paths to the prey. In the approach, the predator determines whether it is near the outer edge of the play area and, if so, selects the actions described above, otherwise it selects actions that move it toward the prey. In the ambush, the predator selects actions that move toward the top center or bottom center of the play area and remain there until the situation changes.
Figure 3—figure supplement 6. Movement trajectories (left) and heat maps (right) of the rule-based predator agents.

Figure 3—figure supplement 6.

The movement trajectories are examples of predator(s) (dark blue and blue) and prey (red) interactions that overlay 10 episodes. The proportion of successful predation and mean episode duration were 74.4 ± 0.54 and 12.9 ± 0.12 (mean of 100 episodes ± SEM across 10 random seeds), respectively. The heat map of each agent was constructed based on the frequency of stay in each position, which is cumulative for 1,000 episodes (100 episodes × 10 random seeds). One predator had a relatively high correlation between the heat maps, whereas the others had a lowcorrelation (r = 0.60, p<0.001 in predator 1, r = 0.20, p<0.001 in predator 2). This trend in results is similar to that in the deep reinforcement learning predator agents (compare Figure 2c). Note that, in the rule-based agentsimulation, the prey’s decision was made by a policy network in the two × shared condition (i.e. predatorrule-based agent vs. prey deep reinforcement learning agent).
Figure 3—figure supplement 7. Two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers ofthe linear network (top) and the nonlinear network (bottom) in behavioral cloning.

Figure 3—figure supplement 7.

The embedding is colored according to the distances between predators and prey. Distances 1 and 2 denote the distance between predator 1 and prey and predator 2 and prey, respectively. If both distances are short, the point is colored blue; if both are long, it is colored white. Note that the accuracy in behavioral cloning from the top 1 to the top 5 was, in ascending order, 0.47, 0.61, 0.71, 0.78, 0.80 for predator 1 and 0.44, 0.55, 0.60, 0.66, and 0.70 for predator 2 for the linear network, and 0.65, 0.77, 0.82, 0.88, and 0.95 for predator 1 and 0.68, 0.78, 0.82, 0.85, and 0.90 for predator 2 for the nonlinear network.
Figure 3—figure supplement 8. Histogram of the state value (V-value) in the individual (left) and shared (right) conditions.

Figure 3—figure supplement 8.

In coloring the embedding, the lower and upper limits of coloring were set to the fifth percentile and 95th percentile (gray lines), respectively, to prevent visibility from being compromised by extreme values that rarely occur.
Figure 3—figure supplement 9. Histogram of the standard deviation of state-action values (Q-values) in individual (left) and shared (right) conditions.

Figure 3—figure supplement 9.

In coloring the embedding, the lower and upper limits of coloring were set to the fifth percentile and 95th percentile (gray lines), respectively, to prevent visibility from being compromised by extreme values that rarely occur.
Figure 3—figure supplement 10. Histogram of the distance between the prey and each predator in individual (left) and shared (right) conditions.

Figure 3—figure supplement 10.

In coloring the embedding, the lower and upper limits of coloring were set to the fifth percentile and 95th percentile (gray lines), respectively, to prevent visibility from being compromised by extreme values that rarely occur.
Figure 3—figure supplement 11. Histogram of the distance between the prey and each predator in the simulations, using rule-based predator agents.

Figure 3—figure supplement 11.

In coloring the embedding in behavioral cloning, the lower and upper limits of coloring were set to the fifth percentile and 95th percentile (gray lines), respectively, to prevent visibility from being compromised by extreme values that rarely occur.

Furthermore, we found a distinct feature in the embedding of predators’ representations. Specifically, in certain state transitions, the position of the points on the embedding changed little, even though the agents were moving (e.g. scenes 1–2 on the embedding of the predator 2). From this, we deduced that the predators’ representations may be more focused on encoding the distance between themselves and others, rather than the specific locations of both parties. To test our reasoning, we colored the representations according to the distance between predators and prey; distance 1 denotes the distance between predator 1 and the prey, and distance 2 denotes that between predator 2 and the prey. As a result, the representations of predators in the shared condition could be clearly separated by the distance-dependent coloration (Fig. 3c right), in contrast to those in the individual condition (Fig. 3c left). These indicate that the predators in the shared condition estimated state and action values and made decisions associated with distance-dependent representations (see Figure 3—figure supplement 2 for the prey’s decision).

Evaluating the playing strength of predator agents using joint play with humans

Finally, to verify the generality of predators’ decisions against unknown prey, we conducted an experiment of joint play between agents and humans. In the joint play, human participants controlled prey on a screen using a joystick. The objective, as in the computational simulation described above, was to evade capture until the end of the episode (30 s) while remaining within the area. We found that the outcomes of the joint play showed similar trends to those of the computer simulation (Figure 4a), showing that the proportion of predations that were successful increased and the mean episode duration decreased as the number of predators increased. These indicate that the predator agents’ decision rules worked well for the prey controlled by humans. To visualize the associations of states experienced by predator agents versus agents and versus humans, we show colored two-dimensional t-SNE embedding of the representations in the last hidden layers of the state and action streams (Figure 4b, Figure 4—figure supplement 1). These showed that, in contrast to a previous study (Mnih et al., 2015), the states were quite distinct, suggesting that predator agents experienced unfamiliar states when playing against the prey controlled by humans. This unfamiliarity may make it difficult for predators to make proper decisions. Indeed, in the one-predator condition, the predator agent occasionally exhibited odd behavior (e.g. staying in one place; see Figure 4—figure supplement 2). On the other hand, in the two- and three-predator conditions, predator agents rarely exhibited such behavior and showed superior performance. This indicates that decision rules of cooperative hunting acquired in certain environments could be applied in other somewhat different environments.

Figure 4. Superior performance of predator agents for prey controlled by humans and comparison of internal representations.

(a) Proportion of predations that were successful (top) and mean episode duration (bottom). For both panels, the thin line denotes the performance of each participant, and the thick line denotes the mean. The theoretical prediction values were calculated based on the mean of proportion of solitary hunts. The proportion of predations that were successful increased as the number of predators increased (Fnumber(1.28,11.48) = 276.20, p<0.001; η2 = 0.90; one vs. two: t(9) = 13.80, p<0.001; two vs. three: t(9) = 5.9402, p<0.001). The mean duration decreased with an increasing number of predators (Fnumber(2,18) = 23.77, p<0.001; η2 = 0.49; one vs. two: t(9) = 2.60, p=0.029; two vs. three: t(9) = 5.44, p<0.001). (b) Comparison of two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the representations in the last hidden layers of state-value stream between self-play (predator agents vs. prey agent) and joint play (predator agents vs. prey human).

Figure 4.

Figure 4—figure supplement 1. Comparison of two-dimensional t-distributed stochastic neighbor embedding (t-SNE) embedding of the internal representations.

Figure 4—figure supplement 1.

To visualize the associations of states experienced by predator agents versus agents (self-play) and versus humans (jointplay), we show colored two-dimensional t-SNE embedding of the representations in the last hidden layer of the action-value stream. Similar to those of the state stream (Figure 4b), the experienced states were quite distinct,especially in the one- and two-predator conditions.
Figure 4—figure supplement 2. Comparison of heat maps between individual (left) and shared (right) reward conditions in joint play.

Figure 4—figure supplement 2.

The heat map of each agent was made based on the frequency of stay in each position which iscumulative for 500 episodes (50 episodes × 10 participants). Our results showed a similar trend between self-play (predator agents vs. prey agent) and joint play (predator agents vs. prey human) in terms of role division among predators. For example, in the individual condition, the heat maps between/among predators were similar, indicating that there was no clear division of roles. On the other hand, in the shared condition, the heat maps between/among predators differed and it indicates that the roles were divided among them. In addition, one of the differences fromself-play was the instability of the predator agent’s behavior in the one-predator condition. As shown in the figure, under certain conditions, the predator agent stopped moving from its location (the upper right corner of the area).

Discussion

Collaborative hunting has been traditionally thought of as an advanced hunting strategy that involves high-level cognition such as aspects of theory of mind (Boesch and Boesch-Achermann, 2000; Boesch, 2002). Here, we have shown that ‘collaboration’(Boesch and Boesch, 1989) can emerge in group hunts of artificial agents based on deep reinforcement learning. Notably, our predator agents successfully learned to collaborate in capturing their prey solely through a reinforcement learning algorithm, without employing explicit mechanisms comparable to aspects of theory of mind (Yoshida et al., 2008; Foerster, 2019; Hu and Foerster, 2020). This means that, in contrast to the traditional view, apparently elaborate coordination can be accomplished by relatively simple decision rules, that is, mappings between states and actions. This result advances our understanding of cooperative hunting behavior and its decision process, and may offer a novel perspective on the evolution of sociality.

Our results on agent behavior are broadly consistent with previous studies concerning observations of animal behavior in nature. First, as the number of predators increased, success rates increased and hunting duration decreased (Creel and Creel, 1995). Second, whether collaborative hunts emerge depended on two factors: the success rate of hunting alone (Busse, 1978; Boesch, 2002) and the presence or absence of reward sharing following prey capture (Boesch, 1994; Stanford, 1996). Third, while each predator generally maintained a consistent role during repeated collaborative hunts, there was flexibility for these roles to be swapped as needed (Stander, 1992; Boesch, 2002). Finally, predator agents in this study acquired different strategies depending on the conditions despite having exactly the same initial values (i.e. network weights), resonating with the findings that lions and chimpanzees living in different regions exhibit different hunting strategies (Stander, 1992; Boesch‐Achermann and Boesch, 1994). These results suggest the validity of our computational simulations and highlight the close link between predators’ behavioral strategies and their living environments, such as the presence of other predators and sharing of prey.

The collaborative hunts have shown performance that surpasses the theoretical predictions based on solitary hunting outcomes. This result is in line with the notion that role division among predators in nature could provide fitness benefits (Lang and Farine, 2017; Boesch and Boesch-Achermann, 2000). Meanwhile, when three predators were involved, performance was comparable whether prey was shared or not. One possible factor that has caused this is spatial constraints. We found that predators occasionally block the prey’s escape path, exploiting the boundaries of the play area and the chasing movements of other predators even in the individual reward condition (Figure 2—figure supplement 4). These results suggest that, under certain scenarios, coordinated hunting behaviors that enhance the success rate of predators may emerge regardless of whether food is shared, potentially relating to the benefits of social predation, including interspecific hunting (Bshary et al., 2006; Thiebault et al., 2016; Sampaio et al., 2021).

We found that the mappings resulting in collaborative hunting were related to distance-dependent internal representations. Additionally, we showed that the distance-dependent rule-based predators successfully reproduced behaviors similar to those of the deep reinforcement learning predators, supporting the association between decisions and distances (Methods; Figure 3—figure supplement 5, Figure 3—figure supplement 6 and Figure 3—figure supplement 7). Deep reinforcement learning has held the promise for providing a comprehensive framework for studying the interplay among learning, representation, and decision making (Botvinick et al., 2020; Mobbs et al., 2021), but such efforts for natural behavior have been limited (Banino et al., 2018; Jaderberg et al., 2019). Our result that the distance-dependent representations relate to collaborative hunting is reminiscent of a recent idea about the decision rules obtained by observation in fish (Steinegger et al., 2018). Notably, the input variables of predator agents do not include variables corresponding to the distance(s) between the other predator(s) and prey, and this means that the predators in the shared conditions acquired the internal representation relating to distance to prey, which would be a geometrically reasonable indicator, by optimization through interaction with their environment. Our results suggest that deep reinforcement learning methods can extract systems of rules that allow for the emergence of complex behaviors.

The predator agents’ decision rules (i.e. policy networks) acquired through interactions with other agents (i.e. self-play) were also useful for unknown prey controlled by humans, despite the dissociation of the experienced states. This suggests that decision rules formed by associative learning can successfully address natural problems, such as catching prey with somewhat different movement patterns than one’s usual prey. Note that the learning mechanism of associative learning (or reinforcement learning) is relatively simple, but it allows for flexible behavior in response to situations, in contrast to innate and simple stimulus-response. Indeed, our prey agents achieved a higher rate of successful evasions than those operated by humans. Our view that decisions for successful hunting are made through representations formed by prior experience is a counterpart to the recent idea that computational relevance for successful escape may be cached and ready to use, instead of being computed from scratch on the spot (Evans et al., 2019). If animals’ decision processes in predator-prey dynamics are structured in this way, it could be a product of natural selection, enabling rapid, robust, and flexible action in interactions with severe time constraints.

In conclusion, we demonstrated that the decisions underlying collaborative hunting among artificial agents can be achieved through mappings between states and actions. This means that collaborative hunting can emerge in the absence of explicit mechanisms comparable to aspects of theory of mind, supporting the recent idea that collaborative hunting does not necessarily rely on complex cognitive processes in brains (Lang and Farine, 2017). Our computational ecology is an abstraction of a real predator-prey environment. Given that chase and escape often involve various factors, such as energy cost (Hubel et al., 2016), partial observability (Mugan and MacIver, 2020; Hunt et al., 2021), signal communication (Vail et al., 2013), and local surroundings (Evans et al., 2019), these results are only a first step on the path to understanding real decisions in predator-prey dynamics. Furthermore, exploring how mechanisms comparable to aspects of theory of mind (Yoshida et al., 2008; Foerster, 2019; Hu and Foerster, 2020) or the shared value functions (Lowe, 2017; Foerster et al., 2018; Rashid, 2020), which are increasingly common in multi-agent reinforcement learning, play a role in these interactions could be an intriguing direction for future research. We believe that our results provide a useful advance toward understanding natural value-based decisions and forge a critical link between ecology, ethology, psychology, neuroscience, and computer science.

Methods

Environment

The predator and prey interacted in a two-dimensional world with continuous space and discrete time. This environment was constructed by modifying an environment known as ‘predator-prey’ within a multi-agent particle environment (Lowe, 2017). Specifically, the position of each agent was calculated by integrating the acceleration (i.e. selected action) twice with the Euler method, and viscous resistance proportional to velocity was considered. The modifications were that the action space (play area size) was constrained to the range of –1 to 1 on the x and y axes, all agent (predator/prey) disk diameters were set to 0.1, landmarks (obstacles) were eliminated, and predator-to-predator contact was ignored for simplicity (Tsutsui et al., 2022). The predator(s) was rewarded for capturing the prey (+1), namely contacting the disks, and punished for moving out of the area (–1), and the prey was penalized for being captured by the predator or for moving out of the area (–1). The predator and prey were represented as a red and blue disk, respectively, and the play area was represented as a black square enclosing them. The time step was 0.1 s and the time limit in each episode was set to 30 s. The initial positions of the predators and prey in each episode were randomly selected from a range of –0.5 to 0.5 on the x and y axes.

Experimental conditions

We selected the number of predators, relative mobility, and prey (reward) sharing as experimental conditions, based on ecological findings (Bailey et al., 2013; Lang and Farine, 2017). For the number of predators, three conditions were set: 1 (one), 2 (two), and 3 (three). In all these conditions, the number of prey was set to 1. For the relative mobility, three conditions were set: 120% (fast), 100% (equal), and 80% (slow) for the acceleration exerted by the predator, based on that exerted by the prey. For the prey sharing, two conditions were set: with sharing (shared), in which all predators were rewarded when a predator catches the prey, and without sharing (individual), in which a predator was rewarded only when it catches prey by itself. In total, there were 15 conditions.

Agent architecture

We considered a sequential decision-making setting in which a single agent interacts with an environment E in a sequence of observations, actions, and rewards. At each time-step t, the agent observes a state stS and selects an action at from a discrete set of actions A={1,2,,|A|}. One time step later, in part as a consequence of its action, the agent receives a reward, rt+1R, and moves itself to a new state st+1. In the MDP, the agent thereby gives rise to a sequence that begins as follows: s0,a0,r1,s1,a1,r2,s2,a2,r3,, and learns a behavioral rule (policy) that depends upon these sequences.

The goal of the agent is to maximize the expected discounted return over time through its choice of actions (Sutton and Barto, 2018). The discounted return Rt was defined as k=0Tγkrt+k+1, where γ[0,1] is a parameter called the discount rate that determines the present value of future rewards, and T is the time step at which the task terminates. The state-value function, action-value function, and advantage function are defined as Vπ(s)=Eπ[Rt|st=s], Qπ(s,a)=Eπ[Rt|st=s,at=a], and Aπ(s,a)=Qπ(s,a)Vπ(s), respectively, where π is a policy mapping states to actions. The optimal action-value function Q(s,a) is then defined as the maximum expected discounted return achievable by following any strategy, after observing some state s and then taking some action a, Q(s,a)=maxπE[Rt|st=s,at=a,π]. The optimal action-value function can be computed by finding a fixed point of the Bellman equations:

Q(s,a)=Esε[r+γmaxaQ(s,a|s,a)], (1)

where s and a are the state and action at the next time-step, respectively. This is based on the following intuition: if the optimal value Q(s,a) of the state s was known for all possible actions a, the optimal strategy is to select the action a maximizing the expected value of r+γmaxaQ(s,a). The basic idea behind many reinforcement learning algorithms is to estimate the action-value function by using the Bellman equation as an iterative update; Qi+1(s,a)=E[r+γmaxaQi(s,a|s,a)]. Such value iteration algorithms converge to the optimal action-value function in situations where all states can be sufficiently sampled, QiQ as i. In practice, however, it is often difficult to apply this basic approach, which estimates the action-value function separately for each state, to real-world problems. Instead, it is common to use a function approximator to estimate the action-value function, Q(s,a;θ)Q(s,a).

There are several possible methods for function approximation, yet we here use a neural network function approximator referred to as deep Q-network (DQN) (Mnih et al., 2015) and some of its extensions to overcome the limitations of the DQN, namely Double DQN (Van Hasselt et al., 2016), Prioritized Experience Replay (Schaul et al., 2015), and Dueling Networks (Wang, 2016). Naively, a Q-network with weights θ can be trained by minimizing a loss function L(θ) that changes at each iteration i,

Li(θi)=Es,aρ()[12(yiQ(s,a;θi))2], (2)

where yi=r+γmaxaQ(s,a;θi1|s,a) is the target value for iteration i, and ρ(s,a) is a probability distribution over states s and actions a. The parameters from the previous iteration θi1 are kept constant when optimizing the loss function L(θ). By differentiating the loss function with respect to the weights we arrive at the following gradient,

θiLi(θi)=Es,aρ();sE[(r+γmaxaQ(s,a;θi1)Q(s,a;θi))θiQ(s,a;θi)]. (3)

We could attempt to use the simplest Q-learning to learn the weights of the network Q(s,a;θ) online; however, this estimator performs poorly in practice. In this simplest form, they discard incoming data immediately, after a single update. This results in two issues: (i) strongly correlated updates that break the i.i.d. assumption of many popular stochastic gradient-based algorithms and (ii) the rapid forgetting of possibly rare experiences that would be useful later. To address both of these issues, a technique called experience replay is often adopted (Lin, 1992), in which the agent’s experiences at each time-step et=(st,at,rt+1,st+1) are stored in a dataset (also referred to as replay memory) D={e1,e2,,eN}, where N is the dataset size, for some time period. When training the Q-network, instead of only using the current experience as prescribed by standard Q-learning, mini-batches of experiences are sampled from D uniformly, at random, to train the network. This enables breaking the temporal correlations by mixing more and fewer recent experiences for the updates, and rare experiences will be used for more than just a single update. Another technique, called the target-network, is also often used for updating to stabilize learning. To achieve this, the target value yi is replaced by r+γmaxaQ(s,a;θi), where θi are the weights, which are frozen for a fixed number of iterations. The full algorithm combining these ingredients, namely experience replay and the target-network, is often called a deep Q-network (DQN), and its loss function takes the form:

Li(θi)=E(s,a,r,su(D))[(yiDQNQ(s,a;θi))2], (4)

where

yiDQN=r+γmaxaQ(s,a;θi), (5)

and U() is a uniform sampling.

It has become known that Q-learning algorithms perform poorly in some stochastic environments. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. As a method to alleviate the performance degradation due to the overestimation, Double Q-learning, which decomposes the maximum operation into action selection and action evaluation by introducing the double estimator, was proposed (Hasselt, 2010). Double DQN (DDQN) is an algorithm that applies the Double Q-learning method to DQN (Van Hasselt et al., 2016). For the DDQN, in contrast to the original Double Q-learning and the other proposed method (Fujimoto et al., 2018), the target network in the DQN architecture, although not fully decoupled, was used as the second value function, and the target value in the loss function (i.e. Eq. Agent architecture) for iteration i is replaced as follows:

yiDDQN=r+γQ(s,argmaxaQ(s,a;θi);θ). (6)

Prioritized Experience Replay is a method that aims to make the learning more efficient and effective than if all transitions were replayed uniformly (Schaul et al., 2015). For the prioritized replay, the probability of sampling from the data-set for transition i is defined as

P(i)=piαkpkα, (7)

where pi>0 is the priority of transition for iteration i and the exponent α determines how much prioritization is used, with α=0 corresponding to uniform sampling. The priority pi is determined by pi=|δi|+ϵ, where δi is a temporal-difference (TD) error (e.g. δi=r+γmaxaQ(s,a;θi)Q(s,a;θi) in DQN) and ϵ is a small positive constant that prevents the case of transitions not being revisited once their error is zero. Prioritized replay introduces sampling bias, and therefore changes the solution to which the estimates will converge. This bias can be corrected by importance-sampling (IS) weights wi=(1N1P(i))β that fully compensate for the non-uniform probabilities P(i) if β=1.

Dueling Network is a neural network architecture designed for value-based algorithms such as DQN (Wang, 2016). This features two streams of computation, the value and advantage streams, sharing a common encoder, and is merged by an aggregation module that produces an estimate of the state-action value function. Intuitively, we can expect the dueling network to learn which states are (or are not) valuable, without having to learn the effect of each action for each state. For the reason of stability of the optimization, the last module of the network is implemented as follows:

Q(s,a;θ,η,ξ)=V(s;θ,ξ)+(A(s,a;θ,η)1|A|aA(s,a;θ,η)), (8)

where θ denotes the parameters of the common layers, whereas η and ξ are the parameters of the layers of the two streams, respectively.

We here modeled an agent (predator/prey) with independent learning, one of the simplest approaches to multi-agent reinforcement learning (Tan, 1993). In this approach, each agent independently learns its own policy and treats the other agents as part of the environment. In other words, each agent learns policies that are conditioned only on their local observation history, and do not account for the non-stationarity of the multi-agent environment. That is, in contrast to previous studies on multi-agent reinforcement learning (Tesauro, 2003; Foerster et al., 2016; Silver et al., 2017; Lowe, 2017; Foerster et al., 2018; Sunehag, 2017; Rashid, 2020; Son et al., 2019; Baker, 2019; Christianos et al., 2020; Mugan and MacIver, 2020; Hamrick, 2021; Yu, 2022), our agents did not share network parameters and value functions, and did not access models of the environment for planning. For each agent n, the policy πn is represented by a neural network and optimized, with the framework of DQN including DDQN, Prioritized Experience Replay, and Dueling architecture. The loss function of each agent takes the form:

Li(θi,ηi,ξi)=Es,a,r,sP(D)[(yiQ(s,a;θi,ηi,ξi))2], (9)

where

yi=r+γQ(s,argmaxaQ(s,a;θi,ηi,ξi);θi,ηi,ξi), (10)

and P (·) is a prioritized sampling. For simplicity, we omitted the agent index n in these equations.

Training details

The neural network was composed of four layers (Figure 1—figure supplement 1). There was a separate output unit for each possible action, and only the state representation was an input to the neural network. The inputs to the neural network were the positions of a specific agent in the absolute coordinate system (x- and y-positions) and the positions and velocities of a specific agent and others in the relative coordinate system (u- and v-positions and u- and v-velocities) (Figure 1—figure supplement 2), which were determined based on findings in neuroscience (O’Keefe and Dostrovsky, 1971) and ethology (Brighton et al., 2017; Tsutsui et al., 2020), respectively. We assumed that delays in sensory processing were compensated for by estimation of motion of self (Wolpert et al., 1998; Kawato, 1999) and others (Tsutsui et al., 2021), and the current information at each time was used as input as is. The outputs were the acceleration in 12 directions every 30 in the relative coordinate system, which were determined with reference to an ecological study (Wilson et al., 2018). After the first two hidden layers of the MLP with 64 units, the network branched off into two streams. Each branch had one MLP layer with 32 hidden units. ReLU was used as the activation function for each layer (Glorot et al., 2011). The network parameters θn, ηn, and ξn were iteratively optimized via stochastic gradient descent with the Adam optimizer (Kingma and Ba, 2014). In the computation of the loss, we used Huber loss to prevent extreme gradient updates (Huber, 1992). The model was trained for 106 episodes, and the network parameters were copied to the target-network every 2000 episodes. The replay memory size was 104, the minibatch size during training was 32, and the learning rate was 106. The discount factor γ was set to 0.9, and α was set to 0.6. We used an ε-greedy policy as the behavior policy πn, which chooses a random action with probability ε or an action according to the optimal Q function argmaxaAQ(s,a) with probability 1ε. In this study, ε was annealed linearly from 1 to 0.1 over the first 104 episodes and fixed at 0.1 thereafter.

Evaluation

The model performance was evaluated using the trained model. The initial position of each agent and termination criteria in each episode were the same as in training. During the evaluation, ε was set to 0, and each agent took greedy actions. If the predator captured the prey within the time limit, the predator was deemed successful; otherwise, the prey was considered successful. Additionally, if one side (predators/prey) moved out of the area, the other side (prey/predators) was deemed successful. We first conducted a computational experiment (self-play: predator agent vs. prey agent). and then conducted a human behavioral experiment (joint play: predator agent vs. prey human). In the computational experiment, we simulated 100 episodes for each of the 10 random seeds (i.e. different initial positions), for a total of 1000 episodes in each condition. In the joint play, human participants controlled prey on a screen using a joystick and interacted with the predator agents for 50 episodes in each condition.

Participants

Ten males participated in the experiment (aged 22–25, mean = 23.5, s.d.=1.2). All participants were right-handed but one, had normal or corrected-to-normal vision, and were naïve to the purpose of the study. This study was approved by the Ethics Committee of Nagoya University Graduate School of Informatics (No. 2021–27). Informed consent was obtained from each participant before the experiment. Participants received 1000 yen per hour as a reward.

Apparatus

Participants were seated in a chair, and they operated the joystick of an Xbox One controller that could tilt freely in any direction to control a disk on the screen. The stimuli were presented on a 26.5-inch monitor (EIZO EV2730Q) at a refresh rate of 60 Hz. A gray square surrounding the disks was defined as the play area. The diameter of each disk on the screen was 2.0 cm, and the width and height of the area were 40.0 cm. The acceleration of each disk on the screen was determined by the inclination of the joystick. Specifically, acceleration was added when the degree of joystick tilt exceeded half of the maximum tilt, and the direction of the acceleration was selected from 12 directions, discretized every 30 degrees in an absolute coordinate system corresponding to the direction of joystick tilt. The reason for setting the direction of acceleration with respect to the absolute coordinate system, rather than the relative coordinate system, in the human behavioral experiment was to allow participants to control more intuitively. The position and velocity of each disk on the screen were updated at 10 Hz (corresponding to the computational simulation) and the position during the episodes was recorded at 10 Hz on a computer (MacBook Pro) with Psychopy version 3.0. The viewing distance of the participants was about 60 cm.

Design

Participants controlled a red disk representing the prey on the screen. They were asked to evade the predator for 30 s without leaving the play area. The agent’s initial position and the outcome of the episode were determined as described above. The experimental block consisted of five sets of 10 episodes, with a warm-up of 10 episodes so that participants could become accustomed to the task. In this experiment, we focused on the slow condition and there were thus five experimental conditions (one, two × individual, two × shared, three × individual, and three × shared). Each participant played one block (i.e. 50 episodes) of each experimental condition. The order of the experimental conditions was pseudo-randomized across participants.

Rule-based agent

We constructed rule-based predator agents to test whether they could reproduce similar behavior to predator agents based on deep reinforcement learning in the two × shared condition. For consistency with the deep reinforcement learning agents, the input to the rule-based agent used to make decisions was limited to the current information (e.g. position and velocity) and the output was provided in a relative coordinate system to the prey; that is, action 1 denotes movement toward the prey and action 7 denotes movement in the opposite direction of the prey. The predator agent first determines whether it, or another predator, is closer to the prey, and then, if the other predator is closer, it determines whether the distance 2 is less than a certain distance threshold (set to 0.4 in our simulation). The decision rule for each predator is selected by this branching, with predator 1 adopting the three rules ‘chase,’ ‘shortcut,’ and ‘approach,’ and predator 2 adopting the two rules ‘chase’ and ‘ambush.’ For the chase, the predator first determines whether it is near the outer edge of the play area and, if so, selects actions that will prevent it from leaving the play area. Specifically, if the predator’s position is such that |x| > 0.9 and |y| > 0.9, action 3 for clockwise (CW) and action 11 for counterclockwise (CCW) was selected, respectively, and if 0.8 < |x| 0.9 and 0.8 < |y| 0.9, action 2 for CW and action 12 for CCW was selected. The CW and CCW were determined by the absolute position of the prey and the relative position vector between the closer predator and prey; the play area was divided into four parts based on the signs of the x and y coordinates, and CW and CCW were determined by the correspondence between each area and the sign of the larger component of absolute value (x or y) of the relative position vector. For instance, if the closer predator is at (0.2, 0.3) and the prey is at (0.5, 0.2), it is determined to be CW. If the predator is not outside the play area, then it determines whether the prey is inside the play area, and, if so, selects actions that will drive them outside; if the prey’s position is such that |x| 0.5 and |y| 0.5, action 11 for CW and action 3 for CCW was selected, and if 0.5 < |x| 0.6 and 0.5 < |y| 0.6, action 12 for CW and action 2 for CCW was selected. In other situations, the predator selects actions so that the direction of movement is aligned with that of the prey; if the angle of the velocity vectors between the predator and prey ψ –50 action 3, and if –50 < ψ –15 action 2, if –15 < ψ 15 action 1, if 15 < ψ 50 action 2, if 50 < ψ action 3 was selected. For the shortcut, the predator determines whether it is near the outer edge of the play area, and if so, selected the action described above, otherwise, it selected actions that producing shorter paths to the prey; action 2 for CW and action 12 for CCW was selected. For the approach, the predator determines whether it is near the outer edge of the play area, and if so, selected the action described above, otherwise, it selected actions that move it toward the prey; action 1 was selected. For the ambush, the predator selected actions that move it toward the top center or bottom center of the play area and to remain that location until the situation changes. If the predator’s position is such that |y| 0, the predator moved with respect to the bottom center point (–0.1, 0.5), and if |y| > 0, it moved toward the top center point (0, 0.6). The coordinates of the top center and bottom center points were based on the result of deep reinforcement learning agents. Specifically, we first divided the play area into four parts based on the signs of the x and y coordinates with respect to the reference (i.e. bottom center or top center) point, and in each area, the predator selected actions 3, 8, or 12 (every 120 degrees) that will move it toward the reference point, depending on the direction of the prey from the predator’s perspective. For instance, if the predator is at (–0.2, 0.8) and the prey is at (−0.2, –0.8), action 12 is selected.

Behavioral cloning

We constructed neural networks to clone the predatory behavior of rule-based agents. The neural network is composed of two weight layers; that is, it takes the state of environments as inputs as in the deep reinforcement learning agents, processes them through a hidden layer, and then outputs probabilities for each of the 13 potential actions using the softmax function. To ensure a fair comparison with the embedding of deep reinforcement learning agents, we set the number of units in the hidden layer to 32. In the networks, all layers were composed of the fully connected layer. In this study, for each agent (i.e. predator 1 and predator 2), we implemented two types of networks: a linear network without any nonlinear transformation, and a nonlinear network with ReLU activations. Specifically, in the linear network, the hidden layer is composed of the fully connected layer without nonlinearity,

h=Wxhx+bh,

where x, h, Wxh, and bh denote the input to the hidden layer (state), the output of the hidden layer, the input-to-hidden weight, and the bias, respectively. In the nonlinear network, the hidden layer is composed of the fully connected layers with nonlinearity,

h=φ(Wxhx+bh),

where φ(x)=max(0, x) is the rectified linear unit (ReLU) for nonlinearity. The neural network models were trained to minimize the cross entropy error,

E=ktklogyk,

where t and y denote the actual actions taken by the rule-based agents and the predicted actions in each class k, respectively. Network parameters were optimized iteratively using stochastic gradient descent with the Adam optimizer. The learning rate, batch size, and epoch were set as 0.0001, 32, and 2000, respectively, for all agents and networks. The networks were trained, validated, and tested using simulation data for 1000 episodes (123,597 time steps), 100 episodes (16,805 time steps), and 100 episode (12076 time steps), respectively. The network weights were saved according to the best performance observed during the validation phase.

Data analysis

All data analysis was performed in Python 3.7. Successful predation was defined as the sum of the number of predators catching prey and the number of prey leaving the play area. The theoretical prediction assumes that each predator’s performance is independent of the others’ performance, and was defined as follows:

Hn=1(1H1)n (11)

where Hn and H1 denote the proportion of successful predation when the number of predators is n and 1, respectively. The duration was defined as the time from the beginning to the end of the episode, with a maximum duration of 30 s. The heat maps were constructed based on the frequency of stay in each position, with the play area divided into 1600 (40×40). The concordance rate was calculated by comparing the actual selected action by each agent in the two or three conditions with the action that would be chosen by the agent in the one condition if it were placed in a same situation. The circular correlation coefficient was calculated by converting the selected actions (1–12) into angles (0–330 degrees) (Berens, 2009), and in this analysis, action 13 (do nothing) was excluded from the analysis. The two-dimensional embedding was made by transforming the vectors in the last hidden layers of state-value stream and action-value stream in the policy network using t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008). To reduce the influence of extremely large or small values, the color ranges of the V value, SD Q value, and distance were limited from the 5th percentile to the 95th percentile of whole values experienced by each agent (see Figure 3—figure supplements 811).

Statistics

All quantitative data are reported as mean ± SEM across random seeds in the computational experiment and across participants in the human experiment. In the human experiment, sample sizes were not predetermined statistically, but rather were chosen based on field standards. The data were analyzed using one- or two-way repeated-measures analysis of variance (ANOVA) as appropriate. For these tests, Mauchly’s test was used to test sphericity; if the sphericity assumption was violated, degrees of freedom were adjusted by the Greenhouse–Geisser correction. To adjust the p values for multiple comparisons, the Holm-Bonferonni method was used. The data distribution was assumed to be normal for multiple comparisons, but this was not formally tested. Two-tailed statistical tests were used for all applicable analyses. The significance level was set at an alpha value of 0.05. The theoretical prediction was excluded from statistical analyses (Figures 2a and 4a) because, from the equation, it is obvious that the proportion of successful predation increases as the number of predators increases. Specific test statistics, p values, and effect sizes for the analyses are detailed in the corresponding figure captions. All statistical analyses were performed using R version 4.0.2 (The R Foundation for Statistical Computing).

Code availability

The code for computational simulation and figures is available at https://github.com/TsutsuiKazushi/collaborative-hunting; (copy archived at Kazushi, 2023).

Acknowledgements

This work was supported by JSPS KAKENHI (Grant Numbers 20H04075, 21H04892, 21H05300, and 22K17673), JST PRESTO (JPMJPR20CA), and the Program for Promoting the Enhancement of Research Universities.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Kazushi Tsutsui, Email: k.tsutsui6@gmail.com.

Malcolm A MacIver, Northwestern University, United States.

Michael J Frank, Brown University, United States.

Funding Information

This paper was supported by the following grants:

  • Japan Society for the Promotion of Science 20H04075 to Keisuke Fujii.

  • Japan Society for the Promotion of Science 21H04892 to Kazuya Takeda.

  • Japan Society for the Promotion of Science 21H05300 to Keisuke Fujii.

  • Japan Society for the Promotion of Science 22K17673 to Kazushi Tsutsui.

  • Japan Science and Technology Agency JPMJPR20CA to Keisuke Fujii.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Resources, Software, Formal analysis, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Validation, Investigation, Methodology, Writing - original draft, Writing - review and editing.

Supervision, Funding acquisition, Project administration.

Conceptualization, Supervision, Funding acquisition, Validation, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Ethics

Human subjects: This study was approved by the Ethics Committee of Nagoya University Graduate School of Informatics. Informed consent was obtained from each participant before the experiment.

Additional files

MDAR checklist

Data availability

The data and models used in this study are available at https://doi.org/10.6084/m9.figshare.21184069.v3.The code for computational simulation and figures is available at https://github.com/TsutsuiKazushi/collaborative-hunting (copy archived at Kazushi, 2023).

The following dataset was generated:

Tsutsui K, Tanaka R, Takeda K, Fujii K. 2023. Dataset. figshare.

References

  1. Axelrod R, Hamilton WD. The evolution of cooperation. Science. 1981;211:1390–1396. doi: 10.1126/science.7466396. [DOI] [PubMed] [Google Scholar]
  2. Bailey I, Myatt JP, Wilson AM. Group hunting within the Carnivora: physiological, cognitive and environmental influences on strategy and cooperation. Behavioral Ecology and Sociobiology. 2013;67:1–17. doi: 10.1007/s00265-012-1423-3. [DOI] [Google Scholar]
  3. Baker B. Emergent Tool Use from Multi-Agent Autocurricula. arXiv. 2019 https://arxiv.org/abs/1909.07528
  4. Banino A, Barry C, Uria B, Blundell C, Lillicrap T, Mirowski P, Pritzel A, Chadwick MJ, Degris T, Modayil J, Wayne G, Soyer H, Viola F, Zhang B, Goroshin R, Rabinowitz N, Pascanu R, Beattie C, Petersen S, Sadik A, Gaffney S, King H, Kavukcuoglu K, Hassabis D, Hadsell R, Kumaran D. Vector-based navigation using grid-like representations in artificial agents. Nature. 2018;557:429–433. doi: 10.1038/s41586-018-0102-6. [DOI] [PubMed] [Google Scholar]
  5. Bednarz JC. Cooperative hunting Harris’ hawks (Parabuteo unicinctus) Science. 1988;239:1525–1527. doi: 10.1126/science.239.4847.1525. [DOI] [PubMed] [Google Scholar]
  6. Berens P. Circstat: a matlab toolbox for circular statistics. Journal of Statistical Software. 2009;31:1–21. doi: 10.18637/jss.v031.i10. [DOI] [Google Scholar]
  7. Boesch C, Boesch H. Hunting behavior of wild chimpanzees in the Taï National Park. American Journal of Physical Anthropology. 1989;78:547–573. doi: 10.1002/ajpa.1330780410. [DOI] [PubMed] [Google Scholar]
  8. Boesch C. Cooperative hunting in wild chimpanzees. Animal Behaviour. 1994;48:653–667. doi: 10.1006/anbe.1994.1285. [DOI] [Google Scholar]
  9. Boesch C, Boesch-Achermann H. The Chimpanzees of the Taï Forest: Behavioural Ecology and Evolution. Oxford University Press; 2000. [DOI] [Google Scholar]
  10. Boesch C. Cooperative hunting roles among taï chimpanzees. Human Nature. 2002;13:27–46. doi: 10.1007/s12110-002-1013-6. [DOI] [PubMed] [Google Scholar]
  11. Boesch‐Achermann H, Boesch C. Hominization in the rainforest: The chimpanzee’s piece of the puzzle. Evolutionary Anthropology. 1994;3:9–16. doi: 10.1002/evan.1360030106. [DOI] [Google Scholar]
  12. Botvinick M, Wang JX, Dabney W, Miller KJ, Kurth-Nelson Z. Deep reinforcement learning and its neuroscientific implications. Neuron. 2020;107:603–616. doi: 10.1016/j.neuron.2020.06.014. [DOI] [PubMed] [Google Scholar]
  13. Brighton CH, Thomas ALR, Taylor GK. Terminal attack trajectories of peregrine falcons are described by the proportional navigation guidance law of missiles. PNAS. 2017;114:13495–13500. doi: 10.1073/pnas.1714532114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Brosnan SF, Salwiczek L, Bshary R. The interplay of cognition and cooperation. Philosophical Transactions of the Royal Society B. 2010;365:2699–2710. doi: 10.1098/rstb.2010.0154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Bshary R, Hohner A, Ait-el-Djoudi K, Fricke H. Interspecific communicative and coordinated hunting between groupers and giant moray eels in the Red Sea. PLOS Biology. 2006;4:e431. doi: 10.1371/journal.pbio.0040431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Busse CD. Do chimpanzees hunt cooperatively? The American Naturalist. 1978;112:767–770. doi: 10.1086/283318. [DOI] [Google Scholar]
  17. Christianos F, Schäfer L, Albrecht S. Shared experience actor-critic for multi-agent reinforcement learning. Advances in Neural Information Processing Systems; 2020. pp. 10707–10717. [Google Scholar]
  18. Couzin ID, Krause J, James R, Ruxton GD, Franks NR. Collective memory and spatial sorting in animal groups. Journal of Theoretical Biology. 2002;218:1–11. doi: 10.1006/jtbi.2002.3065. [DOI] [PubMed] [Google Scholar]
  19. Creel S, Creel NM. Communal hunting and pack size in African wild dogs, Lycaon pictus. Animal Behaviour. 1995;50:1325–1339. doi: 10.1016/0003-3472(95)80048-4. [DOI] [Google Scholar]
  20. Dinets V. Apparent coordination and collaboration in cooperatively hunting crocodilians. Ethology Ecology & Evolution. 2015;27:244–250. doi: 10.1080/03949370.2014.915432. [DOI] [Google Scholar]
  21. Doya K. Modulators of decision making. Nature Neuroscience. 2008;11:410–416. doi: 10.1038/nn2077. [DOI] [PubMed] [Google Scholar]
  22. Evans DA, Stempel AV, Vale R, Branco T. Cognitive control of escape behaviour. Trends in Cognitive Sciences. 2019;23:334–348. doi: 10.1016/j.tics.2019.01.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Foerster J, Assael IA, De N, Whiteson S. Learning to communicate with deep multi-agent reinforcement learning. Proceedings of the 30th International Conference on Neural Information Processing Systems; Barcelona, Spain. 2016. pp. 2145–2153. [Google Scholar]
  24. Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S. Counterfactual Multi-Agent Policy Gradients. Proceedings of the AAAI Conference on Artificial Intelligence; 2018. [DOI] [Google Scholar]
  25. Foerster J. Bayesian action decoder for deep multi-agent reinforcement learning. International Conference on Machine Learning (PMLR; 2019. pp. 1942–1951. [Google Scholar]
  26. Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In International conference on machine learning (PMLR); 2018. pp. 1587–1596. [Google Scholar]
  27. Gazda SK, Connor RC, Edgar RK, Cox F. A division of labour with role specialization in group–hunting bottlenose dolphins (Tursiops truncatus) off Cedar Key, Florida. Proceedings of the Royal Society B. 2005;272:135–140. doi: 10.1098/rspb.2004.2937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (JMLR Workshop and Conference Proceedings); 2011. pp. 315–323. [Google Scholar]
  29. Hamrick JB. On the role of planning in model-based deep reinforcement learning. International Conference on Learning Representations.2021. [Google Scholar]
  30. Hasselt H. Double q-learning. Advances in neural information processing systems; 2010. pp. 2613–2621. [Google Scholar]
  31. Howland HC. Optimal strategies for predator avoidance: the relative importance of speed and manoeuvrability. Journal of Theoretical Biology. 1974;47:333–350. doi: 10.1016/0022-5193(74)90202-1. [DOI] [PubMed] [Google Scholar]
  32. Hu H, Foerster JN. Simplified action decoder for deep multi-agent reinforcement learning. International Conference on Learning Representations.2020. [Google Scholar]
  33. Hubel TY, Myatt JP, Jordan NR, Dewhirst OP, McNutt JW, Wilson AM. Energy cost and return for hunting in African wild dogs and cheetahs. Nature Communications. 2016;7:1–13. doi: 10.1038/ncomms11034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Huber PJ. In Breakthroughs in Statistics. Springer; 1992. Robust estimation of a location parameter; pp. 492–518. [Google Scholar]
  35. Hunt LT, Daw ND, Kaanders P, MacIver MA, Mugan U, Procyk E, Redish AD, Russo E, Scholl J, Stachenfeld K, Wilson CRE, Kolling N. Formalizing planning and information search in naturalistic decision-making. Nature Neuroscience. 2021;24:1051–1064. doi: 10.1038/s41593-021-00866-w. [DOI] [PubMed] [Google Scholar]
  36. Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castañeda AG, Beattie C, Rabinowitz NC, Morcos AS, Ruderman A, Sonnerat N, Green T, Deason L, Leibo JZ, Silver D, Hassabis D, Kavukcuoglu K, Graepel T. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science. 2019;364:859–865. doi: 10.1126/science.aau6249. [DOI] [PubMed] [Google Scholar]
  37. Kawato M. Internal models for motor control and trajectory planning. Current Opinion in Neurobiology. 1999;9:718–727. doi: 10.1016/s0959-4388(99)00028-8. [DOI] [PubMed] [Google Scholar]
  38. Kazushi T. Collaborative-hunting. swh:1:rev:b22af27999a97c564cae2ff8142d54a413e29199Software Heritage. 2023 https://archive.softwareheritage.org/swh:1:dir:fd9557fca4f245d5ee9aeb8282d5ca516b40ca81;origin=https://github.com/TsutsuiKazushi/collaborative-hunting;visit=swh:1:snp:45fd535119fd5409b499c82dff20d0dc869f9423;anchor=swh:1:rev:b22af27999a97c564cae2ff8142d54a413e29199
  39. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv. 2014 https://arxiv.org/abs/1412.6980
  40. Lang SDJ, Farine DR. A multidimensional framework for studying social predation strategies. Nature Ecology & Evolution. 2017;1:1230–1239. doi: 10.1038/s41559-017-0245-0. [DOI] [PubMed] [Google Scholar]
  41. Lin LJ. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning. 1992;8:293–321. doi: 10.1007/BF00992699. [DOI] [Google Scholar]
  42. Lowe R. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems; 2017. pp. 6382–6393. [Google Scholar]
  43. Macdonald DW. The ecology of carnivore social behaviour. Nature. 1983;301:379–384. doi: 10.1038/301379a0. [DOI] [Google Scholar]
  44. Mackintosh NJ. The Psychology of Animal Learning. Academic Press; 1974. [Google Scholar]
  45. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
  46. Mobbs D, Wise T, Suthana N, Guzmán N, Kriegeskorte N, Leibo JZ. Promises and challenges of human computational ethology. Neuron. 2021;109:2224–2238. doi: 10.1016/j.neuron.2021.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Mugan U, MacIver MA. Spatial planning with long visual range benefits escape from visual predators in complex naturalistic environments. Nature Communications. 2020;11:3057. doi: 10.1038/s41467-020-16102-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. O’Keefe J, Dostrovsky J. The hippocampus as a spatial map Preliminary evidence from unit activity in the freely-moving rat. Brain Research. 1971;34:171–175. doi: 10.1016/0006-8993(71)90358-1. [DOI] [PubMed] [Google Scholar]
  49. Packer C, Ruttan L. The evolution of cooperative hunting. The American Naturalist. 1988;132:159–198. doi: 10.1086/284844. [DOI] [Google Scholar]
  50. Rashid T. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research. 2020;21:7234–7284. [Google Scholar]
  51. Samejima K, Ueda Y, Doya K, Kimura M. Representation of action-specific reward values in the striatum. Science. 2005;310:1337–1340. doi: 10.1126/science.1115270. [DOI] [PubMed] [Google Scholar]
  52. Sampaio E, Seco MC, Rosa R, Gingins S. Octopuses punch fishes during collaborative interspecific hunting events. Ecology. 2021;102:e03266. doi: 10.1002/ecy.3266. [DOI] [PubMed] [Google Scholar]
  53. Schaul T, Quan J, Antonoglou I, Silver D. Prioritized Experience Replay. arXiv. 2015 https://arxiv.org/abs/1511.05952
  54. Scheel D, Packer C. Group hunting behaviour of lions: a search for cooperation. Animal Behaviour. 1991;41:697–709. doi: 10.1016/S0003-3472(05)80907-8. [DOI] [Google Scholar]
  55. Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275:1593–1599. doi: 10.1126/science.275.5306.1593. [DOI] [PubMed] [Google Scholar]
  56. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D. Mastering the game of Go without human knowledge. Nature. 2017;550:354–359. doi: 10.1038/nature24270. [DOI] [PubMed] [Google Scholar]
  57. Skinner BF. Contingencies of Reinforcement: A Theoretical Analysis. BF Skinner Foundation; 2014. [Google Scholar]
  58. Smith JM. Evolution and the Theory of Games. Cambridge university press; 1982. [DOI] [Google Scholar]
  59. Son K, Kim D, Kang WJ, Hostallero DE, Yi Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. International conference on machine learning; 2019. pp. 5887–5896. [Google Scholar]
  60. Stander PE. Cooperative hunting in lions: the role of the individual. Behavioral Ecology and Sociobiology. 1992;29:445–454. doi: 10.1007/BF00170175. [DOI] [Google Scholar]
  61. Stanford CB. The hunting ecology of wild chimpanzees: Implications for the evolutionary ecology of pliocene hominids. American Anthropologist. 1996;98:96–113. doi: 10.1525/aa.1996.98.1.02a00090. [DOI] [Google Scholar]
  62. Steinegger M, Roche DG, Bshary R. Simple decision rules underlie collaborative hunting in yellow saddle goatfish. Proceedings of the Royal Society B. 2018;285:20172488. doi: 10.1098/rspb.2017.2488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Sunehag P. Value-Decomposition Networks for Cooperative Multi-Agent Learning. arXiv. 2017 https://arxiv.org/abs/1706.05296
  64. Sutton RS, Barto AG. Toward a modern theory of adaptive networks: expectation and prediction. Psychological Review. 1981;88:135–170. [PubMed] [Google Scholar]
  65. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT press; 2018. [Google Scholar]
  66. Tan M. Multi-agent reinforcement learning: Independent vs. cooperative agents. Proceedings of the tenth international conference on machine learning; 1993. pp. 330–337. [Google Scholar]
  67. Tesauro G. Extending q-learning to general adaptive multi-agent systems. Advances in Neural Information Processing Systems.2003. [Google Scholar]
  68. Thiebault A, Semeria M, Lett C, Tremblay Y. How to capture fish in a school? Effect of successive predator attacks on seabird feeding success. The Journal of Animal Ecology. 2016;85:157–167. doi: 10.1111/1365-2656.12455. [DOI] [PubMed] [Google Scholar]
  69. Tsutsui K, Shinya M, Kudo K. Human navigational strategy for intercepting an erratically moving target in chase and escape interactions. Journal of Motor Behavior. 2020;52:750–760. doi: 10.1080/00222895.2019.1692331. [DOI] [PubMed] [Google Scholar]
  70. Tsutsui K, Fujii K, Kudo K, Takeda K. Flexible prediction of opponent motion with internal representation in interception behavior. Biological Cybernetics. 2021;115:473–485. doi: 10.1007/s00422-021-00891-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Tsutsui K, Takeda K, Fujii K. Emergence of collaborative hunting via multi-agent deep reinforcement learning. International Conference on Pattern Recognition; 2022. pp. 210–224. [Google Scholar]
  72. Vail AL, Manica A, Bshary R. Referential gestures in fish collaborative hunting. Nature Communications. 2013;4:1–7. doi: 10.1038/ncomms2781. [DOI] [PubMed] [Google Scholar]
  73. van der Maaten L, Hinton G. Visualizing data using t-sne. Journal of Machine Learning Research. 2008;9:2579–2605. [Google Scholar]
  74. Van Hasselt H, Guez A, Silver D. Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence; 2016. pp. 2094–2100. [DOI] [Google Scholar]
  75. Wang Z. Dueling network architectures for deep reinforcement learning. International conference on machine learning.2016. [Google Scholar]
  76. Wilson AM, Hubel TY, Wilshin SD, Lowe JC, Lorenc M, Dewhirst OP, Bartlam-Brooks HLA, Diack R, Bennitt E, Golabek KA, Woledge RC, McNutt JW, Curtin NA, West TG. Biomechanics of predator–prey arms race in lion, zebra, cheetah and impala. Nature. 2018;554:183–188. doi: 10.1038/nature25479. [DOI] [PubMed] [Google Scholar]
  77. Wolpert DM, Miall RC, Kawato M. Internal models in the cerebellum. Trends in Cognitive Sciences. 1998;2:338–347. doi: 10.1016/S1364-6613(98)01221-2. [DOI] [PubMed] [Google Scholar]
  78. Wynne CD. Animal Cognition: The Mental Lives of Animals. Palgrave MacMillan; 2001. [Google Scholar]
  79. Yoshida W, Dolan RJ, Friston KJ. Game theory of mind. PLOS Computational Biology. 2008;4:e1000254. doi: 10.1371/journal.pcbi.1000254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Yu C. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems; 2022. pp. 24611–24624. [Google Scholar]

Editor's evaluation

Malcolm A MacIver 1

Cooperative hunting is typically attributed to certain mammals (and select birds) which express highly complex behaviors. This paper makes the valuable finding that in a highly idealized open environment, cooperative hunting can emerge through simple rules. This has implications for a reassessment, and perhaps a widening, of what groups of animals are believed to manifest cooperative hunting.

Decision letter

Editor: Malcolm A MacIver1
Reviewed by: Malcolm A MacIver2

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Collaborative hunting in artificial agents with deep reinforcement learning" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Michael Frank as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions (for the authors):

1) There are extensive remarks from the reviewers on the validity of some of the more theoretical claims. These need to be carefully addressed. Examples include whether it is correct to infer that more complex cognition is not needed for cooperative hunting because DQN can solve the problem; The absence of an explicit model does not mean that there isn't an implicit model of the other agents behaviors that is encoded in the neural network weights.

2) The lack of causal analysis either needs to be addressed or the "mediated by"-type claims need to be tempered.

3) There has been a translation of the DQN into simple rules, but at present the discussion is too incomplete for readers to understand why this was done and what can be concluded.

4) A discussion of the limitations of this work in terms of the absence of things like shared value functions increasingly common in multi-agent RL and absence of partially observable environments common in predator-prey dynamics, would be helpful.

5) Please address R3's comment that the paper's analysis is possible without RL, such as via behavioral cloning.

6) Some of the reviews suggest better coverage of the relevant literature. Given the size of this literature, this has to be selective, but it appears some effort could be expended to further improve the scholarship of the work.

Reviewer #1 (Recommendations for the authors):

Figure 2 – The gap between individual and shared only occurs at two predators, for both % successful predation and duration. It would be helpful to have a discussion of this point. Wild dog packs, for example, are typically larger: perhaps this because the space they work over is much larger (relative to disk size), the environment complexity, or something else, but in any case it would be interesting to know whether shared vs individual is the ruling condition.

It also suggests that sharing is not needed in the 3 predator situation to obtain the same results. Does that mean that the work is suggesting cooperation even occurs without sharing? This seems to be a significant problem, since it's hard to imagine how the term "collaboration" or "cooperation" can be applied in the absence of shared reward. If it is strictly a matter of reduction of duration and increase in rate of success, it may equate to a more limited form of cooperation? Are their biological analogs of group hunts without sharing?

186 – We found that the mappings resulting in collaborative hunting were mediated by distance-dependent internal representations.

'Mediated' here seems to play the role of a "filler term" as used in neuroscience (see Krakauer et al. 2017 Neuroscience needs behavior).

Only correlations have been shown, but this is a causal claim. To support the causal claim, it would be necessary to intervene in the network and show that the interventions in the internal representations have the predicted causal role.

194 – The organization of this paragraph might be better reversed. One could argue that Figure S8 (which could be referenced here) providing similar results and DQN helps support the hypothesis that the representation of distance in the network plays a causal role in the outcome.

234 – Initial position of each episode is unclear as previously noted.

236 – The text above says -1 for moving out of arena to the prey, so if the prey moves out, is it just that, or does the predator also get +1 since the predator is now deemed "successful"?

Reviewer #2 (Recommendations for the authors):

– I am not clear that there is sufficient evidence for lines 172-174. The absence of an explicit model does not mean that there isn't an implicit model of the other agents behaviors that is encoded in the neural network weights. I'm not clear what sort of experiment would allow you to distinguish this though there might be a way to run a linear probe to confirm that this information is not in the network weights?

– The notation in lines 246-251 is confusing because it alternates between POMDP notation (i.e. that the agent gets an observation that is a transformation of the true state) and MDP notation. Is the setup an MDP or a POMDP?

– Perhaps line 259 should be "by finding a fixed point of the Bellman equations?"

– Line 271 should be "Dueling Networks" not "Dueling Network" – The sentence starting on line 271 and ending on 273 could or should be cut entirely as it doesn't provide much value and I think it's debatable whether DQN was the first algorithm to solve a high dimensional input problem; it very much depends on how you define high dimensiona

– To get equation 3 from equation 2, there needs to be a factor of 1/2 somewhere.

– In line 321 I don't know what identifiability means in the context of Q-learning? Is this a technical term used in some subfield that works on Q-learning? Why does subtracting the mean help with "identifiability?""

– A discount factor of 0.9 is a wildly low discount factor, basically leading agents to only care about the next 10 steps. I don't think this necessarily affects the outcome of your project or necessarily requires any changes as I don't think agents need to do long horizon reasoning here, but it's worth keeping in mind!

– I don't fully understand the claim that this expands the range of things that are understood to be possible to learn via associative learning. There's no theory precluding a model-free algorithm from learning this type of behavior so the claim in the discussion strikes me as odd. In practice, this type of result where model-free RL agents successfully hunt together have been around since the release of the multi-particle envs (see https://proceedings.neurips.cc/paper/2017/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html)

– I think the rule-based model is neat but I don't understand what what question it answers. Did I perhaps miss something?

– I don't find the evidence for the distance-dependent features compelling; is all of the evidence for it the t-SNE embeddings?

– Lines 194-196 are confusing to me. Why does there being a rule-based model employ your DQN agent is also learning a similar rule-based model?

Reviewer #3 (Recommendations for the authors):My largest suggestion is to fit a linear model to rule-based behaviors and compare the t-SNE embeddings of the behavioral cloning policy with the embeddings of the RL policy? Is the use of RL truly important for this paper?Around line 362, the idea of Rule based agents and human controlled agents are also introduced. I would like to see linear models that take as observations the rule-based agents observations and output the rule based agents actions. Would the t-SNE embeddings look similar for these linear models and for the RL-trained models? If the embeddings look similar, what does that say about the emergence of these capabilities as a result of RL? Does training via RL even matter? Do we care if it doesn't matter?

There is a large amount of work on multi-agent learning that this paper seemingly ignores, or fails to evaluate against. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments has thousands of citations. However, I am willing to accept that there are limitations to what a single paper can cover.

More specific comments:

Line 46-47. I do not know what "simple descriptions with the distances can reproduce similar behavior" is trying to convey.

Lines 50-51: "Our approach of computational ecology bridges the gap between ecology, ethology, and neuroscience and may provide a comprehensive account of them." This is probably too strong of a claim.

Figure 1: The architecture diagram is a little difficult to understand. The key has "layer with ReLU" but then I do not see any clear white boxes? I also do not see any clear units? I think that maybe this is happening inside of the "prey," "predator 1," etc boxes. However, this is all much too small. I think you should decide if you want this figure to be about the neural network architecture, or about the fact the environment is broken into 1 prey and N predators that share an observation.

I think the actions are also not clear. There are probably too many lines in the figure.

For Figure 1 (b), why not just plot the actual density? Actually, I see this is included in Figure 2. I think this is the more helpful Figure!

In Figure 2, what form of Hypothesis testing was used? Was this a KS test? You can't assume the distributions are Gaussian? The presence of a chi-squared statistic seems to indicate Gaussian assumptions. But the distribution is strongly non-Gaussian in this case. A little more clarity would be helpful.

Line 132 mentions that the variance is higher over the action distribution when the predator is about to catch the prey? This is actually the exact opposite of my intuition. I think that the actions hardly matter when the prey is far away, so there is no obvious optimal action, and the choice would be closer to uniform. I'm not sure that this matters very much, but it's interesting.

Line 325 – Usually IL is reserved for Imitation Learning. I have never seen it used for Independent Learning.

Line 324 – I think biological organisms usually model the behavior of other organisms and account for it while planning.

Line 212 – Q-values do implicitly model the competencies of other agents.

Line 196 – What does it mean to switch the decision rules with the distances?

Overall, I think the problems considered by this paper are interesting. And I am happy you took the time to write it. This work made me think a lot about my own research. I appreciate your efforts here. Thank you.

eLife. 2024 May 7;13:e85694. doi: 10.7554/eLife.85694.sa2

Author response


Essential revisions (for the authors):

1) There are extensive remarks from the reviewers on the validity of some of the more theoretical claims. These need to be carefully addressed. Examples include whether it is correct to infer that more complex cognition is not needed for cooperative hunting because DQN can solve the problem; The absence of an explicit model does not mean that there isn't an implicit model of the other agents behaviors that is encoded in the neural network weights.

Thank you for your constructive criticism regarding the validity of our theoretical claims. We have carefully considered the extensive remarks from both you and the reviewers, and have engaged in in-depth discussions with field experts, including a leading scientist on chimpanzee cognition, to address them. As you and the reviewers have pointed out, it is challenging to verify whether an implicit model of other agents’ behaviors is encoded within the neural network weights. Therefore, we have revised our manuscript to refine our claims to specifically address the absence of aspects of “theory of mind”, as it is certain that the agents in our study do not model or infer the “mental states” of others. Specifically, we have revised the description “high-level cognition such as sharing intentions among predators or modeling competencies of other agents” pointed out by the reviewers to “high-level cognition such as aspects of theory of mind” throughout the manuscript. Although this revision may narrow or moderate our argument, we believe it significantly enhances the precision and accuracy of our discussion. Furthermore, by focusing on aspects of theory of mind, we can create a clear distinction from previous studies that incorporated comparable explicit mechanisms, thus aiding the reader's comprehension of our claims and the future directions of our study. We believe these revisions more accurately address the concerns you and the reviewers have raised and ensure our theoretical claims. We are grateful for the guidance that has helped us in improving the manuscript.

2) The lack of causal analysis either needs to be addressed or the "mediated by"-type claims need to be tempered.

Thank you for your insightful comment regarding the lack of causal analysis. We have carefully considered your critique and agree that a more cautious approach is warranted in the absence of a direct causal analysis. In response, we have tempered our claims throughout the manuscript to reflect this. We have modified any statements that may have implied a stronger causal relationship than is currently supported by the data, ensuring that our descriptions accurately represent the correlative nature of our findings, such as being “related to” or “associated with” specific outcomes.

3) There has been a translation of the DQN into simple rules, but at present the discussion is too incomplete for readers to understand why this was done and what can be concluded.

We appreciate your feedback on the translation of the DQN into simple rules and the need for a more comprehensive discussion to aid reader comprehension. As mentioned above, we have revised our manuscript to focus our claims on the absence of aspects of theory of mind. Additionally, in line with the reviewer’s comment, we have revised the sentence in the Introduction section as follows (lines 36 to 39 in the revised manuscript):

“Given that associative learning is likely to be the most widely adopted learning mechanism in animals, collaborative hunting could arise through associative learning, where simple decision rules are developed based on behavioral cues (i.e., contingencies of reinforcement).”

Furthermore, we have added the following sentence to the Introduction section (lines 44 to 46 in the revised manuscript):

“Notably, our predator agents successfully learned to collaborate in capturing their prey solely through a reinforcement learning algorithm, without employing explicit mechanisms comparable to aspects of theory of mind.”

These revisions and additions aim to provide a clearer exposition of why the translation was undertaken and to discuss more explicitly what conclusions can be drawn from it. We hope that these changes will make the ideas more accessible and the underlying reasoning more transparent to our readers.

4) A discussion of the limitations of this work in terms of the absence of things like shared value functions increasingly common in multi-agent RL and absence of partially observable environments common in predator-prey dynamics, would be helpful.

We are grateful for your suggestion to discuss the limitations of our work. In response to your feedback, we have incorporated a discussion of these limitations within the existing conclusion paragraph of our manuscript. In this revision, we have included the “partial observability” alongside other elements of predator-prey dynamics. We have also added the sentence regarding the “shared value functions” as potential directions for future research to suggest areas that could benefit from further exploration. The revised conclusion paragraph is as follows (lines 223 to 234 in the revised manuscript):

“In conclusion, we demonstrated that the decisions underlying collaborative hunting among artificial agents can be achieved through mappings between states and actions. This means that collaborative hunting can emerge in the absence of explicit mechanisms comparable to aspects of theory of mind, supporting the recent idea that collaborative hunting does not necessarily rely on complex cognitive processes in brains. Our computational ecology is an abstraction of a real predator-prey environment. Given that chase and escape often involve various factors, such as energy cost, partial observability, signal communication, and local surroundings, these results are only a first step on the path to understanding real decisions in predator-prey dynamics. Furthermore, exploring how mechanisms comparable to aspects of theory of mind or the shared value functions, which are increasingly common in multi-agent reinforcement learning, play a role in these interactions could be an intriguing direction for future research. We believe that our results provide a useful advance toward understanding natural value-based decisions and forge a critical link between ecology, ethology, psychology, neuroscience, and computer science.”

We believe that these revisions will greatly assist our readers in understanding the scope and implications of our work.

5) Please address R3's comment that the paper's analysis is possible without RL, such as via behavioral cloning.

Thank you for your comment about the possibility of conducting our analysis without the use of RL. We have considered this perspective and have realized that indeed certain analyses could be substituted with behavioral cloning in some cases. Nevertheless, we would like to emphasize that the use of RL brings clarity to several aspects of our study. We describe the reasons for this below from the perspectives of both data and analysis.

Data: Collaborative hunting data is generally scarce, and to our knowledge, no extensive dataset exists with complete locational data on all individuals during hunts. Furthermore, in many cases, obtaining completely controlled data from field observations is challenging. For example, data collected in the wild tend to exhibit some biases (Lang and Farine, 2017). Consequently, we believe that the controlled comparisons presented in our Figure 2 would be difficult to achieve without RL, which makes them notable results.

Analysis: Additionally, we believe that even with a large and controlled dataset, limitations exist in referring the decision-making process from behavioral cloning results and the visualization of internal representations. One such limitation is the prediction accuracy of behavioral cloning, which would not be 100%. Our additional analysis, conducted in response to Reviewer 3's suggestion, demonstrated that prediction accuracy was, at best, 60-70% (Figure 3 supplement 7). In such case, it would be difficult to rule out the possibility that complex cognitive processes are involved in the remaining percentage. Therefore, although we need to be careful in the interpretation of what is being learned on the deep Q-network as you and the reviewers pointed out again and again, the explicit architecture of RL agents strengthens our argument. Thus, even if sufficient data are available, RL would still be meaningful for our analysis.

6) Some of the reviews suggest better coverage of the relevant literature. Given the size of this literature, this has to be selective, but it appears some effort could be expended to further improve the scholarship of the work.

Thank you for pointing that out. Upon reflection, we agree with you and the reviewers that the original manuscript could benefit from more comprehensive coverage of the relevant literature. We have reviewed additional papers focusing mainly on multi-agent reinforcement learning and predator-prey dynamics and have incorporated these into the references in our manuscript.

Reviewer #1 (Recommendations for the authors):

Figure 2 – The gap between individual and shared only occurs at two predators, for both % successful predation and duration. It would be helpful to have a discussion of this point. Wild dog packs, for example, are typically larger: perhaps this because the space they work over is much larger (relative to disk size), the environment complexity, or something else, but in any case it would be interesting to know whether shared vs individual is the ruling condition.

We appreciate your comment on the gap between individual and shared condition concerning the proportion of successful predation and duration. We have added a discussion in our manuscript that explores the reasons behind the comparable performance in scenarios involving three predators, whether the prey was shared or not (lines 192 to 199 in the revised manuscript). As you suggested, we considered spatial constraints as a contributing factor to this outcome. Our analysis revealed that predators occasionally exploit the play area's boundaries and the movement of other predators to block the prey's escape path. To illustrate these dynamics, we have added a supplementary figure (Figure 2 supplement 4). We believe this addition will provide a clearer understanding of our result and hope you find this enhancement informative.

It also suggests that sharing is not needed in the 3 predator situation to obtain the same results. Does that mean that the work is suggesting cooperation even occurs without sharing? This seems to be a significant problem, since it's hard to imagine how the term "collaboration" or "cooperation" can be applied in the absence of shared reward. If it is strictly a matter of reduction of duration and increase in rate of success, it may equate to a more limited form of cooperation? Are their biological analogs of group hunts without sharing?

Thank you for your point regarding “cooperation” in the absence of reward sharing. We agree with your comment that the significant improvement in success rates and reduction in hunting duration, compared with the theoretical predictions based on solitary hunting results, suggests a more limited form of cooperation. This concept finds analogous to interspecific group hunting. For example, giant moray eels and groupers have been reported to hunt together though they do not share the prey (Bshary et al., 2006); their repeated interactions may eventually lead to a distribution of prey between both predators. This could be a mutually beneficial relationship that emerges over time without direct reward sharing. Our results also showed a consequent distribution of prey (see Figure 2 supplement 3), suggesting the potential emergence of this form of cooperation.

186 – We found that the mappings resulting in collaborative hunting were mediated by distance-dependent internal representations.

'Mediated' here seems to play the role of a "filler term" as used in neuroscience (see Krakauer et al. 2017 Neuroscience needs behavior).

Only correlations have been shown, but this is a causal claim. To support the causal claim, it would be necessary to intervene in the network and show that the interventions in the internal representations have the predicted causal role.

Thank you for your constructive comment concerning the use of the term “mediated”. After reviewing the paper by Krakauer et al. 2017 that you referenced, we have understood that our use of “mediated” inappropriately suggested a causal claim in the absence of direct causal analysis. We have therefore revised our manuscript to more accurately reflect the correlational nature of our findings. Specifically, as the editor suggested, we have tempered our statements throughout the manuscript to ensure that it does not imply causality.

194 – The organization of this paragraph might be better reversed. One could argue that Figure S8 (which could be referenced here) providing similar results and DQN helps support the hypothesis that the representation of distance in the network plays a causal role in the outcome.

Thank you for your suggestion regarding the organization of the paragraph. We have revised the text and its sequence to better illustrate the role of the additional analysis with the rule-based model. The revised paragraph reads as follows (lines 200 to 211 in the revised manuscript):

“We found that the mappings resulting in collaborative hunting were related to distance-dependent internal representations. Additionally, we showed that the distance-dependent rule-based predators successfully reproduced behaviors similar to those of the deep reinforcement learning predators, supporting the association between decisions and distances (Methods; Figure 3 supplements 5, 6, and 7). Deep reinforcement learning has held the promise for providing a comprehensive framework for studying the interplay among learning, representation, and decision making, but such efforts for natural behavior have been limited. Our result that the distance-dependent representations relate to collaborative hunting is reminiscent of a recent idea about the decision rules obtained by observation in fish. Notably, the input variables of predator agents do not include variables corresponding to the distance(s) between the other predator(s) and prey, and this means that the predators in the shared conditions acquired the internal representation relating to distance to prey, which would be a geometrically reasonable indicator, by optimization through interaction with their environment. Our results suggest that deep reinforcement learning methods can extract systems of rules that allow for the emergence of complex behaviors.”

234 – Initial position of each episode is unclear as previously noted.

We thank you for your comment on the initial positions of the agents in each episode. We have revised the manuscript to provide a more precise description of the agents’ initial positioning. These details are described in the second paragraph of the Results section and in the Environment subsection of the Methods section (lines 77 to 79, and 247 to 248, in the revised manuscript, respectively).

236 – The text above says -1 for moving out of arena to the prey, so if the prey moves out, is it just that, or does the predator also get +1 since the predator is now deemed "successful"?

We apologize for any confusion caused by the description of the reward and successful predation. During training phase, if the prey moves out of the arena, the predator does not receive a positive reward. We determined that it was not appropriate for the predator to be rewarded in such instances, especially during the early stages of learning, as the movement of the prey outside the arena is often not directly related to the predator's actions. On the other hand, for the evaluation phase, we consider such instances as “successful predation”. This is because, even with trained prey, there are instances where they may exit the arena in an attempt to evade predators, particularly when multiple predators are involved. In such scenarios, it seems reasonable to regard the prey's moving out as indicative of successful predation. To facilitate the reader's understanding, we have added the following clarification to the second paragraph of the Results section (lines 82 to 85 in the revised manuscript):

“During the evaluation phase, if the predator captured the prey within the time limit, the predator was deemed successful; otherwise, the prey was considered successful. Additionally, if one side (predators/prey) moved out of the area, the other side (prey/predators) was deemed successful.”

Furthermore, to avoid any confusion among our readers, we have moved the original description you pointed out from the Environment subsection in the Methods section to the Evaluation subsection (lines 360 to 369 in the revised manuscript). We believe these changes will make the paper more clear and reader-friendly and hope this explanation clarifies your doubts. Thank you for bringing this to our attention.

Reviewer #2 (Recommendations for the authors):

– I am not clear that there is sufficient evidence for lines 172-174. The absence of an explicit model does not mean that there isn't an implicit model of the other agents behaviors that is encoded in the neural network weights. I'm not clear what sort of experiment would allow you to distinguish this though there might be a way to run a linear probe to confirm that this information is not in the network weights?

Thank you for your constructive criticism regarding the validity of our theoretical claims. We have carefully considered the feedback from you and the other reviewers, and have engaged in in-depth discussions with field experts, including a leading scientist on chimpanzee cognition, to address them. As you suggested, it is challenging to verify whether an implicit model of other agents’ behaviors is encoded within the neural network weights. Therefore, we have revised our manuscript to refine our claims to specifically address the absence of aspects of “theory of mind”, as it is certain that the agents in our study do not model or infer the “mental states” of others. Although this revision may narrow or moderate our argument, we believe it significantly enhances the precision and accuracy of our discussion. We believe these revisions more accurately address the concerns you and the reviewers have raised and ensure our theoretical claims. We are grateful for the guidance that has helped us to improve the manuscript.

– The notation in lines 246-251 is confusing because it alternates between POMDP notation (i.e. that the agent gets an observation that is a transformation of the true state) and MDP notation. Is the setup an MDP or a POMDP?

We apologize for any confusion caused by the inconsistent notation. The setup we used is an MDP, not a POMDP. We have corrected the relevant descriptions in the Methods section and the notation in Figure 1a to consistently reflect an MDP framework (lines 257 to 263 in the revised manuscript). Thank you for bringing this to our attention, and we appreciate your patience as we rectify this error.

– Perhaps line 259 should be "by finding a fixed point of the Bellman equations?"

Thank you for your suggestion concerning the phrasing. We have amended the manuscript accordingly (line 271 in the revised manuscript). We appreciate your attention to detail and your assistance in enhancing the technical accuracy of our paper.

– Line 271 should be "Dueling Networks" not "Dueling Network"

Thank you for pointing out the correct terminology. We have made the correction to “Dueling Networks” as you suggested (line 283 in the revised manuscript). Additionally, we have capitalized the initial letters of RL methods throughout the manuscript.

– The sentence starting on line 271 and ending on 273 could or should be cut entirely as it doesn't provide much value and I think it's debatable whether DQN was the first algorithm to solve a high dimensional input problem; it very much depends on how you define high dimensiona

We appreciate your critical feedback on the sentence. Upon reflection, we agree with your suggestion and have therefore removed it from the manuscript. Thank you for guiding us towards a more concise and accurate presentation of our work.

– To get equation 3 from equation 2, there needs to be a factor of 1/2 somewhere.

Thank you for pointing out the discrepancy between equations 2 and 3. We have included the factor of 1/2 to ensure the correctness of the equations. Again, we appreciate your attention to detail and your assistance with our work.

– In line 321 I don't know what identifiability means in the context of Q-learning? Is this a technical term used in some subfield that works on Q-learning? Why does subtracting the mean help with "identifiability?""

Thank you for your careful review and for bringing to our attention the term “identifiability”. We referred to the term as it was introduced in the paper by Wang et al. (2016) on Dueling Networks. However, after re-evaluating its usage based on your suggestion, we agree that subtracting the mean does not necessarily aid identifiability. Consequently, we have removed the related sentences from the Methods section of our manuscript and appreciate your guidance on this matter.

– A discount factor of 0.9 is a wildly low discount factor, basically leading agents to only care about the next 10 steps. I don't think this necessarily affects the outcome of your project or necessarily requires any changes as I don't think agents need to do long horizon reasoning here, but it's worth keeping in mind!

Thank you for your advice on the choice of the discount factor. We will certainly take this into consideration and pay close attention to the impact of different discount factors on agent behavior in future research!

– I don't fully understand the claim that this expands the range of things that are understood to be possible to learn via associative learning. There's no theory precluding a model-free algorithm from learning this type of behavior so the claim in the discussion strikes me as odd. In practice, this type of result where model-free RL agents successfully hunt together have been around since the release of the multi-particle envs (see https://proceedings.neurips.cc/paper/2017/hash/68a9750337a418a86fe06c1991a1d64c-Abstract.html)

Thank you for your input on our discussion about associative learning. After considering your perspective, we agree with your comment and have removed the related statements from the discussion.

– I think the rule-based model is neat but I don't understand what what question it answers. Did I perhaps miss something?

The rule-based model was developed to support our claim that predator agents' decisions are related to distance-dependent internal representations. We examined the state vectors of last hidden layers in each agent's network, which lead to action values after a single linear transformation and aggregation. With this in mind, we posited that if neural networks encode the distances between predators and prey, a concise rule-based model based on these distances should replicate similar behaviors. This additional analysis, prompted by a reviewer's comments, sought to substantiate our claim. While successfully replicating predator behavior using a distance-dependent rule-based model does not completely prove that the RL agent's decision are associated with the distances, it would provide support for our assertion. Additionally, in response to Reviewer 1's suggestion, we have relocated the description of these results to a dedicated paragraph in the Discussion section that explores the relationship between agent decisions and distance-dependent representations (lines 201 to 203 in the revised manuscript), thereby clarifying the aim of this additional analysis for the reader.

– I don't find the evidence for the distance-dependent features compelling; is all of the evidence for it the t-SNE embeddings?

As mentioned in our previous response, our assertion that predator agents’ decisions are related to the distances is supported by analyses of rule-based modeling as well as t-SNE embeddings. These approaches aim to provide a comprehensive understanding of the role of the distances in the agents' decision processes. Additionally, as Reviewer 1 highlighted, the distances between predators and prey agents are directly related to their rewards, making it plausible that these distances factor into the computation of action values during decision-making. We believe that this evidence addresses your concerns.

– Lines 194-196 are confusing to me. Why does there being a rule-based model employ your DQN agent is also learning a similar rule-based model?

Thank you for your continued engagement with our work. While partially reiterating what was mentioned in our previous response, we would like to clarify the rationale behind employing a rule-based model. It is to demonstrate that if predator agents encode the distances between predators and prey within their neural networks, these distances could potentially be used to construct a simple rule-based model that replicates the agents’ behavior. We have recognized that the initial presentation of the rule-based model's description could have been abrupt and confusing. Consequently, we have moved this discussion to the fourth paragraph of the Discussion section (lines 200 to 211 in the revised manuscript) and have provided a detailed explanation of its purpose and implementation in the Methods section (lines 396 to 432 in the revised manuscript). This rearrangement aims to make the intent and methodology of the additional analysis clearer to the reader.

Reviewer #3 (Recommendations for the authors):

My largest suggestion is to fit a linear model to rule-based behaviors and compare the t-SNE embeddings of the behavioral cloning policy with the embeddings of the RL policy? Is the use of RL truly important for this paper?

Thank you for your substantial suggestion to compare the t-SNE embeddings of both the behavioral cloning policy and the RL policy. As advised, we implemented two types of networks: a linear network without any nonlinear transformation and a nonlinear network with ReLU activations, and have conducted the comparison as shown in Figure 3 —figure supplement 7 (top: linear network, bottom: nonlinear network).

We found the results to be intriguing because these are somehow similar to the RL embeddings, which we consider promising for potential application to other biological data we possess. We are grateful for this insightful recommendation, and details of these procedures and results have been added to Methods section and Supplementary Figures (lines 433 to 452, and Figure 3 supplement 7, in the revised manuscript, respectively). However, as mentioned in our response to the editor, we firmly believe that the use of RL was essential for the outcomes presented in this paper. While some analyses could indeed be substituted with behavioral cloning in some cases, as this additional analysis has shown, we believe that the use of RL is still important for this paper. The reasons for this are described below from the respective perspectives of data and analysis.

Data: Collaborative hunting data is generally scarce, and to our knowledge, no extensive dataset exists with complete locational data on all individuals during hunts. Furthermore, in many cases, obtaining completely controlled data from field observations is challenging. For example, data collected in the wild tend to exhibit some biases (Lang and Farine, 2017). Consequently, we believe that the controlled comparisons presented in our Figure 2 would be difficult to achieve without RL, which makes them notable results.

Analysis: Additionally, we believe that even with a large and controlled dataset, limitations exist in referring the decision-making process from behavioral cloning results and the visualization of internal representations. One such limitation is the prediction accuracy of behavioral cloning, which would not be 100%. Our additional analysis, conducted in response to your suggestion, demonstrated that prediction accuracy was, at best, 60-70% (Figure 3 supplement 7). In such case, it would be difficult to rule out the possibility that complex cognitive processes are involved in the remaining percentage. Therefore, although we need to be careful in the interpretation of what is being learned on the deep Q-network as you and the other reviewers pointed out again and again, the explicit architecture of RL agents strengthens our argument. Thus, even if sufficient data are available, RL would still be meaningful for our analysis.

Around line 362, the idea of Rule based agents and human controlled agents are also introduced. I would like to see linear models that take as observations the rule-based agents observations and output the rule based agents actions. Would the t-SNE embeddings look similar for these linear models and for the RL-trained models? If the embeddings look similar, what does that say about the emergence of these capabilities as a result of RL? Does training via RL even matter? Do we care if it doesn't matter?

As mentioned in our response to your previous comment, we tested both a linear network you suggested and a nonlinear network which is more similar to the RL network. For both networks, we aligned the inputs with those given to the RL network. As demonstrated above, we found that the embeddings do separate to some extent based on the distances. Interestingly, despite the differences in prediction accuracy between the networks, the embeddings were quite similar. These findings suggest the usefulness of analyzing decision-making processes through behavioral cloning, as you have suggested, and show potential for application to the biological data, as already noted. However, as previously mentioned, models with an explicit structure like RL agents bring clarity to our study and are essential in substantiating our arguments. The explicitness of the RL model architecture helps us to dissect and articulate the mechanisms more precisely.

There is a large amount of work on multi-agent learning that this paper seemingly ignores, or fails to evaluate against. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments has thousands of citations. However, I am willing to accept that there are limitations to what a single paper can cover.

We apologize for the oversight and appreciate your pointing out the omission of significant multi-agent learning literature. We understand the importance of situating our work within the broader research context and have now included relevant citations in our manuscript. We regret any impression of neglecting existing contributions. If there are any specific references you believe we should include, we would appreciate being informed.

More specific comments:

Line 46-47. I do not know what "simple descriptions with the distances can reproduce similar behavior" is trying to convey.

Thank you for your comment. We have removed the phrase you pointed out “simple descriptions with the distances can reproduce similar behavior” because we deemed it unnecessary. As mentioned in our response to Reviewer 2, the additional analysis was intended to support the insights gained from the t-SNE embedding analyses. To clarify the purpose of this additional analysis, we have relocated the relevant sentences to the paragraph in the Discussion section that deals with the relationship between agent decisions and internal representations (lines 200 to 211 in the revised manuscript). In addition, to improve clarity for the reader, further details have been consolidated in the Methods section and Supplementary Figures (lines 396 to 432, and Figure 3 supplements 5 to 7, in the revised manuscript, respectively).

Lines 50-51: "Our approach of computational ecology bridges the gap between ecology, ethology, and neuroscience and may provide a comprehensive account of them." This is probably too strong of a claim.

Thank you for your critique regarding the claim made in our manuscript. In accordance with your feedback, we have tempered the statement. Instead of suggesting a bridging of gaps across disciplines, we just assert that (lines 52 to 53 in the revised manuscript):

“Our results support the recent suggestions that the underlying processes facilitating collaborative hunting can be relatively simple.”

This revised statement focuses on the contribution of our work to the existing literature by providing evidence that supports current hypotheses about the simplicity of mechanisms underlying collaborative behavior.

Figure 1: The architecture diagram is a little difficult to understand. The key has "layer with ReLU" but then I do not see any clear white boxes? I also do not see any clear units? I think that maybe this is happening inside of the "prey," "predator 1," etc boxes. However, this is all much too small. I think you should decide if you want this figure to be about the neural network architecture, or about the fact the environment is broken into 1 prey and N predators that share an observation.

Thank you for your feedback on Figure 1. In line with your suggestions, we have revised the figure to better highlight the environmental setup involving one prey and N predators, thereby aiming to enhance readability and comprehension for readers. Specifically, we have removed the legends “unit”, “layer with ReLU”, “forward connection”, and “aggregating module”. Furthermore, for readers interested in a visualization of the network, we have referenced the Supplementary Figure (Figure 1 supplement 1) that illustrates the network architecture in the “Training details” subsection of the Methods section (line 342 in the revised manuscript).

I think the actions are also not clear. There are probably too many lines in the figure.

Thank you for your comment regarding the indication of actions in Figure 1a. The agents in our study can perform a total of 13 actions: acceleration in 12 directions plus an option to do nothing. The illustration in the “Action” part of Figure 1a accurately depicts these with 12 arrows. Your observation about the excess of lines might also relate to the four lines each for “State”, “Reward”, and “Action”, reflecting the independent learning framework employed in our study. Consolidating these lines into a single one could potentially obscure the individual learning processes of the agents. Therefore, while acknowledging that the figure may appear somewhat cluttered, we have opted to keep the distinct lines as they are to maintain clarity and avoid misunderstanding. We hope this clarification addresses your concerns.

For Figure 1 (b), why not just plot the actual density? Actually, I see this is included in Figure 2. I think this is the more helpful Figure!

Thank you for your suggestion regarding Figure 1b. Indeed, we initially created heat maps to represent the data. However, we found that for conditions where episodes ended quickly, that is the fast and equal conditions, the heat maps were predominantly influenced by the initial positions, resulting in a concentration of distribution in the center of the play area. Therefore, we decided to first present trajectories for each condition to capture the general behavior of the agents and then focused on providing a heat map for the slow condition, where the episode duration was longer and less influenced by the initial positions. Following your valuable feedback, we have added heat maps for the fast and equal conditions as Supplementary Figures to accommodate readers interested in visualizing the density across all conditions (Figure 2 supplement 1 in the revised manuscript). We hope this addition will be helpful.

In Figure 2, what form of Hypothesis testing was used? Was this a KS test? You can't assume the distributions are Gaussian? The presence of a chi-squared statistic seems to indicate Gaussian assumptions. But the distribution is strongly non-Gaussian in this case. A little more clarity would be helpful.

We appreciate your detailed attention to the statistical analysis in our manuscript. Based on the context of variability you have described, we believe that your comments refer to Figure 4 rather than Figure 2. Regarding the sample size of 10 per condition, we acknowledge that the central limit theorem may not provide a strong justification for the assumption of normality. However, we would like to emphasize the robustness of ANOVA when dealing with small sample sizes and its ability to yield reliable results even when data distributions deviate from normality. The lack of formal testing for normality is indeed a limitation, as noted in Statistics subsection in the Methods section of our manuscript (lines 469 to 481 in the revised manuscript). Yet, the ANOVA test has been widely recognized for its robustness, especially in the context of balanced designs, which is the case in our experimental setup. Moreover, the Holm-Bonferroni method has been applied to adjust for multiple comparisons, reducing the risk of Type I errors. We believe that these considerations, along with the conservative nature of our statistical correction methods, provide a reasonable basis to uphold the validity of our findings. Our approach aligns with common practices in the field, where the practical constraints of sample collection and experimental design often necessitate a balance between statistical ideals and real-world applicability.

Line 132 mentions that the variance is higher over the action distribution when the predator is about to catch the prey? This is actually the exact opposite of my intuition. I think that the actions hardly matter when the prey is far away, so there is no obvious optimal action, and the choice would be closer to uniform. I'm not sure that this matters very much, but it's interesting.

Thank you for your comment regarding the variance in action values. We appreciate this opportunity to clarify the interpretation of the variance of action values in our study. A larger variance in action values indicates a situation where there is a significant distinction in the value of possible actions, with some actions being highly valued and others much less so. Conversely, a smaller variance suggests that there is little difference in the value of actions, with the distribution of action values being closer to uniform. This can be observed in Figure 3 supplement 4 in the revised manuscript, where the action values of predator 2 indeed approach a uniform distribution when the prey is distant. Therefore, it seems our findings are consistent with your intuition that when the prey is far away, the actions matter less. If there is any misunderstanding, we would be grateful for the opportunity to ensure our interpretations align with the observed behavior.

Line 325 – Usually IL is reserved for Imitation Learning. I have never seen it used for Independent Learning.

Thank you for bringing this to our attention. We have removed the abbreviation “IL” for Independent Learning to avoid any confusion.

Line 324 – I think biological organisms usually model the behavior of other organisms and account for it while planning.

Thank you for your insightful comment regarding the modeling of biological behaviors. Our initial intention was to illustrate that each policy network in our computational model operates independently, similar to individual neural processes in biological brains, without the shared network weights that are often used in multi-agent reinforcement learning environments. Nonetheless, we agree with your observation that the lack of explicit mechanisms for modeling and planning may not entirely reflect the intricacies of biological organisms. Consequently, we have revised the manuscript to remove the term “biologically plausible” from the Results and Methods sections to prevent any overstatement of our computational model's capabilities. The revised description is as follows (lines 331 to 332 in the revised manuscript):

“We here modeled an agent (predator/prey) with independent learning, one of the simplest approaches to multi-agent reinforcement learning.”

We believe this modification more accurately conveys our methodology and the scope of our study.

Line 212 – Q-values do implicitly model the competencies of other agents.

Thank you for your point out. As you and the other reviewers noted, we have recognized that deep Q-networks implicitly model the competencies of other agents. Therefore, we have revised our manuscript to refine our claims to specifically address the absence of aspects of “theory of mind”, as it is certain that the agents in our study do not model or infer the “mental states” of others. Although this revision may narrow or moderate our argument, we believe it significantly enhances the precision and accuracy of our discussion. We are grateful for the guidance that has helped us to improve the manuscript.

Line 196 – What does it mean to switch the decision rules with the distances?

Thank you for your comment. We have removed the phrase you pointed out “switch the decision rules with the distances” because we deemed it ambiguous as you suggested. As mentioned in our response to your previous comment, we have rearranged the relevant paragraph in the Discussion section to clarify the purpose of this additional analysis (lines 200 to 211 in the revised manuscript). We are grateful for the guidance that has helped us to improve the manuscript.

Overall, I think the problems considered by this paper are interesting. And I am happy you took the time to write it. This work made me think a lot about my own research. I appreciate your efforts here. Thank you.

Thank you for your positive feedback. We are delighted to hear that our paper has sparked further thought about your research. It is encouraging to know that the problems we have addressed are considered interesting within the research community. Your kind words are greatly appreciated.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Tsutsui K, Tanaka R, Takeda K, Fujii K. 2023. Dataset. figshare. [DOI]

    Supplementary Materials

    MDAR checklist

    Data Availability Statement

    The data and models used in this study are available at https://doi.org/10.6084/m9.figshare.21184069.v3.The code for computational simulation and figures is available at https://github.com/TsutsuiKazushi/collaborative-hunting (copy archived at Kazushi, 2023).

    The following dataset was generated:

    Tsutsui K, Tanaka R, Takeda K, Fujii K. 2023. Dataset. figshare.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES