Summary
During dynamic social interaction, inferring and predicting others’ behaviors through theory of mind (ToM) is crucial for obtaining benefits in cooperative and competitive tasks. Current multi-agent reinforcement learning (MARL) methods primarily rely on agent observations to select behaviors, but they lack inspiration from ToM, which limits performance. In this article, we propose a multi-agent ToM decision-making (MAToM-DM) model, which consists of a MAToM spiking neural network (MAToM-SNN) module and a decision-making module. We design two brain-inspired ToM modules (Self-MAToM and Other-MAToM) to predict others’ behaviors based on self-experience and observations of others, respectively. Each agent can adjust its behavior according to the predicted actions of others. The effectiveness of the proposed model has been demonstrated through experiments conducted in cooperative and competitive tasks. The results indicate that integrating the ToM mechanism can enhance cooperation and competition efficiency and lead to higher rewards compared with traditional MARL models.
Graphical abstract

Highlights
-
•
ToM helps agents infer others’ actions by self-modeling or modeling others
-
•
Agents with ToM optimize self-policy after considering others’ future actions in MARL
-
•
We build a brain-inspired SNN model to simulate the function and mechanism of ToM
-
•
The model exhibits superior performance on cooperative and competitive tasks
The bigger picture
Theory of mind (ToM), a kind of high-level social cognitive ability, enables individuals to infer others’ mental states and thus explain and predict others’ behavior. ToM plays a crucial role in human interaction. Currently, in multi-agent systems, agents without ToM make decisions based on observations of the environment, often ignoring the impact of other agents’ future behavior on current decisions. Inspired by the function and mechanism of ToM in the human brain, this article designed a spiking neural network (SNN) for ToM and validated it with multi-agent decision-making tasks. This method enables the effective facilitation of competition and cooperation among multiple agents. This work reveals the critical role of ToM in multi-agent interactions and lays the foundation for the future development of social intelligence for brain-inspired artificial intelligence.
In human social decision-making, theory of mind (ToM) optimizes decisions by inferring others’ mental states, based on self-experience or historical observations of others. We were inspired by the mechanism of ToM to construct a multi-agent ToM spiking neural network (MAToM-SNN). Agents with MAToM-SNN can predict other agents’ actions and make decisions while considering others’ future actions. These agents achieve better performance on various cooperative and competitive multi-agent reinforcement learning tasks than those without MAToM-SNN.
Introduction
Higher animals, such as humans, gradually derive behaviors such as cooperation and competition in social relationships, which are inseparable from the ability to infer others’ mental states, known as theory of mind (ToM). ToM often draws on information such as desires and beliefs to construct representations of others to predict others’ behavior in a given situation.1,2,3 If we can learn to predict others’ behavior, we can avoid predictable trouble or take advantage of predicted opportunities. The critical role of ToM in social cognition inspired us to explore its role in multi-agent decision-making. An agent with ToM should be able to predict the actions of others depending on self-experience and observations of others and then adjust their policies to obtain more rewards overall based on predictions.
A simple example of using ToM to predict the behavior of others is shown in the scenario in Figure 1A, where there are three agents: B (Bob), G (Green), and O (Oli). B needs to predict the behavior of G and O. B’s self-experience is that he likes apples. B then finds that O likes oranges based on his observation of O (step 1 in Figure 1A). Since G’s behaviors have never been seen, B infers that G likes apples based on his own experience. From B’s observation of O, B would predict that O likes oranges. Imitation theory suggests that when an agent sees other agents exhibiting behavior similar to the one it has executed before, this similarity evokes empathy in the observing agent. Thus, the observing agent infers that the observed agents have the same intention4 (e.g., B’s inference about G in Figure 1A). The above process can be simulated by a hypothetical neural circuit as shown in Figure 1B. In the beginning, B observes the environment and can infer the observation of others. The inferior parietal lobule (IPL) and the posterior superior temporal sulcus (pSTS) store self-relevant and other-relevant observations, respectively.5,6 The anterior cingulate cortex (ACC) is stimulated when B observes that their behavior is different from his expectations.7,8 The ventral medial prefrontal cortex (vmPFC) stores information related to self, and the dorsal medial PFC (dmPFC) stores information related to others.9 Based on the stored information, the dorsolateral PFC (dlPFC) can simulate the decisions of others.10
Figure 1.
Introduction to ToM and related brain areas
(A) A case of ToM.
(B) A schematic hypothetical neural circuit of ToM.
Some existing studies focus on ToM mechanism-inspired decision-making models. A recursive ToM algorithm modeled by probabilistic methods can complete the rock-paper-scissors game.11 Osten et al. introduced the idea of ToM in modeling multiple opponents.12 Nguyen et al. used ToM to learn cooperative behavior with others quickly.13 Baker et al. modeled others’ beliefs through a Bayesian model, which enables simple inference.14,15 In our past work, we have drawn on the neural circuits and learning mechanisms of ToM and proposed brain-inspired ToM models that incorporated multi-scale neural plasticity mechanisms and coordinated mechanisms in multiple brain areas.16,17 These algorithms were separately applied to pass the false belief task and the artificial intelligence (AI) safety risks experiment. More complex multi-agent decision-making tasks would significantly challenge them. Rabinowitz et al. predicted the trajectories of others based on their behavioral data and constructed models of others’ goals.18 This work is limited to modeling others and does not use the predictions of others to help collaborate better. Wang et al.’s work predicted the goals and observations of others. The predicted results jointly helped choose their behavior, which led to better collaboration and communication on cooperative multi-agent tasks.19 In addition, the vast majority of multi-agent reinforcement learning (MARL) models focus on improvements in deep reinforcement learning (RL) methods, such as centralized learning and distributed decision-making, while lacking references to the mechanism of ToM.20,21,22 In general, there is still a need to draw deeply on the mechanisms of ToM to improve the efficiency and collaborative effectiveness of multi-agent decision-making.
Considering the various limitations of the existing methods mentioned above, this article aims to build a brain-inspired multi-agent ToM spiking neural network (MAToM-SNN) model and explore its role in MARL. SNNs23,24,25,26 have the advantages of simulating the brain’s structure and function, extracting spatiotemporal properties, and so on. More importantly, SNNs are more biologically plausible, energy efficient, and naturally more suitable for simulating various cognitive functions of the brain.23,24,25,26 Information in SNNs is transmitted through non-continuous binary spikes. Thus, it is difficult to optimize SNNs with traditional backpropagation methods. Spike-timing-dependent plasticity (STDP)-based approaches17,27,28 that draw on the characteristics of synaptic plasticity can solve simple decision-making or control tasks but are not competent for complex decision-making tasks. Some works17,29,30,31,32,33 adopted reward-modulated STDP to optimize SNNs to solve RL problems. However, these methods cannot be applied to the end-to-end optimization of deep SNN-based RL or MARL. The conversion of artificial neural networks (ANNs) into SNNs34,35 and surrogate gradients for directly training deep SNNs36 are feasible methods to improve the performance of deep SNNs. A hybrid learning framework in deep RL uses SNNs to build actor networks and ANNs to build critic networks that perform well in some complex tasks.37 In multi-agent decision-making, our past work is based on reward-modulated SNNs for self-organized swarm unmanned aerial vehicle (UAV) obstacle avoidance,28 which does not extend to MARL environments and lacks references to ToM. Little research on MARL has been implemented using SNNs: Saravanan et al.38 enable SNNs to reduce time and data consumption when completing MARL tasks, and Ye et al.39 combine mean-field MARL with SNNs to approximate device-to-device users and combine RL to optimize the convergence rate. The former explores the feasibility of SNNs in the field of MARL, while the latter solves a practical problem in a specific domain. In contrast, our work explores the significance of ToM in multi-agent cooperation and competition tasks with a combination of energy-efficient SNNs and MARL.
Inspired by the mechanism of ToM in social cognition, this paper proposes a MAToM-SNN model. The core mechanism of ToM is to use the self-experience or observation of others to infer others’ mental states and then optimize self-policy to gain more rewards. Motivated by this, the main contributions of this article can be summarized as follows.
-
(1)
MAToM-DM integrates MAToM-SNN and the decision-making module. The output of MAToM-SNN, the predicted behavior of others, is integrated into the decision-making module to help the individual and collective achieve more rewards and more efficient collaboration.
-
(2)
MAToM-SNN uses SNNs to model brain-inspired ToM networks, which integrate spiking neurons and surrogate gradients. MAToM-SNN incorporates Self-MAToM and Other-MAToM to infer others’ behavior, with one based on self-experience and the other based on observations of others’ behaviors.
-
(3)
MAToM-DM is validated on multiple cooperative (stag hunt game) and mixed cooperative-competitive (multi-agent particle environment) tasks. Experimental results demonstrate that the introduction of MAToM-SNN on MARL algorithms can improve the efficiency of group collaboration and help achieve higher rewards (compared with IQL and value-decomposition network [VDN] with recurrent neural networks [RNNs] and SNNs in cooperative environments and MADDPG with RNNs and SNNs in competitive environments).
Results
In MAToM-DM, each agent predicts the actions of others based on self-experience or observations of others. This predicted information helps to optimize its behavior and improve its own or the team’s reward. The environment contains multiple agents. Each agent contains a ToM module (MAToM-SNN) and a decision-making module, as shown in Figure 2. The inputs to MAToM-SNN are one’s observations of others. The multi-layer SNNs characterize the attributions of mental states to others. The output of MAToM-SNN is a prediction of others’ behavior. MAToM-DM concatenates the predicted actions of others obtained from MAToM-SNN with the observations of the environment as inputs to the decision-making module to learn the optimal policy. We model MAToM-SNN in two ways, Self-MAToM and Other-MAToM, and they have the same structure as shown on the right side of Figure 2. The Self-MAToM module is trained by self-experience. Nevertheless, we use historical observations of others to train the Other-MAToM module. This article compares the effects of self-modeling (Self-MAToM module) and others’ modeling (Other-MAToM module) on ToM. The interaction and collaboration between multiple agents with ToM could accomplish complex decision-making tasks such as cooperation and competition. The proposed MAToM-DM method is verified on cooperative stag hunt tasks40 with discrete space and time and a cooperative-competitive multi-agent particle environment22,41 with continuous space and discrete time. We specifically apply the neuron model and the surrogate gradient in the brain-inspired cognitive intelligence engine (BrainCog)26 in the process of implementing the modules for this work.
Figure 2.
Overview of MAToM-DM and architecture implementation for MAToM-SNN
MAToM-ToM contains Self-MAToM and Other-MAToM, which are trained with the help of the self-trajectory and other-trajectory saved in the buffer, respectively. In addition, Other-MAToM can learn by using Self-MAToM’s parameters.
Stag hunt games
We evaluate the MAToM-SNN in cooperative tasks with discrete space and time in which two agents can act cooperatively to obtain more rewards or act independently to obtain less rewards. Because each agent learns the same policy, the experiences of the self and others are the same. In this environment, we do not distinguish between Self-MAToM and Other-MAToM. The environment of these tasks is a 5 ∗ 5 grid. All agents can take physical actions, including left, up, down, right, and stay. In each experiment, agents’ observations are one-dimensional vectors containing the agent’s position, others’ positions, the stag’s position, and plants’ positions. The detailed stag hunt task environments are depicted in Figure 3 and are introduced as follows.
Figure 3.
Illustrations of stag hunt tasks (left column) and results (right column)
The resulting figure shows the mean episode reward across the different random seeds.
Harvest
The environment contains two cooperative agents and randomly maturing plants. When an agent is on top of a plant, it harvests this plant. If an agent harvests a young plant, it will receive a young reward, which is 1, and the other agent will receive nothing. If an agent harvests a mature plant, both agents will receive the mature reward, which is 2.
Escalation
This scenario consists of two cooperative agents and a stag. At first, the stag is not moving. The stag will start moving when two agents walk together to the stag. Only when the two agents keep chasing the stag together will both receive a reward of 1. Otherwise, the game will end.
Hunt
This scenario contains two agents, two plants, and a randomly moving stag. Each agent can harvest a plant and obtain a plant reward, 1. Two agents can collaborate to hunt the stag and obtain a stag reward of 5. A single agent cannot get close to the stag or else it will be injured and get a penalty of −5.
In three tasks, both SNNs and RNNs are used to build the decision-making module, which is trained by the VDN21 method. We integrated MAToM-SNN into the decision-making module. During training of the network, the batch size is 250. The length of an episode is 50 steps. The learning rate of the decision-making module is 0.99. In the harvest task, MAToM-DM learns how to harvest mature plants. In the last two tasks, two agents learned to chase the stag together.
Comparative experiments
To validate the effectiveness of the MAToM-SNN model, we compare our methods with baselines, including DQN,42 IQL,20 RNN-based VDN (RVDN),21 and SNN-based VDN (SVDN).
The comparative results on different baselines are shown in Figure 3 and Table 1. MAToM-RVDN and MAToM-SVDN imply that MAToM-SNN is combined with RVDN and SVDN, respectively. The experimental results show that RVDN and SVDN outperform the IQL and DQN models. The rewards of RVDN and SVDN in Table 1 imply that VDNs with different structures (such as RVDN and SVDN) perform differently in different tasks. Nevertheless, their total mean rewards have been improved after combining with MAToM-SNN. As shown in Figure 3 (MAToM-RVDN line and MAToM-SVDN line), MAToM-SNN can improve learning efficiency in most cases, especially when MAToM-SNN is added to RVDN (MAToM-RVDN), and the learning speed is dramatically improved. In addition, MAToM-SNN can guarantee almost no loss of performance and even improve performance. Similarly, MAToM-SNN also improves the decision-making ability of the agent with SVDN, especially in the hunt task, and MAToM-SVDN has achieved remarkably excellent performance. In the harvest task, the performance of SVDN is slightly better than that of MAToM-SVDN in the second half of the curve. The possible reason for this is that the two agents do not need to cooperate in hunting in this scenario, but both must work to collect mature plants. When both agents see a mature plant, both infer that the other will harvest the plant and so neither does so. Therefore, the result converges to the optimal local solution. Because RVDN performs better in this task, it causes the model to cross the local optimization and makes MAToM play its role.
Table 1.
Comparison results of stag hunt tasks
| Method | Harvest | Escalation | Hunt |
|---|---|---|---|
| DQN42 | |||
| IQL20 | |||
| RVDN21 | |||
| MAToM-RVDN | 100.063.43 | 83.474.19 | 106.07 14.23 |
| SVDN | |||
| MAToM-SVDN | 74.042.96 | 78.412.85 | 120.3912.63 |
The table records the mean and standard deviation of raw rewards.
In summary, MAToM-SNN can improve the cooperation efficiency and total reward of MARL tasks. The advantage is mainly reflected in that the MAToM-SNN’s prediction of others’ future behavior can help the agents to expand the decision information and thus make more rational decisions.
Multi-agent particle environment
We choose multi-agent particle environment (MPE) for more complex environments where cooperation and competition coexist. The environment contains two competing teams (team A and team B). There is at least one agent within the team. If the team has more than one agent, then the agent within the team has the same reward function. The agents can take actions: up, down, left, right, and stay. Notably, the observation of the opposing agents is not consistent. We describe all scenarios (as shown in Figure 4) in detail below.
Figure 4.
Illustrations of the multi-agent particle environment (left column) and results (right column)
Each curve converges at the end of training. The shading region is a confidence interval across the different random seeds.
Physical deception
In this scenario, team A has one agent, and team B has two agents. The positive reward for team B is based on how close the closest one of them is to the target landmark. The smaller the distance, the greater the positive reward. The negative reward for team B is based on how close team A is to the target landmark. The smaller the distance, the smaller the negative reward. Team A is rewarded based on the distance to the target, but it does not know which landmark is the target landmark. Therefore, agents in team A need to infer agents in team B to predict the target landmark.
Predator-prey
Team A has one prey, and team B has three predators. When team B collides with team A, team B is rewarded, and A is penalized instead. Two landmarks impeded the agents’ way.
World communication
The environment contains team A (containing two agents), team B (containing four agents, who catch team A), two pieces of food that agents in team A can eat, two forests that can hide all the agents inside, and a landmark impeding the way. Only the leader of team B can know the coordinates of the agent in the forest and can choose to tell team members by communication.
In the MPE environment, the design of the decision-making module in MAToM-DM is based on MADDPG. We construct the SNN and RNN versions of MADDPG as the decision-making modules and use the output of MAToM-SNN as the input of the actor network of the decision-making module. During training of the MAToM-SNN, the batch size is 1,024. The length of an episode is 25 steps. The learning rate of the decision-making module is 0.99.
Comparative experiments
We compare our methods with DDPG,42 MADDPG,22 RMADDPG, and SMADDPG. Since there are different kinds of agents in mixed cooperative-competitive environments, MAToM-SNN can use both self-experience (the Self-MAToM module) and historical observations of others (the Other-MAToM module) to learn. Therefore, we also compare the experimental results of the Self-MAToM module and the Other-MAToM module.
To clearly explain the results, we record part of a learning curve in Figure 4 and the raw rewards in Table 2. As seen from the table, RMADDPG and SMADDPG outperform the DDPG and MADDPG models. Self-MAToM and Other-MAToM significantly outperform the baseline DDPG. Moreover, Other-MAToM even outperforms the baseline MADDPG model significantly, and in the predator-prey and physical deception scenarios, Self-MAToM also performs better than MADDPG. In summary, although MAToM-DM with RNN and SNN structures have different performances for different experimental tasks, the mean rewards have been improved after combining with Other-MAToM. We analyze the reason why Self-MAToM is slightly inferior to Other-MAToM. In a competitive task (the world communication scenario) where the opponents’ strategies are significantly different from self, using self-experience (Self-MAToM) to infer the opponent’s behavior would be more inaccurate and even lead to a decline in performance. Therefore, MAToM-DM with Other-MAToM performs better than Self-MAToM on mixed cooperative-competitive tasks.
Table 2.
Comparison results of multi-agent particle environments
| Method | Physical deception | Predator-prey | World communication |
|---|---|---|---|
| DDPG42 | |||
| MADDPG22 | |||
| RMADDPG | |||
| Self-MAToM-RMADDPG | |||
| Other-MAToM-RMADDPG | 0.141.11 | 20.364.24 | 32.1611.69 |
| SMADDPG | |||
| Self-MAToM-SMADDPG | |||
| Other-MAToM-SMADDPG | 2.431.06 | 20.704.63 | 19.194.68 |
The table records the mean and standard deviation of raw rewards.
Discussion
Taking inspiration from the ToM mechanism in social cognition, this article proposes a brain-inspired MAToM-SNN model that brings ToM into multi-agent decision-making. The proposed model uses more biologically plausible SNNs to construct MAToM-SNN and improves the efficiency and performance of multi-agent collaboration. Extensive experiments demonstrated that the MAToM-SNN model achieves outstanding performance in cooperative and competitive tasks. Compared with multiple baseline models, the proposed model is more stable, robust, and efficient and performs well in different stochastic scenarios.
At present, MARL mainly draws on the classical algorithms of RL to improve the objective function of the actor network and the loss function of the critic. Alternatively, from the perspective of multi-intelligent systems,21,43,44 scholars have used the Q value of each agent to fit the Q value of the system. Most of these methods use the idea of centralized training decentralized execution (CTDE), which means that the agents are independent in the interaction process. The problem is that the local information observed by the agents would lead to biased decisions. Communication45,46 methods allow agents to receive information directly from others and thus enrich the observation information to improve decision-making accuracy. These methods directly access the information of all other agents and are unsuitable for competitive tasks.
Unlike previous MARL models, MAToM-DM considers ToM in multi-agent interactions, which plays a crucial role in social cognition and is naturally suitable for solving multi-agent decision-making tasks. In addition, our main highlight is that we adopt full SNNs to implement two kinds of brain-inspired ToM models: Self-MAToM infers others’ behavior based on self-experience, and Other-MAToM is based on the observation of others. Then, the predicted behaviors of others are used to adjust the agent strategies to better interact with others. Therefore, our MAToM-SNN model is more biologically interpretable, plausible, and predictably effective, as well as convincingly illustrated by the superior results on different cooperative and competitive tasks.
Analysis of MAToM-SNN
To further analyze the role of MAToM-SNN, we conduct ablation study experiments on multiple competitive scenarios using Other-MAToM. For the opposing team A and team B, we grant the ability of ToM to team B (no ToM to team A at this point), as well as to both team A and team B. Table 3 depicts the mean episode reward for each team on three competitive tasks. The experimental results reveal that the team with the ToM ability (B-ToM) achieved a greater reward than the team without ToM (B). In addition, when only one of the opposing teams has ToM capability (B-ToM), the inferred team’s reward (A) is reduced. This suggests that ToM boosts one team’s rewards and suppresses the opposite team’s rewards. When both teams have ToM, the team with more agents (team B) can achieve more rewards and suppress the team with fewer agents from gaining rewards. This also validates that our MAToM-SNN model will help teams with larger numbers improve performance in competitive tasks.
Table 3.
Ablation study experiment results of multi-agent particle environments
| Method | Team | Physical deception | Predator-prey | World communication |
|---|---|---|---|---|
| MADDPG22 | A | |||
| B | ||||
| MAToM-SMADDPG | A-ToM | |||
| B-ToM | ||||
| MAToM-SMADDPG | A | |||
| B-ToM |
The table records the mean and standard deviation of raw rewards.
Effects of Self-MAToM
To further analyze the role of Self-MAToM, we conduct ablation study experiments on multiple competitive scenarios. Intuitively, when we encounter an unseen person for the first time, we infer others by considering others in our place. Moreover, as we get to know others, we adjust our expectations of them based on their behaviors. Inspired by this, we designed Self-Other-MAToM as a combination of Self-MAToM and Other-MAToM. Self-Other-MAToM uses Self-MAToM’s parameters to initialize the network and learn from the observation of others. We can share parameters between Self-MAToM and Other-MAToM because they adopt the same network structure. At the same time, we proved that the number of parameters of both Self-MAToM and Other-MAToM is the same in the same experiment as shown in Table 4.
Table 4.
The number of parameters
| Physical deception | Predator-prey | World communication | |
|---|---|---|---|
| Params | 5.539K | 5.803K | 6.199K |
Figure 5 depicts that, compared with Self-MAToM and Other-MAToM, Self-Other-MAToM has a higher starting reward because Self-Other-MAToM uses self-experience to predict others in the early stage of training so that it can model others more quickly. The starting reward of team A with Self-Other-MAToM is lower than that with Self-MAToM or Other-MAToM in the predator-prey scenario. One possible reason for this is that team B with Self-Other-MAToM is too strong, so team A is completely suppressed. Therefore, it is not a problem of Self-Other-MAToM that leads to the score decline but a lag in the competition. Based on the above analysis, the results prove that the initial Self-MAToM parameters will help agents participate faster in cooperative and competitive tasks at the beginning of the experiment.
Figure 5.
Learning curves of Self-MAToM, Other-MAToM, and Self-Other-MAToM
The left column shows the mean episode rewards of the system. The right column shows the mean episode rewards of each team.
In this article, we find that modeling and predicting others’ behavior based on others’ observations (Other-MAToM) is superior to using self-experience to predict others’ behavior (Self-MAToM). The possible reason for this is that self-experience and others’ experiences are different, especially in competitive tasks. The Other-MAToM model performs better in predicting others. In addition, Self-MAToM can help Other-MAToM learn quickly. The self is a prerequisite for inferring others. Thus, it is essential to infer about others based on self-experience when information about them is incomplete (e.g., B’s inference about G in Figure 1A). This has inspired us to explore better integration of self and others’ experiences into the ToM model in future studies and to delve further into more biologically plausible mechanisms of ToM.
With the development of AI, multi-agent systems are quickly entering human society, in ways such as autonomous driving and logistics delivery. ToM is an advanced cognitive function evolved in humans during social decision-making processes that allows individuals to predict the behavior of others by understanding their intentions. This provides valuable insights for building social intelligence, as ToM enables agents to act interpretably. In future research, we can explore how ToM can help multiple agents generate safe, trustworthy, and ethical behavior during interactions, as well as how high-order ToM performs in complex decision-making tasks. Like humans, the development of ToM in agents requires time and exposure to a variety of samples. In other words, when an agent with ToM encounters different types of agents, it needs to spend more time learning how to distinguish between them and what mental states each agent may produce. Therefore, it is worthwhile to explore how to improve the efficiency of the ToM model.
Experimental procedures
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Dr. Yi Zeng (yi.zeng@ia.ac.cn).
Materials availability
This study did not generate new unique materials.
Data and code availability
All original code has been deposited at GitHub under https://github.com/BrainCog-X/Brain-Cog/tree/main/examples/Social_Cognition/MAToM-SNN and at Zenodo under https://doi.org/10.5281/zenodo.7768874 and is publicly available as of the date of publication. The data used in this article are available from the code, and readers can also contact the lead contact for requests.
SNNs provide a better understanding of biological systems by simulating how information is transmitted through spikes in the brain. ToM, as a higher cognitive function of the human brain, plays an important role in the human decision-making process. In this article, we propose an MAToM-DM model consisting of a ToM module (MAToM-SNN) and a decision-making module. We designed the ToM module based on SNNs to infer the behavior of others. The decision-making module accepts MAToM-SNN’s predictions of others’ actions as input to guide self-decision-making and improve performance during cooperation or competition with others.
SNN.
This subsection introduces the neuron model of the SNN and the training method. The leaky integrate-and-fire (LIF) model47,48 is used to describe the dynamic activity of neurons. The differential Equation 1 describes the dynamic process of voltage change caused by the postsynaptic potential (PSP), represents the time constant, is the reset voltage, is the updated voltage, and is the PSP. When exceeds the threshold , the neuron fires as shown in Equation 2. is a spike function.
| (Equation 1) |
| (Equation 2) |
To make the model competent for both continuous and discrete control tasks, the network output is a continuous value. Therefore, we distinguish the last layer neurons in SNNs from the other layers, as shown in Figure 6. Information is passed between SNN layers by spikes. The output of the network is the voltage. We apply this setting to all modules in MAToM-SNN. The PSP represents the result of the spike-weighted accumulation of the layer as the input to the layer as shown in Equation 3.
| (Equation 3) |
Figure 6.
Illustrations of LIF neurons
Since gradient backpropagation cannot optimize SNNs because the spike sequence is not differentiable in the derivation process, we trained SNNs with the help of the surrogate gradient instead as shown in Equation 4.49 In the experiments, is equal to 2.
| (Equation 4) |
MAToM-SNN
In this subsection, we describe the MAToM-SNN modeling process in detail. We applied a four-layer fully connected SNN to construct a ToM module. The inputs to the ToM module are observations of others (including environmental context and observations of others’ states and behaviors). The output is the prediction of others’ actions. The hidden layer encodes others’ mental states.
The ToM module learns the parameters in a supervised manner. We use Kullback-Leibler (KL) divergence (Equation 5) as the loss function (Equation 6).
| (Equation 5) |
| (Equation 6) |
In this article, we consider two sources to infer others’ mental states and predict others’ actions. One is self-experience, and the other is observation of others. The Self-MAToM module and Other-MAToM module have the same structure, as shown in Figure 2.
Self-MAToM module
This module is considered to be coding memories. Self-experience is considered to be the training sample, which is the same as the history trajectory saved in the RL buffer. The Self-MAToM module is to infer others with self-experience. Therefore, the training sample should remove the quantities that are unobservable to others, such as others’ target landmarks. The continuous optimization of the loss function (Equation 6) indicates the continuous improvement of the Self-MAToM module.
Other-MAToM module
The training sample uses observations of the historical trajectories of others. This history trajectory is stored in the same way as RL. The difference is that the ToM module’s buffer also contains behavioral observations of others. In the loss function (Equation 6), is the target policy, and is the predicted policy.
MAToM-DM with different decision-making modules
We constructed the decision-making module based on the existing MARL algorithm. In the previous subsection, we described the ToM module. The purpose of the ToM module is to be integrated in front of the decision-making model. We integrate the ToM module on the critic-based VDN and actor-critic-based MADDPG algorithms as follows.
Markov games
We describe multi-agent tasks as a multi-agent extension of Markov decision processes (MDPs). A Markov game for agents contains the states, ; a set of actions, ; and a set of observations, . Observations are related to states and actions. Each agent selects its actions according to the policy and generates the next state depending on the state transition function. At the same time, each agent will obtain rewards, . After each agent interacts with the environment, a trajectory (, ) will be formed and stored in buffer .
MAToM-DM with VDN
The cooperation tasks
We added ToM to the VDN method. VDN extends DQN to multi-agent tasks and optimizes the team reward. VDN learns to decompose the team value function into agentwise value functions.21 Each agent’s observations are composed of the agent history and its current action . We denote as a tuple of predicted others’ actions, and means the prediction of agent to agent . We concatenated the predicted actions of others obtained by the ToM module with the original observations as the inputs. In Equation 7, we describe the value-decomposition process that integrates the ToM module. This means that the system’s total Q function can be approximately equal to the sum of all agents’ Q functions.
| (Equation 7) |
MAToM-DM with MADDPG
The competitive tasks
We added the ToM module to the MADDPG method. The MADDPG algorithm endows each agent with an actor and a critic network. The actor network obtains the observations to generate actions, and the critic network evaluates the observations and other people’s behavior. The actions obtained from the actor network are considered part of the critic network inputs. The actor network obtains the best behavior through a gradient-based backpropagation algorithm. The objective function is shown in Equation 8, where is the policy of agent k and is the critic function of . The critical network is updated with the temporal difference (TD) error. We concatenated the actions of others, , obtained by ToM with the original observations as part of the actor network inputs.
| (Equation 8) |
Acknowledgments
This study was supported by the National Key Research and Development Program (grant no. 2020AAA0104305), the Strategic Priority Research Program of the Chinese Academy of Sciences (grant no. XDB32070100), and the National Natural Science Foundation of China (grant no. 62106261).
Author contributions
Conceptualization, Z.Z. and Y. Zeng; methodology, Z.Z.; investigation, Z.Z.; resources, F.Z., Y. Zeng, and Y.S.; writing – original draft, Z.Z., F.Z., Y. Zhao, Y. Zeng, and Y.S.; writing – review & editing, Z.Z., F.Z., Y. Zhao, and Y. Zeng; supervision, Y. Zeng; funding acquisition, Y. Zeng and F.Z.
Declaration of interests
The authors declare no competing interests.
Published: June 23, 2023
References
- 1.Sebastian C.L., Fontaine N.M.G., Bird G., Blakemore S.-J., Brito S.A.D., McCrory E.J.P., Viding E. Neural processing associated with cognitive and affective Theory of Mind in adolescents and adults. Soc. Cognit. Affect Neurosci. 2012;7:53–63. doi: 10.1093/scan/nsr023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Koster-Hale J., Saxe R. Theory of mind: a neural prediction problem. Neuron. 2013;79:836–848. doi: 10.1016/j.neuron.2013.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dennis M., Simic N., Bigler E.D., Abildskov T., Agostino A., Taylor H.G., Rubin K., Vannatta K., Gerhardt C.A., Stancin T., Yeates K.O. Cognitive, affective, and conative theory of mind (ToM) in children with traumatic brain injury. Dev. Cogn. Neurosci. 2013;5:25–39. doi: 10.1016/j.dcn.2012.11.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gallese V., Goldman A. Mirror neurons and the simulation theory of mind-reading. Trends Cognit. Sci. 1998;2:493–501. doi: 10.1016/S1364-6613(98)01262-5. [DOI] [PubMed] [Google Scholar]
- 5.Uddin L.Q., Molnar-Szakacs I., Zaidel E., Iacoboni M. rTMS to the right inferior parietal lobule disrupts self–other discrimination. Soc. Cognit. Affect Neurosci. 2006;1:65–71. doi: 10.1093/scan/nsl003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Patel G.H., Sestieri C., Corbetta M. The evolution of the temporoparietal junction and posterior superior temporal sulcus. Cortex. 2019;118:38–50. doi: 10.1016/j.cortex.2019.01.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Shenhav A., Botvinick M.M., Cohen J.D. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron. 2013;79:217–240. doi: 10.1016/j.neuron.2013.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wang F., Peng K., Bai Y., Li R., Zhu Y., Sun P., Guo H., Yuan C., Rotshtein P., Sui J. The dorsal anterior cingulate cortex modulates dialectical self-thinking. Front. Psychol. 2016;7:152. doi: 10.3389/fpsyg.2016.00152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Abu-Akel A., Shamay-Tsoory S. Neuroanatomical and neurochemical bases of theory of mind. Neuropsychologia. 2011;49:2971–2984. doi: 10.1016/j.neuropsychologia.2011.07.012. [DOI] [PubMed] [Google Scholar]
- 10.Suzuki S., Harasawa N., Ueno K., Gardner J.L., Ichinohe N., Haruno M., Cheng K., Nakahara H. Learning to simulate others’ decisions. Neuron. 2012;74:1125–1137. doi: 10.1016/j.neuron.2012.04.030. [DOI] [PubMed] [Google Scholar]
- 11.De Weerd H., Verbrugge R., Verheij B. How much does it help to know what she knows you know? an agent-based simulation study. Artif. Intell. 2013;199-200:67–92. doi: 10.1016/j.artint.2013.05.004. [DOI] [Google Scholar]
- 12.Von Der Osten F.B., Kirley M., Miller T. IJCAI. 2017. The minds of many: opponent modeling in a stochastic game; pp. 3845–3851. [Google Scholar]
- 13.Nguyen D., Venkatesh S., Nguyen P., Tran T. Asian Conference on Machine Learning. PMLR; 2020. Theory of mind with guilt aversion facilitates cooperative reinforcement learning. 33–48. [Google Scholar]
- 14.Baker C., Saxe R., Tenenbaum J. Bayesian theory of mind: modeling joint belief-desire attribution. Proceedings of the annual meeting of the cognitive science society. 2011;33:2469–2474. [Google Scholar]
- 15.Baker C.L., Jara-Ettinger J., Saxe R., Tenenbaum J.B. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nat. Human Behav. 2017;1:0064. doi: 10.1038/s41562-017-0064. [DOI] [Google Scholar]
- 16.Zeng Y., Zhao Y., Zhang T., Zhao D., Zhao F., Lu E. A brain-inspired model of theory of mind. Front. Neurorob. 2020;14:60. doi: 10.3389/fnbot.2020.00060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhao Z., Lu E., Zhao F., Zeng Y., Zhao Y. A brain-inspired theory of mind spiking neural network for reducing safety risks of other agents. Front. Neurosci. 2022;16:753900. doi: 10.3389/fnins.2022.753900. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rabinowitz N., Perbet F., Song F., Zhang C., Eslami S.A., Botvinick M. International conference on machine learning. PMLR; 2018. Machine theory of mind; pp. 4218–4227. [Google Scholar]
- 19.Wang Y., Zhong F., Xu J., Wang Y. Tom2c: target-oriented multi-agent communication and cooperation with theory of mind. arXiv. 2021 doi: 10.48550/arXiv.2111.09189. Preprint at. [DOI] [Google Scholar]
- 20.Tampuu A., Matiisen T., Kodelja D., Kuzovkin I., Korjus K., Aru J., Aru J., Vicente R. Multiagent cooperation and competition with deep reinforcement learning. PLoS One. 2017;12 doi: 10.1371/journal.pone.0172395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sunehag P., Lever G., Gruslys A., Czarnecki W.M., Zambaldi V., Jaderberg M., Lanctot M., Sonnerat N., Leibo J.Z., Tuyls K., Graepel T. Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18 Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems; 2018. Value-decomposition networks for cooperative multi-agent learning based on team reward; pp. 2085–2087. [Google Scholar]
- 22.Lowe R., WU Y., Tamar A., Harb J., Pieter Abbeel O., Mordatch I. In: Guyon I., Luxburg U.V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., editors. Vol. 30. Curran Associates, Inc.; 2017. Multi-agent actor-critic for mixed cooperative-competitive environments; pp. 6379–6390. (Advances in Neural Information Processing Systems). [Google Scholar]
- 23.Maass W. Networks of spiking neurons: the third generation of neural network models. Neural Network. 1997;10:1659–1671. [Google Scholar]
- 24.Ghosh-Dastidar S., Adeli H. Spiking neural networks. Int. J. Neural Syst. 2009;19:295–308. doi: 10.1142/S0129065709002002. [DOI] [PubMed] [Google Scholar]
- 25.Khalil R., Moftah M.Z., Moustafa A.A. The effects of dynamical synapses on firing rate activity: a spiking neural network model. Eur. J. Neurosci. 2017;46:2445–2470. doi: 10.1111/ejn.13712. [DOI] [PubMed] [Google Scholar]
- 26.Zeng Y., Zhao D., Zhao F., Shen G., Dong Y., Lu E., Zhang Q., Sun Y., Liang Q., Zhao Y., et al. Braincog: A spiking neural network based brain-inspired cognitive intelligence engine for brain-inspired ai and brain simulation. arXiv. 2022 doi: 10.48550/arXiv.2207.08533. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Vasquez Tieck J.C., Becker P., Kaiser J., Peric I., Akl M., Reichard D., Roennau A., Dillmann R. 2019 IEEE 18th International Conference on Cognitive Informatics & Cognitive Computing (ICCI∗CC) 2019. Learning target reaching motions with a robotic arm using brain-inspired dopamine modulated STDP; pp. 54–61. [Google Scholar]
- 28.Zhao F., Zeng Y., Han B., Fang H., Zhao Z. Nature-inspired self-organizing collision avoidance for drone swarm based on reward-modulated spiking neural network. Patterns. 2022;3 doi: 10.1016/j.patter.2022.100611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Izhikevich E.M. Solving the distal reward problem through linkage of stdp and dopamine signaling. Cerebr. Cortex. 2007;17:2443–2452. doi: 10.1093/cercor/bhl152. [DOI] [PubMed] [Google Scholar]
- 30.Frémaux N., Gerstner W. Neuromodulated spike-timing-dependent plasticity, and theory of three-factor learning rules. Front. Neural Circ. 2016;9:85. doi: 10.3389/fncir.2015.00085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sanda P., Skorheim S., Bazhenov M. Multi-layer network utilizing rewarded spike time dependent plasticity to learn a foraging task. PLoS Comput. Biol. 2017;13 doi: 10.1371/journal.pcbi.1005705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zhao F., Zeng Y., Xu B. A brain-inspired decision-making spiking neural network and its application in unmanned aerial vehicle. Front. Neurorob. 2018;12:56. doi: 10.3389/fnbot.2018.00056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhao F., Zeng Y., Guo A., Su H., Xu B. A neural algorithm for drosophila linear and nonlinear decision-makinge. Sci. Rep. 2020;10 doi: 10.1038/s41598-020-75628-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Patel D., Hazan H., Saunders D.J., Siegelmann H.T., Kozma R. Improved robustness of reinforcement learning policies upon conversion to spiking neuronal network platforms applied to atari breakout game. Neural Network. 2019;120:108–115. doi: 10.1016/j.neunet.2019.08.009. [DOI] [PubMed] [Google Scholar]
- 35.Tan W., Patel D., Kozma R. Proceedings of the AAAI conference on artificial intelligence. Vol. 35. 2021. Strategy and benchmark for converting deep q-networks to event-driven spiking neural networks; pp. 9816–9824. [Google Scholar]
- 36.Sun Y., Zeng Y., Li Y. Solving the spike feature information vanishing problem in spiking deep Q network with potential based normalization. arXiv. 2022 doi: 10.48550/arXiv.2206.03654. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tang G., Kumar N., Yoo R., Michmizos K. In: Proceedings of the 2020 Conference on Robot Learning vol. 155 of Proceedings of Machine Learning Research. Kober J., Ramos F., Tomlin C., editors. PMLR; 2021. Deep reinforcement learning with population-coded spiking neural network for continuous control; pp. 2016–2029. [Google Scholar]
- 38.Saravanan M., Kumar P.S., Dey K., Gaddamidi S., Kumar A.R. 2021 International Conference on Rebooting Computing (ICRC) IEEE; 2021. Exploring spiking neural networks in single and multi-agent rl methods; pp. 88–98. [Google Scholar]
- 39.Ye P.-G., Liang W., Lu Q., Xiao R.-F., Guo Z.-Y., Sun K.-X. 2021 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS) IEEE; 2021. Spiking mean field multi-agent reinforcement learning for dynamic resources allocation in d2d networks; pp. 60–67. [Google Scholar]
- 40.Nesterov-Rappoport D.L. Tech. Rep. Drew University Madison; NJ: 2022. The evolution of trust: Understanding prosocial behavior in multi-agent reinforcement learning systems. [DOI] [Google Scholar]
- 41.Mordatch I., Abbeel P. Vol. 32. 2018. Emergence of grounded compositional language in multi-agent populations. (Proceedings of the AAAI Conference on Artificial Intelligence). [DOI] [Google Scholar]
- 42.Mnih V., Kavukcuoglu K., Silver D., Rusu A.A., Veness J., Bellemare M.G., Graves A., Riedmiller M., Fidjeland A.K., Ostrovski G., et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
- 43.Rashid T., Samvelyan M., Schroeder C., Farquhar G., Foerster J., Whiteson S. International conference on machine learning. PMLR; 2018. Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning; pp. 4295–4304. [Google Scholar]
- 44.Son K., Kim D., Kang W.J., Hostallero D.E., Yi Y.Q. International conference on machine learning. PMLR; 2019. Learning to factorize with transformation for cooperative multi-agent reinforcement learning; pp. 5887–5896. [Google Scholar]
- 45.Sukhbaatar S., Szlam A., Fergus R. In: Lee D., Sugiyama M., Luxburg U., Guyon I., Garnett R., editors. Vol. 29. Curran Associates, Inc.; 2016. Learning multiagent communication with backpropagation; pp. 2244–2252. (Advances in Neural Information Processing Systems). [Google Scholar]
- 46.Sheng J., Wang X., Jin B., Yan J., Li W., Chang T.-H., Wang J., Zha H. Learning structured communication for multi-agent reinforcement learning. Auton. Agent. Multi. Agent. Syst. 2022;36 doi: 10.1007/s10458-022-09580-8. 50–31. [DOI] [Google Scholar]
- 47.Tal D., Schwartz E.L. Computing with the leaky integrate-and-fire neuron: logarithmic computation and multiplication. Neural Comput. 1997;9:305–318. doi: 10.1162/neco.1997.9.2.305. [DOI] [PubMed] [Google Scholar]
- 48.Gerstner W., Kistler W.M. Cambridge university press; 2002. Spiking Neuron Models: Single Neurons, Populations, Plasticity. [Google Scholar]
- 49.Wu Y., Deng L., Li G., Zhu J., Shi L. Spatio-temporal backpropagation for training high-performance spiking neural networks. Front. Neurosci. 2018;12:331. doi: 10.3389/fnins.2018.00331. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All original code has been deposited at GitHub under https://github.com/BrainCog-X/Brain-Cog/tree/main/examples/Social_Cognition/MAToM-SNN and at Zenodo under https://doi.org/10.5281/zenodo.7768874 and is publicly available as of the date of publication. The data used in this article are available from the code, and readers can also contact the lead contact for requests.
SNNs provide a better understanding of biological systems by simulating how information is transmitted through spikes in the brain. ToM, as a higher cognitive function of the human brain, plays an important role in the human decision-making process. In this article, we propose an MAToM-DM model consisting of a ToM module (MAToM-SNN) and a decision-making module. We designed the ToM module based on SNNs to infer the behavior of others. The decision-making module accepts MAToM-SNN’s predictions of others’ actions as input to guide self-decision-making and improve performance during cooperation or competition with others.






