Abstract
In the last decade, the free energy principle (FEP) and active inference (AIF) have achieved many successes connecting conceptual models of learning and cognition to mathematical models of perception and action. This effort is driven by a multidisciplinary interest in understanding aspects of self-organizing complex adaptive systems, including elements of agency. Various reinforcement learning (RL) models performing active inference have been proposed and trained on standard RL tasks using deep neural networks. Recent work has focused on improving such agents’ performance in complex environments by incorporating the latest machine learning techniques. In this paper, we build upon these techniques. Within the constraints imposed by the FEP and AIF, we attempt to model agents in an interpretable way without deep neural networks by introducing Free Energy Projective Simulation (FEPS). Using internal rewards only, FEPS agents build a representation of their partially observable environments with which they interact. Following AIF, the policy to achieve a given task is derived from this world model by minimizing the expected free energy. Leveraging the interpretability of the model, techniques are introduced to deal with long-term goals and reduce prediction errors caused by erroneous hidden state estimation. We test the FEPS model on two RL environments inspired from behavioral biology: a timed response task and a navigation task in a partially observable grid. Our results show that FEPS agents fully resolve the ambiguity of both environments by appropriately contextualizing their observations based on prediction accuracy only. In addition, they infer optimal policies flexibly for any target observation in the environment.
Introduction
A key challenge in both cognitive science and artificial intelligence is understanding cognitive processes in biological systems and applying this knowledge to the development of artificial agents. In this work, we develop an interpretable artificial agent which integrates key aspects of both reinforcement learning (RL) [1] and active inference [2–6]. In RL, an external reward signal is typically used to guide an agent’s behavior while in active inference no such reward signal exists. Instead, agents follow an intrinsic motivation which is rooted in the free energy principle (FEP) [7, 8].
Central to the FEP is the idea that adaptive systems can be modeled as performing an approximate form of Bayesian inference. The outcome of this process minimizes a quantity, called variational free energy (VFE), which lends its name to the FEP. The FEP encompasses various paradigms that align with the goal of Bayesian inference, such as predictive processing [9] and the Bayesian brain hypothesis [10].
One way to implement the FEP is through active inference which involves a planning method for action selection based on an internal model of the environment, referred to as the world model, along with a preference distribution, which encodes the agent’s desired states (e.g., maintaining a certain body temperature). According to the FEP, it is—in principle—always possible to explain the observed behavior of living systems through active inference [11]. The Free Energy Principle has been applied to a wide range of domains, including neuroscience [11–13], psychology [14, 15], behavior biology [4, 16, 17], and machine learning [7, 18–21].
Active inference has recently gained traction in theoretical discussions [5, 22–25] leading to several successful algorithmic implementations [6, 7, 14, 18–21, 26, 27]. Existing implementations are based on methods developed in the context of model-based reinforcement learning and rely mostly on neural networks to represent the world model [6, 7, 14, 18, 21, 26, 28, 29]. As an alternative to neural networks, this work proposes implementing active inference using Projective Simulation (PS) [30].
In PS, memory is organized as a directed graph, which—unlike neural networks—is designed to be interpretable such that individual vertices carry semantic information [31, 32]. The agent’s cognitive processes are modelled as (1) a random walk through the graph (deliberation), and (2) the updating of transition probabilities along the graph’s edges (learning). The primary motivation for using PS over neural networks is its inherent interpretability.
While so far PS has mostly been employed in the model-free RL framework, this work extends PS to the active inference framework by proposing and testing an agent model called Free Energy Projective Simulation (FEPS). FEPS is an active inference model which uses PS for both the world model and the action policy, combining existing and novel methods for training these components. Key features of FEPS include an internal reward signal derived from the prediction accuracy, and the use of two distinct preference distributions—one for learning the world model and another for achieving a specific goal. For the latter purpose, we propose new heuristics to estimate the long-term value of belief states in sight of the goal.
In the following sections, we first introduce active inference (Sect 2) and PS (Sect 3), before presenting the architecture (Sect 4) and algorithmic features (Sect 5) of the FEPS model. We then test FEPS through numerical simulations in biologically inspired environments, focusing on timed responses and navigation (Sect 6). Finally, we conclude with a discussion of future research directions (Sect 7).
1 Related work
FEPS is a learning model that performs an approximate form of Bayesian inference, guided by internal rewards generated by the model itself. It thus falls within the broad category of reinforcement learning models: agents that interact with their environment to acquire data and select actions aimed at maximizing a utility function. More specifically, FEPS agents build upon model-based reinforcement learning (MBRL) since they are equipped with a world model that they use to plan actions. Using an active inference approach to interact with and learn about the environment, FEPS is also intended to model some cognitive processes involving cognitive maps to support adaptive behavior, drawing on existing techniques in this field. In the remainder of this section, we elaborate on how our model relates to the aforementioned methods and what it contributes beyond them.
Model-based Reinforcement learning. agents use a model of the environment to make predictions about outcomes of actions and the resulting expected utility. This not only helps agents to learn from sparse rewards [33], but also makes the training more sample efficient by using the world model as a generative model [34–38]. In particular, some successful approaches such as Dreamer encode sensory inputs into latent space, while a recurrent neural network captures the dynamics in latent space [37, 39, 40]. In spite of their almost human-like performance in some complex environments [39, 40], the world models they learn remain inaccessible to users. Interpretability is therefore a key challenge in order to use MBRL models to reason about cognitive processes underlying and using cognitive maps. While FEPS also encodes observations into latent space and simulates dynamics on the latter, it remains interpretable in three ways: 1) the world model, encoded on a graph-structured memory, is readable by users, 2) the agent’s decision-making process, modeled as a random walk on a graph, is traceable, and 3) the learning process amounts to strengthening associations between nodes on the memory.
RL with internal rewards: Finding efficient exploration strategies constitutes an essential problem in RL [41]. Curiosity mechanisms have been proposed [42], which often rely on a world model. In particular, the policy can be chosen to maximize both internal reward signals emitted by the agent itself, and external goal-oriented rewards distributed by the environment [43, 44]. Popular formulations of internal rewards include information-theoretical quantities, such as prediction errors [45], epistemic uncertainty [38], empowerment [46] or equivalently [47], mutual information [48, 49]. FEPS agents require no external rewards from the environment: thanks to active inference, their actions are internally motivated by information gain, ensuring exploration, and rewards are generated internally from the accuracy of predictions to reinforce relevant transitions in the world model.
Learning with active inference: Leveraging the free energy as a learning signal–such as a loss function [50] or a reward [51]– for RL agents with active inference is advantageous in that it establishes a trade-off between exploration and exploitation, and allows for two types of exploration: goal-directed and epistemic [7, 52]. Internal motivation in active inference is rooted in a preference prior that describes the desirability of states, akin to a utility function in RL, and that can be either given or learned from experts [51], by predicting rewards [53], or calculated iteratively from experience [18, 20, 54]. Most RL applications of active inference model the agents’ distributions with deep neural networks [7, 19, 20, 50, 55–61], thereby inheriting their lack of interpretability. The contribution of FEPS is threefold. 1) We provide an iterative heuristics to calculate the preference prior from the world model, allowing the agent to adapt rapidly to target changes. 2) FEPS is interpretable. 3) Free energy is not used as a learning signal but it instead integrates world model and preferences to plan for actions, such that the agent’s representation of the environment directly influences its behavior.
Cognitive maps: Cognitive maps [62] are mental representations of spatial and abstract spaces [63, 64], learned from experience and supported by episodic memory [65–67]. Clone-structured cognitive graphs (CSCG) [68–72] have recently emerged as biologically-plausible candidates to model cognitive maps: clones of sensory signals, organized on a graph, are allocated to specific paths or contexts in the partially observable environment, thanks to the sequences of observations collected. FEPS makes use of CSCG to structure the memory of the agent and performs Bayesian inference to formulate beliefs about hidden states, as in [72, 73] for example. Relying on CSCG, FEPS uses active inference to add a model of behavior based on the cognitive map of the agent. Instead of the expectation-maximization algorithm, the reinforcement of associations between successive states update the cognitive map of the agent.
Note that there are also parallels between energy-based techniques, such as Boltzmann machines and simulated annealing, and aspects of FEPS. Both use physics-inspired loss functions like the variational free energy to approximate complex probability distributions on graphs. However, the Free Energy Principle has a broader aim: to describe complex adaptive systems. In this framework, the minimization of the VFE is a tool to measure adaptation in terms of uncertainty reduction, and it remains agnostic of any method to implement it. In FEPS, the FEP further constrains the agent’s architecture and its interactions with external degrees of freedom. A final conceptual distinction is that, unlike energy-based graph techniques, the vertices of the graph are treated as fragments of the agent’s experiences in FEPS. The graph thus represents the agent’s memory, which is used to guide decisions.
2 Active inference
The following presentation of active inference is divided into two parts. First, we outline how the world model is constructed and updated in response to sensations (perceptual inference). Next, we explain how planning and action selection are modeled (active inference).
During perceptual inference, through interactions with the environment, the agent develops the world model [37, 38], sometimes referred to as a generative model. The world model [6, 11] is described as a partially observable Markov decision process (POMDP) (), with belief states , observations , and actions . The transition function T assigns probabilities to transitions between belief states given some action, while L, sometimes referred to as the emission function, defines the probability that some belief state emits a sensory signal . While sensory states are observed and shared between the world model and the environment, belief states are part of the world model only and support the encoding of generally different unobservable hidden states in the environment.
In what follows, the time-indexed random variables Bt, St, and At represent beliefs, observations, and actions at a given time t, with their possible values denoted by the corresponding lower-case letters. The transition function T between times t–1 and t corresponds to the conditional distribution , while the emission function L at time t is the likelihood . Finally, rewards R are used to learn the transition and emission functions. In this work, rewards are internal (generated by the agent itself) [43] but in general they could also be external (given by the environment). It is further assumed that action At is selected based on the current belief state Bt, as represented by the conditional distribution , which defines the policy. Combining these elements, the joint distribution of the world model up to time t can be decomposed as follows:
(1) |
where the index notation is used for sequences of random variables, and denotes the initial distribution over beliefs and observations. Eq 1 defines the relations between random variables: while the distribution over observation and an action is specified by the current belief state, both the previous belief state and the previous action determine the current belief state.
The transition function plays a crucial role in active inference because it has to be learned by the agent. Following the idea of active inference, learning involves updating through variational inference, an approximate version of Bayesian inference. Variational inference simplifies Bayesian inference by restricting the set of possible posterior transition functions to a family , parametrized by some variable ϕ, thus reducing computational complexity. When a new observation is received from the environment, the approximate posterior is obtained as the solution to an optimization problem which minimizes the VFE, here conditioned on the specific values bt−1 and at−1 for the random variables Bt−1 and At−1 in the previous step:
(2) |
where represents the expectation value over belief states distributed according to the posterior distribution of some function f of bt. Minimizing the VFE balances two effects: while the first term penalizes drastic changes in the distribution, the second promotes accuracy in the model to predict observations coming from the environment. The Kullback-Leibler distance represents the dissimilarity between the posterior and prior distributions for the transition function, and respectively. The second term is the surprise of the current observation. In order to calculate this surprise, an expectation value is calculated from the posterior distribution over belief states. Realizing that the first term is either larger or equal to zero, the VFE is an upper bound on the surprise raised by the current input from the environment. By updating its model and minimizing the variational free energy, the agent therefore minimizes its surprise. Typically, the prior is replaced by the posterior after collecting a number of observations and implementing the corresponding updates on the posterior.
So far, the model acquisition has been described in terms of perceptual inference: the system has no control over its actions yet, and constructs the internal model based solely on its observations. The reduction of the free energy describes the adaptation of the system to its environment. The distribution over actions, or policy , can take any form and is not yet optimized over.
A FEPS agent performs active inference [2] by exploiting the world model that has been learned through perceptual inference in order to plan sequences of future actions. Using the current world model, the agent estimates the future free energy for each of its actions based on the current transition function and implements the one action that minimizes it. This estimate of the free energy for future states does not have a unique form [74] and different formulations can lead to different trade-offs between exploration and exploitation. Here, we use the most common one, denoted in the literature as the Expected Free Energy (EFE):
(3) |
(4) |
where refers to the world model over a single time step, and the surprise of getting outcomes st and bt is denoted . stands for the conditional entropy of the random variable Y conditioned on a specific value X = x. It corresponds to the expected surprise over belief states and is therefore always positive. is a distribution over belief states and observations.
The EFE relies on two measures to determine actions: how much uncertainty is expected, and how useful is the action in order to fulfill a goal. In an active inference setting, an action’s utility refers to its expected capacity to meet some preferences. The first term in Eq 4 is the negative entropy over belief states, related to the expected information gain about the transition function: the larger the entropy, the larger the gain. Maximizing this entropy minimizes the expected free energy and leads to explorative behaviors [74]. Finite entropy can have two causes: (1) the transition to the next state is still uncertain in the world model, or (2) the transition to the next state in the environment is stochastic. In the second case, if entropy were used alone, agents could fall in the so-called curiosity trap [45] by always choosing actions that lead to random outcomes in the environment and inherently result in high entropy values. The second term reflects the utility gained from taking the action under consideration. When minimized, the transitions are expected to match the preferences to a greater extent. Therefore, minimizing the EFE maximizes the utility an agent expects from its action. For this purpose, a new biased distribution, the preference distribution , encodes the desirability of some states and observations. The larger the preference for a state, the larger the probability in the preference distribution, and the smaller the associated surprise. This second term encourages exploitation of the world model in order to fulfill the preferences. In a biological agent, these preferences could be of genetic origin (a preference for homeostatic states for example), socially learned (a preference for some type of songs for birds of certain areas, that would be different for the same species in a different place), acquired or externally given. As a result, they can encode a set of states and observations that are favorable to the survival of the agent. As a modeling choice, preferences cover both belief states and observations, and are conditioned on the previous state and action.
For the purpose of learning with an artificial agent, the preferences can naturally encode a task to fulfill as a preferred target state or observation. Choosing the action that minimizes the EFE, the agent selects the transition in its world model that will bring it as close as possible to its preferred states according to its world model. Furthermore, by using the full world model over long periods of time, say from 0 to t, as in Eq 1, the long-term EFE can be calculated to project t steps ahead in the world model and plan for sequences of t actions. Finding the optimal sequence of actions can rely on tree search with pruning [75], or calculating a long-term, discounted, expected free energy quantity [28, 54, 76] for example.
To summarize, according to the free energy principle, an agent that adapts to its surroundings can be modeled as learning a world model of its environment. In this process, the variational free energy is minimized, reflecting the reduction in the surprise the agent experiences about new sensory events. Active inference prescribes a method to plan actions in order to reduce this surprise. In other words, actions are chosen to consolidate the world model. An action is chosen if it minimizes the expected free energy, or equivalently, if it is expected to lead to transitions associated with high uncertainty and high utility. As noticed already in [74], this can be counter-intuitive at first glance, since minimizing the VFE and the EFE respectively minimizes and maximizes uncertainty. This paradox can be resolved by realizing that in active inference, actions target areas in the world model whose outcomes are least predictable in order learn about them and to receive less surprising outcomes in the future, thereby minimizing the VFE in the long run.
3 Projective simulation
Projective Simulation (PS) [30] is a model for embodied reinforcement learning and agency inspired from physics that performs associative learning on a memory encoded as a graph. It is composed of a network of vertices, denoted clips, with a defined structure that gives each clip a semantic meaning, that can be assigned from the start or acquired progressively through past experiences. For example, a clip may represent a sensory state the agent’s sensors are capable of receiving, or it can inherit supplementary semantics from past experiences and reinforcements, through associations to other clips in the graph memory. Directed edges between clips are weighted and can be modified dynamically to learn and adapt to the environment. In particular, PS allows the simulation of percept-action loops to represent previous interactions with the environment and the associated decision process. The resulting graph is the Episodic and Compositional Memory (ECM). When a clip is visited, either because it is currently experienced, or because it is used for deliberation, it is excited. After a first clip was excited, deliberation takes place as a traceable random walk in the ECM originating from this initial clip. It ends when a decoupling criterion is met, which leads to an action on the environment.
For the purpose of solving different environments while retaining interpretability, the ECM can adopt different structures. In order to imitate a percept-action loop, the ECMs are often structured as bipartite graphs, where a first layer contains percepts and the second is composed of actions [77, 78]. In this case, the trained ECM is directly relatable to a policy. For more complex environments, or in order to extract abstract concepts from the percepts, an intermediate layer can be added between the percept and action layers [31, 79]. The ECM can either contain a fixed number of clips, or it can dynamically add and erase some of them when needed [79]. To consider composite percepts and actions, the ECM graph can be replaced by a hypergraph, where each hyperedge connects a set of clips to another set [80].
Each directed edge is equipped with at least one attribute to track learning and allow the agent to react adaptively to the environment. h-values increase as the edges are rewarded. They encode the strength of associations between clips that are useful to fulfill some task and record the learning of the agent. The probability of a transition associated with an edge with h-value hij connecting two clips, , is inferred from the h-values:
(5) |
Alternatively, a softmax function can also be implemented to enhance the differences between probabilities, especially in large ECMs. Upon receiving a new percept, the agent deliberates by taking a random walk through the ECM, using the probabilities defined from the h-values to sample a new edge.
As the agent modeled with PS interacts with its environment, it receives rewards that are distributed over the different edges, which changes the corresponding h-values. Specifically, the h-value of the edge is updated as follows:
(6) |
where γ is the forgetting parameter, is the initial h-value for and R is the reward. When the reward is positive, the h-value of the corresponding edge is increased accordingly. If an edge is not visited or does not receive a positive reward, the corresponding h-value decreases back to their initial value thanks to the forgetting mechanism in the second term. In order to accept continuous percepts and to unlock generalization on some task, neural networks have been used to update the h-values in some cases. Training was then implemented by minimizing a loss function [81].
Projective simulation has been tested in multiple tasks, ranging from the standard RL toy environments [77, 78], robotics [82], to animal behavior simulations [83, 84] and quantum computations [85, 86]. Extension of the model include modeling the ECM with a quantum photonic circuit [32] and considering composite concepts in the form of multiple joint excited clips using hypergraphs [80].
4 The free energy projective simulation agent
We combine Projective Simulation, with Active Inference, following the free energy principle’s framework. A FEPS agent is a model-based Projective Simulation agent, where the world model is an ECM with a clone-structured architecture [68, 70]. Consequently, clone clips inherit the semantics of the unique sensory state they relate to, and context creates distinctions between hidden states that emit the same observations. As in the FEP, the agent does not need external rewards. Instead, prediction accuracy, weighted with confidence, is used as a reinforcement signal. The world model is directly exploited by the agent to set the edges’ h-values in the policy with active inference.
4.1 Architecture of the agent
To mimic a system described by the FEP, the agent is composed of two structures: the world model and the policy, each represented by separate graphs with vertices corresponding to random variables, and edges that can be sampled to perform a random walk, as in Fig 1. Consider the agent can perceive NS sensory states, has NB possible belief states and a repertoire of NA actions. Each state is supported by one vertex in a graph. The world model’s vertices can either support belief states or sensory states. The world model is the representation of the environment (see Eq 1) required by active inference. For reasons that will become clear shortly, the ECM of a FEPS agent is made of all vertices, that we call clips, that support belief states, and edges that represent transitions between such clips. A belief state is then formally defined by the excitation configuration of clips at one step of the deliberation. We limit the number of excitations in the ECM at any given time to one. In this case, the excitation configuration on the clips is analogous, from a Bayesian perspective, to a belief– a probability distribution– over clips, where a single clip in the distribution is associated with a probability of 1. Such a vertex deserves the name "clip" because it receives an excitation when the corresponding representation of an event is revisited in order to make a decision. The policy covers all possible conditioned responses coming from any belief state, given the repertoire of actions of the agent.
Fig 1. Architecture and training of an FEPS agent.
a) Architecture of a FEPS agent, with four sensory states (squares) and two possible actions (diamonds). The agent has two main components: the world model and the policy. The world model is composed of vertices representing observations (squares) while clone clips represent all values a belief state can take (circles). As in a clone-structured graph, each clone clip b relates to exactly one observation s and the emission function is deterministic. The clone clips, together with the set of edges between them, form an ECM. A belief state, circled in purple, is designated by an excited clone clip. The weighted edges in the ECM encode the transition function and are trainable with reinforcement: there is one set of edges per action (light and dark turquoise arrows). The belief state in the ECM is an input to the policy, where the probability of sampling an action is a function of the EFE. In turn, the action that was selected determines the edge set to sample from in the world model in order to make a prediction for the next belief state and observation. b) Training of the world model of a FEPS agent. The agent interacts with the environment by receiving observations and implementing actions. When an action at is chosen, a corresponding edge is sampled in the world model, from the current to the next belief state, conditioned on the action. The observation st + 1 associated with the next belief state is the prediction for the next sensory state. Simultaneously, the action is applied to the environment and creates a transition in the hidden states of the environment, (bottom, green rectangle). This transition is perceived by the agent through the observation . Finally, the weights of the edges are updated. The reinforcement of an edge is proportional to the number of correct predictions it enabled in a row, as depicted with the thickness of the arrows in the world model. When the agent makes an incorrect prediction (the purple arrow), the reinforcements are applied to the edges that contributed to the trajectory. The last, incorrect, edge is not reinforced.
In the world model, two sets of edges can be traversed at different times during the deliberation: we denote them emission and transition edges respectively. They aim at predicting and explaining sensory signals received from the environment.
In the first set, emission edges relate belief states and sensory states, modeling the latter as parts of larger, possibly contextualized, hidden states by using clone-structured Hidden Markov Models (HMM) [68–70]. Each clip is bound to a single observation and is denoted a clone clip. A single edge in the emission set carries a non-zero probability for each clip, as shown in Fig 1a. Consequently, this set of edges defines a deterministic likelihood in the world model in Eq 1 and the agent remains initially agnostic of the dynamical structure of the hidden states in the environment. Meanwhile, a clone clip can readily be interpreted as an augmented description of a specific observation. In particular, the additional information can relate to the cause or the context of the observation, such as the previous belief state and previous action, for example. Sampling some sensory state amounts to predicting the next observation: if it turns out to coincide with the actual observation perceived from the environment, a reward will be distributed to the edges contributing to the random walk that led to this prediction. For clarity, we denote predicted states with hatted low case letters. We choose to associate each observation to a fixed number of clone clips, such that . This approach works remarkably well for navigation problems [69, 70]. Transferring the dynamics learned on some set of sensory states to another set is also possible by keeping the transition function unchanged, but redistributing clone clips to the sensory states in the new set [70].
The purpose of the set of transition edges in the world model is to encode the presumed dynamics in the environment as transitions between belief states, conditioned on actions. They are represented as edges between clone clips in Fig 1a. In contrast to other sets of edges, transition edges are endorsed with attributes such as h-values that enable learning with reinforcements (as in Eq 6 for example). Therefore, clone clips together with transition edges constitute an ECM for the FEPS agent. From a given clone clip, for each action, a set of edges points to the next possible estimated belief states. The h-values of those edges indicate how certain the agent is that taking a particular action from the current belief state will lead to any of the clone clips in the future. There can be at most NB edges in each such group of transition edges. The full set of transition edges – edges in total – defines the transition function in the world model in Eq 1 and corresponds to the trained part of the model. Before reinforcement, this distribution is referred to as the prior. The posterior is the updated version of the transition function, after the agent distributed rewards to the relevant transition edges. The posterior is labeled .
The final component of the agent guides its behavior. The policy is modeled as a separate graph with two layers of NB and NA clips respectively. Given the clone clip corresponding to belief state bt is excited in the world model, an action at is sampled to be applied to the environment, based on how much surprise it expects from this decision. Each edge is weighted with the expected free energy , that defines the policy.
4.2 Reinforcement with prediction accuracy
Each interaction step with the environment involves a deliberation over three states: (1) the next belief state is proposed, (2) the next sensory state is predicted and (3) an action is chosen. The agent excites a belief state bt + 1 it believes it will transition to, given its current action at, by sampling a transition edge . From there, the agent makes a prediction about the next sensory state. Meanwhile, an action at + 1 is selected in the policy and it is applied to the hidden state in the environment, that emits a new observation. The interaction step ends by comparing the predicted and perceived sensory states, and .
The world model is trained without external rewards, and reinforcement is instead based on matching predictions and observations. We call trajectory a sequence of transitions that led to correct predictions about sensory states. To record the trajectory, transition edges are equipped with a new attribute: the confidence, f. Initialized at zero, it increases for all transitions in the trajectory every time the prediction and the actual sensory state coincide. The more subsequent predictions an edge enabled, the higher the confidence for that edge: it reflects the number of correct predictions the edge enabled until the end of the trajectory. Formally, a trajectory τ is a sequence of transitions whose predictions were confirmed by the observations from the environment. If at step n in the t-th trajectory the sensory prediction was accurate, confidence is enhanced for all edges in :
(7) |
When the prediction and observation do not match, the trajectory is interrupted, and the rewards are distributed to the transition edges’ h-values proportionally to the corresponding confidence:
(8) |
where is the h-value at the end of the previous trajectory, the initial h-value of the edge, and R scales the reinforcement of the edges. Confidence values are reinitialized at zero to start the next trajectory. This mechanism provides a built-in learning schedule such that the scale of the reinforcement signals grows progressively: rewards are initially small when trajectories are short, and they become larger when transitions are accurately captured in the model. For the world model to yield accurate predictions, at most each of the transition edges must have been visited and reinforced appropriately. As a result, the computational cost of learning scales at least quadratically in the number of observations and clones per observation. During the deliberations, states are sampled from the prior ECM that did not receive the rewards yet, while the posterior ECM is updated with confidence and rewards at the end of the trajectory. It is equivalent to sampling states from the prior ECM, but updating the posterior ECM with the rewards R for all edges in the trajectory every time a prediction was verified by the observation in the environment. Metaphorically speaking, this mechanism is analogous to layers of snow accumulating in time on salient features. At the end of a trajectory, the snow is cleared away, bringing all salient points back at an equal level.
To complete the update of the FEPS agent, the policy is modified according to the EFE inferred from the new world model and can be adjusted to make behaviors more or less explorative. In particular, h-values of an edge are set to the expected free energy in Eq 4 with a world model conditioned on bi and aj. Each h-value directly carries the surprise expected from traversing the corresponding edge. As in [6], the policy is defined using a softmax function:
(9) |
where ζ is a (real-valued) scaling parameter and is the value of the EFE for action aj coming from state bi. In active inference, ζ is typically negative. When it becomes more negative, actions associated with small EFE receive a large probability. More specifically, looking at the decomposition of the EFE in Eq 4, actions associated with larger entropies, that is lower certainty, together with higher chances of landing on preferable states or observations, become more attractive during the deliberation. In contrast, if the scaling parameter is positive, large EFE yield large probabilities in the policy, and actions with high certainty but also less chances of meeting preferences are more likely to be sampled. In this case, the agent implements a non-explorative policy that is confined to a known region of the environment at the expense of reaching the preferred states. When , the policy is uniform, and the agent has no bias towards certainty nor utility.
5 Algorithmic features of the FEPS
The FEPS can be augmented with a number of techniques that take advantage of the world model and its interpretability. During learning, the internal model can be leveraged to identify transitions that are instrumental to gain information or to get closer to a preferred state. Furthermore, the performance of an agent in completing a task can be enhanced by evaluating the correct belief state accurately and quickly. Since the policy depends on the EFE, the preference distribution can be tuned according to the task: to seek information to complete the model, or to complete a given goal. Therefore, we propose to separate training into two phases, depending on how the preference distribution is constructed. We introduce a belief state estimation scheme that distributes belief states over multiple clone clips and eliminates those that are incompatible with new observations.
5.1 Leveraging preferences as a learning tool
So far, the preference distribution entering the EFE was not defined. One can optionally leverage it to define a goal in the environment, be it for the purpose of gaining information, or to solve an actual task. Therefore, we propose to separate learning into two tasks: model the environment and attain a goal in it. During the first phase, which we denote the exploration phase, the agent explores the environment without a prescribed goal. Instead, actions whose outcomes are expected to reduce prediction errors should be favorized. This phase spreads over multiple episodes and relies on interacting with the environment. In contrast to the exploration phase, the second phase is dedicated to learning to complete a given task. For this purpose, we designed an algorithm to infer a goal-oriented policy from the world model in a single step and without further information.
5.1.1 Seek information gain about the world model.
Before designating any task bound to the environment as a preference, we investigate whether the preference distribution can be used to incentivize actions that minimize prediction errors, according to the current world model. This is directly related to the minimization of the VFE. Specifically, preferences should encourage the agent to seek transitions the world model associates with certainty – or equivalently with high probabilities, irrespective of actions. Sequences of interaction steps with the environment guided by this preference distribution belong to the exploration phase. To reflect the preference for highly probable transitions in the world model regardless of the action chosen, the preference distribution is constructed as the marginal of the world model over actions:
(10) |
(11) |
Plugging this distribution into the expected free energy evaluated for an action a in Eq 4 results in the following:
(12) |
(13) |
(14) |
where is the information gain about the random variable X when the value y for the second random variable Y is known. The dependency on the observations dropped from the first to second line thanks to the constraints the clone structure imposes on the emission function. A complete derivation of this formula is provided in S2 Appendix.
If we follow the conventional formulation of active inference as in Sect 2, the agent should increase the probability of sampling an action that minimizes this EFE. Doing so during the exploration phase, the agent would therefore seek actions it estimates will yield the lowest information gain about belief states. As a result, the agent would stay in a region of the environment where it predicts it will receive the least surprise, according to its world model. This situation is sometimes referred to as the Dark Room problem [87]: an agent that adapts by minimizing its surprise about observations would act to stay in a dark room instead of using actions to explore other places that may be more surprising, but also more favorable to its survival, because all observations there would be predictable.
There is, however, an easy solution to this problem for FEPS. In order to avoid the dark room problem and to select actions that are expected to improve the model of the environment, the scaling parameter ζ in Eq 9 can be set to a positive value. In this case, the larger the EFE associated to an action – and the estimated information gain about the next belief state, the larger the probability of this action in the policy. The scaling parameter ζ can be understood as a way to determine how greedy an agent is in its exploration, or how strongly the information gain associated with each action influences its behavior. We call exploration phase the interaction steps in which the agent samples its actions from such a policy.
5.1.2 Task-oriented behavior by inferring preferences on belief states.
At the end of the exploration phase, a task is designated by encoding the associated targets with a high probability in the preference distribution. From there, the agent can plan, that is sample a sequence of actions to follow to achieve the goal. While the target is identified as a sensory state for the FEPS, transitions that are useful to reach it are deduced from the world model. In our framework, updating the policy takes a single step and does not require further interaction with the environment.
Though active inference commonly determines the behavior by planning sequences of actions, it becomes expensive for distant horizons Th. A sequence of Th actions must be chosen out of possible combinations, by evaluating the EFE in a space of possible outcomes for each sequence. Methods such as habitual tree search or sophisticated inference have been developed to mitigate this scaling issue [19, 76]. Alternative approaches are presented in [6].
Instead of planning by evaluating the generative distribution over all possible future sequences of outcomes, we propose to encode the long-term value of a state directly into the preference distribution. Our scheme is reminiscent of iterative value estimation [88, 89] and successor representation [35, 90, 91], in that it estimates a value function provided some expectations about occupancies of states in the future, either acquired by experience with reinforcement for example, or by inverting a learned transition function. In contrast to searching a tree of future sequences of actions, this method does not rely on mental time traveling [92], to the extent that agents do not simulate possible future scenarios. Instead, they are “stuck in time", and infer preferences in one go from stochastic quantities stored in the world model, in contrast to [20]. Our method also departs from sophisticated inference [28, 76]: instead of bootstrapping expected free energies in time, we bootstrap preferences over belief states and calculate the EFE only once.
We model the preference distribution to factorize over sensory and belief states [93], and we condition it on the current belief state bt:
(15) |
The first part, , is an absolute preference distribution over sensory states, that is independent of where the agent believes it is in the world model. More specifically, if s is the target state for an observation, a probability is associated to it. All other observations are given a uniform distribution . Since the target observation is given by design in the absolute preference distribution, it plays a role analogous to that of a reward function in reinforcement learning. In addition, the policy derived from Eq 9 aims at maximizing the probability of receiving the target observation: the more probable a transition to the target observation, the lower the EFE and the larger the probability of taking the corresponding action, thereby maximizing rewards. The second part then reflects look-ahead preferences over belief states and how useful an agent estimates a transition to be in order to satisfy its absolute preferences. In other words, the utility of belief states over longer horizons is inferred from the value associated with the observations they can transition to. This way, even if the goal might be far away in the world model, the preference for the target observation propagates to intermediate belief states that contribute to reaching it. Metaphorically speaking, the preference for a target observation propagates to the belief states that are useful to get to that target observation. For example, consider an animal in a maze. The target is manifested with a high preference towards observing food. However, the preference distribution over the locations does not indicate how to reach the food. To remedy this, the agent infers the value of belief states in order to reach the target: if a transition to some location brings the animal closer to the food, it is assigned a higher probability in the preference distribution. This way, the preference distribution highlights the path of relevant actions to the target.
We propose a heuristic algorithm to estimate the look-ahead preference distribution over transitions in the world model. The algorithm can update the policy times if needed, to refine the preference distribution. The initial policy is uniform. During each update k, two quantities are calculated: (1) the look-ahead preference distribution results from the value of each transition, that estimates how useful a transition is to reach a target within a prediction horizon , and (2) the policy is calculated with Eq 9 and the latest preference distribution.
To initialize each update step k, the policy is used together with the world model to evaluate how easy a belief state can be reached from the current state, in a distribution we denote reachability:
(16) |
The reachability of bt + 1 coming from bt is large if there exists transitions associated with high probabilities in the world model and the corresponding actions have high chances to be sampled in the policy.
In addition, the initial value of a belief state equals the value of the observation it corresponds to:
(17) |
For the clone-structured model, this sum reduces to a single term. At this stage, the only belief states associated with a high value are those that represent a target observation in the absolute preference distribution.
Next, for each iteration , where Th is a prediction horizon, the value of a belief state is increased if it can lead to transitions that are useful to reach the target within n steps in the environment:
(18) |
for some discount factor that makes the value of the state decrease with the number of steps between this state and the target. At each n, the value of a belief state bt + 1 can either keep its previous value , or it can take the discounted value of the best state b + it can transition to. As a result, the value of a state can only increase. If n = 1, using this value function can point at the right decision while being one step away from s . However, it does not incentivize the correct action when starting further away from the goal. To mitigate this effect, a state can inherit the value of the states it can reach over larger time scales n > 1, thereby propagating the preference for the target to more distant but useful belief state states.
When the prediction horizon is reached, a transition is associated with the following probability in the look-ahead preference distribution:
(19) |
where the set contains the children of bt, or the states that are easily reachable from bt. is the mean reachability of belief states coming from state bt.
Finally, to conclude the k-th update step in the algorithm, the policy is calculated by using the preference distribution in the expected free energy, as in Eq 9. We show a possible look-ahead preference distribution in the world model in Fig 2. The computational time cost of this planning procedure scales linearly with the product of the prediction horizon Th and the number of update iterations . Keeping the number of clones per observation constant, it also scales as . This cost can easily become constant if the procedure is parallelized.
Fig 2. Estimation of belief states in superposition, after the world model has been trained.
To minimize its prediction error due to faulty belief state estimation, an agent considers multiple clone clips as candidate belief states simultaneously. For the initial observation, (on the left), the agent includes all corresponding clone clips to its hypothesis, as depicted on the right. Conditioned on the chosen action, a clone clip is sampled for each candidate belief state to represent the next one. Finally, all clone clips that are incompatible with the observation from the environment are eliminated from the hypothesis. The clips that remain become the current candidate belief states. In the world model, the thickness of the arrows represents the look-ahead preferences: the larger the arrow, the more advantageous is the transition in order to reach the target observation, s4 in this case.
5.2 Delineate belief states for the same observation
In spite of prescribing a method to sample actions, active inference does not include techniques to efficiently choose belief states when multiple of them could explain the current observation. In particular, models involving neural networks lack the interpretability to design suitable belief state selection methods. Therefore, letting the agent learn a world model with a clone-structured HMM has advantages beyond planning. We propose a technique to evaluate belief states in superposition, depicted in Fig 2. When placed in an environment and receiving its first observation, the agent makes an initial hypothesis about its belief state by distributing an excitation to any clone clip compatible with the observation, as shown in Fig 1c. At each step, an action is sampled from the policy for each excited clone clip. The resulting frequencies define a new distribution, from which an action is sampled before it is applied to the environment. For each compatible clone clip, the agent samples a new clip to represent the belief state it anticipates it would transition to if the clone clip under consideration stands for the correct belief state. The excitation jumps onto the new clip. After applying the action to the environment and receiving the resulting sensory signal, the agent takes away the excitation on any clip that does not match the current observation. Depending on the structure of the environment and the number of clones for each observation, the agent progressively narrows down its candidate belief states to a single clone clip, in spite of the initial uncertainty. This elimination process can be thought of as successively applying Bayes’ rule to the clone clips while keeping the posterior uniform as the agent receives more observations: thanks to the deterministic likelihood imposed by the clone structure of the ECM, the probability of any incompatible candidate clone clip becomes zero. In the event that the world model is imperfect and the agent has eliminated excitations on all clips, it starts its hypothesis over, and considers all the clone clips of its current, unpredicted observation as candidate belief states. This mechanism allows the agent to evolve in environments with ambiguous observations in the absence of more contextual information about its initial conditions. The computational cost of this method depends on the structure of the environment. More specifically, the maximum number of steps required to disambiguate hidden states is the length of the longest sequence of actions that would produce the same observation sequences when starting from those hidden states.
6 Numerical results
In this section, we present a numerical analysis of the model on environments inspired from behavioral psychology tasks, namely a timed response experiment in a Skinner box and a navigation task to forage for food. The parameters used for the simulations are given in S1 Table in the Appendix.
6.1 The timed response task
6.1.1 Learn short-term associations.
The timed response task is a minimal environment for an agent to learn to contextualize its observations with past states and actions when the sensory signals emitted by the environment are ambiguous. This environment simulates an animal standing in front of a door that can be opened with a lever. The goal is for the agent to learn a conditioned response and to press a lever at the right time in order to access food. The environment’s MDP is depicted in Fig 3. For this task, the observations combine two sensory inputs: { (light off, hungry), (light off, satiated), (light on, hungry)}. Since food can be consumed only when the light is off, the observation (light on, satiated) in excluded from the set. The actions are {wait, press the lever}. The environment is initialized in the (hidden) state E0, that emits observation (light off, hungry). From there, the light turns on, regardless of the action taken by the agent. Once the light has turned on, the agent must learn to wait one step before pressing the lever. If it does so too early, it gets back to the initial state. If the agent activates the lever on time, the environment transitions to state , where the light is off, but the agent is satiated.
Fig 3. MDP for the timed response environment.
This environment has four hidden states. The observations are compositional and contain information that are both external (light on or off) and internal (hungry or satiated) to the agent. Arrows correspond to the transitions the actions the agent can result in. In this environment, the agent can either wait or press a lever. In order to complete the task, the agent must reach and feel satiated. The only way to do so is to follow the actions marked with thicker arrows. The observation (light on, hungry) is called ambiguous because it can be emitted by two hidden states E1 and E2 that can only be distinguished with context.
For the simulations, we give each observation two clones. It makes it possible to accommodate enough candidate belief states to model up to two ambiguous hidden states that emit the same sensory signal, enabling the agent to adapt its policy to a one-step waiting time between the light turning on and the food being accessible for example. When observations are not ambiguous and can be emitted only by a single hidden state in the environment, some belief states become redundant. This redundancy can make the training more challenging to the extent that the agent has to find a convention and adapt its model accordingly before it can account for all transitions in the environment faithfully. The larger the waiting time n, the more clones might be necessary to learn. We train 100 agents for 4000 episodes of 80 steps in this environment and we test two scenarios. In the first, the agent is directly given the preference for the target observation (light off, satiated). The EFE is then scaled with a scaling parameter of in the policy in Eq 9. In the second scenario, we test whether the agent can learn more efficiently if it explores aimlessly the environment for the same number of episodes without preference for the target before adapting its policy to the task, with a scaling parameter of 0.
6.1.2 Simulation results.
The timed response environment is a minimal testbed for the FEPS, where some hidden states are not uniquely defined by the observation they emit, but also by the recent past. The agent must learn two types of belief states. While clones for (light off, hungry) and (light off, satiated) only support transitions that are independent of the actions the agent takes, the clones for (light on, hungry) must appropriately use information about the previous observation and action to be contextualized and distinguished. Some results are reported in Fig 4.
Fig 4. Training FEPS agents for the timed response task.
a) Evolution of the variational free energy (top) defined in Eq 2 and expected free energy as in Eq 4 (bottom) during the training, averaged over 100 agents and a time window of 100 episodes. At each step, the VFE depends on the specific belief states and actions that were sampled. Two types of training are compared: a first set, “task" in dark purple, learned the model with a preference to fulfill a task in the environment, while the second set, “explore", in green, experimented aimlessly in the environment with a uniform policy before switching to the task. Both were trained for 4000 steps before being tested on the task. The best and worst agents are represented with dashed and dotted lines respectively, and examples of individual agents were traced with transparent lines, full for task-oriented agents, and dashed for the exploring ones. When the VFE converges to its minimal value, the world model is precise enough for most belief states to make planning possible. As expected from the values chosen for the scaling parameter ζ, agents select actions that minimize and maximize the EFE for task-oriented and exploring agents respectively. The EFE of exploring agents plateaus quickly at the limit derived for S3 Appendix. b) World model learned by one of the agents trained on the task, where each circle is a belief state, whose observation is denoted by its colors and label. The numbers at the center of the circles are the clone indices for each clone clip. Arrows indicate the transitions learned in the world model, red for action “press the lever", and blue for “waiting". Dashed lines indicate that both actions lead to the same transition. The weight on the arrow indicates its probability in the world model. Stars mark transitions that were identified as useful to achieve the goal with a probability of 1 in the preference distribution. The policy is indicated by the thickness of the arrows, where a thick arrow corresponds to probabilities close to 1, and thinner close to 0.5.
During training, regardless of the strategy that the agent uses to resolve the environment, all agents follow a similar learning pattern. The acquisition of the world model happens in stages, where the transition from one to the next is manifested in a steep increase in the length of the trajectories of correct predictions, or equivalently, as a steep decrease in the free energy, as in Fig 4a. First, agents quickly eliminate transitions that are impossible in the environment, leading to an initial drop in the VFE. For example, as in Fig 3, a direct transition from (light off, hungry) to (light off, satiated) is incompatible with the timed response task. During the second phase, the number of rewards the agent can collect is limited by the absence of convention on the context-sensitive belief states. Therefore, a plateau is observed on the VFE. A final sharp decrease in the VFE signals the adoption of a convention between clones to accurately disentangle ambiguous hidden states that cannot be told apart with observations alone. The EFE evolves as expected during the training: it decreases for the task-oriented agents, while it rapidly plateaus to its limit (see S3 Appendix) in with an exploration phase. However, in contrast to the VFE, the convergence of the EFE to its asymptote value does not reflect the fact that the model is good enough to make accurate predictions.
For most agents, training with or without a preference for the target in order to construct the world model does not influence the final model. As in Fig 4a, the two best agents, that is, those that converge to free energy values below 1 the earliest, share similar learning curves. The averaged behaviors remain fairly identical, except for the length of the second stage of the learning phase, where the agents have yet to adopt a successful convention. In particular, in the absence of aimless experimentation with the environment, this second phase can last longer, such that convergence is not attained for a fraction of the agents trained in this way.
After convergence of the free energy, or equivalently of the number of successful predictions, the model has collapsed on a single representation of the environment, where the ambiguous observation is contextualized by the previous belief state and action. For example, the observation (light on, hungry) is divided into two belief states with different uses, reflecting the two hidden states in the environment. The corresponding clone clips bear more information than the observation to which they are linked. In Fig 4b, clone 2 for (light on, hungry) is the first hidden state encountered when the light turns on, and clone 1 can only be accessed by waiting from clone 2. When more clones are provided than necessary, two possibilities can arise: either the agent uses all the clones as duplicates of the same hidden state in the environment (as for clones of (light off, hungry)), or a single clone is reachable (clone 2 of (light off, satiated)) and the other is never trained because it was not visited. When multiple clone clips participate equally in the representation of the observation, the actions that lead to it also split with equal probability (as derived in S3 Appendix).
Finally, for the agents that converged to an appropriate representation of the environment, the look-ahead preferences inferred from the world model result in an optimal policy when using the EFE. A visualization of the preferences and policy is provided in Fig 4. We chose a prediction horizon of 2 steps, and a single iteration was required to calculate appropriate look-ahead preferences for the transitions. The transitions with maximal preference were kept so that, coming from a belief state bt, a single belief state b is more preferable than the others. The agents were tested for 1000 rounds, each starting in E0 in the environment. For a prediction horizon of a single step, the edge between clone 1 of (light on, hungry) and clone 2 of (light off, satiated) would be the only edge to carry a probability larger than other transitions in the preference distribution. Thanks to the look-ahead preferences (and the adoption of a convention between belief states), waiting between clones 1 and 2 of (light on, hungry) is also preferred. The policy that results from this preference distribution is optimal to solve the task, in spite of initially having no hints about the target prior to the last transition.
6.2 Navigation task in a partially observable grid
6.2.1 Long-term planning in an ambiguous environment with symmetry.
The FEPS is further challenged in a larger navigation task, where observations are shared among hidden states, and multiple sequences of actions can emit the same observations, due to the symmetry of the environment. In order to disentangle the hidden states, the agent must use long-term information about its past observations and actions to contextualize its current state in a way that is consistent across actions. In this environment, food is hidden in one position in a grid. Locations in a 3 by 3 grid world are associated with smell intensities, denoted by their integer intensities from 0 to 3 from the lower left to the upper right corners, according to their closeness to the food.
The world model is provided with 3 clones for each of the four observations, and the behavior repertoire comprises directional actions {go right, go left, go up, go down}. 30 agents were trained on different hyperparameters and preferences, for a total of 40000 episodes of 80 steps. The hyperparameters are provided in S1 Table in the Appendix. Since this environment is larger and more complex, we test different training techniques, by changing the preference distributions and varying the scaling parameter. In particular, we test two scenarios. In the first the agent is trained with preferences pointing at a target in the environment, while in the second, the preferences are identified with the marginal of the world model over actions, which incentivizes dissociating the effect of actions on the environment. While in the first case, the agent tries to directly solve the task, in the second, the agent explores first before it learns the task. For each scenario, different scaling parameters were implemented, ranging from -3 to 3 for the task-oriented training, and from 0 to 3 with an exploration phase. Note that choosing a scaling parameter of 0 while training either directly on the task or with a preceding exploration phase results in the same uniform policy.
6.3 Simulation results.
Training task-oriented and exploring agents. Compared to agents trained with a uniform policy, agents using policies inferred from the EFE can achieve longer trajectories. This is apparent in Fig 5a), where with a scaling parameter of 0, the trajectories of agents reach a length of 60 on average by the end of the training, whereas given a proper tuning, agents using the EFE to define the policy can predict up to 69 transitions, regardless of the training method. Adapting the scaling parameter to the preference distribution is crucial for the agents to learn the model and query actions that result in longer trajectories. For task-oriented agents, as expected from Eq 4, a negative scaling parameter at -3 is optimal, whereas +1 yields the best results for the exploring agents. For both training methods, setting it at +3 results in the shortest trajectories, that are 28 and 50 steps long, respectively. In the case of task-oriented agents, a possible explanation is that in this regime and when the preference distribution designates a target in the environment, the policy that is derived from the EFE in Eq 4 minimizes the drive of the agent to explore and to fulfill any preferences in the environment. Therefore, the agent has no incentive to go outside of a region of the environment where it can predict its observations, and does not try to learn the rest of it. In contrast, a scaling parameter makes exploring agents very greedy. It is then possible that previously explored regions are not visited often enough to reinforce the correct associations in the world model in the long-run. Finally, our simulation shows that for the best parameter settings, the task-oriented agents converge on average faster than the exploring ones: while 17500 episodes are needed to predict 65 steps for the former, the latter crosses this milestone at 28000 episodes.
Fig 5. Training results for the grid world environment.
a) Evolution of the length of the trajectories during the training, for different scaling parameters ranging from −3 to 3, and different preference distributions: the agent can either learn to complete the task from the start (“task”), or first explore the grid (“explore”). We represent the running averages over a time window of 1500 steps of the lengths of trajectories, averaged over 30 agents. These lengths depend on the specific belief states and actions sampled by the agents. b) Evolution of the variational and expected free energies during the training for the two best settings in a): the task-oriented preferences are paired with a scaling parameter of −3, and the exploration preferences with a parameter of +1. The thick lines represent the running average of energies over a time window of 1500 steps, averaged over all 30 agents, while the dotted and dashed lines stand for the best and worst agents, that is the agents whose VFE converge first and last, respectively, to a minimum. The transparent lines indicate the behavior of agents selected randomly: these lines are full for task-oriented agents and dashed for exploring agents. c) Comparison of the accuracy of the model to make predictions between the two best parameter settings in a). Blue and red colors indicate explorative and task-oriented agents respectively. Darker and lighter shades distinguish the belief state estimation methods, “sup.” designating superposition. The policies of the agents are uniform, such that the actions yielding most certain outcomes cannot be relied upon to validate predictions. Two belief state estimation techniques are tested. Belief state estimation samples a single clone clip at a time, whereas the evaluation of belief states in superposition allows multiple clone clips to represent candidate belief states simultaneously, as long as they produce predictions that are compatible with the next observation. Each individual agent is tested over 1000 trajectories of at most 80 steps.
Looking at the evolution of the free energy during the training in Fig 5b), one sees that in both cases, the VFE decreases as the length of trajectories increases. It looks as if agents minimize their free energy by maintaining a world model and improving their predictions about future sensory states. This matches the free energy principle. Comparing individual learning curves, exploring agents have more diverse behaviors than those trained on the task. We define the best and worst agents as those converging the earliest and the latest to a free energy minimum, respectively. The best exploring agents can converge faster than the task-oriented ones, whereas the worst exploring agent lands on a higher free energy by the end of the training. However, the free energy of the worst exploring agent either goes down or plateaus, but it does not rise again, as for the worst agent directly trained on the task. When the training scenarios influence the scale of the EFE, no significant variations around the average behavior is observed in the EFE to distinguish successful agents from those that did not converge. In particular, the EFE exploring agents converge very quickly to its asymptotic value, regardless of the accuracy of the world model.
In order to fairly compare the models obtained from each training method, note that whenever the length of trajectories or the free energy are optimized, two contributions are at play. On the one hand, training the world model to capture the environment more accurately decreases the uncertainty about the transitions following actions, and therefore improves the length of the trajectories and the free energy. But on the other hand, as long as the policy deviates from the uniform distribution, it influences how much uncertainty the agents seek. Therefore, in order to evaluate how much the world model helps making accurate predictions, we test the length of trajectories with a uniform policy for the agents trained in the two best parameter and preference settings. The trajectory lengths averaged over all agents trained with the respective parameters are provided in Fig 5c).
With the belief state estimation protocol where a single clone clip is excited at each step, trajectories achieved by FEPS agents are analogous in length for both types of model-learning strategies, either exploration- or task-oriented. More precisely, half of the agents are able to predict half of the maximum length for a trajectory. Similarly, extreme cases are comparable.
Estimating belief states in superposition, i.e. considering all clone clips that are compatible with the current trajectory at each step, doubles the length of trajectories for both training strategies. Indeed, agents of both types reach near optimal length. For explorative agents, the top 50% of the agents achieves the same performance, showing a slight improvement compared to task-oriented agents. We expect that this improvement will become greater when the environments reach larger scales, and when the target given to task-oriented agents become more ambiguous. In addition, the ability of the agents to adapt to different goals in the environment would also be affected.
The agent learns an interpretable map of the environment. After the free energy converges, the belief states in the world model no longer stand only for the observation they relate to, but also convey information about the path they can appear in. As an example, the world model of an agent trained directly on the task is represented in Fig 6a). Because the environment the agent was trained in is deterministic, actions tend to map onto a single belief state when the number of clones for an observation matches the number of hidden states in the environment.
Fig 6. Interpretability of the world model and robustness to reward reevaluation for the grid world environment.
a) Example of a world model learned by an agent directly trained on the task, with the target positioned in the top right corner of the grid. The circles represent the belief states as in Fig 1, numbered with clone indices, and colored according to the observation they relate to in the grid. The arrows stand for the transition probabilities: the thicker the arrow, the more the agent believes taking an action from a belief state will lead to the state the arrow points at. b) Mapping of the world model in a) onto the grid: each clone is associated with exactly one cell, and can be interpreted as a single, specific hidden state in the environment. Stars stand for the preferred transitions, and the arrows for the policy resulting from these preferences. c) Median number of steps from each initial position in the grid in order to reach the target, where belief states are estimated in superposition. Two targets are given to the agent and symbolized with a gray triangle: reach observation of 3 (top right, red) and 0 (bottom left, blue), each requiring opposite policies. When targets were swapped, no interaction with the environment was necessary to re-evaluate the value of the transitions and the resulting policy. The median time to target is compared to that of a random agent with uniform policy and tested with the same procedure.
In order to understand how the world model can be interpreted, let us look at observation 0 (blue in Fig 6a) and b)), that can be emitted by three different hidden states in the lower left corner of the grid. After the training, the role of the three clone clips splits based on what has been observed, including sequences of actions and observations that led to it, and those that were predicted from it. For example in Fig 6a) , going left or down from clone c3(0) maps it onto itself, suggesting that it represents the lower left corner of the grid. Going right or up from c3(0) predicts transitions to clones c2(0) and c1(0) respectively, making them candidates for the second positions on the first row and first column respectively. In this way, each clone clip can be associated to a single location in the grid, as shown in Fig 6b). As a result, training the world model in this environment amounts to assigning a single hidden state to each clone clip in a manner that is consistent across actions.
In particular, in spite of being agnostic of the spatial nature of the environment, the world model can be interpreted as a topological map of the grid, as pictured in Fig 6b). Each clone maps onto a single location of the grid, and this mapping is consistent for all actions. When too many clones were provided compared to the degeneracy of an observation, multiple clones are associated with a single location, as for observation 3 in the upper right corner of the grid. This consistent mapping suggests that the agent is able to contextualize the observations with a sequence of previous states and actions. Thanks to this mapping of clones onto single locations, the look-ahead preferences over belief states encode paths ending at the target. They steer the policy toward an optimal one.
Exploit the world model to complete tasks flexibly. Provided the target observation is signaled in the preference distribution, the agents can adapt their policy to reach it following near optimal trajectories. In Fig 6c), agents whose model successfully converged are tested to reach targets with observations 3 (top, red) and 0 (bottom, blue), requiring opposite policies to be attained. For both tasks, irrespective of the initial location of the agent, the performance is drastically improved compared to a random agent with uniform policy. In order to move the target, no other interaction with the environment was required than changing the preference distribution over sensory states, and no new observation was exchanged.
While the paths towards observation 0 are mostly optimal, there is an overhead by at most one step to reach observation 3. Since the policy is an optimal one, as shown in Fig 6b), we suggest that this is due to the estimation of the belief states. In particular, the small size of the grid prevents agents from narrowing down their hypothesis to a single belief state early enough to opt for the right policy at the boundary of the grid. This could explain why on average, agents take 1.5 steps to reach the target when being right under it: they have 0.5 probability of finding the wrong belief state, and therefore choose the wrong action 50% of the time. This eliminates the last erroneous state from the hypothesis and the agent behaves optimally in the next step.
7 Discussion and outlook
In this work, we propose an interpretable, minimal model for cognitive processes that combines associative learning and active inference to adapt to any environment, without any prior knowledge about it and independently of any goal tied to it. We develop the Free Energy Projective Simulation model, a physics-inspired model that seeks inspiration in current paradigms in cognitive sciences, namely active inference, that encompasses the Bayesian brain hypothesis (BBH) and fits more broadly into predictive processing. A FEPS agent is equipped with a memory structured as a graph, the ECM, where the agent encodes associations between events represented as clone clips. As in models of associative learning in the cerebellum [94–96], internal reinforcement signals based on the prediction accuracy of the agent reinforce associations between observations. Clones clips are the support for belief states and acquire contextual meaning as the agent collects longer sequences of correct predictions about its observations and improves its world model. The behavior of FEPS agents does not depend on any reinforcement and is fully determined with active inference, such that the policy optimizes the expected free energy estimated for each action, given the current world model. The resulting model is interpretable in three ways: 1) The world model is readable, 2) Deliberation is traceable and 3) Credit assignment during training is explainable. We perform numerical analysis of the agents in two environments inspired from behavioral psychology: a timed response environment, as well as a navigation task in a partially observable grid world.
Leveraging the interpretability of the model, we identified three capacities a FEPS agent requires in order to interact with an environment and to reach a goal while controlling it, that is, minimizing its surprise about future events. (1) The representation of the environment in the world model should be accurate enough in its predictions, i.e. it should be capable of predicting future sensory signals. (2) The agent should be able to choose the appropriate belief state for aliased observations. (3) For a given time horizon, the agent should be equipped with an efficient mechanism to plan its course of actions.
First, in order to build a complete and accurate representation of the environment, we propose to start learning with an exploration phase. It focuses on exploring the environment strategically rather than completing a task. For this, we provide a model-dependent preference distribution that evolves as the agent learns and we show that the corresponding expected free energy of an action is equal to the information gain about belief states resulting from this action. As a result, the behavior of the agent is motivated by the acquisition of information in its world model rather than by a target bound to the environment. Our numerical simulations showed that a model learned in such a way is able to predict longer sequences of observations when actions are selected at random, compared to the model trained directly on a task. Previous works usually involve either adopting a different definition for the expected free energy [74], or more commonly, adding terms to it to encourage explorative behaviors, [20], leveraging existing literature on curiosity mechanisms, for example [29, 43, 45, 97]. Alternatively, a count-based boredom mechanism [98] could be well-suited for FEPS, as it could be directly implemented on the edge attributes, such as the confidence, to deviate the agent from transitions it has already resolved.
Second, we design a simple procedure to progressively disambiguate belief states with sequences of observations. It can be used to interact with the environment after the model has been trained. Leveraging the clone structure imposed to the clips in the memory of the agent, belief states can be estimated in superposition. As predictions for each candidate belief state are validated or not by the observation collected from the environment, the agent can eliminate candidate belief states that are incompatible with the context provided by its current sensory state. We show in our numerical analysis that this simple mechanism allows to double the number of correct, consecutive predictions, regardless of the technique used to learn the world model.
Thirdly, we introduce a planning method based on the expected free energy. Instead of relying on a tree search or a simulation of future scenarios, we propose to encode the utility of a transition between two belief states in the preference distribution. For this, we factorize the preference distribution into an absolute preference distribution that designates a target among the possible sensory states, and a look-ahead preference distribution. The latter assigns a probability to transitions between belief states that is commensurate with its estimated utility in reaching the target within a given number of actions. Using the world model, the value for each transition is determined in an iterative manner. We tested this scheme in two numerical experiments and achieved optimal policies in both cases, as long as the world model was predictive enough. The closest model to our knowledge is Successor Representation [35, 90, 91], that has been hypothesized to account for so-called model-free learning in cognitive systems [90]. A major difference is that in Successor Representation, the value of a transition depends on the prediction error over the expected reward in the environment, whereas we assign value to a belief state via the probability of the associated sensory state in the absolute preference distribution. A limitation of our scheme, however, is that a target can only be encoded as a possibly ambiguous observation. Using reinforcement and a few interactions with the environment, the preference for a particular hidden state could be encoded in the look-ahead preference distribution in a hybrid scheme.
The numerical analysis of FEPS in larger environments has revealed that the scaling of the method requires modifications of the model in some cases, inheriting in this regard scalability issues from tabular RL methods. More specifically, the main obstacle in efficiently learning a world model is the number of edges to reinforce, because it scales quadratically with the number of clones, and therefore, with the number of distinguishable hidden states in the environment. Consequently, FEPS can learn a representation of a large environment with little to no ambiguity in the observations, but can fail to model a smaller one with few but very ambiguous observations. In addition, clone differentiation can become harder if many different paths in the environment can lead to the same observation sequence using the same actions. Including a mechanism to add and delete clones, such that the world model approximates an ε-transducer for the environment [99] on finite degrees of freedom, could help mitigate this scaling issue. Modifications to the decision-making and reinforcement algorithms could also include multiple edges at each step in order to speed up the learning process.
Conceptually, the FEPS framework fits into the field of NeuroAI [100], at the intersection of behavioral sciences, engineering and neurosciences. While Projective Simulation can be used to learn in an artificial environment, its vocation is to understand agency and the behavior of agents in the world. Furthermore, the FEPS attempts to give a biologically plausible account of learning and adaptive behavior, grounding internal computations in the active inference framework and the predictive processing paradigm. ECMs are potentially implementable on physical platforms and can be considered embodied structures underlying the memory of the agents. For example, a parallel can be drawn between the role the network of belief states plays for FEPS agents, and that of place cells and grid cells in the hippocampus [69, 70, 72]. Both integrate stimuli to create a contextualized representation of an event in order to make predictions about future stimuli. For the FEPS to provide a modeling platform of interest for cognitive and behavioral sciences, the next challenge is to implement learning on real-world tasks and in a fully embodied way, including the calculation of coincidences between predictions and observations, and the update of the associations between states.
Supporting information
(PDF)
(PDF)
(PDF)
Acknowledgments
The authors express their gratitude to Philip A. Lemaitre for fruitful discussions.
Data Availability
The code and data supporting this study are accessible at https://doi.org/10.5281/zenodo.15526876.
Funding Statement
This research was funded in whole, or in part, by the Austrian Science Fund (FWF) DOI 10.55776/F71 and 10.55776/WIT9503323. For the purpose of open access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission. We gratefully acknowledge support from the European Union (ERC Advanced Grant, QuantAI, No. 101055129). The views and opinions expressed in this article are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council - neither the European Union nor the granting authority can be held responsible for them. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge, MA: MIT Press; 1998.
- 2.Kirchhoff M, Parr T, Palacios E, Friston K, Kiverstein J. The Markov blankets of life: autonomy, active inference and the free energy principle. J R Soc Interface. 2018;15(138):20170792. doi: 10.1098/rsif.2017.0792 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Linson A, Clark A, Ramamoorthy S, Friston K. The active inference approach to ecological perception: general information dynamics for natural and artificial embodied cognition. Front Robot AI. 2018;5:21. doi: 10.3389/frobt.2018.00021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pezzulo G, Rigoli F, Friston K. Active Inference, homeostatic regulation and adaptive behavioural control. Prog Neurobiol. 2015;134:17–35. doi: 10.1016/j.pneurobio.2015.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Raja V, Valluri D, Baggs E, Chemero A, Anderson ML. The Markov blanket trick: On the scope of the free energy principle and active inference. Phys Life Rev. 2021;39:49–72. doi: 10.1016/j.plrev.2021.09.001 [DOI] [PubMed] [Google Scholar]
- 6.Mazzaglia P, Verbelen T, Çatal O, Dhoedt B. The free energy principle for perception and action: a deep learning perspective. Entropy (Basel). 2022;24(2):301. doi: 10.3390/e24020301 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tschantz A, Millidge B, Seth AK, Buckley CL. Reinforcement learning through active inference. arXiv preprint 2020. https://arxiv.org/abs/2002.12636
- 8.Friston K, FitzGerald T, Rigoli F, Schwartenbeck P, O Doherty J, Pezzulo G. Active inference and learning. Neurosci Biobehav Rev. 2016;68:862–79. doi: 10.1016/j.neubiorev.2016.06.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bubic A, von Cramon DY, Schubotz RI. Prediction, cognition and the brain. Front Hum Neurosci. 2010;4:25. doi: 10.3389/fnhum.2010.00025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Vilares I, Kording K. Bayesian models: the structure of the world, uncertainty, behavior, and the brain. Ann N Y Acad Sci. 2011;1224(1):22–39. doi: 10.1111/j.1749-6632.2011.05965.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Friston K. The free-energy principle: a unified brain theory?. Nat Rev Neurosci. 2010;11(2):127–38. doi: 10.1038/nrn2787 [DOI] [PubMed] [Google Scholar]
- 12.Friston K, Kiebel S. Predictive coding under the free-energy principle. Philos Trans R Soc Lond B Biol Sci. 2009;364(1521):1211–21. doi: 10.1098/rstb.2008.0300 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pezzulo G, D’Amato L, Mannella F, Priorelli M, Van de Maele T, Stoianov IP, et al. Neural representation in active inference: using generative models to interact with-and understand-the lived world. Ann N Y Acad Sci. 2024;1534(1):45–68. doi: 10.1111/nyas.15118 [DOI] [PubMed] [Google Scholar]
- 14.Cullen M, Davey B, Friston KJ, Moran RJ. Active inference in OpenAI Gym: a paradigm for computational investigations into psychiatric illness. Biological psychiatry: Cognitive Neuroscience and Neuroimaging. 2018;3:809–18. doi: 10.1016/j.bpsc.2018.06.010 [DOI] [PubMed] [Google Scholar]
- 15.McGovern HT, De Foe A, Biddell H, Leptourgos P, Corlett P, Bandara K, et al. Learned uncertainty: the free energy principle in anxiety. Front Psychol. 2022;13:943785. doi: 10.3389/fpsyg.2022.943785 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ramstead MJD, Badcock PB, Friston KJ. Answering Schrödinger’s question: a free-energy formulation. Phys Life Rev. 2018;24:1–16. doi: 10.1016/j.plrev.2017.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Heins C, Millidge B, Da Costa L, Mann RP, Friston KJ, Couzin ID. Collective behavior from surprise minimization. Proc Natl Acad Sci U S A. 2024;121(17):e2320239121. doi: 10.1073/pnas.2320239121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Mazzaglia P, Verbelen T, Dhoedt B. Contrastive active inference. In: Advances in Neural Information Processing Systems. vol. 34. Curran Associates, Inc.; 2021. p. 13870–82. https://proceedings.neurips.cc/paper/2021/hash/73c730319cf839f143bf40954448ce39-Abstract.html
- 19.Fountas Z, Sajid N, Mediano P, Friston K. Deep active inference agents using Monte-Carlo methods. In: Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc.; 2020. p. 11662–75. https://proceedings.neurips.cc/paper/2020/hash/865dfbde8a344b44095495f3591f7407-Abstract.html
- 20.Nguyen VD, Yang Z, Buckley CL, Ororbia A. R-AIF: solving sparse-reward robotic tasks from pixels with active inference and world models. arXiv preprint 2024. http://arxiv.org/abs/2409.14216
- 21.de Tinguy D, Van de Maele T, Verbelen T, Dhoedt B. Spatial and temporal hierarchy for autonomous navigation using active inference in minigrid environment. Entropy (Basel). 2024;26(1):0. doi: 10.3390/e26010083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Clark A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav Brain Sci. 2013;36(3):181–204. doi: 10.1017/S0140525X12000477 [DOI] [PubMed] [Google Scholar]
- 23.Clark A. How to knit your own Markov blanket: resisting the second law with metamorphic minds. Philosophy and Predictive Processing. 2017. doi: 10.15502/9783958573031 [DOI] [Google Scholar]
- 24.Kagan BJ, Kitchen AC, Tran NT, Habibollahi F, Khajehnejad M, Parker BJ, et al. In vitro neurons learn and exhibit sentience when embodied in a simulated game-world. Neuron. 2022;110(23):3952–69.e8. doi: 10.1016/j.neuron.2022.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Smith R, Badcock P, Friston KJ. Recent advances in the application of predictive coding and active inference models within clinical neuroscience. Psychiatry Clin Neurosci. 2021;75(1):3–13. doi: 10.1111/pcn.13138 [DOI] [PubMed] [Google Scholar]
- 26.Lanillos P, Meo C, Pezzato C, Meera AA, Baioumy M, Ohata W. Active inference in robotics and artificial agents: survey and challenges. arXiv preprint 2021. http://arxiv.org/abs/2112.01871
- 27.Piriyakulkij T, Kuleshov V, Ellis K. Active preference inference using language models and probabilistic reasoning. 2023. http://arxiv.org/abs/2312.12009
- 28.Kawahara D, Ozeki S, Mizuuchi I. A curiosity algorithm for robots based on the free energy principle. In: 2022 IEEE/SICE International Symposium on System Integration (SII). 2022. p. 53–9. 10.1109/sii52469.2022.9708819 [DOI]
- 29.Tinker TJ, Doya K, Tani J. Intrinsic rewards for exploration without harm from observational noise: a simulation study based on the free energy principle. arXiv preprint 2024. http://arxiv.org/abs/2405.07473 [DOI] [PubMed]
- 30.Briegel HJ, De las Cuevas G. Projective simulation for artificial intelligence. Sci Rep. 2012;2:400. doi: 10.1038/srep00400 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Eva B, Ried K, Müller T, Briegel HJ. How a minimal learning agent can infer the existence of unobserved variables in a complex environment. Minds Mach (Dordr). 2023;33(1):185–219. doi: 10.1007/s11023-022-09619-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Flamini F, Krumm M, Fiderer LJ, Müller T, Briegel HJ. Towards interpretable quantum machine learning via single-photon quantum walks. arXiv preprint 2023. http://arxiv.org/abs/2301.13669
- 33.Daw ND, Dayan P. The algorithmic anatomy of model-based evaluation. Philos Trans R Soc Lond B Biol Sci. 2014;369(1655):20130478. doi: 10.1098/rstb.2013.0478 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Sutton RS. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bull. 1991;2(4):160–3. doi: 10.1145/122344.122377 [DOI] [Google Scholar]
- 35.Dayan P. Improving generalization for temporal difference learning: the successor representation. Neural Computation. 1993;5(4):613–24. doi: 10.1162/neco.1993.5.4.613 [DOI] [Google Scholar]
- 36.Gu S, Lillicrap T, Sutskever I, Levine S. Continuous deep Q-learning with model-based acceleration. In: Proceedings of the 33rd International Conference on Machine Learning. 2016. p. 2829–38. https://proceedings.mlr.press/v48/gu16.html
- 37.Ha D, Schmidhuber J. World models. arXiv preprint 2018.doi: abs/1803.10122 [Google Scholar]
- 38.Mendonca R, Rybkin O, Daniilidis K, Hafner D, Pathak D. Discovering and achieving goals via world models. In: Advances in Neural Information Processing Systems. 2021. p. 24379–91. https://proceedings.neurips.cc/paper/2021/hash/cc4af25fa9d2d5c953496579b75f6f6c-Abstract.html
- 39.Hafner D, Lillicrap T, Ba J, Norouzi M. Dream to control: learning behaviors by latent imagination. arXiv preprint 2020. http://arxiv.org/abs/1912.01603
- 40.Kaiser L, Babaeizadeh M, Milos P, Osinski B, Campbell RH, Czechowski K. Model-based reinforcement learning for Atari. arXiv preprint 2024. http://arxiv.org/abs/1903.00374
- 41.Ladosz P, Weng L, Kim M, Oh H. Exploration in deep reinforcement learning: a survey. Information Fusion. 2022;85:1–22. doi: 10.1016/j.inffus.2022.03.003 [DOI] [Google Scholar]
- 42.Schmidhuber J. Curious model-building control systems. In: Proceedings of the 1991 IEEE International Joint Conference on Neural Networks. 1991. p. 1458–63. https://ieeexplore.ieee.org/document/170605/
- 43.Oudeyer P-Y, Kaplan F. What is intrinsic motivation? a typology of computational approaches. Front Neurorobot. 2007;1:6. doi: 10.3389/neuro.12.006.2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Aubret A, Matignon L, Hassas S. A survey on intrinsic motivation in reinforcement learning. arXiv preprint 2019. http://arxiv.org/abs/1908.06976 [DOI] [PMC free article] [PubMed]
- 45.Pathak D, Agrawal P, Efros AA, Darrell T. Curiosity-driven exploration by self-supervised prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, HI, USA: IEEE; 2017. p. 488–9. http://ieeexplore.ieee.org/document/8014804/
- 46.Klyubin AS, Polani D, Nehaniv CL. Empowerment: a universal agent-centric measure of control. In: 2005 IEEE Congress on Evolutionary Computation. p. 128–35. 10.1109/cec.2005.1554676 [DOI]
- 47.Mohamed S, Rezende DJ. Variational information maximisation for intrinsically motivated reinforcement learning. arXiv preprint 2015. http://arxiv.org/abs/1509.08731
- 48.Gregor K, Rezende DJ, Wierstra D. Variational intrinsic control. arXiv preprint 2016. http://arxiv.org/abs/1611.07507
- 49.Kim H, Kim J, Jeong Y, Levine S, Song HO. EMI: exploration with mutual information. arXiv preprint 2019. http://arxiv.org/abs/1810.01176
- 50.Ueltzhöffer K. Deep active inference. Biol Cybern. 2018;112(6):547–73. doi: 10.1007/s00422-018-0785-7 [DOI] [PubMed] [Google Scholar]
- 51.Shin JY, Kim C, Hwang HJ. Prior preference learning from experts: designing a reward with active inference. arXiv preprint 2021. http://arxiv.org/abs/2101.08937
- 52.Schwartenbeck P, Passecker J, Hauser TU, FitzGerald TH, Kronbichler M, Friston KJ. Computational mechanisms of curiosity and goal-directed exploration. Elife. 2019;8:e41703. doi: 10.7554/eLife.41703 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Sajid N, Tigas P, Friston K. Active inference, preference learning and adaptive behaviour. IOP Conf Ser: Mater Sci Eng. 2022;1261(1):012020. doi: 10.1088/1757-899x/1261/1/012020 [DOI] [Google Scholar]
- 54.Paul A, Sajid N, Da Costa L, Razi A. On efficient computation in active inference. Expert Systems with Applications. 2024;253:124315. doi: 10.1016/j.eswa.2024.124315 [DOI] [Google Scholar]
- 55.Millidge B. Deep active inference as variational policy gradients. arXiv preprint 2019. http://arxiv.org/abs/1907.03876
- 56.Heins RC, Mirza MB, Parr T, Friston K, Kagan I, Pooresmaeili A. Deep active inference and scene construction. Front Artif Intell. 2020;3:509354. doi: 10.3389/frai.2020.509354 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sancaktar C, van Gerven MAJ, Lanillos P. End-to-end pixel-based deep active inference for body perception and action. In: 2020 Joint IEEE 10th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob); 2020. p. 1–8. https://ieeexplore.ieee.org/document/9278105/?arnumber=9278105
- 58.Tschantz A, Baltieri M, Seth AK, Buckley CL. Scaling active inference. arXiv preprint 2019. http://arxiv.org/abs/1911.10601
- 59.Himst O v d, Lanillos P. Deep active inference for partially observable mdps. arXiv preprint 2020. http://arxiv.org/abs/2009.03622
- 60.Catal O, Verbelen T, Nauta J, Boom CD, Dhoedt B. Learning perception and planning with deep active inference. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. 10.1109/icassp40776.2020.9054364 [DOI]
- 61.Millidge B, Buckley CL. Successor representation active inference. arXiv preprint 2022. http://arxiv.org/abs/2207.09897
- 62.Tolman EC. Cognitive maps in rats and men. Psychol Rev. 1948;55(4):189–208. doi: 10.1037/h0061626 [DOI] [PubMed] [Google Scholar]
- 63.Stachenfeld KL, Botvinick MM, Gershman SJ. The hippocampus as a predictive map. Nat Neurosci. 2017;20(11):1643–53. doi: 10.1038/nn.4650 [DOI] [PubMed] [Google Scholar]
- 64.Ekstrom AD, Ranganath C. Space, time, and episodic memory: the hippocampus is all over the cognitive map. Hippocampus. 2018;28(9):680–7. doi: 10.1002/hipo.22750 [DOI] [PubMed] [Google Scholar]
- 65.Behrens TEJ, Muller TH, Whittington JCR, Mark S, Baram AB, Stachenfeld KL, et al. What is a cognitive map? organizing knowledge for flexible behavior. Neuron. 2018;100(2):490–509. doi: 10.1016/j.neuron.2018.10.002 [DOI] [PubMed] [Google Scholar]
- 66.Rueckemann JW, Sosa M, Giocomo LM, Buffalo EA. The grid code for ordered experience. Nat Rev Neurosci. 2021;22(10):637–49. doi: 10.1038/s41583-021-00499-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Momennejad I. Memory, space, and planning: multiscale predictive representations. arXiv preprint 2024. http://arxiv.org/abs/2401.09491
- 68.Dedieu A, Gothoskar N, Swingle S, Lehrach W, Lázaro-Gredilla M, George D. Learning higher-order sequential structure with cloned HMMs. arXiv preprint 2019. http://arxiv.org/abs/1905.00507
- 69.George D, Rikhye RV, Gothoskar N, Guntupalli JS, Dedieu A, Lázaro-Gredilla M. Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps. Nat Commun. 2021;12(1):2392. doi: 10.1038/s41467-021-22559-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Guntupalli JS, Raju RV, Kushagra S, Wendelken C, Sawyer D, Deshpande I. Graph schemas as abstractions for transfer learning, inference, and planning. arXiv preprint 2023. http://arxiv.org/abs/2302.07350
- 71.de Tinguy D, Verbelen T, Dhoedt B. Exploring and learning structure: active inference approach in navigational agents. In: Buckley CL, Cialfi D, Lanillos P, Pitliya RJ, Sajid N, Shimazaki H, editors. Active Inference. Cham: Springer; 2025. p. 105–18.
- 72.Whittington JCR, McCaffary D, Bakermans JJW, Behrens TEJ. How to build a cognitive map. Nat Neurosci. 2022;25(10):1257–72. doi: 10.1038/s41593-022-01153-y [DOI] [PubMed] [Google Scholar]
- 73.Whittington JCR, Muller TH, Mark S, Chen G, Barry C, Burgess N, et al. The Tolman-Eichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation. Cell. 2020;183(5):1249-1263.e23. doi: 10.1016/j.cell.2020.10.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Millidge B, Tschantz A, Buckley CL. Whence the expected free energy? arXiv preprint 2020. http://arxiv.org/abs/2004.08128 [DOI] [PubMed]
- 75.Da Costa L, Parr T, Sajid N, Veselic S, Neacsu V, Friston K. Active inference on discrete state-spaces: a synthesis. J Math Psychol. 2020;99:102447. doi: 10.1016/j.jmp.2020.102447 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Friston K, Da Costa L, Hafner D, Hesp C, Parr T. Sophisticated inference. Neural Computation. 2021;33(3):713–63. [DOI] [PubMed] [Google Scholar]
- 77.Melnikov AA, Makmal A, Briegel HJ. Projective simulation applied to the grid-world and the mountain-car problem. arXiv preprint 2014. http://arxiv.org/abs/1405.5459
- 78.Mautner J, Makmal A, Manzano D, Tiersch M, Briegel HJ. Projective simulation for classical learning agents: a comprehensive investigation. New Gener Comput. 2015;33(1):69–114. doi: 10.1007/s00354-015-0102-0 [DOI] [Google Scholar]
- 79.Melnikov AA, Makmal A, Dunjko V, Briegel HJ. Projective simulation with generalization. Sci Rep. 2017;7(1):14430. doi: 10.1038/s41598-017-14740-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.LeMaitre PA, Krumm M, Briegel HJ. Multi-excitation projective simulation with a many-body physics inspired inductive bias. arXiv preprint 2024. http://arxiv.org/abs/2402.10192
- 81.Jerbi S, Trenkwalder LM, Poulsen Nautrup H, Briegel HJ, Dunjko V. Quantum enhancements for deep reinforcement learning in large spaces. PRX Quantum. 2021;2(1):010328.doi: 10.1103/prxquantum.2.010328 [DOI] [Google Scholar]
- 82.Hangl S, Dunjko V, Briegel HJ, Piater J. Skill learning by autonomous robotic playing using active learning and exploratory behavior composition. Front Robot AI. 2020;7:42. doi: 10.3389/frobt.2020.00042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.López-Incera A, Ried K, Müller T, Briegel HJ. Development of swarm behavior in artificial learning agents that adapt to different foraging environments. PLoS One. 2020;15(12):e0243628. doi: 10.1371/journal.pone.0243628 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.López-Incera A, Nouvian M, Ried K, Müller T, Briegel HJ. Honeybee communication during collective defence is shaped by predation. BMC Biol. 2021;19(1):106. doi: 10.1186/s12915-021-01028-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Tiersch M, Ganahl EJ, Briegel HJ. Adaptive quantum computation in changing environments using projective simulation. Sci Rep. 2015;5:12874. doi: 10.1038/srep12874 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Trenkwalder LM, López-Incera A, Nautrup HP, Flamini F, Briegel HJ. Automated gadget discovery in science. arXiv preprint 2022. http://arxiv.org/abs/2212.12743
- 87.Friston K, Thornton C, Clark A. Free-energy minimization and the dark-room problem. Front Psychol. 2012;3:130. doi: 10.3389/fpsyg.2012.00130 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Åström KJ. Optimal control of Markov processes with incomplete state information. Journal of Mathematical Analysis and Applications. 1965;10(1):174–205. doi: 10.1016/0022-247x(65)90154-x [DOI] [Google Scholar]
- 89.Silver D, Veness J. Monte-Carlo planning in large POMDPs. In: Advances in Neural Information Processing Systems. 2010. https://papers.nips.cc/paper_files/paper/2010/hash/edfbe1afcf9246bb0d40eb4d8027d90f-Abstract.html
- 90.Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS Comput Biol. 2017;13(9):e1005768. doi: 10.1371/journal.pcbi.1005768 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Momennejad I, Russek EM, Cheong JH, Botvinick MM, Daw ND, Gershman SJ. The successor representation in human reinforcement learning. Nat Hum Behav. 2017;1(9):680–92. doi: 10.1038/s41562-017-0180-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Roberts WA. Are animals stuck in time? Psychol Bull. 2002;128(3):473–89. doi: 10.1037/0033-2909.128.3.473 [DOI] [PubMed] [Google Scholar]
- 93.Wei R. Value of information and reward specification in active inference and POMDPs; 2024. http://arxiv.org/abs/2408.06542
- 94.Thompson RF, Thompson JK, Kim JJ, Krupa DJ, Shinkman PG. The nature of reinforcement in cerebellar learning. Neurobiology of Learning and Memory. 1998;70(1–2):150–76. doi: 10.1006/nlme.1998.3845 [DOI] [PubMed] [Google Scholar]
- 95.Wagner MJ, Luo L. Neocortex-cerebellum circuits for cognitive processing. Trends Neurosci. 2020;43(1):42–54. doi: 10.1016/j.tins.2019.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat Neurosci. 1999;2(1):79–87. doi: 10.1038/4580 [DOI] [PubMed] [Google Scholar]
- 97.Kim Y, Nam W, Kim H, Kim JH, Kim G. Curiosity-bottleneck: exploration by distilling task-specific novelty. In: Proceedings of the 36th International Conference on Machine Learning. 2019. p. 3379–88. https://proceedings.mlr.press/v97/kim19c.html
- 98.Bellemare M, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R. Unifying count-based exploration and intrinsic motivation. In: Advances in Neural Information Processing Systems. 2016. https://proceedings.neurips.cc/paper_files/paper/2016/hash/afda332245e2af431fb7b672a68b659d-Abstract.html
- 99.Barnett N, Crutchfield JP. Computational mechanics of input–output processes: structured transformations and the ε-transducer. J Stat Phys. 2015;161(2):404–51. doi: 10.1007/s10955-015-1327-5 [DOI] [Google Scholar]
- 100.Momennejad I. A rubric for human-like agents and NeuroAI. Philos Trans R Soc Lond B Biol Sci. 2023;378(1869):20210446. doi: 10.1098/rstb.2021.0446 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
(PDF)
(PDF)
(PDF)
Data Availability Statement
The code and data supporting this study are accessible at https://doi.org/10.5281/zenodo.15526876.