Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2020 Apr 23;16(4):e1007805. doi: 10.1371/journal.pcbi.1007805

Learning action-oriented models through active inference

Alexander Tschantz 1,2,*, Anil K Seth 1,2,3, Christopher L Buckley 2,4,*
Editor: Natalia L Komarova5
PMCID: PMC7200021  PMID: 32324758

Abstract

Converging theories suggest that organisms learn and exploit probabilistic models of their environment. However, it remains unclear how such models can be learned in practice. The open-ended complexity of natural environments means that it is generally infeasible for organisms to model their environment comprehensively. Alternatively, action-oriented models attempt to encode a parsimonious representation of adaptive agent-environment interactions. One approach to learning action-oriented models is to learn online in the presence of goal-directed behaviours. This constrains an agent to behaviourally relevant trajectories, reducing the diversity of the data a model need account for. Unfortunately, this approach can cause models to prematurely converge to sub-optimal solutions, through a process we refer to as a bad-bootstrap. Here, we exploit the normative framework of active inference to show that efficient action-oriented models can be learned by balancing goal-oriented and epistemic (information-seeking) behaviours in a principled manner. We illustrate our approach using a simple agent-based model of bacterial chemotaxis. We first demonstrate that learning via goal-directed behaviour indeed constrains models to behaviorally relevant aspects of the environment, but that this approach is prone to sub-optimal convergence. We then demonstrate that epistemic behaviours facilitate the construction of accurate and comprehensive models, but that these models are not tailored to any specific behavioural niche and are therefore less efficient in their use of data. Finally, we show that active inference agents learn models that are parsimonious, tailored to action, and which avoid bad bootstraps and sub-optimal convergence. Critically, our results indicate that models learned through active inference can support adaptive behaviour in spite of, and indeed because of, their departure from veridical representations of the environment. Our approach provides a principled method for learning adaptive models from limited interactions with an environment, highlighting a route to sample efficient learning algorithms.

Author summary

Within the popular framework of ‘active inference’, organisms learn internal models of their environments and use the models to guide goal-directed behaviour. A challenge for this framework is to explain how such models can be learned in practice, given (i) the rich complexity of natural environments, and (ii) the circular dependence of model learning and sensory sampling, which may lead to behaviourally suboptimal models being learned. Here, we develop an approach in which organisms selectively model those aspects of the environment that are relevant for acting in a goal-directed manner. Learning such ‘action-oriented’ models requires that agents balance information-seeking and goal-directed actions in a principled manner, such that both learning and information seeking are contextualised by goals. Using a combination of theory and simulation modelling, we show that this approach allows simple but effective models to be learned from relatively few interactions with the environment. Crucially, our results suggest that action-oriented models can support adaptive behaviour in spite of, and indeed because of, their departure from accurate representations of the environment.

Introduction

In order to survive, biological organisms must be able to efficiently adapt to and navigate in their environment. Converging research in neuroscience, biology, and machine learning suggests that organisms achieve this feat by exploiting probabilistic models of their world [18]. These models encode statistical representations of the states and contingencies in an environment and agent-environment interactions. Such models plausibly endow organisms with several advantages. For instance, probabilistic models can be used to perform perceptual inference, implement anticipatory control, overcome sensory noise and delays, and generalize existing knowledge to new tasks and environments. While encoding a probabilistic model can be advantageous in these and other ways, natural environments are extremely complex and it is infeasible to model them in their entirety. Thus it is unclear how organisms with limited resources could exploit probabilistic models in rich and complex environments.

One approach to this problem is for organisms to selectively model their world in a way that supports action [914]. We refer to such models as action-oriented, as their functional purpose is to enable adaptive behaviour, rather than to represent the world in a complete or accurate manner. An action-oriented representation of the world can depart from a veridical representation in a number of ways. First, because only a subset of the states and contingencies in an environment will be relevant for behaviour, action-oriented models need not exhaustively model their environment [11]. Moreover, specific misrepresentations may prove to be useful for action [1518], indicating that action-oriented models need not be accurate. By reducing the need for models to be isomorphic with their environment, an action-oriented approach can increase the tractability of the model learning process [1924], especially for organisms with limited resources.

Within an action-oriented approach, an open question is how action-oriented models can be learned from experience. The environment, in and of itself, provides no distinction between states and contingencies that are relevant for behaviour and those which are not. However, organisms do not receive information passively. Rather, organisms actively sample information from their environment, a process which plays an important role in both perception and learning [23, 2527]. One way that active sampling can facilitate the learning of efficient action-oriented models is to learn online in the presence of goal-directed actions. Performing goal-directed actions restricts an organism to behaviourally relevant trajectories through an environment. This, in turn, structures sensory data in a behaviorally relevant way, thereby reducing the diversity and dimensionality of the sampled data (see Fig 1). Therefore, this approach offers an effective mechanism for learning parsimonious models that are tailored to an organism’s adaptive requirements [19, 20, 23, 24, 28, 29].

Fig 1. The coupling of learning and control.

Fig 1

(A) Goal-directed cycle of learning and control. A schematic overview of the coupling between a model and its environment when learning takes place in the presence of goal-directed actions. Here, a model is learned based on sampled observations. This model is then used to determine goal-directed actions, causing goal-relevant transitions in the environment, which in turn generate goal-relevant observations. (B) Maladaptive cycle of learning and control. A schematic overview of the model-environment coupling when learning in the presence of goal-directed actions, but for the case where a maladaptive model has been initially learned. The feedback inherent in the online learning scheme means that the model samples sub-optimal observations, which are subsequently used to update the model, thus entrenching maladaptive cycles of learning and control (bad bootstraps). (C) Observations sampled from random actions. The spread of observations covers the space of possible observations uniformly, meaning that a model of these observations must account for a diverse and distributed set of data, increasing the model’s complexity. The red circle in the upper right quadrant indicates the region of observation space associated with optimal behaviour, which is only sparsely sampled. Note these are taken from a fictive simulation and are purely illustrative. (D) Observations sampled from sub-optimal goal-directed actions. Only a small portion of observation space is sampled. A model of this data would, therefore, be more parsimonious in its representation of the environment. However, the model prescribes actions that cause the agent to selectively sample a sub-optimal region of observation space (i.e outside the red circle in the upper-right quadrant). As the agent only samples this portion of observation space, the model does not learn about more optimal behaviours. (E) Observations sampled from optimal goal-directed actions. Here, as in D, the goal-directed nature of action ensures that only a small portion of observation space is sampled. However, unlike D, this portion is associated with optimal behaviours.

Learning probabilistic models to optimise behaviour has been extensively explored in the model-based reinforcement learning (RL) literature [8, 3032]. A significant drawback to existing methods is that they tend to prematurely converge to sub-optimal solutions [33]. One reason this occurs is due to the inherent coupling between action-selection and model learning. At the onset of learning, agents must learn from limited data, and this can lead to models that initially overfit the environment and, as a consequence, make sub-optimal predictions about the consequences of action. Subsequently using these models to determine goal-oriented actions can result in biased and sub-optimal samples from the environment, further compounding the model’s inefficiencies, and ultimately entrenching maladaptive cycles of learning and control, a process we refer to as a “bad-bootstrap” (see Fig 1).

One obvious approach to resolving this problem is for an organism to perform some actions, during learning, that are not explicitly goal-oriented. For example, heuristic methods, such as ε-greedy [34], utilise noise to enable exploration at the start of learning. However, random exploration of this sort is likely to be inefficient in rich and complex environments. In such environments, a more powerful method is to utilize the uncertainty quantified by probabilistic models to determine epistemic (or intrinsic, information-seeking, uncertainty reducing) actions that attempt to minimize the model uncertainty in a directed manner [3540]. While epistemic actions can help avoid bad-bootstraps and sub-optimal convergence, such actions necessarily increase the diversity and dimensionality of sampled data, thus sacrificing the benefits afforded by learning in the presence of goal-directed actions. Thus, a principled and pragmatic method is needed to learn action-oriented models in the presence of both goal-directed and epistemic actions.

In this paper, we develop an effective method for learning action-oriented models. This method balances goal-directed and epistemic actions in a principled manner, thereby ensuring that an agent’s model is tailored to goal-relevant aspects of the environment, while also ensuring that epistemic actions are contextualized by and directed towards an agent’s adaptive requirements. To achieve this, we exploit the theoretical framework of active inference, a normative theory of perception, learning and action [4143]. Active inference proposes that organisms maintain and update a probabilistic model of their typical (habitable) environment and that the states of an organism change to maximize the evidence for this model. Crucially, both goal-oriented and epistemic actions are complementary components of a single imperative to maximize model evidence—and are therefore evaluated in a common (information-theoretic) currency [38, 40, 43].

We illustrate this approach with a simple agent-based model of bacterial chemotaxis. This model is not presented as a biologically-plausible account of chemotaxis, but instead, is used as a relatively simple behaviour to evaluate the hypothesis that adaptive action-oriented models can be learned via active inference. First, we confirm that learning in the presence of goal-directed actions leads to parsimonious models that are tailored to specific behavioural niches. Next, we demonstrate that learning in the presence of goal-directed actions alone can cause agents to engage in maladaptive cycles of learning and control—‘bad bootstraps’—leading to premature convergence on sub-optimal solutions. We then show that learning in the presence of epistemic actions allows agents to learn accurate and exhaustive models of their environment, but that the learned models are not tailored to any behavioural niche, and are therefore inefficient and unlikely to scale to complex environments. Finally, we demonstrate that balancing goal-directed and epistemic actions through active inference provides an effective method for learning efficient action-oriented models that avoid maladaptive patterns of learning and control. ‘Active inference’ agents learn well-adapted models from a relatively limited number of agent-environment interactions and do so in a way that benefits from systematic representational inaccuracies. Our results indicate that probabilistic models can support adaptive behaviour in spite of, and moreover, because of, the fact they depart from veridical representations of the external environment.

The structure of the paper is as follows. In section two, we outline the active inference formalism, with a particular focus on how it prescribes both goal-directed and epistemic behaviour. In section three, we present the results of our agent-based simulations, and in section four, we discuss these results and outline some broader implications. In section five, we outline the methods used in our simulations, which are based on the Partially Observed Markov Decision Process (POMDP) framework, a popular method for modelling choice behaviour under uncertainty.

Results

Formalism

Active inference is a normative theory that unifies perception, action and learning under a single imperative—the minimization of variational free energy [42, 43]. Free energy F(ϕ,o) is defined as:

F(ϕ,o)=KL[Q(x|ϕ)||P(x,o)]=KL[Q(x|ϕ)||P(x|o)]-lnP(o) (1)

where KL is the Kullback-Libeler divergence (KL-divergence) between two probability distributions, both of which are parameterized by the internal states of an agent. The first is the approximate posterior distribution, Q(x|ϕ), often referred to as the recognition distribution, which is a distribution over unknown or latent variables x with sufficient statistics ϕ. This distribution encodes an agent’s ‘beliefs’ about the unknown variables x. Here, the term ‘belief’ does not necessarily refer to beliefs in the cognitive sense but instead implies a probabilistic representation of unknown variables. The second distribution is the generative model, P(x, o), which is the joint distribution over unknown variables x and observations o. This distribution encodes an agent’s probabilistic model of its (internal and external) environment. We provide two additional re-arrangements of Eq 1 in Appendix 1.

Minimizing free energy has two functional consequences. First, it minimizes the divergence between the approximate posterior distribution Q(x|ϕ) and the true posterior distribution P(x|o), thereby implementing a tractable form of approximate Bayesian inference known as variational Bayes [44]. On this view, perception can be understood as the process of maintaining and updating beliefs about hidden state variables s, where sS. The hidden state variables can either be a compressed representation of the potentially high-dimensional observations (i.e. representing an object), or they can represent quantities that are not directly observable (i.e. velocity). This casts perception as a process of approximate inference, connecting active inference to influential theories such as the Bayesian brain hypothesis [45, 46] and predictive coding [47]. Under active inference, learning can also be understood as a process of approximate inference [43]. This can be formalized by assuming that agents maintain and update beliefs over the parameters θ of their generative model, where θ ∈ Θ. Finally, action can be cast as a process of approximate inference by assuming that agents maintain and update beliefs over control states u, where uU, which prescribe actions a, where aA. The delineation of control states from actions helps highlight the fact that actions are something which occur ‘in the world’, whereas control states are unknown random variables that the agent must infer. Together, this implies that x = (s, θ, u). Approximate inference, encompassing perception, action, and learning, can then be achieved through the following scheme:

ϕ*=argminϕF(ϕ,o) (2)

In other words, as new observations are sampled, the sufficient statistics ϕ are updated in order to minimize free energy (see the Methods section for the implementation used in the current simulations, or [48] for an alternative implementation based on the Laplace approximation). Once the optimal sufficient statistics ϕ* have been identified, the approximate posterior will become an approximation of the true posterior distribution Q(x|ϕ*)≈P(x|o), meaning that agents will encode approximately optimal beliefs over hidden states s, model parameters θ and control states u.

The second consequence of minimizing free energy is that it maximizes the Bayesian evidence for an agents generative model, or equivalently, minimizes ‘surprisal’ −ln P(o), which is the information-theoretic surprise of sampled observations (see Appendix 1). Active inference proposes that an agent’s goals, preferences and desires are encoded in the generative model as a prior preference for favourable observations (e.g. blood temperature at 37°) [49]. In other words, it proposes that an agent’s generative model is biased towards favourable states of affairs. These prior preferences could be learned from experience, or alternatively, acquired through processes operating on evolutionary timescales. The process of actively minimizing free energy will, therefore, ensure that these favourable (i.e. probable) observations are preferentially sampled [50]. However, model evidence cannot be directly maximized through the inference scheme described by Eq 2, as the marginal probability of observations P(o) is independent of the sufficient statistics ϕ. Therefore, to maximize model evidence, agents must act in order to change their observations. This process can be achieved in a principled manner by selecting actions in order to minimize expected free energy, which is the free energy that is expected to occur from executing some (sequence of) actions [38, 51].

Expected free energy

To ensure that actions minimize (the path integral of) free energy, an agent’s generative model should specify that control states are a-priori more likely if they are expected to minimize free energy in the future, thus ensuring that the process of approximate inference assigns a higher posterior probability to the control states that are expected to minimize free energy [52]. The expected free energy for a candidate control state Gτ(ϕτ, ut) quantifies the free energy expected at some future time τ given the execution of some control state ut, where t is the current time point and:

Gτ(ϕτ,ut)=EQ(oτ,xτ|ut,ϕτ)[lnQ(xτ|uτ,ϕτ)-lnP(oτ,xτ|ut)]EQ(oτ,xτ|ut,ϕτ)[lnQ(xτ|ut,ϕτ)-lnQ(xτ|oτ,ut,ϕτ)](Negative)epistemicvalue-EQ(oτ,xτ|ut,ϕτ)[lnP(oτ)](Negative)instrumentalvalue (3)

We describe the formal relationship between free energy and expected free energy in Appendix 2. In order to evaluate expected free energy, agents must first evaluate the expected consequences of control, or formally, evaluate the predictive approximate posterior Q(oτ, xτ|ut, ϕτ). We refer readers to the Methods section for a description of this process.

The second (approximate) equality of Eq 3 demonstrates that expected free energy is composed of an instrumental (or extrinsic, pragmatic, goal-directed) component and an epistemic (or intrinsic, uncertainty-reducing, information-seeking) component. Note that under active inference, agents are mandated to minimize expected free energy, and as both the instrumental and epistemic terms are in a negative form in Eq 3, expected free energy will be minimized when instrumental and epistemic value are maximized. We provide a full derivation of the second equality in Appendix 3, but note here that the decomposition of expected free energy into instrumental and epistemic value affords an intuitive explanation. Namely, as free energy quantifies the divergence between an agent’s current beliefs and its model of the world, this divergence can be minimized via two methods: by changing beliefs such that they align with observations (associated with maximizing epistemic value), or by changing observations such that they align with beliefs (associated with maximizing instrumental value).

Formally, instrumental value quantifies the degree to which the predicted observations oτ—given by the predictive approximate posterior Q(oτ, xτ|ut, ϕτ)—are consistent with the agents prior beliefs P(oτ). In other words, this term will be maximized when an agent expects to sample observations that are consistent with its prior beliefs. As an agent’s generative model assigns a higher prior probability to favourable observations (i.e. goals and desires), maximizing instrumental value can be associated with promoting ‘goal-directed’ behaviours. This formalizes the notion that, under active inference, agents seek to maximize the evidence for their (biased) model of the world, rather than seeking to maximize reward as a separate construct (as in, e.g., reinforcement learning) [49].

Conversely, epistemic value quantifies the expected reduction in uncertainty in the beliefs over unknown variables x. Formally, it quantifies the expected information gain for the predictive approximate posterior Q(xτ|ut, ϕτ). By noting that that x can be factorized into hidden states s and model parameters θ, we can rewrite positive epistemic value (i.e. the term to be maximized) as:

EQ(oτ,sτ,θ|ut,ϕτ)[lnQ(sτ|oτ,ut,ϕτ)-lnQ(sτ|ut,ϕτ)]Stateepistemicvalue+EQ(oτ,sτ,θ|ut,ϕτ)[lnQ(θ|sτ,oτ,ut,ϕτ)-lnQ(θ|ϕτ)]Parameterepistemicvalue (4)

We provide a full derivation of Eq 4 in Appendix 4 and discuss its relationship to several established formalisms. Here, we have decomposed epistemic value into state epistemic value, or salience, and parameter epistemic value, or novelty[53]. State epistemic value quantifies the degree to which the expected observations oτ reduce the uncertainty in an agent’s beliefs about the hidden states sτ. In contrast, parameter epistemic value quantifies the degree to which the expected observations oτ and expected hidden states sτ reduce the uncertainty in an agent’s beliefs about model parameters θ. Thus, by maintaining a distribution over model parameters, the uncertainty in an agent’s generative model can be quantified, allowing for ‘known unknowns’ to be identified and subsequently acted upon [40]. Maximizing parameter epistemic value, therefore, causes agents to sample novel agent-environment interactions, promoting the exploration of the environment in a principled manner.

Summary

In summary, active inference proposes that agents learn and update a probabilistic model of their world, and act to maximize the evidence for this model. However, in contrast to previous ‘perception-oriented’ approaches to constructing probabilistic models [11], active inference requires an agent’s model to be intrinsically biased towards certain (favourable) observations. Therefore, the goal is not necessarily to construct a model that accurately captures the true causal structure underlying observations, but is instead to learn a model that is tailored to a specific set of prior preferences, and thus tailored to a specific set of agent-environment interactions. Moreover, by ensuring that actions maximize evidence for a (biased) model of the world, active inference prescribes a trade-off between instrumental and epistemic actions. Crucially, the fact that actions are selected based on both instrumental and epistemic value means that epistemic foraging will be contextualized by an agent’s prior preferences. Specifically, epistemic foraging will be biased towards parts of the environment that also provide instrumental value, as these parts will entail a lower expected free energy relative to those that provide no instrumental value. Moreover, the degree to which epistemic value determines the selection of actions will depend on instrumental value. Thus, when the instrumental value afforded by a set of actions is low, epistemic value will dominate action selection, whereas if actions afford a high degree of instrumental value, epistemic value will have less influence on the action selection. Finally, as agents maintain beliefs about (and thus quantify the uncertainty of) the hidden state of the environment and the parameters of their generative model, epistemic value promotes agents to actively reduce the uncertainty in both of these beliefs.

Simulation details

To test our hypothesis that acting to minimize expected free energy will lead to the learning of well-adapted action-oriented models, we empirically compare the types of model that are learned under four different action strategies. These are the (i) minimization of expected free energy, (ii) maximization of instrumental value, (iii) maximization of epistemic value, and (iv) random action selection, where the minimization of expected free energy (i) corresponds to a combination of the instrumental (ii) and epistemic (iii) strategies. For each strategy, we assess model performance after a range of model learning durations. We assess model performance across several criteria, including whether or not the models can prescribe well-adapted behaviour, the complexity and accuracy of the learned models, whether the models are tailored to a behavioural niche, and whether or not the models become entrenched in maladaptive cycles of learning and control (‘bad-bootstraps’).

We implement a simple agent-based model of bacterial chemotaxis that infers and learns based on the active inference scheme described above. Specifically, our model implements the ‘adaptive gradient climbing’ behaviour of E. coli. Note that we do not propose our model as a biologically realistic account of bacterial chemotaxis. Instead, we use chemotaxis as a relatively simple behaviour that permits a thorough analysis of the learned models. However, the active inference scheme described in this paper has a degree of biological plausibility [54], and there is some evidence to suggest that bacteria engage in model-based behaviours [5558]. This behaviour depends on the chemical gradient at the bacteria’s current orientation. In positive chemical gradients, bacteria ‘run’ forward in the direction of their current orientation. In negative chemical gradients, bacteria ‘tumble’, resulting in a new orientation being sampled. This behaviour, therefore, implements a rudimentary biased random-walk towards higher concentrations of chemicals. To simulate the adaptive gradient climbing behaviour of E. coli, we utilize the partially observed Markov Decision Process (POMDP) framework [59]. This framework implies that agents do not have direct access to the true state of the environment, that the state of the environment only depends on the previous state and the agent’s previous action, and that all variables and time are discrete. Note that while agents operate on discrete representations of the environment, the true states of the environment (i.e the agent’s position, the location of the chemical source, and the chemical concentrations) are continuous.

At each time step t, agents receive one of two observations, either a positive chemical gradient opos or a negative chemical gradient oneg. The chemical gradient is computed as a function of space (whether the agent is facing towards the chemical source) rather than time (whether the agent is moving towards the chemical source) [60], and thus only depends on the agent’s current position and orientation, and the position of the chemical source. After receiving an observation, agents update their beliefs in order to minimize free energy. In the current simulations, agents maintain and update beliefs over three variables. The first is the hidden state variable s, which represents the agent’s belief about the local chemical gradient, and which has a domain of {spos, sneg}, representing positive and negative chemical gradients, respectively. The second belief is over the parameters θ of the agent’s generative model, which describe the probability of transitions in the environment, given action. The final belief is over the control variable u, which has the domain of {urun, utumble}, representing running and tumbling respectively. Agents are also endowed with the prior belief that observing positive chemical gradients opos is a-priori more likely, such that the evidence for an agent’s model is maximized (and free energy minimized) when sampling positive chemical gradients.

Once beliefs have been updated, agents execute one of two actions, either run arun or tumble atumble, depending on which of the corresponding control states was inferred to be more likely. Running causes the agent to move forward one unit in the direction of their current orientation, whereas tumbling causes the agent to sample a new orientation at random. The environment is then updated and a new time step begins. We refer the reader to the Methods section for a full description of the agents generative model, approximate posterior, and the corresponding update equations for inference, learning and action.

Agents

All of the action strategies we compare infer posterior beliefs over hidden states, model parameters and control states via the minimization of free energy. However, they differ in how they assign prior (and thus posterior) probability to control states. The first strategy we consider is based on the minimization of expected free energy, which entails the following prior over control states:

PEFE(ut)=σ(EQ(oτ,sτ,θ|ut,ϕτ)[lnQ(θ|sτ,oτ,ut,ϕτ)-lnQ(θ|ϕτ)]+EQ(oτ,sτ,θ|ut,ϕτ)[lnP(oτ)]) (5)

where σ(⋅) is the softmax function, which ensures that PEFE(ut) is a valid distribution. The first term corresponds to parameter epistemic value, or ‘novelty’, and quantifies the amount of information the agent expects to gain about their (beliefs about their) model parameters θ. The second term corresponds to instrumental value and quantifies the degree to which the expected observations conform to prior beliefs. Therefore, the expected free energy agent selects actions that are expected to result in probable (‘favourable’) observations, and that are expected to disclose maximal information about the consequences of action. Note that in the following simulations, agents have no uncertainty in their likelihood distribution, which describes the relationship between the hidden state variables s and the observations o (see Methods). As such, the expected free energy agent does not assign probability to control states based on state epistemic value. Formally, when there is no uncertainty in the likelihood distribution, state epistemic value reduces to the entropy of the predictive approximate posterior over s, see [38]. For simplicity, we have omitted this term from the current simulations.

The second strategy is the instrumental, or ‘goal-directed’, strategy, which utilizes the following prior over control states:

PInstrumental(ut)=σ(EQ(oτ,sτ,θ|ut,ϕτ)[lnP(oτ)]) (6)

The instrumental agent, therefore, selects actions that are expected to give rise to favourable observations. The third strategy is the epistemic, or ‘information-seeking’, strategy, which is governed by the following prior over control states:

PEpistemic(ut)=σ(EQ(oτ,sτ,θ|ut,ϕτ)[lnQ(θ|sτ,oτ,ut,ϕτ)-lnQ(θ|ϕτ)]) (7)

The epistemic agent selects actions that are expected to disclose maximal information about model parameters. The final strategy is the random strategy, which assigns prior probability to actions at random. These models were chosen to explore the relative contributions of instrumental and epistemic value to model learning, and crucially, to understand their combined influence. We predict that, when acting to minimize expected free energy, agent’s will engage in a form of goal-directed exploration that is biased by their prior preferences, leading to adaptive action-oriented models. In contrast, we expect that (i) the instrumental agent will occasionally become entrenched in bad-bootstraps, due to the lack of exploration, and (ii) the epistemic agent will explore portions of state space irrelevant to behaviour, leading to slower learning. An overview of the model can be found in Fig 2 and implementation details for all four strategies are provided in the Methods section.

Fig 2. Simulation & model details.

Fig 2

(A) Agent overview. Agents act in an environment which is described by states ψ, which are unknown to the agent but generate observations o. The agent maintains beliefs about the state of the environment s, however, s and ψ need not be homologous. Agents also maintain beliefs about control states u, which in turn prescribe actions a. Finally, the agent maintains beliefs over model parameters θ, which describe the probability of transitions in s under different control states u. (B) Actions. At each time step, agents can either run, which moves them forward one unit in the direction of their current orientation, or tumble, which causes a new orientation to be sampled at random. (C) Approximate posterior. The factorization of the approximate posterior, and the definition of each factor. In this figure, x denotes the variables that an agent infers and ϕ denotes the parameters of the approximate posterior. We refer readers to Methods section for a full description of these distributions. (D) Generative model. The factorization of the generative model and the definition of each factor. Here, λ denotes the parameters of likelihood distribution and α denotes the parameters of the prior distribution over parameters. We again refer readers to the methods section for full descriptions of these distributions. (E) Free energy minimization. The general scheme for free energy minimization under the mean-field assumption. We refer readers to the Methods section for further details. (F) Control state inference. The update equation for control state inference, where Q˜=Q(oτ,sτ,θ|ut). This equation highlights the difference between the three action-strategies considered in the following simulations.

Model performance

We first assess whether the learned models can successfully generate chemotactic behaviour. We quantify this by measuring an agent’s distance from the source after an additional (i.e., post-learning) testing phase. Each testing phase begins by placing an agent at a random location and orientation 400 units from the chemical source. The agent is then left to act in the environment for 1000 time steps, utilizing the model that was learned during the preceding learning phase. No additional learning takes place during the testing phase. As the epistemic and random action strategies do not assign any instrumental (goal-oriented) value to actions, there is no tendency for them to navigate towards the chemical source. Therefore, to ensure a fair comparison between action strategies, all agents select actions based on the minimization of expected free energy during the testing phase. This allows us to assess whether the epistemic and random strategies can learn models that can support chemotactic behaviour, and ensures that any observed differences are determined solely by attributes of the learned models.

Fig 3a shows the final distance from the source at the end of the testing phase, plotted against the duration of the preceding learning phase, and averaged over 300 learned models for each action strategy and learning duration. The final distance of the expected free energy, epistemic and random strategies decreases with the amount of time spent learning, meaning that these action strategies were able to learn models which support chemotactic behaviour. However, the instrumental strategy shows little improvement over baseline performance, irrespective of the amount of time spent learning. Note that the first learning period consists of zero learning steps, meaning that the corresponding distance gives the (averaged) baseline performance for a randomly initialized model. This is less than the initial distance (400 units) as some of the randomly initialized models can support chemotaxis without any learning. The final distance from the source for the expected free energy, epistemic and random agents is not zero due to the nature of the adaptive-hill climbing chemotaxis strategy, which causes agents to not to settle directly on the source, but instead navigate around its local vicinity. Models learned by the expected free energy strategy consistently finish close to the chemical source, and learn chemotactic behaviour after fewer learning steps relative to the other strategies.

Fig 3.

Fig 3

(A) Chemotactic performance. The average final distance from the chemical source after an additional testing phase, in which agents utilized the models learned in the corresponding learning phase. The average distance is plotted against the number of steps in the corresponding learning phase and is averaged over 300 models for each strategy and learning duration. Note that the x-axis denotes the number of time steps in the learning phase, rather than the number of time steps in the subsequent testing phase. Filled regions show +-SEM. (B) Examples trajectories. The spatial trajectories of agents who successfully navigated up the chemical gradient towards the chemical source.

Model accuracy

We now move on to consider whether learning in the presence of goal-oriented behaviour leads to models that are tailored to a behavioural niche. First, we assess how each action strategy affects the overall accuracy of the learned models. To test this, we measure the KL-divergence between the learned models and a ‘true’ model of agent-environment dynamics. Here, a ‘true’ model describes a model that has the same variables, structure and fixed parameters, but which has had infinite training data over all possible action-state contingencies. Due to the fact that the true generative process does not admit the notion of a prior, we measure the accuracy of the expectation of the approximate posterior distribution over parameters θ, i.e. E[Q(θ|ϕα)]. Fig 4a shows the average accuracy of the learned models for each action strategy, plotted against the amount of time spent learning. These results demonstrate that the epistemic and random strategies consistently learn the most accurate models while the instrumental strategy consistently learns the least accurate models. However, the expected free energy strategy learns a model that is significantly less accurate than both the epistemic and random strategies, indicating that the most well-adapted models are not necessarily the most accurate.

Fig 4. Model accuracy.

Fig 4

(A) Model accuracy. The average negative model accuracy, measured as the KL-divergence from a ‘true’ model of agent-environment dynamics. The accuracy is plotted against the number of steps in the corresponding learning phase and is averaged over 300 models for each strategy. Filled regions show +-SEM. (B) Distributions of state transitions. The distribution of action-dependent state transitions for each strategy over 1000 learning steps, averaged over 300 models for each strategy. Here, columns indicate the state at the previous time step, whereas rows indicate the state following the transition. The top matrices display transitions that follow from tumbling, whereas the bottom matrices display transitions that follow from running. The numbers indicate the percentage of time that the corresponding state transition was encountered. For instance, the top left box denotes the percentage of time the agent experienced negative to negative state transitions following a tumbling action. Note that the distribution of transitions encountered by the epistemic and random strategies corresponds, within a small margin of error, to the distribution of transitions encountered by a ‘true’ model, i.e. a model that has been learned from infinite transitions with no behavioural biases. (C) Change in distributions. The average change in each of the distributions of the full learned model, measured as the KL-divergence between the original (randomly-initialized) distributions and the final (post-learning) distribution. Refer to Methods section for a description of these distributions. (D & E) Reversed preferences. These results are the same as for panels B & C, but for the case where agents have reversed preferences (i.e. priors). Here, agents believe running down chemical gradients to be more likely. The results demonstrate that the models of expected free energy and instrumental agent are sensitive to prior preferences. (F) Active/passive prediction error. The cumulative mean squared error of counterfactual predictions about state transitions, over 1000 steps learning and averaged over 300 agents. The active condition describes predictions of state-transitions following self-determined actions, whereas the passive condition describes predictions following random actions.

Fig 4a additionally suggests that the epistemic and random strategies learn equally accurate models. This result may appear surprising, as the epistemic strategy actively seeks out transitions that are expected to improve model accuracy. However, given the limited number of possible state transitions in the current simulation, it is plausible that a random strategy offers a near-optimal solution to exploration. To confirm this, we evaluated the accuracy of models learned by the epistemic and random strategies in high-dimensional state space. The results of this experiment are given in Appendix 6, where it can be seen that the epistemic strategy does indeed learn models that are considerably more accurate than the random strategy.

We hypothesized that the expected free energy and instrumental strategies learned less accurate models because they were acting in a goal-oriented manner while learning. This, in turn, may have caused these strategies to selectively sample particular (behaviourally-relevant) transitions, at the cost of sampling other (behaviourally-irrelevant) transitions less frequently. To confirm this, we measured the distribution of state transitions sampled by each of the strategies after 1000 time steps learning, averaged over 300 agents. Because agents learn an action-conditioned representation of state transitions, i.e. P(st|st−1, ut−1, θ), we separate state transitions that follow agents running from those that follow agents tumbling. Here, the notion of a state transition refers to a change in the state of the environment as a function of time, i.e. a positive to negative state transition implies that the agent was in a positive chemical gradient at time t and a negative chemical gradient at t + 1. These results are shown in Fig 4b. For the epistemic and random strategies, the distribution is uniformly spread over (realizable) state transitions (running-induced transitions from positive to negative and negative to positive gradients are rare for all strategies, as such transitions can only occur in small portions of the environment). In contrast, the distributions sampled by the expected free energy and instrumental strategies are heavily biased towards a running-induced transitions from positive gradients to again a positive gradient. This is the transition that occurs when an agent is ‘running up the chemical gradient’, i.e., performing chemotaxis. The bias means that the remaining transitions between states are sampled less, relative to the epistemic and random strategies.

How do the learned models differ, among the four action strategies? To address this question, we measured the post-learning change in different distributions of the full model. This change reflects a measure of ‘how much’ an agent has learned about that particular distribution. As described in the Methods, the full transition model P(st|st−1, ut−1, θ) is composed of four separate categorical distributions. The first describes the effects of tumbling in negative gradients, the second describes the effects of tumbling in positive gradients, the third describes the effects of running in negative gradients, and fourth describes the effects of running in positive gradients. Fig 4c plots the KL-divergence between each of the original (randomly-initialized) distributions and the subsequent (post-learning) distributions. These results show that the expected free energy and instrumental strategies learn substantially less about three of the distributions, compared to the epistemic and random agents, explaining the overall reduction of accuracy displayed in Fig 4a. However, for the distribution describing the effects of running in positive gradients, the instrumental strategy learns as much as the epistemic and random strategies, while the expected free energy strategy learns substantially more. These results, therefore, demonstrate that acting in a goal-oriented manner biases an agent to preferentially sample particular (goal-relevant) transitions in the environment and that this, in turn, causes agents to learn more about these (goal-relevant) transitions.

To further verify this result, we repeated the analysis described in Fig 4b and 4c, but for the case where agents learn in the presence of reversed prior preferences (i.e. the agents believe that observing negative chemical gradients is a-priori more likely, and thus preferable). The results for these simulations are shown in 4d and 4e, where it can be seen that the expected free energy and instrumental strategy now preferentially sample running-induced transitions from negative to negative gradients, and learn more about the distribution describing the effects of running in negative gradients. This is the distribution relevant to navigating down the chemical gradient, a result that is expected if the learned models are biased towards prior preferences. By contrast, the models learned by the epistemic and random agents are not dependent on their prior beliefs or preferences.

Active and passive accuracy

The previous results suggest that learning in the presence of goal-directed behaviour leads to models that are biased towards certain patterns of agent-environment interaction. To further elucidate this point, we distinguish between active accuracy and passive accuracy. We define active accuracy as the accuracy of a model in the presence of the agents own self-determined actions (i.e. the actions chosen according to the agent’s strategy), and passive accuracy as the accuracy of a model in the presence of random actions. We measured both the passive and active accuracy of the models learned under different action strategies following 300 time-steps of learning. To do this, we let agents act in their environment for an additional 1000 time steps according to their action strategy, and, at each time step, measured the accuracy of their counterfactual predictions about state transitions. In the active condition, agents predicted the consequence of a self-determined action, whereas, in the passive condition, agents predicted the consequence of a randomly selected action. We then measured the mean squared error between the agents’ predictions and the ‘true’ predictions (i.e. the predictions given by the ‘true’ model, as described for Fig 4a). The accumulated prediction errors for the passive and active conditions are shown in Fig 4f, averaged over 300 learned models for each strategy. As expected, there is no difference between the passive and active condition for the random strategy, as this strategy selects actions at random. The epistemic strategy shows the highest active error, which is due to the fact that the epistemic strategy seeks out novel (and thus less predictable) transitions. The instrumental strategy has the lowest active prediction error, and therefore the highest active accuracy. This is consistent with the view that learning in the presence of goal-directed behaviour allows agents to learn models that are accurate in the presence of their self-determined behaviour. Finally, the expected free energy strategy has an active error that is lower than the epistemic and random strategies, but higher than the instrumental strategy. This arises from the fact that the expected free energy strategy balances both goal-directed and epistemic actions. Note that, in the current context, active accuracy is improved at the cost of passive accuracy. While the instrumental strategy learns the least accurate model, it is the most accurate at predicting the consequences of its self-determined actions.

Pruning parameters

We now consider whether learning in the presence of goal-directed behaviour leads to simpler models of agent-environment dynamics. A principled way to approach this question is to ask whether each of the model’s parameters are increasing or decreasing the Bayesian evidence for the overall model, which provides a measure of both the accuracy and the complexity of a model. In brief, if a parameter decreases model evidence, then removing—or ‘pruning’—that parameter results in a model with higher evidence. This procedure can, therefore, provide a measure of how many ‘redundant’ parameters a model has, which, in turn, provides a measure of the complexity of a model (assuming that redundant parameters can, and should, be removed). We utilise the method of Bayesian model reduction [61] to evaluate the evidence for models with removed parameters. This procedure allows us to evaluate the evidence for reduced models without having to refit the model’s parameters.

We first let each of the strategies learn a model for 500 time-steps. The parameters optimized during this learning period are then treated as priors for an additional (i.e., post-learning) testing phase. During this testing phase, agents act according to their respective strategies for an additional 500 time-steps, resulting in posterior estimates of the parameters.

Given the prior parameters α and posterior parameters ϕα, we can evaluate an approximation for the change in model evidence under a reduced model through the equation:

ΔF=lnB(ϕα)+lnB(α)-lnB(α)-lnB(ϕα+α-α) (8)

where ln B(⋅) is the beta function, α′ are the prior parameters of the reduced model, and F is the variational free energy, which provides a tractable approximation of the Bayesian model evidence. See [40] for a derivation of Eq 8. If ΔF is positive, then the reduced model—described by the reduced priors α′—has less evidence than the full model, and vice versa. We remove each of the prior parameters individually by setting their value to zero and evaluate Eq 8. Fig 5a shows the percentage of trials that each parameter was pruned for each of the action strategies, averaged over 300 trials for each strategy. For the instrumental and epistemic agents, the parameters describing the effects of running in negative gradients and tumbling in positive gradients are most often pruned, as these are the parameters that are irrelevant to chemotaxis (which involves running in positive chemical gradients and tumbling in negative chemical gradients). In Fig 5b we plot the total number of parameters pruned, averaged over 300 agents. These results demonstrate that the expected free energy strategy entails models that have the highest number of redundant parameters, followed by the instrumental strategy. Under the assumption that redundant parameters can, and should, be pruned, the expected free energy and instrumental strategies learn simpler models, compared to the epistemic and random strategies. These results additionally suggest that pruning parameters will prove to be more beneficial (in terms of model complexity) for action-oriented models.

Fig 5. Model complexity.

Fig 5

(A) Number of pruned parameters. Percentage of times each parameter was pruned, averaged over 300 agents. A parameter was pruned if it decreased the evidence for agents model. (B) Total pruned parameters. The average number of total number of pruned parameters, averaged over 300 agents.

Bad bootstraps and sub-optimal convergence

In the Introduction, we hypothesized that ‘bad-bootstraps’ occur when agents (and their models) become stuck in maladaptive cycles of learning and control, resulting in an eventual failure to learn well-adapted models. To test for the presence of bad-bootstraps, we allowed agents to learn models over an extended period of 4,000-time steps. We allowed this additional time to exclude the possibility that opportunities to learn had not been fully exploited by agents. (We additionally conducted the same experiment with 10,000-time steps; results were unchanged). We then tested the learned models on their ability to support chemotaxis, by allowing them to interact with their environment for an additional 1,000 time-steps using the expected free energy action strategy. To quantify whether the learned models were able to perform chemotaxis in any form, we measured whether the agent had moved more than 50 units towards the source by the end of the testing period.

After 4,000 learning steps, all the agents that had learned models using the expected free energy, epistemic or random strategies were able to perform at least some chemotaxis. In contrast 36% of the agents that had learned models under maximization of instrumental value did not engage in any chemotaxis at all. To better understand why instrumental agents frequently failed to learn well-adapted models, even after significant learning, we provide an analysis of a randomly selected failed model. This model prescribes a behavioural profile whereby agents continually tumble, even in positive chemical gradients. This arises from the belief that tumbling is more likely to give rise to positive gradients, even when the agent is in positive gradients. In other words, the model encodes the erroneous belief that, in positive gradients, running will be less likely to give rise to positive chemical gradients, relative to tumbling. Given this belief, the agent continually tumbles, and therefore never samples information that disconfirms this maladaptive belief. This exemplifies a ‘bad bootstrap’ arising from the goal-directed nature of the agent’s action strategy.

Finally, we explore how assigning epistemic value to actions can help overcome bad bootstraps. We analyse an agent which acts to minimize expected free energy, quantifying the relative contributions of epistemic and instrumental value to running and tumbling. We initialize an agent with a randomly selected maladapted model and allow the agent to interact with (and learn from) the environment according to the expected free energy action strategy (i.e using the E.F.E agent). In Fig 6a, we plot the (negative) expected free energy of the running and tumbling control states over time, along with the relative contributions of instrumental and epistemic value. These results show that the (negative) expected free energy for the tumble control state is initially higher than that of the running control state because the agent believes there is less instrumental value in running. This causes the agent to tumble, which in turn causes the agent to gather information about the effects of tumbling. Consequently, the model becomes less uncertain about the expected effects of tumbling, thereby decreasing the epistemic value of tumbling (and thus the (negative) expected free energy of tumbling). This continues until the negative expected free energy of tumbling becomes less than that of running, which has remained constant (since the agent has not yet gained any new information about running). At this point, the agent infers running to be the more likely action, which causes the agent to run. The epistemic value of running now starts to decrease, but as it does so the new sampled observations disclose information that running is very likely to cause transitions from positive to positive gradients (i.e., to maintain positive gradients). The instrumental value of running (and thus the negative expected free energy of running) therefore sharply increases in positive gradients, causing the agent to continue to run in positive gradients. Note that this agent did not fully resolve its uncertainty about tumbling. This highlights the fact that, under active inference, the epistemic value of an action is contextualized by current instrumental imperatives.

Fig 6. Overcoming bad-bootstraps.

Fig 6

(A) Expected free energy. A plot of expected free energy for run and tumble control states overtime for an agent with an initially maladapted model. This model encodes the erroneous belief that running is less likely to give rise to positive chemical gradients, relative to tumbling. Therefore, at the start of the trial, the instrumental value of tumbling (green dotted line) is higher than the instrumental value of running (purple dotted line). The epistemic value of both running and tumbling (brown and red dotted lines, respectively) is initially the same. As the (negative) expected free energy for tumbling (orange line) is higher than the (negative) expected free energy for running (blue line), the agent tumbles for the first 900 time steps. During this time, agents gain information about the effects of tumbling, and the epistemic value of tumbling decreases, causing the negative expected free energy for tumbling to also decrease. This continues until the negative expected free energy is for tumbling is lower than the negative expected free energy for running, which has remained constant. Agents then run and gather information about the effects of running. This causes the epistemic value of running to decrease, but also causes the instrumental value of running to sharply increase, as the new information disconfirms their erroneous belief that running will not give rise to positive gradients.

Discussion

Equipping agents with generative models provides a powerful solution to prescribing well-adapted behaviour in structured environments. However, these models must, at least in part, be learned. For behaving agents—i.e., biological agents—the learning of generative models necessarily takes place in the presence of actions; i.e., in an ‘online’ fashion, during ongoing behaviour. Such models must also be geared towards prescribing actions that are useful for the agent. How to learn such ‘action-oriented’ models poses significant challenges for both computational biology and model-based reinforcement learning (RL).

In this paper, we have demonstrated that the active inference framework provides a principled and pragmatic approach to learning adaptive action-oriented models. Under this approach, the minimization of expected free energy prescribes an intrinsic and context-sensitive balance between goal-directed (instrumental) and information-seeking (epistemic) behaviours, thereby shaping the learning of the underlying generative models. After developing the formal framework, we illustrated its utility using a simple agent-based model of bacterial chemotaxis. We compared three situations. When agents learned solely in the presence of goal-directed actions, the learned models were specialized to the agent’s behavioural niche but were prone to converging to sub-optimal solutions, due to the instantiation of ‘bad-bootstraps’. Conversely, when agents learned solely in the presence of epistemic (information-seeking) actions, they learned accurate models which avoided sub-optimal convergence, but at the cost of reduced sample efficiency due to the lack of behavioural specialisation.

Finally, we showed that the minimisation of expected free-energy effectively-balanced goal-directed and information-seeking actions, and that the models learned in the presence of these actions were tailored to the agent’s behaviours and goal, and were also robust to bad-bootstraps. Learning took place efficiently, requiring fewer interactions with the environment. The learned models were also less complex, relative to other strategies. Importantly, models learned via active inference departed in systematic ways from a veridical representation of the environment’s true structure. For these agents, the learned models supported adaptive behaviour not only in spite of, but because of, their departure from veridicality.

Learning action-oriented models: Good and bad bootstraps

When learning generative models online in the presence of actions, there is a circular dynamic in which learning is coupled to behaviour. The (partially) learned models are used to specify actions, and these actions provide new data which is then used to update the model. This circular dynamic (or ‘information self-structuring’ [20]) raises the potential for both ‘good’ and ‘bad’ bootstraps.

If actions are selected based purely on (expected) instrumental value, then the resulting learned models will be biased towards an agent’s behavioural profile and goals (or prior preferences under the active inference framework—see Fig 4c & 4e), but will also be strongly constrained by the model’s initial conditions. In our simulations, we showed that learning from instrumental actions was prone to the instantiation of ‘bad-bootstraps’. Specifically, we demonstrated that these agents typically learned an initially maladapted model due to insufficient data or sub-optimal initialisation, and then subsequently used this model to determine goal-directed actions. This resulted in agents engaging with the environment in a sub-optimal and biased manner, thereby reintroducing sub-optimal data and causing models to become entrenched within local minima. Recent work in model-based RL has identified this coupling to be one of the major obstacles facing current model-based RL algorithms [62]. More generally, it is likely that bad-bootstraps are a prevalent phenomenon whenever parameters are used to determine the data from which the parameters are learned. Indeed, this problem played a significant role in motivating the (now common) use of ‘experience replay’ in model-free RL [63]. Experience replay describes the method of storing past experiences to be later sampled from for learning, thus breaking the tight coupling between learning and behaviour.

In the context of online learning, one way to avoid bad-bootstraps is to select actions based on (expected) epistemic value [37, 40, 53], where agents seek out novel interactions based on counterfactually informed beliefs about which actions will lead to informative transitions. By utilising the uncertainty encoded by (beliefs about) model parameters, this approach can proactively identify optimally informative transitions. In our simulations, we showed that agents using this strategy learned models that asymptoted towards veridicality and, as such, were not tuned to any specific behavioural niche. This occurred because pure epistemic exploration treats all uncertainties as equally important, meaning that agents were driven to resolve uncertainty about all possible agent-environment contingencies. While models learned using this strategy were able to support chemotactic behaviour (Fig 3a), learning was highly sample-inefficient.

We have argued that a more suitable approach is to balance instrumental and epistemic actions in a principled way during learning. This is what is achieved by the active inference framework, via minimization of expected free energy. Minimizing expected free energy means that the model uncertainties associated with an agent’s goals and desires are prioritised over those which are not. Furthermore, it means that model uncertainties are only resolved until an agent (believes that it) is sufficiently able to achieve its goals, such that agents need not resolve all of their model uncertainty. In our simulations, we showed that active inference agents learned models in a sample-efficient way, avoided being caught up in bad bootstraps, and generated well-adapted behaviour in our chemotaxis setting. Our data, therefore, support the hypothesis that learning via active inference provides a principled and pragmatic approach to the learning of well-adapted action-oriented generative models.

Exploration vs. exploitation

Balancing epistemic and instrumental actions recalls the well-known trade-off between exploration and exploitation in reinforcement learning. In this context, the simplest formulation of this trade-off can be construed as a model-free notion in which exploration involves random actions. One example of this simple formulation is the ε-greedy algorithm which utilises noises in the action selection process to overcome premature sub-optimal convergence [34]. While an ε-greedy strategy might help overcome ‘bad-bootstraps’ by occasionally promoting exploratory actions, the undirected nature of random exploration is unlikely to scale to complex environments.

The balance between epistemic and instrumental actions in our active inference agents is more closely connected to the exploration-exploitation trade-off in model-based RL. As in our agents, model-based RL agents often employ exploratory actions that are selected to resolve model uncertainty. As we have noted, such actions can help avoid sub-optimal convergence (bad bootstraps), especially at the early stages of learning where data is sparse. However, in model-based RL it is normally assumed that, in the limit, a maximally comprehensive and maximally accurate (i.e., veridical) model would be optimal. This is exemplified by approaches that conduct an initial ‘exploration’ phase—in which the task is to construct a veridical model from as few samples as possible—followed by a subsequent ‘exploitation’ phase. By contrast, our approach highlights the importance of ‘goal-directed exploration’, in which the aim is not to resolve all uncertainty to construct a maximally accurate representation of the environment, but is instead to selectively resolve uncertainty until adaptive behaviour is (predicted to be) possible. Moreover, we have demonstrated that goal-directed exploration allows exploration to be contextualised by an agent’s goals. Specifically, we have shown that acting to simultaneously explore and exploit the environment causes exploration to be biased towards parts of state space that are relevant for goal-directed behaviour, thereby increasing the efficiency of exploration. Therefore, our work suggests that acting to minimise expected free energy can benefit learning by naturally affording an efficient form of goal-directed exploration.

This kind of goal-directed exploration highlights an alternative perspective on the exploration-exploitation trade-off. We demonstrated that “exploitation”—traditionally associated with exploiting the agent’s current knowledge to accumulate reward—can also lead to a type of constrained learning that leads to ‘action-oriented’ representations of the environment. In other words, our results suggest that, in the context of model-learning, the “explore-exploit” dilemma additionally entails an “explore-constrain” dilemma. This is granted a formal interpretation under the active inference framework—as instrumental actions are associated with soliciting observations that are consistent with the model’s prior expectations. However, given the formal relationship between instrumental value in active inference and the Bellman equations [43], a similar trade-off can be expected to arise in any model-based RL paradigm.

Model non-veridicality

In our simulations, models learned through active inference were able to support adaptive behaviour even when their structure and variables departed significantly from an accurate representation of the environment. By design, the models utilized a severely impoverished representation of the environment. An exhaustive representation would have required models to encode information about the agent’s position, orientation, the position of the chemical source, as well as a spatial map of the chemical concentrations so that determining an adaptive action would require a complex transformation of these variables. In contrast, our model was able to support adaptive behaviour by simply encoding a representation of the instantaneous effects of action on the local chemical gradient. Therefore, rather than encoding a rich and exhaustive internal mirror of nature, the model encoded a parsimonious representation of sensorimotor couplings that were relevant for enabling action [64]. While this particular ‘action-oriented’ representation was built-in through the design of the generative model, it nonetheless underlines that models need not be homologous with their environment if they are to support adaptive behaviour.

By evaluating the number of ‘redundant’ model parameters (as evaluated through Bayesian model reduction), we further demonstrated that learning in the presence of goal-directed behaviour leads to models that were more parsimonious in their representation of the environment, relative to other strategies (Fig 5b). Moreover, we showed that this strategy leads to models that did not asymptote to veridicality, in terms of the accuracy of the model’s parameters (Fig 4a). Interestingly, these agents nevertheless displayed high ‘active accuracy’ (i.e., the predictive accuracy in the presence of self-determined actions), highlighting the importance of contextualising model accuracy in terms of an agent’s actions and goals.

While these results demonstrate that models can support adaptive behaviour in spite of their misrepresentation of the environment and that these misrepresentations afforded benefits in terms of sample efficiency and model complexity, the active inference framework additionally provides a mechanism whereby misrepresentation enables adaptive behaviour. Active inference necessarily requires an organism’s model to include systematic misrepresentations of the environment, by virtue of the organism’s existence. Specifically, an organism’s generative model must encode a set of prior beliefs that distinguish it from its external environment. For instance, the chemotaxis agents in the current simulation encoded the belief that observing positive chemical gradients was a-priori more likely. From an objective and passive point of view, these prior beliefs are, by definition, false. However, these systematic misrepresentations can be realized through action, thereby giving rise to apparently purposeful and autopoietic behaviour. Thus, under active inference, adaptive behaviour is achieved because of, and not just in spite of, a models departure from veridicality [15].

Encoding frugal and parsimonious models plausibly afford organism’s several evolutionary advantages. First, the number of model parameters will likely correlate with the metabolic cost of that model. Moreover, simpler models will be quicker to deploy in the service of action and perception and will be less likely to overfit the environment. This perspective, therefore, suggests that the degree to which exhaustive and accurate models are constructed should be mandated by the degree to which they are necessary for on-going survival. If the mapping between the external environment and allostatic responses is complex and manifold, then faithfully modelling features of the environment may pay dividends. However, in the case that frugal approximations and rough heuristics can be employed in the service of adaptive behaviour, such faithful modelling should be avoided. We showed that such “action-oriented” models arise naturally under ecologically valid learning conditions, namely, learning online in the presence of goal-directed behaviour. However, action-oriented behaviour that was adapted to the agent’s goals only arose under the minimisation of expected free energy.

It is natural to ask whether there are scenarios in which action-oriented models might impede effective learning and adaptation. One such candidate scenario is transfer learning [65], whereby existing knowledge is reapplied to novel tasks or environments. This form of learning is likely to be important in biology, as for many organisms preferences can change over time. If the novel task or environment requires a pattern of sensorimotor coordination that is distinct from learned patterns of sensorimotor coordination, then a more exhaustive model of the environment might indeed facilitate transfer learning. However, if adaptation in the novel task or environment can be achieved through a subset of existing patterns of sensorimotor coordination (i.e. in going from walking to running), then one might expect an action-oriented representation to facilitate transfer learning, in so far as such representations reduce the search space for learning the new behaviour. This type of transfer learning is closely related to curriculum learning [66], whereby complex behaviours are learned progressively by first learning a series of simpler behaviours. We leave it to future work to explore the scenarios in which action-oriented models enable efficient transfer and curriculum learning.

Active inference

While any approach to balancing exploration and exploitation is amenable to the benefits described in the previous sections, we have focused on the normative principle of active inference. From a purely theoretical perspective, active inference re-frames the exploration-exploitation dilemma by suggesting that both exploration and exploitation are complementary perspectives on a single objective function—the minimization of expected free energy. However, an open question remains as to whether this approach provides a practical solution to balancing exploration and exploitation. On the one hand, it provides a practically useful recipe by casting both epistemic and instrumental value in the same (information-theoretic) currency. However, the balance will necessarily depend on the shape of the agent’s beliefs about hidden states, beliefs about model parameters, and prior beliefs about preferable observations. In the current work, we introduced an artificial weighting term to keep the epistemic and instrumental value within the same range. The same effect could have been achieved by constructing the shape (i.e. variance) of the prior preferences P(o).

Active inference also provides a suitable framework for investigating the emergence of action-oriented models. Previous work has highlighted the fact that active inference is consistent with, and necessarily prescribes, frugal and parsimonious generative models, thus providing a potential bridge between ‘representation-hungry’ approaches to cognition espoused by classical cognitivism and the ‘representation-free’ approaches advocated by embodied and enactive approaches [6, 12, 13, 64, 6775].

This perspective has been motivated by at least three reasons. First, active inference is proposed as a description of self-organization in complex systems [6]. Deploying generative models and minimizing free energy are construed as emergent features of a more fundamental drive towards survival. On this account, the purpose of representation is not to construct a rich internal world model, but instead to capture the environmental regularities that allow the organism to act adaptively.

The second reason is that minimizing free energy implicitly penalizes the complexity of the generative model (see Appendix 1). This implies that minimizing free energy will reduce the complexity (or parameters) required to go from prior beliefs to (approximately) posterior beliefs, i.e. in explaining some observations. This occurs under the constraint of accuracy, which makes sure that the inferred variables can sufficiently account for the observations. In other words, minimizing free energy ensures that organism’s maximize the accuracy of their predictions while minimizing the complexity of the models that are used to generate those predictions.

As discussed in the previous section, active inference also requires agents to encode systematic misrepresentations of their environment. Our work has additionally introduced a fourth motivation for linking active inference to adaptive action-oriented models, namely, that the minimization of expected free energy induces a balance between self-sustaining (and thus constrained) patterns of agent-environment interaction and goal-directed exploration.

The arguments and simulations presented in this paper resonate with previous work which views an active inference agent as a ‘crooked scientist’ [76, 77]. Here, an agent is seen as a ‘scientist’ insofar as it seeks out information to enable more accurate predictions. However, this work additionally highlights the fact that agents are biased by their own non-negotiable prior beliefs and preferences, leading them to seek out evidence for these hypotheses. We have built upon this previous work by exploring the types of models that are learned when an agent acts as a ‘crooked scientist’.

Conclusion

In this paper, we have demonstrated that the minimization of expected free energy (through active inference) provides a principled and pragmatic solution to learning action-oriented probabilistic models. These models can make the process of learning models of natural environments tractable, and provide a potential bridge between ‘representation-hungry’ approaches to cognition and those espoused by enactive and embodied disciplines. Moreover, we showed how learning online in the presence of behaviour can give rise to ‘bad-bootstraps’—a phenomenon that has the potential to be problematic whenever learning is coupled with behaviour. Epistemic or information-seeking actions provide a plausible mechanism for overcoming bad-bootstraps. However, to exploration to be efficient, the epistemic value of actions must be contextualized by agents goals and desires. The ability to learn adapted models that are tailored to action provides a potential route to tractable and sample efficient learning algorithms in a variety of contexts, including computational biology and model-based RL.

Methods

The generative model

The agent’s generative model specifies the joint probability over observations o, hidden state variables s, control variables u and parameter variables θ. To account for temporal dependencies among variables, we consider a generative model that is over a sequence of variables through time, i.e. x˜={x1,...,xt}, where tilde notation indicates a sequence from time t = 0 to the current time t, and xt denotes the value of x at time t. The generative model is given by the joint probability distribution P(o˜,s˜,u˜,θ|λ,α), where:

P(o˜,s˜,u˜,θ|λ,α)=P(θ|α)t=1TP(ot|st,λ)P(st|st-1,ut-1,θ)P(ut)P(ot|st,λ)=Cat(λ)P(st|st-1,ut-1,θ)=Cat(θ)P(θ|α)=Dir(α)P(ut)=σ(-G˜) (9)

where σ(⋅) is the softmax function. For simplicity, we initialize P(st=0) as a uniform distribution, and therefore exclude it from Eq 9.

The likelihood distribution specifies the probability of observing some chemical gradient ot given a belief about the chemical gradient st. This distribution is described by a set of categorical distributions, denoted Cat(⋅), where each categorical distribution is a distribution over k discrete and exclusive possibilities. The parameters of a categorical distribution can be represented as a vector with each entry describing the probability of some event pi, with i=1kpi=1. As the likelihood distribution is a conditional distribution, a separate categorical distribution is maintained for each hidden state in S, (i.e. spos and sneg), where each of these distributions specifies the conditional probability of observing some chemical gradient (either opos and oneg). The parameters of the likelihood distribution can therefore be represented as a 2 x 2 matrix where each column j is a categorical distribution that describes P(ot|st = j, λ). For the current simulations, we provide agents with the parameters λ and do not require these parameters to be learned. The provided parameters encode the belief that there is an unambiguous mapping between spos and opos, and between sneg and oneg, meaning that λ can be encoded as an identity matrix.

The prior probability over hidden states st is given by the transition distribution P(st|st−1, ut−1, θ), which specifies the probability of the current hidden state, given beliefs about the previous hidden state and the previous control state. In other words, this distribution describes an agent’s beliefs about how running and tumbling will cause changes in the chemical gradient. Following previous work [38], we assume that agents know which control state was executed at the previous time step. As with the likelihood distribution, the prior distribution is described by a set of categorical distributions. Each categorical distribution j specifies the probability distribution P(st|st−1 = j, θ), such that P(st|st−1, θ) can again be represented as a 2 x 2 matrix. However, the transition distribution is also conditioned on control states u, meaning a separate transition matrix is maintained for both urun and utumble, such that the transition distribution can be represented as a 2 x 2 x 2 tensor. Agents, therefore, maintain separate beliefs about how the environment is likely to change for each control state.

We require agents to learn the parameters θ of the transition distribution. At the start of each learning period, we randomly initialize θ, such that agents start out with random beliefs about how actions cause transitions in the chemical gradient. To enable these parameters to be learned, the generative model encodes (time-invariant) prior beliefs over θ in the distribution P(θ|α). This distribution is modelled as Dirichlet distribution, denoted Dir(⋅), where α are the parameters of this distribution. A Dirichlet distribution represents a distribution over the parameters of a distribution. In other words, sampling from this distribution returns a vector of parameters, rather than a scalar. By maintaining a distribution over θ, the task of learning about the environment is transformed into a task of inferring unknown variables.

Finally, the prior probability of control states is proportional to a softmax transformation of -G˜, which is a vector of (negative) expected free energies, with one entry for each control state. This formalizes the notion that control states are a-priori more likely if they are expected to minimize free energy. We provide a full specification of expected free energy in the following sections.

The approximate posterior

The approximate posterior encodes an agent’s current approximately posterior beliefs about the chemical gradient s, the control state u and model parameters θ. As with the generative model, the approximate posterior is over a sequence of variables Q(s˜,u˜,θ|ϕ), where ϕ are the sufficient statistics of the distribution.

In order to make inference tractable, we utilize the mean-field approximation to factorize the approximate posterior. This approximation treats a potentially high-dimensional distribution as a product of a number of simpler marginal distributions. Heuristically, this treats certain variables as statistically independent. Practically, it allows us to infer individual variables while keeping the remaining variables fixed. This approximation makes inference tractable, at the (potential) price of making inference sub-optimal. For inference to be optimal, the factorization of the approximate posterior must match the factorization of the true posterior.

Here, we factorize over time, the beliefs about the chemical gradient, the beliefs about model parameters and the beliefs about control states:

Q(s˜,u˜,θ|ϕ)=Q(θ|ϕα)t=0TQ(st|ϕst)Q(ut|ϕut)Q(θ|ϕα)=Dir(ϕα)Q(st|ϕst)=Cat(ϕst)Q(ut|ϕut)=Cat(ϕut) (10)

Inference, learning and action

Having defined the generative model and the approximate posterior, we can now specify how free energy can be minimized. In brief, this involves updating the sufficient statistics of the approximate posterior ϕ as new observations are sampled. To minimize free energy, we identify the derivative of free energy with respect to the sufficient statistics F(ϕ,o)ϕ, solve for zero, i.e. F(ϕ,o)ϕ=0, and rearrange to give the variational updates that minimize free energy. Given the mean-field assumption, we can perform this scheme separately for each of the partitions of ϕ, i.e ϕst, ϕut and ϕα.

For the current scheme, the update equations for the hidden state parameters ϕs are (see Appendix 5 for a full derivation):

ϕst=σ(lnP(ot|st,λ)+lnP(st|st-1,ut-1,θ)) (11)

This equation corresponds to state estimation or ‘perception’ and can be construed as a Bayesian filter that combines the likelihood of the current observation with a prior belief that is based on the previous hidden state and the previous control state. To implement this update in practice, we rewrite Eq 11 in terms of the relevant parameters and sufficient statistics (see Appendix 5):

ϕst=σ(lnλ·ot+θ¯ut-1·ϕst-1)θ¯ut-1=EQ(θ|ϕα)[lnθut-1]=ψ(ϕαijut-1)-ψ(i=1nϕαjut-1) (12)

Here, ot is a one-hot encoded vector specifying the current observation, θu specifies the transition distribution corresponding to control state u, and ψ(⋅) is the digamma function. Note that the parameters of the likelihood distribution λ are point-estimates of a categorical distribution, meaning it is possible to straightforwardly take the logarithm of this distribution. However, the beliefs about θ are described by the Dirichlet distribution Q(θ|α), meaning that the mean of the logarithm of this distribution (denoted θ¯) must be evaluated (leading to lines two and three of Eq 12).

Learning can be conducted in a similar manner by updating the parameters ϕα (see Appendix 5 for a full derivation):

ϕαu=αu+t=1T[at-1=ut-1]·ξϕstϕst-1 (13)

where [⋅] is an inversion bracket that returns one if the statement inside the bracket is true and zero otherwise, and ξ is an artificial learning rate, set to 0.001 for all simulations. Note that we update the parameters ϕα after each iteration, but use a small learning rate to simulate the difference in time scales implied by the factorization of the generative model and approximate posterior. This update bears a resemblance to Hebbian plasticity, in the sense that the probability of each parameter increases if the corresponding transition is observed (i.e. ‘fire together wire together’).

Finally, actions can be inferred by updating the parameters ϕut, where the update is given by (see Appendix 5 for a full derivation):

ϕut=σ(-G˜) (14)

This equation demonstrates that the (approximately) posterior beliefs over control states are proportional to the vector of negative expected free energies. In other words, the posterior and prior beliefs about control states are identical.

Expected free energy

In this section, we describe how to evaluate the vector -G˜. This is a vector of negative expected free energies, with one for each control state uU. As specified in the formalism, the negative expected free energy for a single control state is defined as −Gτ(ut), where τ is some future time point, and, for the current simulations:

-Gτ(ut)=EQ(oτ,sτ,θ|ut,ϕτ)[lnQ(θ|sτ,oτ,ut,ϕτ)-lnQ(θ|ϕτ)]Parameterepistemicvalue+EQ(oτ,sτ,θ|ut,ϕτ)[lnP(oτ)]Instrumentalvalue (15)

As described in the results section, we ignore the epistemic value for hidden states, as there is no uncertainty in the likelihood distribution. Moreover, for all simulations, τ = t + 1, such that we only consider the immediate the immediate effects of action. This scheme is, however, entirely consistent with a sequence of actions, i.e. a policy.

In order to evaluate expected free energy, we rewrite Eq 15 in terms of parameters. By noting that EQ(oτ,sτ,θ|ut,ϕτ)[lnP(oτ)]=EQ(oτ|ut,ϕτ)[lnP(oτ)], we can write instrumental value as:

EQ(oτ|ut,ϕτ)[lnP(oτ)]=ϕoτ·ρ (16)

where ϕoτ are the sufficient statistics of Q(oτ|ut, ϕτ), and ρ are the parameters of P(oτ), which is a categorical distribution, such that ρ is a vector with one entry for each oO. In order to evaluate parameter epistemic value, we utilise the following approximation:

EQ(oτ,sτ,θ|ut,ϕτ)[lnQ(θ|sτ,oτ,ut,ϕτ)-lnQ(θ|ϕτ)]ϕsτ·Wut·ϕstWut=i=1nϕαj-1-ϕα-1 (17)

For details of this approximation, we refer the reader to [40]. For a given control state ut, negative expected free energy can, therefore, be calculated as:

-Gτ(ut)=ϕsτ·Wut·ϕst+δ(ϕoτ·ρ) (18)

where ϕsτ are the sufficient statistics of Q(sτ|ut, ϕτ) and δ is an optional weighting term. For all simulations, this is set to 1/10. To calculate Eq 18, it is first necessary to evaluate the expected beliefs Q(sτ|ut, ϕτ) and Q(oτ|ut, ϕτ). The expected distribution over hidden states Q(sτ|ut, ϕτ) is given by EQ(st|ut,ϕτ)[P(sτ|st,ut,θ)]. Given these beliefs over future hidden states, we can evaluate Q(oτ|ut, ϕτ) as EQ(sτ|ut,ϕτ)[P(oτ|sτ,λ)].

The full update scheme for the agents is provided in algorithm 1:

Algorithm 1 Active inference MDP algorithm

Require: parameters of likelihood distribution λ, parameters of prior distribution over transition distribution parameters α, prior probability of observations ρ

1: for t in T do

2:  otenv.observe()     ⊲ Sample observation from environment

3:  ϕst=σ(lnλ·ot+θ¯ut-1·ϕst-1)       ⊲ Hidden state inference

4:  ϕut=σ(-G)                 ⊲ Control state inference

5:  whereGτ(ut)=ϕsτ·Wut·ϕstEpistemicagent+ϕoτ·ρInstrumentalagentExpectedfreeenergyagent

6:  ϕαu=αu+t=1T[at-1=ut-1]·ξϕstϕst-1      ⊲ Learning inference

7:  atQ(ut|ϕut)                    ⊲ Sample action

8:  env.update(at)                     ⊲ Perform action

9: end for

Supporting information

S1 Appendix. Rearrangements of the free energy functional.

In this appendix, we provide derivations for three arrangements of the free energy functional.

(PDF)

S2 Appendix. Derivation of expected free energy.

In this appendix, we formally describe the relationship between free energy and expected free energy.

(PDF)

S3 Appendix. Deriving instrumental and epistemic value.

In this appendix, we decompose expected free energy into instrumental and epistemic value.

(PDF)

S4 Appendix. Relationship of epistemic value to established formalisms.

In this appendix, we demonstrate that epistemic value—a component of expected free energy—is equivalent to a number of established formalisms.

(PDF)

S5 Appendix. Deriving the variational update equations for inference, learning and action.

In this appendix, we derive the update equations for beliefs about hidden states, control states and model parameters.

(PDF)

S6 Appendix. Learning in a high-dimensional state space.

In this appendix, we present results for an additional experiment where we compare learning under epistemic and random action strategies in a high-dimensional state space.

(PDF)

Data Availability

The Python code used to simulate the models and the simulation data are available at https://github.com/alec-tschantz/action-oriented.

Funding Statement

AT is funded by a PhD studentship from the Sackler Foundation and the School of Engineering and Informatics at the University of Sussex. CLB is supported by BBRSC grant number BB/P022197/1. We are grateful to the Dr. Mortimer and Theresa Sackler Foundation, which supports the Sackler Centre for Consciousness Science. AKS is additionally grateful to the Canadian Institute for Advanced Research (Azrieli Programme on Brain, Mind, and Consciousness). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Doll BB, Simon DA, Daw ND. The ubiquity of model-based reinforcement learning. Current opinion in neurobiology. 2012;22(6):1075–1081. 10.1016/j.conb.2012.08.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Dayan P, Berridge KC. Model-based and model-free Pavlovian reward learning: revaluation, revision, and revelation. Cognitive, Affective & Behavioral Neuroscience. 2014;14(2):473–492. 10.3758/s13415-014-0277-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Botvinick M, Weinstein A. Model-based hierarchical reinforcement learning and human action control. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences. 2014;369 (1655). 10.1098/rstb.2013.0480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Dolan R, Dayan P. Goals and Habits in the Brain. Neuron. 2013;80(2):312–325. 10.1016/j.neuron.2013.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Conant RC, Ashby WR. Every good regulator of a system must be a model of that system. International Journal of Systems Science. 1970;1(2):89–97. 10.1080/00207727008920220 [DOI] [Google Scholar]
  • 6. Friston K. Life as we know it. Journal of the Royal Society, Interface. 2013;10(86):20130475 10.1098/rsif.2013.0475 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kuvayev L, Sutton RS. Model-Based Reinforcement Learning with an Approximate, Learned Model. In: in Proceedings of the Ninth Yale Workshop on Adaptive and Learning Systems; 1996. p. 101–105.
  • 8. Deisenroth MP. A Survey on Policy Search for Robotics. Foundations and Trends in Robotics. 2011;2(1-2):1–142. 10.1561/2300000021 [DOI] [Google Scholar]
  • 9. Seth AK. The cybernetic Bayesian brain. In: Open MIND; 2015. p. 1–24. [Google Scholar]
  • 10. Seth AK, Tsakiris M. Being a Beast Machine: The Somatic Basis of Selfhood. Trends in Cognitive Sciences. 2018;22(11):969–981. 10.1016/j.tics.2018.08.008 [DOI] [PubMed] [Google Scholar]
  • 11. Baltieri M, Buckley CL. An active inference implementation of phototaxis. The 2018 Conference on Artificial Life: A Hybrid of the European Conference on Artificial Life (ECAL) and the International Conference on the Synthesis and Simulation of Living Systems (ALIFE). 2017;29:36–43. [Google Scholar]
  • 12. Clark A. Radical Predictive Processing. Southern Journal of Philosophy. 2015;53(S1):3–27. 10.1111/sjp.12120 [DOI] [Google Scholar]
  • 13. Pezzulo G, Donnarumma F, Iodice P, Maisto D, Stoianov I. Model-Based Approaches to Active Perception and Control. Entropy. 2017;19:266 10.3390/e19060266 [DOI] [Google Scholar]
  • 14. Gibson JJ. The Ecological Approach to Visual Perception: Classic Edition. Psychology Press; 2014. [Google Scholar]
  • 15. Wiese W. Action Is Enabled by Systematic Misrepresentations. Erkenntnis. 2017;82(6):1233–1252. 10.1007/s10670-016-9867-x [DOI] [Google Scholar]
  • 16. McKay RT, Dennett DC. The evolution of misbelief. The Behavioral and Brain Sciences. 2009;32(6):493–510; discussion 510–561. 10.1017/S0140525X09990975 [DOI] [PubMed] [Google Scholar]
  • 17. Mendelovici A. Reliable Misrepresentation and Tracking Theories of Mental Representation. Philosophical Studies. 2013;165(2):421–443. 10.1007/s11098-012-9966-8 [DOI] [Google Scholar]
  • 18. Zehetleitner M, Schönbrodt FB. When misrepresentations are successful In: Epistemological Dimensions of Evolutionary Psychology. New York: Springer; 2015. [Google Scholar]
  • 19. Verschure PFMJ, Voegtlin T, Douglas RJ. Environmentally mediated synergy between perception and behaviour in mobile robots. Nature. 2003;425(6958):620–624. 10.1038/nature02024 [DOI] [PubMed] [Google Scholar]
  • 20. Montúfar G, Ghazi-Zahedi K, Ay N. A Theory of Cheap Control in Embodied Systems. PLOS Computational Biology. 2015;11(9):e1004427 10.1371/journal.pcbi.1004427 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Thornton C. Gauging the value of good data: Informational embodiment quantification. Adaptive Behavior. 2010;18(5):389–399. 10.1177/1059712310383914 [DOI] [Google Scholar]
  • 22.Ruesch J, Ferreira R, Bernardino A. A measure of good motor actions for active visual perception. In: 2011 IEEE International Conference on Development and Learning (ICDL). vol. 2; 2011. p. 1–6.
  • 23.Lungarella M, Sporns O. Information Self-Structuring: Key Principle for Learning and Development. In: Proceedings. The 4th International Conference on Development and Learning, 2005; 2005. p. 25–30.
  • 24. Lungarella M, Sporns O. Mapping Information Flow in Sensorimotor Networks. PLOS Computational Biology. 2006;2(10):e144 10.1371/journal.pcbi.0020144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Yang SCH, Wolpert DM, Lengyel M. Theoretical perspectives on active sensing. Current opinion in behavioral sciences. 2018;11:100–108. 10.1016/j.cobeha.2016.06.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Gottlieb J, Oudeyer PY. Towards a neuroscience of active sampling and curiosity. Nature Reviews Neuroscience. 2018;19(12):758 10.1038/s41583-018-0078-0 [DOI] [PubMed] [Google Scholar]
  • 27. Friston K, Adams RA, Perrinet L, Breakspear M. Perceptions as Hypotheses: Saccades as Experiments. Frontiers in Psychology. 2012;3 10.3389/fpsyg.2012.00151 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Barandiaran XE. Autonomy and Enactivism: Towards a Theory of Sensorimotor Autonomous Agency. Topoi. 2017;36(3):409–430. 10.1007/s11245-016-9365-4 [DOI] [Google Scholar]
  • 29. Egbert MD, Barandiaran XE. Modeling habits as self-sustaining patterns of sensorimotor behavior. Frontiers in Human Neuroscience. 2014;8 10.3389/fnhum.2014.00590 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Polydoros AS, Nalpantidis L. Survey of Model-Based Reinforcement Learning: Applications on Robotics. Journal of Intelligent & Robotic Systems. 2017;86(2):153–173. 10.1007/s10846-017-0468-y [DOI] [Google Scholar]
  • 31.Atkeson CG, Santamaria JC. A Comparison of Direct and Model-Based Reinforcement Learning. In: In International Conference on Robotics and Automation. IEEE Press; 1997. p. 3557–3564.
  • 32.Ha D, Schmidhuber J. World Models. arXiv:180310122. 2018;.
  • 33.Chua K, Calandra R, McAllister R, Levine S. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. arXiv:180512114. 2018;.
  • 34.Watkins CJCH. Learning from delayed rewards. Ph D thesis, King’s College, University of Cambridge. 1989;.
  • 35.Stadie BC, Levine S, Abbeel P. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models. arXiv:150700814 [cs, stat]. 2015;.
  • 36.Houthooft R, Chen X, Duan Y, Schulman J, De Turck F, Abbeel P. VIME: Variational Information Maximizing Exploration. arXiv:160509674 [cs, stat]. 2016;.
  • 37.Sun Y, Gomez F, Schmidhuber J. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments. arXiv:11035708 [cs, stat]. 2011;.
  • 38. Friston K, Rigoli F, Ognibene D, Mathys C, Fitzgerald T, Pezzulo G. Active inference and epistemic value. Cognitive Neuroscience. 2015;6(4):187–214. 10.1080/17588928.2015.1020053 [DOI] [PubMed] [Google Scholar]
  • 39.Burda Y, Edwards H, Pathak D, Storkey A, Darrell T, Efros AA. Large-Scale Study of Curiosity-Driven Learning. arXiv:180804355 [cs, stat]. 2018;.
  • 40. Friston KJ, Lin M, Frith CD, Pezzulo G, Hobson JA, Ondobaka S. Active Inference, Curiosity and Insight. Neural Computation. 2017;29(10):2633–2683. 10.1162/neco_a_00999 [DOI] [PubMed] [Google Scholar]
  • 41. Friston KJ, Stephan KE. Free-energy and the brain. Synthese. 2007;159(3):417–458. 10.1007/s11229-007-9237-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Friston K. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 2010;11(2):127–138. 10.1038/nrn2787 [DOI] [PubMed] [Google Scholar]
  • 43. Friston K, FitzGerald T, Rigoli F, Schwartenbeck P, O’Doherty J, Pezzulo G. Active inference and learning. Neuroscience & Biobehavioral Reviews. 2016;68:862–879. 10.1016/j.neubiorev.2016.06.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Hinton GE, van Camp D. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT’93. New York, NY, USA: ACM; 1993. p. 5–13. Available from: http://doi.acm.org/10.1145/168304.168306.
  • 45. Knill DC, Pouget A. The Bayesian brain: the role of uncertainty in neural coding and computation. Trends in Neurosciences. 2004;27(12):712–719. 10.1016/j.tins.2004.10.007 [DOI] [PubMed] [Google Scholar]
  • 46. Gregory RL. Perceptions as hypotheses. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences. 1980;290(1038):181–197. 10.1098/rstb.1980.0090 [DOI] [PubMed] [Google Scholar]
  • 47. Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience. 1999;2(1):79–87. 10.1038/4580 [DOI] [PubMed] [Google Scholar]
  • 48. Buckley CL, Kim CS, McGregor S, Seth AK. The free energy principle for action and perception: A mathematical review. Journal of Mathematical Psychology. 2017;81:55–79. 10.1016/j.jmp.2017.09.004 [DOI] [Google Scholar]
  • 49. Friston KJ, Daunizeau J, Kiebel SJ. Reinforcement Learning or Active Inference? PLOS ONE. 2009;4(7):e6421 10.1371/journal.pone.0006421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Friston K, Adams R, Montague R. What is value-accumulated reward or evidence? Frontiers in Neurorobotics. 2012;6:11 10.3389/fnbot.2012.00011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Friston K, Thomas FitzGerald, Michael Moutoussis, Timothy Behrens, Dolan Raymond J. The anatomy of choice: dopamine and decision-making. Philosophical Transactions of the Royal Society B: Biological Sciences. 2014;369(1655):20130481 10.1098/rstb.2013.0481 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Parr T, Friston KJ. Generalised free energy and active inference: can the future cause the past? bioRxiv. 2018;. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Schwartenbeck P, Passecker J, Hauser T, FitzGerald THB, Kronbichler M, Friston KJ. Computational mechanisms of curiosity and goal-directed exploration. bioRxiv. 2018;. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Friston KJ, Rosch R, Parr T, Price C, Bowman H. Deep temporal models and active inference. Neuroscience & Biobehavioral Reviews. 2018;90:486–501. 10.1016/j.neubiorev.2018.04.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Mitchell A, Romano GH, Groisman B, Yona A, Dekel E, Kupiec M, et al. Adaptive prediction of environmental changes by microorganisms. Nature. 2009;460(7252):220–224. 10.1038/nature08112 [DOI] [PubMed] [Google Scholar]
  • 56. Mitchell A, Lim W. Cellular perception and misperception: Internal models for decision-making shaped by evolutionary experience. BioEssays: News and Reviews in Molecular, Cellular and Developmental Biology. 2016;38(9):845–849. 10.1002/bies.201600090 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Freddolino PL, Tavazoie S. Beyond homeostasis: a predictive-dynamic framework for understanding cellular behavior. Annual Review of Cell and Developmental Biology. 2012;28:363–384. 10.1146/annurev-cellbio-092910-154129 [DOI] [PubMed] [Google Scholar]
  • 58. Berg HC, Brown DA. Chemotaxis in Escherichia coli analysed by three-dimensional tracking. Nature. 1972;239(5374):500–504. 10.1038/239500a0 [DOI] [PubMed] [Google Scholar]
  • 59. Puterman ML. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 1st ed New York, NY, USA: John Wiley & Sons, Inc; 1994. [Google Scholar]
  • 60. Thar R, Kuhl M. Bacteria are not too small for spatial sensing of chemical gradients: an experimental evidence. Proceedings of the National Academy of Sciences of the United States of America. 2003;100(10):5748–5753. 10.1073/pnas.1030795100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Friston KJ, Litvak V, Oswal A, Razi A, Stephan KE, van Wijk BCM, et al. Bayesian model reduction and empirical Bayes for group (DCM) studies. NeuroImage. 2016;128:413–431. 10.1016/j.neuroimage.2015.11.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Wang T, Bao X, Clavera I, Hoang J, Wen Y, Langlois E, et al. Benchmarking Model-Based Reinforcement Learning. arXiv:190702057 [cs, stat]. 2019;.
  • 63.Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with Deep Reinforcement Learning. arXiv:13125602 [cs]. 2013;.
  • 64.Baltieri M, Buckley CL. Generative models as parsimonious descriptions of sensorimotor loops. arXiv:190412937 [cs, q-bio]. 2019;. [DOI] [PubMed]
  • 65. Lu J, Behbood V, Hao P, Zuo H, Xue S, Zhang G. Transfer learning using computational intelligence: A survey. Knowledge-Based Systems. 2015;80:14–23. 10.1016/j.knosys.2015.01.010 [DOI] [Google Scholar]
  • 66.Bengio Y, Louradour J, Collobert R, Weston J. Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning; 2009. p. 41–48.
  • 67. Kiverstein J. Free Energy and the Self: An Ecological–Enactive Interpretation. Topoi. 2018;. 10.1007/s11245-018-9561-5 [DOI] [Google Scholar]
  • 68. Kirchhoff MD, Robertson I. Enactivism and predictive processing: a non-representational view. Philosophical Explorations. 2018;21(2):264–281. 10.1080/13869795.2018.1477983 [DOI] [Google Scholar]
  • 69. Negru T. Self-organization, Autopoiesis, Free-energy Principle and Autonomy. Organon F. 2018;25(2):215–243. [Google Scholar]
  • 70. Linson A, Clark A, Ramamoorthy S, Friston K. The Active Inference Approach to Ecological Perception: General Information Dynamics for Natural and Artificial Embodied Cognition. Frontiers in Robotics and AI. 2018;5 10.3389/frobt.2018.00021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Williams D. Predictive Processing and the Representation Wars. Minds and Machines. 2018;28(1):141–172. 10.1007/s11023-017-9441-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Michael Kirchhoff, Thomas Parr, Ensor Palacios, Karl Friston, Julian Kiverstein. The Markov blankets of life: autonomy, active inference and the free energy principle. Journal of The Royal Society Interface. 2018;15(138):20170792 10.1098/rsif.2017.0792 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Kirchhoff MD, Froese T. Where There is Life There is Mind: In Support of a Strong Life-Mind Continuity Thesis. Entropy. 2017;19(4):169 10.3390/e19040169 [DOI] [Google Scholar]
  • 74. Baltieri M, Buckley C. The dark room problem in predictive processing and active inference, a legacy of cognitivism? PsyArXiv; 2019. Available from: https://osf.io/p4z8f. [Google Scholar]
  • 75.Baltieri M, Buckley CL. Nonmodular architectures of cognitive systems based on active inference. arXiv:190309542 [cs, q-bio]. 2019;.
  • 76. Bruineberg J, Rietveld E, Parr T, van Maanen L, Friston KJ. Free-energy minimization in joint agent-environment systems: A niche construction perspective. Journal of theoretical biology. 2018;455:161–178. 10.1016/j.jtbi.2018.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Bruineberg J, Kiverstein J, Rietveld E. The anticipating brain is not a scientist: the free-energy principle from an ecological-enactive perspective. Synthese. 2018;195(6):2417–2444. 10.1007/s11229-016-1239-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007805.r001

Decision Letter 0

Natalia L Komarova

11 Feb 2020

Dear Mr Tschantz,

Thank you very much for submitting your manuscript "Learning action-oriented models through active inference" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Natalia L. Komarova

Deputy Editor

PLOS Computational Biology

Natalia Komarova

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This is a very interesting and well written article. It provides compelling theoretical arguments and computer simulations supporting the view that learning should balance instrumental and epistemic imperatives (as exemplified by active inference); and in favor of action-oriented models.

I have some comments on the theoretical part and the simulations, which I hope will help improving the manuscript.

Major comments

- The paper does not report how many control states are used for the simulation (nor the grid size, number of transition functions and parameters). By reading the paper I had the impression that only 2 control states were used. However, Appendix 6 mentions that the simulation is the same as the main text but with a bigger grid (15x15); and in this case 255 control states were used (which is somewhat odd, given that as at any moment the agent can select between run and tumble; is this correct?). Was the same method (one control state for each grid position) used also in the main simulation? And in that case, how did the agent know about its grid position? This point has also some theoretical implications as if so many control states are necessary, the idea that action-oriented models are more parsimonious than models encoding (for example) position is questionable.

- It would be interesting to discuss (or even better to simulate, but this is fully optional) how well learned models afford transfer learning. For example, how fast the different agents readapt to changes of prior preferences for gradients. This is important as for many biological organisms preferences can change over time.

- In the light of the author's discussion about exploration and exploitation, the most useful comparison would be with an "epsilon-greedy instrumental" agent. Is the epsilon-greedy mechanism (which is much simpler than epistemic exploration) sufficient to prevent bad bootstrap in instrumental agents? This could be discussed or (optionally) simulated.

Minor comments

- The authors cast the problem as POMDP. However, there is a one-to-one mapping between states and observations. Why not casting it as a MPD?

- The idea of representational inaccuracies is interesting. Is it fair to say that the inaccuracies that emerge in the transition functions are (just) ignorance, deriving from not having sampled some of the (infrequent or unselected) transitions?

- Page 7, around line 133. Please explain better the difference between control states and actions (and why the former are necessary in active inference but not in other frameworks), as these may sound confusing to most readers.

- Page 11, around line 117. The authors argue that "The goal is not, therefore, to construct a model that accurately captures the true causal structure underlying observations". This is correct, but it may be argued that "the goal is not necessarily" to do so. To the extent that knowing the true causal structure (or a close approximation) is useful, it could be learned by active inference agents.

- Page 11, around line 238. I think that for the sake of clarity the authors could reiterate that agent (i) is the "sum" of agents (ii) and (iii).

- Page 12, around line 268. Please explain "spatially rather than temporally"

- Figure 2 is interesting and useful. However, it (or its caption) should explicitly mention and explain all the parameters. There are some parameters like lambda and alpha that are not introduced in the caption.

- Figure 3 is well explained in the text. Despite so, it risks to be confusing (for distracted readers) as the text mentions that agents can learn for 1000 time steps and the x axis also reports an interval between 0 and 1000 (but it is about learning steps) - hence some readers may conflate the two and believe that the x axis reports time steps for action and not learning. Maybe you could simply change the interval in the x axis?

- The reported results seem to show that the overall performance of (all) agents is not excellent. Could you comment on that?

- Pag, 16, around line 354. The authors mention that they "measure the accuracy of the expectation of the approximate posterior distribution". Please clarify why you measure the expectation.

- The (interesting) results reported in figure 4 suggest that the EFE agent samples neg-neg transitions less often than the instrumental agent. Why this?

- In Figure 4B-D, it would be useful to show the distribution over states of the "true model" (the same used to measure model accuracy in Figure 4A. This would help making sense of some results; for example, the error of the instrumental agent is lower than the EFE agent; does this imply that the distributions it recovers (e.g., 53) are correct? Note also that these numbers are a bit difficult to interpret given that (as I mentioned above) things like the size of the grid etc. are not mentioned. Adding more details (and perhaps a figure) about the simulation scenario would make the results easier to interpret.

- Pag. 21, around line 485. The discussion of model complexity is interesting but could appear counterintuitive. After all, more parameters means more complexity. It may be useful to remark that pruning is necessary in in such kind of models?

- Pag 22, around line 519. It would be useful to mention that the agent used in the simulation is the EFE agent?

Reviewer #2: Review of: Learning action-oriented models through active inference

I have read this paper with great interest. It is one of the best computational papers on the free-energy principle I have seen. I would like to thank the authors for their clarity and thorough analysis of the system under scrutiny. I very much like the didactic nature of the paper. The argumentative steps, developed along with an explanation of the mathematical constructs, make for a very technical, yet highly readable paper. The chemotaxis model serves just as a toy model, exactly because of its simplicity it is able to make clear a number of potentially counterintuitive aspects of the behavior of expected-free-energy-minimizing agents.

I take the main result of the paper to be to show that EFE-agents learn qualitatively different and systematically biased models of their environment, which from a representational perspective are unexpected, but from an action-oriented and embodied perspective are less counterintuitive. Furthermore, learning is quite often unaddressed in the FEP literature, but plays a key role in this paper.

I have no major points of disagreement with the authors, see below a number of smaller points that I think could serve to improve the paper:

As an overall comment, some of the figures were very difficult to read (for example 4B). I could follow the narrative in the paper, but in these cases the figures were not very helpful.

p.8: The authors rightfully point out that active inference “proposes that an agent’s generative model is biased towards favourable states of affairs”. However, in the current simulation these “prior preferences” are taken as a given. Given that learning takes centre-stage in their paper, and given that the authors take all the other argumentative steps. Perhaps they can say something about how these preferences are acquired (learning seems not to be an option here, especially because of the bad-bootstrapping kind of scenarios the authors point out).

p.14: The authors introduce 4 learning strategies. At first I thought that the results of the paper were trivial, because, for example, an epistemic agent is explicitly not made to seek out the goal state. But since these strategies are only employed in the learning stage and not in the evaluation stage, this worry was unnecessary. Still, the authors might say something about their expectations in why specifically these models were chosen. And whether the results they obtained were anticipated.

p.17: “In contrast, the distributions sampled by the expected free energy and instrumental strategies are heavily biased towards a running-induced transition from positive to positive gradients. This is the transition that occurs when an agent is ‘running up the chemical gradient’, i.e., performing chemotaxis. The bias means that the remaining state transitions are sampled less, relative to the epistemic and random strategies.”

� I found it hard to parse these few sentences. Not sure if the “positive” to “positive” is a mistake (or a different meaning of transition is implied here). It might be good to explain the idea of a state transition briefly in the paper (it bears the connotations of state transitions in physics, while what is implied here is a qualitative change in behaviour (i.e. tumbling or running).

p.26: I think the “goal-directed exploration” point, in contrast to more traditional solutions to the explore-exploit trade-off can be highlighted a bit more. Does EFE provide a more principle or integrated solution to this trade-off than more discrete ones (i.e. in which there is an irreversible switch from exploring to exploiting)?

I think the authors are very well on top of the literature, and I appreciate them engaging with the more theoretical/philosophical literature out there. I think the central idea of the paper can be very well be connected to what Bruineberg, Kiverstein and Rietveld (2016) call the “crooked scientist”. The EFE-agent as 1.) having non-negotiable preferences for the sensation it expects/wishes to encounter and 2) through this, exploring only that part of state-space that is relevant for its goal-directed activities. This second point really comes out nicely in the author’s paper and is a great continuation of the “crooked scientist” line of thinking. Although, the 2015 paper does not discuss expected free-energy, a follow up paper, Bruineberg et al., (2018) does discuss expected free-energy. I think the points made there (in the context of niche construction) are very much in line with and complementary to the current paper.

References:

Bruineberg, J., Kiverstein, J., & Rietveld, E. (2018). The anticipating brain is not a scientist: the free-energy principle from an ecological-enactive perspective. Synthese, 195(6), 2417-2444.

Bruineberg, J., Rietveld, E., Parr, T., van Maanen, L., & Friston, K. J. (2018). Free-energy minimization in joint agent-environment systems: A niche construction perspective. Journal of theoretical biology, 455, 161-178.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007805.r003

Decision Letter 1

Natalia L Komarova

19 Mar 2020

Dear Mr Tschantz,

We are pleased to inform you that your manuscript 'Learning action-oriented models through active inference' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Natalia L. Komarova

Deputy Editor

PLOS Computational Biology

Natalia Komarova

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1007805.r004

Acceptance letter

Natalia L Komarova

15 Apr 2020

PCOMPBIOL-D-19-01647R1

Learning action-oriented models through active inference

Dear Dr Tschantz,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Rearrangements of the free energy functional.

    In this appendix, we provide derivations for three arrangements of the free energy functional.

    (PDF)

    S2 Appendix. Derivation of expected free energy.

    In this appendix, we formally describe the relationship between free energy and expected free energy.

    (PDF)

    S3 Appendix. Deriving instrumental and epistemic value.

    In this appendix, we decompose expected free energy into instrumental and epistemic value.

    (PDF)

    S4 Appendix. Relationship of epistemic value to established formalisms.

    In this appendix, we demonstrate that epistemic value—a component of expected free energy—is equivalent to a number of established formalisms.

    (PDF)

    S5 Appendix. Deriving the variational update equations for inference, learning and action.

    In this appendix, we derive the update equations for beliefs about hidden states, control states and model parameters.

    (PDF)

    S6 Appendix. Learning in a high-dimensional state space.

    In this appendix, we present results for an additional experiment where we compare learning under epistemic and random action strategies in a high-dimensional state space.

    (PDF)

    Attachment

    Submitted filename: Response.pdf

    Data Availability Statement

    The Python code used to simulate the models and the simulation data are available at https://github.com/alec-tschantz/action-oriented.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES