Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Oct 27.
Published in final edited form as: Adv Neural Inf Process Syst. 2020 Dec;33:7898–7909.

Inverse Rational Control with Partially Observable Continuous Nonlinear Dynamics

Minhae Kwon 1, Saurabh Daptardar 2, Paul Schrater 3, Xaq Pitkow 4
PMCID: PMC8549572  NIHMSID: NIHMS1654208  PMID: 34712038

Abstract

A fundamental question in neuroscience is how the brain creates an internal model of the world to guide actions using sequences of ambiguous sensory information. This is naturally formulated as a reinforcement learning problem under partial observations, where an agent must estimate relevant latent variables in the world from its evidence, anticipate possible future states, and choose actions that optimize total expected reward. This problem can be solved by control theory, which allows us to find the optimal actions for a given system dynamics and objective function. However, animals often appear to behave suboptimally. Why? We hypothesize that animals have their own flawed internal model of the world, and choose actions with the highest expected subjective reward according to that flawed model. We describe this behavior as rational but not optimal. The problem of Inverse Rational Control (IRC) aims to identify which internal model would best explain an agent’s actions. Our contribution here generalizes past work on Inverse Rational Control which solved this problem for discrete control in partially observable Markov decision processes. Here we accommodate continuous nonlinear dynamics and continuous actions, and impute sensory observations corrupted by unknown noise that is private to the animal. We first build an optimal Bayesian agent that learns an optimal policy generalized over the entire model space of dynamics and subjective rewards using deep reinforcement learning. Crucially, this allows us to compute a likelihood over models for experimentally observable action trajectories acquired from a suboptimal agent. We then find the model parameters that maximize the likelihood using gradient ascent. Our method successfully recovers the true model of rational agents. This approach provides a foundation for interpreting the behavioral and neural dynamics of animal brains during complex tasks.

1. Introduction

Brains evolved to understand, interpret, and act upon the physical world. To thrive and reproduce in a harsh and dynamic natural environment, brains, therefore, evolved flexible, robust controllers. To be the controller, the fundamental function of the brain is to organize sensory data into an internal model of the outside world. The animals are never able to get complete information about the world. Instead, they only get partial and noisy observations of it. Thus, the brain should build its own internal model which necessarily includes uncertainties of the outside world, and base its decision upon that model [1]. However, we hypothesize that this internal model is not always correct, but the animals still behave rationally — meaning that animals act optimally according to their own internal model of the world, which may differ from the true world.

The goal of this paper is to identify the internal model of the agent by observing its actions. Unlike Inverse Reinforcement Learning (IRL) [2, 3, 4] which aims to learn only the reward function of target agent, or Inverse Optimal Control (IOC) [5, 6] to infer only unknown dynamics model, we use Inverse Rational Control (IRC) [7] to infer both. Since we consider neuroscience tasks which include naturalistic controls and complex physics of the world, we substantially extend past work [7] to include continuous spaces of state, action, and parameter with nonlinear dynamics. We parameterize nonlinear task dynamics and reward functions based on a physics model such that the family of tasks shares an overall structure but has different model parameters. In our framework, an experimentalist can observe state information of the environment and actions taken by the agent. On the other hand, the experimentalist cannot observe information about the agent’s internal model, such as its observations and beliefs. IRC infers the latent internal information of the agent using the data observable by the experimentalist.

The task is formulated as a Partially Observable Markov Decision Process (POMDP) [8, 9], a powerful framework for modeling agent behavior under uncertainty. In order to model an animal’s cognitive process whereby the decision-making is based on its own beliefs about the world, we reformulate the POMDP as a belief Markov Decision Process (belief MDP) [10, 11]. The agent builds its belief (i.e., its posterior distribution over world states) based on partial, noisy observations and its internal model, and the decision-making is based on its belief.

We construct a Bayesian agent to learn optimal policies and value functions over an entire parameterized family of models, which can be viewed as an optimized ensemble of agents each dedicated to one task. This then allows us to maximize the likelihood of the state-action trajectories generated by a target agent, by finding which parameters from the ensemble best explain the target agent’s data.

The main contributions of this paper are the following. First, our work is able to find both the reward function and internal dynamics model simultaneously in continuous nonlinear tasks. Note that continuous nonlinear dynamical systems are the most general form of tasks, so it is trivial to solve discrete and/or linear systems using the proposed approach. Second, we propose a novel approach to implement the Bayesian optimal control ensembles, including an idea of belief representation and belief updating method using estimators with constrained representational capacity (e.g., an extended Kalman filter). This allows us to build an algorithm that imitates the bounded rational cognitive process of the brain [12] and to perform belief-based decision-making. Lastly, we propose a novel approach to IRC combining Maximum Likelihood Estimation (MLE) and Monte Carlo Expectation-Maximization (MCEM). This method successfully infers the reward function and internal model parameters of the target agent by maximizing the likelihood of state-action trajectories under the assumption of rationality, while marginalizing over latent sensory observations. Importantly, this is possible because we trained ensembles of agents over entire parameter spaces using flexible function approximators. To the best of our knowledge, our work is the first study to infer both the reward and internal model of an unknown agent with partially observable continuous nonlinear dynamics.

2. Related Work

Inverse reinforcement learning (IRL).

The goal of IRL or imitation learning is to learn a reward function or a policy from expert demonstrations, and the goal of Inverse Optimal Control (IOC) is to infer an unknown dynamics model. Both approaches solve aspects of the general problem of inferring internal models of an observed agent. For example, some IRL works such as [13, 14, 15, 16] formulate the optimization problems to find features of reward or cost function that best explain the target agent’s state-action trajectories. Specifically, [13] finds reward features by solving a linear programming problem, and [15] uses a quadratic programming method to learn mappings from features to a cost function. In addition, [17] combines the principle of maximum entropy [18] to IRL so that the solution becomes as random as possible while still matching reward features the best. This guarantees avoiding the worst-case policy [19, 20]. Another stream of IRL is imitation learning [21, 22, 23, 24]. Typical IRL approaches use a two-step process: first learn the expert’s reward function first, and then train the policy with the learned reward. This could be slow, [21] directly extracts a policy from data. Across all of these methods, there is no a complete inverse solution that can learn how an agent models rewards, dynamics, and uncertainty in a partially observable task with continuous nonlinear dynamics and continuous controls.

Meta reinforcement learning (Meta RL).

The fundamental objective of Meta RL is to learn new skills or adapt to a new environment rapidly with a few training experiences. In order to efficiently adapt to the new tasks or environments, some Meta RL works try to infer tasks or meta parameters that govern the general task. For example, optimization-based meta RL works such as [25, 26, 27, 28] include a so-called ‘outer loop’ which optimizes the meta-parameters. In this sense, the meta RL is related to our goal since both work aim to infer the task parameters, although we use this parameterization to explain the actions of an agent. However, there are few studies to consider the partially observable setting of the agent. [29] includes both POMDP frameworks and meta-learning, but the partially observable information is the task information not the state information. [30] also considers a Bayesian approach with meta-learning, but it also uses Bayesian reasoning to infer the unseen tasks and learn quickly. Therefore, our paper differs from other Meta RL works in its task structure and goal. We allow partial observability about the world state, as occurs naturally in the animal’s decision-making process. More fundamentally, the goal of our work is not to find smarter agents, but rather to infer the internal model of an existing agent and to explain its behaviors.

Neuroscience and cognitive science

Neuroscientists aim to answer how the brain selects actions based on noisy sensory information and incomplete knowledge of the world. The hypothesis of the Bayesian brain [31] has been proposed to explain the brain’s functionalities with Bayesian inference and probabilistic representations of the animal’s internal state. Several studies propose mechanisms by which neurons could implement optimal behaviors despite pervasive uncertainty [11, 32, 33]. Despite the utility of having behavioral benchmarks based on optimality, animals often appear to behave suboptimally. Such suboptimality might come from the wrong internal model [7, 34] that is induced by a subjective prior belief of the animal [35, 36] and suboptimal inference [37]. The main goal of this paper is to infer the internal model of suboptimal agents using state-of-the-art deep reinforcement learning techniques and to provide a theoretical tool to interpret behavior and neural data obtained from ongoing neuroscience research.

For this reason, we test our approach by simulating an existing neuroscience task called ‘catching fireflies’ [38, 39], which is complex enough to require a sophisticated internal model, while being restrictive enough that animals can learn it and one can adequately constrain models of this behavior using feasible data volumes. Ultimately we will apply our approach to understand the internal models of behaving animals, where we do not know the ground truth. Before doing that, it is important to use simulated agents that allow us to validate the method when we do know the ground truth. Recently, a similar effort to build AI-relevant testbed for animal cognition and behavior is presented in [40, 41].

3. Bayesian Optimal Control Ensembles

Our method can be viewed as a search over an ensemble of agents, each optimally trained on different POMDP tasks, to find which of these agents best explain the experimentally observed behaviors of a target agent. The experimentalist is an external observer who has information about the world states and agent actions, but not about the agent’s internal model, noisy sensory observations, or beliefs.

3.1. Belief Markov Decision Process and optimal control

A POMDP is defined as a tuple M=(S,A,Ω,R,T,O,γ) that includes states stS, actions atA, observations ot ∈ Ω, reward functions R(st, at, st+1; θ), state transition probabilities T(st+1|st, at; θ), observation probabilities O(ot|st; θ) at time t, and a temporal discount factor γ. Here, θ ∈ Θ denotes a vector of model parameters defining the rewards, state transitions and observations, and the state space S and action space A are considered to occupy a continuous space. Thus, θ parameterizes a POMDP family. A graphical model of a POMDP is presented in Figure 1.

Figure 1:

Figure 1:

Graphical model of a POMDP. Solid circles denote observable variables to an experimentalist, and empty circles denote latent variables.

The state st is defined as the representation of the environment which can live in high dimensional spaces. It may be fully accessible by the experimentalist but not by the agent. The agent gets an observation ot of the environment with state st, which is partial and noisy version of state st. Because of the partial observability, the dimension of ot could be lower than the dimension of st. The observation process is modeled by the observation function O(ot|st; θ). Note that any noise added from state to observation is the internal noise of the agent, i.e., the noise within the nervous system of the agent. Because of this noise, the observation is only known to the agent and the experimentalist can never access it directly.

Based on its observations and actions up to time t, a rational agent builds a posterior distribution B(st|o1:t, a1:t−1; θ) over the world state given the history of observations and actions, and it bases its actions upon that posterior. In practice this posterior is summarized by a belief bt, defined as sufficient statistics for the posterior distribution over states, i.e., B(st|bt) = B(st|o1:t, a1:t−1; θ). In principle, a belief bt over a general continuous state could be infinite-dimensional, but we assume that the belief is continuous but finite-dimensional. Let B(st|bt) be the probability that the environment is in the state st when the agent’s belief is bt. By the Markov property, bt is determined by bt−1, at−1, ot such that B(st|bt) can be calculated as follows.

B(stbt)=B(stbt1,at1,ot;θ) (1)
=1ZO(otst;θ)dst1T(stst1,at1;θ)B(st1bt1) (2)

where Z = ∫ dst O(ot|st; θ)∫ dst−1T(st|st−1, at−1; θ) B(st−1|bt−1) is a normalizing constraint. In general this recursion is intractable, so we approximate it under tractable model assumptions, as we do in our application below. By replacing the state of the environment by the belief of the agent, the POMDP problem can be reformulated as a belief MDP problem, and the optimal policy can be found based on well-known MDP solvers [42, 43, 44, 45] applied to the fully observed belief state.

The optimal policy π*(at|bt; θ) defines how the agent chooses an action at* that maximizes the temporally discounted total expected future reward, given the current belief bt and internal model θ. This defines the Q-value Q(bt, at; θ) as a belief-action value:

Q(bt,at;θ)=dbt+1T¯(bt+1bt,at;θ)(R¯(bt,at,bt+1;θ)+γmax aQ(bt+1,a;θ)) (3)

where T¯(bt+1bt,at;θ) is the belief transition probability and R¯(bt,at,bt+1;θ) is the reward as a function of belief, defined as follows.

T¯(bt+1bt,at;θ)=dstdst+1dot+1B(stbt)T(st+1st,at;θ)O(ot+1st+1;θ)p(bt+1bt,at,ot+1;θ) (4)
R¯(bt,at,bt+1;θ)=dstdst+1B(stbt)B(st+1bt+1)R(st,at,st+1;θ)

In (4), the belief update is expressed in a generalized form p(bt+1|bt, at, ot+1; θ) that allows either deterministic optimal belief updates, or could even account for other constraints on the inference process, including stochasticity.

The optimal action from a belief state will be also defined by a deterministic policy π*(atbt;θ)=δ(at*=arg maxaQ(bt,a;θ)). In case of continuous belief and action spaces, it is hard both to compute an optimal Q-function and to maximize it. Thus, we will approximate both using neural networks.

3.2. Training Bayesian optimal control ensembles with partial noisy observations

To successfully design and train an ensemble of agents, we identify three major challenges and provide solutions.

First, how can we construct the optimal control ensembles that can solve a family of tasks? As discussed, the task can be parameterized by the model parameter θ ∈ Θ such that the family of tasks shares the model structure but has different model parameters. We use this model parameter as an external input to flexible function approximators (neural networks) to estimate values and policies (Critic and Actor). Thus, the agent can be trained over parameter spaces. As presented in Figure 2, Critic and Actor both take parameter vector θ as an input, and respectively calculate the Q-value and best action for the task with that θ.

Figure 2:

Figure 2:

A block diagram of Algorithm 1.

Second, how should we represent and update the agent’s belief? For our concrete example application below, we use an extended Kalman filter [46] to provide a tractable Gaussian approximation for the belief state and its nonlinear dynamics. The resultant belief update is deterministic, p(bt+1|bt, at, ot+1; θ) = δ (bt+1 = f(bt, at, ot+1; θ)). Tests with more flexible particle filters showed that this approximation is reasonable in our target application. For other applications, different belief representations and dynamics may be more accurate [47], and in principle a family of agents could use representational learning [48].

Lastly, how should we train a rational model agent ensemble with continuous belief and action spaces? Here we use the model-free deep reinforcement learning algorithm called Deep Deterministic Policy Gradient (DDPG) [49]. This method is able to approximate the value function over continuous belief states, actions, and task parameters, all using one neural network (the Critic), and uses it to train a policy network (the Actor) which also receives inputs about the current belief and task parameters. Viable alternatives for continuous control in the deep reinforcement learning literature include [50, 51, 52, 53].

The training process for optimal control ensembles is summarized in Algorithm 1, and a block diagram is provided in Figure 2. The agent is trained on simulated experiences. Given the belief bt and parameters θ, the Actor returns the best action at. As the agent performs the action at, it changes the world state to st+1 following the state transition probabilities T. The reward from the world R is given to the agent and fed back to the Critic to get a better estimation of the Q-value, which then improves the selection of the action in the Actor. From the new state st+1, the agent gets a partial and noisy observation ot+1 with the observation probabilities O. Then, the Gaussian belief state is updated using the extended Kalman filter f, bt+1 = f(bt, at, ot+1; θ). A new action at+1 is selected by the Actor, and these processes are iterated until the neural networks are fully trained. During this training, we sample new model parameters θ every episode so the agent can experience the entire space of tasks, and thus generalize better over that space.

graphic file with name nihms-1654208-t0001.jpg

4. Inverse Rational Control with Maximum Likelihood Estimation

Once an agent ensemble is fully trained over the entire parameter space, we can use this ensemble to find the internal model parameters of the best-fitting rational agent in that model family. We solve the continuous Inverse Rational Control problem by finding the parameters θ that have the highest likelihood for explaining an agent’s measured behavior.

4.1. Discrepancy between the true world and internal model

Recall that our core hypothesis is that animals have their own internal model of the world which may not be always correct, but they still behave rationally, choosing actions with the highest expected subjective reward according to their internal model. We must therefore distinguish between the two kinds of model parameters: the true ones ϕ which determine the world dynamics and are known to the experimentalist, and the agent’s internal model parameters θ which are latent for the experimentalist but governs all cognitive processes of the agent (Figure 3). The world parameters ϕ govern the world dynamics such as state transition probability T(st+1|st, at; ϕ) and reward function R(st, at, st+1; ϕ). On the other hand, the internal parameters θ govern the agent’s internal process such as the observation probability O(ot|st; θ), the belief transition probability T¯(bt+1bt,at;θ), and the subjective reward as a function of belief R¯(bt,at,bt+1;θ), leading to a subjective belief update probability p(bt+1|bt, at, ot+1; θ) and rational policy π(at|bt; θ).

Figure 3:

Figure 3:

An illustrative explanation of model discrepancy. The solid lines and circles are governed by the true world parameter ϕ which is known to the experimentalist. The dashed lines and empty circles are governed by internal model parameter θ which is latent to the experimentalist, and may differ from ϕ.

4.2. Inferring internal model parameter θ

To find the internal model parameters θ that maximize the log-likelihood of the experimentally observable data (s, a)1:T, θ^=arg maxθln p(s1:T,a1:Tϕ,θ) we use the Monte Carlo Expectation Maximization (MCEM) algorithm [54] to marginalize the complete data log-likelihood over latent observations o1:T and beliefs b1:T. This yields an iterative algorithm, which repeatedly maximizes

θ^k+1=arg max θdo1:Tdb1:Tp(o1:Tb1:Ts1:T,a1:T;θk)ln p(s1:T,o1:T,b1:T,a1:Tϕ,θ) (5)
arg max θ1Ll=1Llnp(s1:T,o1:T(l),b1:T(l),a1:Tϕ,θ) (6)

where the sum is over samples (o(l), b(l))1:T drawn from a posterior distribution p(o1:T db1:T|s1:T, a1:T; θk) determined by parameters θk from previous iterations.

The log-likelihood of the complete data (including the l-th samples of observations and beliefs based on parameter θk) can be decomposed using the Markov property into

ln p(s1:T,o1:T(l),b1:T(l),a1:Tϕ,θ) (7)
=ln p(s0,o0(l),b0(l),a0)+t=1T(lnT(stst1,at1;ϕ)+lnO(ot(l)st;θ)+lnp(bt(l)bt1(l),at1,ot(l);θ)+lnπ(atbt(l);θ)). (8)

Note that the only terms depending on the agent’s parameters θ are the latent observations probabilities, belief dynamics, and policy. So when we optimize over θ, all other terms vanish. Moreover, since we use deterministic belief updates, the belief update term in (8) is also independent of θ when evaluated on sampled beliefs. The only terms that survive are

θ^=arg max θl=1Lt=1T(ln O(ot(l)st;θ)+ln π(atbt(l);θ)). (9)

To optimize (9), we use gradient ascent over parameter space, θθ+αθL with learning rate α. This is explained in Algorithm 2.

graphic file with name nihms-1654208-t0002.jpg

5. Demonstration task: ‘Catching fireflies’

To verify the proposed method, we carefully select a relevant task. Our application focus on continuous world states, actions, and beliefs makes standard RL testbeds (e.g. Nintendo, MuJoCo) more difficult. Common tasks like gridworld or tiger do not exhibit continuous properties and remain excessively small toy problems. Standard continuous control tasks do not use partially observability; tasks that do would likely generate beliefs that would be substantially harder to interpret. Additionally, there is a ready application to existing neuroscience experiments based on ‘catching fireflies’ in virtual reality [55, 39], which is complex enough to be interesting to animals, requires a continuous representation of uncertainty and continuous control, and yet remains tractable enough that we can assess the fidelity of recovered beliefs.

In our task, an agent must navigate through a virtual environment to reach a transiently visible target, called the ‘firefly’ (Figure 4A). At the beginning of each trial, a firefly blinks briefly at a random location on the ground plane. The agent is able to control its forward and angular velocities to freely navigate the world. If the agent stops sufficiently close to the invisible target, it receives a reward. As the agent moves, a sparse ground plane texture generates an optic flow pattern, a vector field of local image motion. This allows the agent to estimate its speed up to some perceptual uncertainty. However, there is no direct access to information about its current location because the ground plane texture is transient and does not provide spatial landmarks. Thus, the agent must integrate optic flow to estimate its current position relative to the firefly target, as well as its uncertainty about that position.

Figure 4:

Figure 4:

A. An illustration of the ‘catching fireflies’ task from the agent’s point of view. To reach the transiently visible firefly target, an agent must navigate by optic flow over a dynamically textured ground plane. The agent is rewarded if it stops close enough to the target location. B. Converging trajectory of IRC estimates of the agent’s parameters θ. We use gradient ascent to find θ that maximizes approximated log likelihood L(θ) in Algorithm 2. C. Successful recovery of individual agent parameters. The black line is the identity, meaning that true values and estimated values are equal. Across all parameter spaces, the proposed approach accurately recovers the agent’s internal model parameters given limited data.

We demonstrate the efficacy of our approach using a simulated agent for which ground truth is known. Thus, we verify our method by showing the successful recovery of the internal model parameters since we know the ground truth. Note that there are no comparisons to alternative methods because no other algorithms exist that solve the IRC problem in continuous state and action spaces. Figure 4B shows a two-dimensional contour plot of the approximate log-likelihood of observable data L(θ). Recall that the model parameters θ are high dimensional, so here we plot only two dimensions of θ. The red line shows an example trajectory of parameters θ as IRC Algorithm 2 converges. Our approach estimates θ that maximizes the log-likelihood of the observable data L(θ). Figure 4C shows that the estimated parameters recovered by our algorithm closely match the agent’s true parameters.

6. Conclusion

This paper introduces a novel framework to infer the internal model of agents from their behaviors. We infer not only the subjective reward function of the agent, but we also simultaneously infer the task dynamics that the agent assumes. To accomplish this, we first train Bayesian optimal control ensembles that generalize over the space of task parameters. Since the target agent is only exposed to partial information about the world state, the agent chooses the best action based on its belief about the world and its assumptions about the task. By using this optimally trained agent ensemble, our approach to Inverse Rational Control with continuous state and action spaces can infer the internal model parameters that best explain the collected behavioral data. With a simulated agent where we know the ground truth, we confirm that our approach successfully recovers the internal models. This success encourages us to apply this method to real behavioral data as well as to new tasks and applications.

Broader Impact

We have implemented IRC for neuroscience applications, but the core principles have value in other fields as well. We can view IRC as a form of Theory of Mind, whereby one agent (a neuroscientist) creates a model of another agent’s mind (for a behaving animal). Theory of Mind is a prominent component of human social interactions, and imputing rational motivations to actions provides a useful description of how people think [56, 57, 58]. Using IRC methods to provide a better understanding of people’s motivations could yield important insights for understanding and improving social and political interactions, as well as raising possible ethical concerns if used for manipulation. The design of agents interacting with humans would also benefit from being able to attribute rational strategies to others. For example, recent work uses a related approach to impute purpose to a neural network [16]. One important practical example is self-driving cars, which currently struggle to handle the perceived unpredictability of humans. While humans do indeed behave unpredictably, some of this may stem from ignorance of the rational computation that drives their actions. The IRC provides a framework for interpreting agents, and serves as a valuable tool for greater understanding of unifying principles of control.

Appendix. A Derivation of Monte Carlo Expectation Maximization (MCEM)

A derivation from (5) to (6) is based on MCEM. We here provide more details of the MCEM.

Let x be the observable data, z be the latent variable and θ be the parameters that govern the process. The goal is to find θ that maximizes the log likelihood of the observable data.

θ=arg max θln p(xθ)

The log likelihood of the observable data can be reformulated as follows.

ln p(xθ)=dzq(z) ln p(xθ)=dzq(z)[ln p(x,zθ)ln p(zx,θ)]=dzq(z)[ln p(x,zθ)ln q(z)+ln q(z)ln p(zx,θ)]=dzq(z)[lnp(x,zθ)q(z)lnp(zx,θ)q(z)]=dzq(z)lnp(x,zθ)q(z)dzq(z)lnp(zx,θ)q(z) (10)
=L(q,θ)+KL(qp) (11)

Since KL divergence is always non-negative value, L(q,θ) is the lower bound of ln p(x|θ). The complete data log likelihood ln p(x, z|θ) is easier to handle than the observed data log likelihood ln p(x|θ). Thus, instead of maximizing ln p(x|θ), we aim to maximize its lower bound L(q,θ)=dzq(z)lnp(x,zθ)q(z).

A.1 E-step

As KL(qp) gets smaller, we have a tighter lower bound. If KL(qp) = 0, ln p(xθ)=L(q,θ). KL(qp) = 0 is satisfied only if q = p. Thus q(z) = p(z|x, θ) from (10). This is the E-step of the EM algorithm [59]. Note that in this step, q(z) is a function only of z, which means both x and θ are used as given variables. Thus, we denote θold as a fixed parameter that is used to specify q(z). Once q(z) = p(z|x, θold) is used in L(q,θ) of (11), ln p(x|θ) can be expressed as follows.

ln p(xθ)=L(q,θ)=dzp(zx,θold)lnp(x,zθ)p(zx,θold)=dzp(zx,θold)ln p(x,zθ)dzp(zx,θold)ln p(zx,θold)=dzp(zx,θold)ln p(x,zθ)+H(zx,θold)=Q(θ,θold)+H(zx,θold) (12)

A.2 M-step

Next, we want to find θ that maximizes ln p(x|θ). This is the M-step of the EM algorithm. Since H(z|x, θold) is a constant (i.e., not a function of θ),

θ=arg max θln p(xθ)=arg max θQ(θ,θold)=arg max θdzp(zx,θold)ln p(x,zθ). (13)

If p(z|x, θold) is hard to get analytically, (13) can be approximated by the Monte Carlo approach. The resultant optimization is called the MCEM algorithm.

θ=arg max θdzp(zx,θold)ln p(x,zθ)arg max θ1Ll=1Llnp(x,z(l)θ) (14)

where zl is l-th particle for the latent variable z.

Contributor Information

Minhae Kwon, School of Electronic Engineering, Soongsil University, Seoul, Republic of Korea.

Saurabh Daptardar, Google Inc., Mountain View, CA, USA.

Paul Schrater, Department of Computer Science, University of Minnesota, Minnesota, IN, USA.

Xaq Pitkow, Electrical and Computer Engineering, Rice University, Houston, TX, USA.

References

  • [1].Fahlman Scott E, Hinton Geoffrey E, and Sejnowski Terrence J. Massively parallel architectures for al: Netl, thistle, and boltzmann machines. In National Conference on Artificial Intelligence, AAAI, 1983. [Google Scholar]
  • [2].Russell Stuart. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103. ACM, 1998. [Google Scholar]
  • [3].Choi Jaedeug and Kim Kee-Eung. Inverse reinforcement learning in partially observable environments. Journal of Machine Learning Research, 12(Mar):691–730, 2011. [Google Scholar]
  • [4].Babes Monica, Marivate Vukosi, Subramanian Kaushik, and Littman Michael L. Apprenticeship learning about multiple intentions. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 897–904, 2011. [Google Scholar]
  • [5].Dvijotham Krishnamurthy and Todorov Emanuel. Inverse optimal control with linearly-solvable mdps. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 335–342, 2010. [Google Scholar]
  • [6].Schmitt Felix, Bieg Hans-Joachim, Herman Michael, and Rothkopf Constantin A. I see what you see: Inferring sensor and policy models of human real-world motor behavior. In AAAI, pages 3797–3803, 2017. [Google Scholar]
  • [7].Wu Z, Kwon M, Daptardar S, Schrater P, and Pitkow X. Rational thoughts in neural codes. Proceedings of the National Academy of Sciences of the United States of America, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Sutton Richard S, Barto Andrew G, et al. Reinforcement learning: An introduction. MIT press, 1998. [Google Scholar]
  • [9].Åström Karl Johan. Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applications, 10(1):174–205, 1965. [Google Scholar]
  • [10].Kaelbling Leslie Pack, Littman Michael L, and Cassandra Anthony R. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1–2):99–134, 1998. [Google Scholar]
  • [11].Rao Rajesh PN. Decision making under uncertainty: a neural model based on partially observable markov decision processes. Frontiers in computational neuroscience, 4:146, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Simon Herbert A. Theories of bounded rationality. Decision and organization, 1(1):161–176, 1972. [Google Scholar]
  • [13].Ng Andrew Y, Russell Stuart J, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000. [Google Scholar]
  • [14].Abbeel Pieter and Ng Andrew Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM, 2004. [Google Scholar]
  • [15].Ratliff Nathan D, Bagnell J Andrew, and Zinkevich Martin A. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736, 2006. [Google Scholar]
  • [16].Chalk Matthew, Tkačik Gašper, and Marre Olivier. Inferring the function performed by a recurrent neural network. bioRxiv, page 598086, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Ziebart Brian D, Maas Andrew L, Bagnell J Andrew, and Dey Anind K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. [Google Scholar]
  • [18].Jaynes Edwin T. Information theory and statistical mechanics. Physical review, 106(4):620, 1957. [Google Scholar]
  • [19].Vazquez-Chanlatte Marcell, Jha Susmit, Tiwari Ashish, Ho Mark K, and Seshia Sanjit. Learning task specifications from demonstrations. In Advances in Neural Information Processing Systems, pages 5367–5377, 2018. [Google Scholar]
  • [20].Scobee Dexter RR and Sastry S Shankar. Maximum likelihood constraint inference for inverse reinforcement learning. In Proceedings of 8th International Conference on Learning Representations, 2020. [Google Scholar]
  • [21].Ho Jonathan and Ermon Stefano. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016. [Google Scholar]
  • [22].Jeon Wonseok, Seo Seokin, and Kim Kee-Eung. A bayesian approach to generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 2018. [Google Scholar]
  • [23].Ravindran Balaraman and Levine Sergey. Adail: Adaptive adversarial imitation learning. In NeurIPS Workshop on Learning Transferable Skills, 2019. [Google Scholar]
  • [24].Gangwani Tanmay and Peng Jian. State-only imitation with transition dynamics mismatch. In Proceedings of 8th International Conference on Learning Representations, 2020. [Google Scholar]
  • [25].Finn Chelsea, Abbeel Pieter, and Levine Sergey. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1126–1135. JMLR. org, 2017. [Google Scholar]
  • [26].Zintgraf Luisa M, Shiarlis Kyriacos, Kurin Vitaly, Hofmann Katja, and Whiteson Shimon. Fast context adaptation via meta-learning. In Proceedings of the 36th International Conference on Machine Learning, 2019. [Google Scholar]
  • [27].Alet Ferran, Schneider Martin F, Lozano-Perez Tomas, and Kaelbling Leslie Pack. Meta-learning curiosity algorithms. In Proceedings of 8th International Conference on Learning Representations, 2020. [Google Scholar]
  • [28].Fakoor Rasool, Chaudhari Pratik, Soatto Stefano, and Smola Alexander J. Meta-q-learning. In Proceedings of 8th International Conference on Learning Representations, 2020. [Google Scholar]
  • [29].Humplik Jan, Galashov Alexandre, Hasenclever Leonard, Ortega Pedro A, Teh Yee Whye, and Heess Nicolas. Meta reinforcement learning as task inference. In Proceedings of 7th International Conference on Learning Representations, 2019. [Google Scholar]
  • [30].Yoon Jaesik, Kim Taesup, Dia Ousmane, Kim Sungwoong, Bengio Yoshua, and Ahn Sungjin. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 7332–7342, 2018. [Google Scholar]
  • [31].Doya Kenji, Ishii Shin, Pouget Alexandre, and Rao Rajesh PN. Bayesian brain: Probabilistic approaches to neural coding. MIT press, 2007. [Google Scholar]
  • [32].Dayan Peter and Daw Nathaniel D. Decision theory, reinforcement learning, and the brain. Cognitive, Affective, & Behavioral Neuroscience, 8(4):429–453, 2008. [DOI] [PubMed] [Google Scholar]
  • [33].Huang Yanping and Rao Rajesh PN. Reward optimization in the primate brain: A probabilistic model of decision making under uncertainty. PloS one, 8(1), 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Houlsby Neil MT, Huszár Ferenc, Ghassemi Mohammad M, Orbán Gergő, Wolpert Daniel M, and Lengyel Máté. Cognitive tomography reveals complex, task-independent mental representations. Current Biology, 23(21):2169–2175, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Daunizeau Jean, Den Ouden Hanneke EM, Pessiglione Matthias, Kiebel Stefan J, Stephan Klaas E, and Friston Karl J. Observing the observer (i): meta-bayesian models of learning and decision-making. PloS one, 5(12), 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Daunizeau Jean, Den Ouden Hanneke EM, Pessiglione Matthias, Kiebel Stefan J, Friston Karl J, and Stephan Klaas E. Observing the observer (ii): deciding when to decide. PLoS one, 5(12), 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Beck Jeffrey M, Ma Wei Ji, Pitkow Xaq, Latham Peter E, and Pouget Alexandre. Not noisy, just wrong: the role of suboptimal inference in behavioral variability. Neuron, 74(1):30–39, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Lakshminarasimhan Kaushik J., Petsalis Marina, Park Hyeshin, DeAngelis Gregory C., Pitkow Xaq, and Angelaki Dora E.. A dynamic bayesian observer model reveals origins of bias in visual path integration. Neuron, 99(1):194–206.e5, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Lakshminarasimhan Kaushik J, Avila Eric, Neyhart Erin, DeAngelis Gregory C, Pitkow Xaq, and Angelaki Dora E. Tracking the mind’s eye: Primate gaze behavior during virtual visuomotor navigation reflects belief dynamics. Neuron, pages 662–674, May 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Crosby Matthew, Beyret Benjamin, and Halina Marta. The animal-ai olympics. Nature Machine Intelligence, 1(5):257–257, 2019. [Google Scholar]
  • [41].The animal-AI testbed. http://animalaiolympics.com/AAI/.
  • [42].Bellman Richard. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957. [Google Scholar]
  • [43].Howard Ronald A. Dynamic programming and markov processes. 1960.
  • [44].Van Nunen JAEE. A set of successive approximation methods for discounted markovian decision problems. Zeitschrift fuer operations research, 20(5):203–208, 1976. [Google Scholar]
  • [45].Puterman Martin L and Shin Moon Chirl. Modified policy iteration algorithms for discounted markov decision problems. Management Science, 24(11):1127–1137, 1978. [Google Scholar]
  • [46].Julier Simon J and Uhlmann Jeffrey K. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3):401–422, 2004. [Google Scholar]
  • [47].Vértes Eszter and Sahani Maneesh. Flexible and accurate inference and learning for deep generative models. In Advances in Neural Information Processing Systems, pages 4166–4175, 2018. [Google Scholar]
  • [48].Bengio Yoshua, Courville Aaron, and Vincent Pascal. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. [DOI] [PubMed] [Google Scholar]
  • [49].Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, and Wierstra D. Continuous control with deep reinforcement learning. In ICLR, 2016. [Google Scholar]
  • [50].Lazaric Alessandro, Restelli Marcello, and Bonarini Andrea. Reinforcement learning in continuous action spaces through sequential monte carlo methods. In Advances in neural information processing systems, pages 833–840, 2008. [Google Scholar]
  • [51].Konda Vijay R and Tsitsiklis John N. Actor-critic algorithms. In Advances in neural information processing systems, pages 1008–1014, 2000. [Google Scholar]
  • [52].Haarnoja Tuomas, Zhou Aurick, Abbeel Pieter, and Levine Sergey. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. [Google Scholar]
  • [53].Simmons-Edler Riley, Eisner Ben, Mitchell Eric, Seung Sebastian, and Lee Daniel. Q-learning for continuous actions with cross-entropy guided policies. arXiv preprint arXiv:1903.10605, 2019. [Google Scholar]
  • [54].Bishop Christopher M. Pattern recognition and machine learning. springer, 2006. [Google Scholar]
  • [55].Lakshminarasimhan Kaushik J, Petsalis Marina, Park Hyeshin, DeAngelis Gregory C, Pitkow Xaq, and Angelaki Dora E. A dynamic bayesian observer model reveals origins of bias in visual path integration. Neuron, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [56].Baker Chris L, Jara-Ettinger Julian, Saxe Rebecca, and Tenenbaum Joshua B. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour, 1(4):1–10, 2017. [Google Scholar]
  • [57].Rafferty Anna N, LaMar Michelle M, and Griffiths Thomas L. Inferring learners’ knowledge from their actions. Cognitive Science, 39(3):584–618, 2015. [DOI] [PubMed] [Google Scholar]
  • [58].Baker Chris, Saxe Rebecca, and Tenenbaum Joshua. Bayesian theory of mind: Modeling joint belief-desire attribution. In Proceedings of the annual meeting of the cognitive science society, volume 33, 2011. [Google Scholar]
  • [59].Dempster Arthur P, Laird Nan M, and Rubin Donald B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977. [Google Scholar]

RESOURCES