Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

. 2016 Oct 24;113(45):12868–12873. doi: 10.1073/pnas.1609094113

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

Freely available online through the PNAS open access option.

PMC Copyright notice

Fig. 1. — Schematic of the algorithm in an example decision problem (see SI Appendix for the general formal algorithm). Assume an individual has a “mental model” of the reward and transition consequent on taking each action at each state in the environment. The value of taking action $a$ at the current state $s$ is denoted by $Q (s, a)$ and is defined as the sum of rewards (temporally discounted by a factor of $0 \leq γ \leq 1$ per step) that are expected to be received upon performing that action. $Q (s, a)$ can be estimated in different ways. (A) “Planning” involves simulating the tree of future states and actions to arbitrary depths ( $k \to \infty)$ and summing up all of the expected discounted consequences, given a behavioral policy. (B) An intermediate form of control (i.e., plan-until-habit) involves limited-depth forward simulations ( $k = 3$ in our example) to foresee the expected consequences of actions up to that depth (i.e., up to state $s'$ ). The sum of those foreseen consequences ( $r_{0} + γ r_{1} + γ^{2} r_{2}$ ) is then added to the cached habitual assessment [ $γ^{k} Q_{h a b i t} (s', a')$ ] of the consequences of the remaining choices starting from the deepest explicitly foreseen states ( $s'$ ). (C) At the other end of the depth-of-planning spectrum, “habitual control” avoids planning ( $k = 0$ ) by relying instead on estimates $Q_{h a b i t} (s, a)$ that are cached from previous experience. These cached values are updated based on rewards obtained when making a choice.