Skip to main content
. 2017 Nov 23;8:2048. doi: 10.3389/fpsyg.2017.02048

Figure 2.

Figure 2

(A) Development of a state-action value for two different learning rates. For the purpose of illustration, we assume that the agent makes identical choices across all trials. Filled and empty circles indicate trials in which the action was rewarded (r = 1) or not rewarded (r = 0), respectively. With a high learning rate (light line), the state-action value estimate fluctuates strongly, representing the rewards of the most recent trials. In contrast, with a low learning rate (dark line), the state-action value is more stable because it pools over more of the previous trials. (B) The higher the inverse softmax temperature, the more it is likely to prefer an action with a state-action value of 1 over another action with a state-action value of 0.