Table 3.
Model element | General Description | Model specification |
---|---|---|
ot | One vector per category of possible observations. Each vector contains entries corresponding to possible p’bse, ablj stimuli for that category at time t. | Possible observations for reward:
Possible observations for choice:
|
st | A vector containing entries corresponding to the probability of each possible state that could be occupied at time t. | Possible choice states:
|
A P(ot|st) |
A matrix encoding the relationship between states and observations (one matrix per observation category). |
|
a | Dirichlet priors associated with the A matrix that specify beliefs about the mapping from states to observations. Learning corresponds to updating the concentration parameters for these priors after each observation, where the magnitude of the updates is controlled by a learning rate parameter q (see Supplementary Materials and Figure 1). | Each entry for learnable reward probabilities began with a uniform concentration parameter value of magnitude a0, and was updated after each observed win or loss on the task. The learning rate η and a0 (which can be understood as a measure of sensitivity to new information; see Supplementary Materials) were fit to participant behavior. |
B P(st+1|st,π) |
A set of matrices encoding the probability of transitioning from one state to another given the choice of policy (π). Here policies simply include the choice of each bandit. | Transition probabilities were deterministic mappings based on a participant’s choices such that, for example, P(sbandit 1|sstart’πbandit 1) = 1, and 0 for all other transitions, and so forth for the other possible choices. |
C lnP(o) |
One vector per observation category encoding the preference (reward value) of each possible observation within that category. | The value of observing a win was a model parameter cr reflecting reward sensitivity; the value of all other observations was set to 0. The value of cr was fit to participant behavior. Crucially, higher cr values have the effect of reducing goal-directed exploration, as the probability of each choice (based on expected free energy Gπ) becomes more driven by reward than by information- seeking (see Supplementary Materials and Figure 1). |
D P(st=1) |
A vector encoding prior probabilities over states. | This encoded a probability of 1 that the participant began in the start state. |
π | A vector encoding the probability of selecting each allowable policy (one entry per policy). The value of each policy is determined by its expected free energy (Gπ), which depends on a combination of expected reward and expected information gain. Actions at each time point are chosen based on sampling from the distribution over policies, π = σ(Gπ); the determinacy of action selection is modulated by an inverse temperature or action precision parameter α (see Supplementary Materials and Figure 1). | This included 3 allowable policies, corresponding to the choice of transitioning to each of the three bandit choice states. The action precision parameter a was fit to participant behavior. |