Table 3.

Computational model description

Model element	General Description	Model specification
o_t	One vector per category of possible observations. Each vector contains entries corresponding to possible p’bse, ablj stimuli for that category at time t.	Possible observations for reward: Start Reward No reward Possible observations for choice: Start Bandit 1 Bandit 2 Bandit 3
s_t	A vector containing entries corresponding to the probability of each possible state that could be occupied at time t.	Possible choice states: Start Bandit 1 Bandit 2 Bandit 3
A P(o_t\|s_t)	A matrix encoding the relationship between states and observations (one matrix per observation category).	A reward probability matrix: P (o_reward\|s_choice) An identity matrix for observed choice (entailing that participants had no uncertainty about the choice they made): P (o_choice\|s_choice)
a	Dirichlet priors associated with the A matrix that specify beliefs about the mapping from states to observations. Learning corresponds to updating the concentration parameters for these priors after each observation, where the magnitude of the updates is controlled by a learning rate parameter q (see Supplementary Materials and Figure 1).	Each entry for learnable reward probabilities began with a uniform concentration parameter value of magnitude a₀, and was updated after each observed win or loss on the task. The learning rate η and a₀ (which can be understood as a measure of sensitivity to new information; see Supplementary Materials) were fit to participant behavior.
B P(s_t+1\|s_t,π)	A set of matrices encoding the probability of transitioning from one state to another given the choice of policy (π). Here policies simply include the choice of each bandit.	Transition probabilities were deterministic mappings based on a participant’s choices such that, for example, P(s_{bandit 1\|}s_start’π_{bandit 1}) = 1, and 0 for all other transitions, and so forth for the other possible choices.
C lnP(o)	One vector per observation category encoding the preference (reward value) of each possible observation within that category.	The value of observing a win was a model parameter c_r reflecting reward sensitivity; the value of all other observations was set to 0. The value of c_r was fit to participant behavior. Crucially, higher c_r values have the effect of reducing goal-directed exploration, as the probability of each choice (based on expected free energy G_π) becomes more driven by reward than by information- seeking (see Supplementary Materials and Figure 1).
D P(s_t=1)	A vector encoding prior probabilities over states.	This encoded a probability of 1 that the participant began in the start state.
π	A vector encoding the probability of selecting each allowable policy (one entry per policy). The value of each policy is determined by its expected free energy (G_π), which depends on a combination of expected reward and expected information gain. Actions at each time point are chosen based on sampling from the distribution over policies, π = σ(G_π); the determinacy of action selection is modulated by an inverse temperature or action precision parameter α (see Supplementary Materials and Figure 1).	This included 3 allowable policies, corresponding to the choice of transitioning to each of the three bandit choice states. The action precision parameter a was fit to participant behavior.