Abstract
The dilemma between information gathering (exploration) and reward seeking (exploitation) is a fundamental problem for reinforcement learning agents. How humans resolve this dilemma is still an open question, because experiments have provided equivocal evidence about the underlying algorithms used by humans. We show that two families of algorithms can be distinguished in terms of how uncertainty affects exploration. Algorithms based on uncertainty bonuses predict a change in response bias as a function of uncertainty, whereas algorithms based on sampling predict a change in response slope. Two experiments provide evidence for both bias and slope changes, and computational modeling confirms that a hybrid model is the best quantitative account of the data.
Keywords: explore-exploit dilemma, reinforcement learning, Bayesian inference
1 Introduction
When rewards are uncertain, a reinforcement learning agent faces the explore-exploit dilemma: should she exploit the option with the highest expected reward, possibly foregoing even higher rewards from other options, or should she explore other options to gather more accurate information about their values? How humans resolve this dilemma has long puzzled psychologists and neuroscientists (Cohen, McClure, & Yu, 2007; Mehlhorn et al., 2015). Because the optimal solution is computationally intractable, humans must employ approximations or heuristics. A large menu of algorithmic possibilities has been developed in the machine learning literature, and some of these have been studied experimentally. However, these algorithms can be difficult to disentangle empirically because they seem to make similar predictions. The key contribution of this work is to show how different algorithms can in fact make quite different predictions when viewed through the appropriate analytical lens, providing new insights into how humans resolve the explore-exploit dilemma.
Research on exploration has coalesced around two big ideas. The first is that humans engage in “directed” exploration, seeking out options that are highly informative about the underlying reward distribution. This is commonly implemented by adding an “uncertainty” or “information” bonus to the estimates of expected reward (Auer, Cesa-Bianchi, & Fischer, 2002). This scheme has the virtue that exploration will decrease with uncertainty, so that eventually choices will be purely exploitative once enough information has been gathered. A number of studies have found evidence for uncertainty bonuses in human decision making (Frank, Doll, Oas-Terpstra, & Moreno, 2009; Krueger, Wilson, & Cohen, 2017; Lee, Zhang, Munro, & Steyvers, 2011; Meyer & Shi, 1995; Wilson, Geana, White, Ludvig, & Cohen, 2014; Zhang & Yu, 2013), but some have not (Daw, O’doherty, Dayan, Seymour, & Dolan, 2006; Payzan-LeNestour & Bossaerts, 2011). Wilson et al. (2014) suggested one reason why evidence for uncertainty bonuses has been equivocal: uncertainty and reward are confounded, because people tend to choose more rewarding options and hence have less uncertainty about those options. Wilson et al. (2014) demonstrated that decisive evidence for uncertainty bonuses can be obtained when this confound is controlled.
The second big idea is that humans engage in “random” exploration, produced by injecting stochasticity into their choices (Daw et al., 2006). The most widely adopted techniques use a fixed source of stochasticity (see next section), but some evidence suggests that humans use a more sophisticated form of stochasticity, which adapts to their uncertainty level (Schulz, Konstantinidis, & Speekenbrink, 2015; Speekenbrink & Konstantinidis, 2015). This is in fact one of the oldest exploration strategies in reinforcement learning, dating back to the pioneering work of Thompson (1933). However, relatively few studies have attempted to tease apart the effects of uncertainty on directed and random exploration.
The purpose of this paper is to characterize some qualitative properties of particular directed and random exploration strategies, which make them empirically distinguishable. We first provide a formal description of these strategies, and then present the results of two experiments that suggest the use of both strategies. A hybrid of directed and random exploration strategies is anticipated by recent reinforcement learning algorithms (Chapelle & Li, 2011; May, Korda, Lee, & Leslie, 2012), which have been shown to have attractive empirical and theoretical properties.
2 Algorithms for exploration
We focus on the two-armed bandit, in which an agent on trial t selects an action at ∈ {1, 2} and observes a reward rt. Most models of human exploration have assumed some form of fixed random exploration (e.g., Daw et al., 2006), typically choosing action k on trial t with probability given by a softmax distribution:
(1) |
where Qt(k) is an estimate of the expected reward (value) 𝔼[rt|at = k] and β is an inverse temperature parameter that controls the stochasticity of action selection: lower values of β produce more stochasticity. Other forms of random exploration have also been considered: for example, Gonzalez and colleagues (Gonzalez & Dutt, 2011; Lejarraga, Dutt, & Gonzalez, 2012) add noise via an instance-based retrieval mechanism, and Barron and Erev (2003) use an ε-greedy strategy with a time-dependent exploration parameter. One problem with all of these strategies is that they do not take the agent’s uncertainty into account. If an agent samples an action 10 times, she should have more uncertainty about its value than if she samples it 100 times, but the softmax policy will produce the same action probabilities as long as the value estimates are the same. Intuitively, the agent should be more exploratory under high uncertainty, as indeed the optimal exploration policy dictates (Wilson et al., 2014).
An alternative strategy is to adopt the principle of optimism in the face of uncertainty: prefer arms with greater uncertainty. The most famous operationalization of this principle is the Upper Confidence Bound (UCB) algorithm (Auer et al., 2002), which selects actions deterministically according to:
(2) |
where Ut(k) is the upper confidence bound that plays the role of an uncertainty bonus. The classic version of UCB (known as UCB1) uses
(3) |
where Nt(k) is the number of times action k was chosen. While UCB1 is based on a frequentist confidence interval, Bayesian variants have been developed in which Qt(k) corresponds to the posterior mean and the uncertainty bonus is proportional to the posterior standard deviation σt(k) (Srinivas, Krause, Seeger, & Kakade, 2010). Details of this computation can be found in the Appendix.
UCB can be understood as a form of directed exploration without any random component, whereas softmax is a form of random exploration without any directed component. A more sophisticated form of random exploration is Thompson sampling (Thompson, 1933), which draws random values from the posterior and then chooses greedily with respect to these random values. The key property of Thompson sampling that distinguishes it from softmax exploration is the uncertainty-based determination of choice stochasticity: the agent will explore more when she is more uncertain.
Viewed as hypotheses about human behavior, both UCB and Thompson sampling predict that exploration will increase with uncertainty, but they make qualitatively different predictions about the nature of this increase. This can be seen clearly by examining choice probability as a function of the estimated difference in value between the two arms (Figure 1), where for simplicity we have assumed that arm 1 has an unknown mean and arm 2 has a known mean of 0. If we add a fixed random component to UCB (see Appendix), then both algorithms produce sigmoidal choice probability curves. When uncertainty increases, UCB predicts a response bias (intercept) shift. This arises because the uncertainty combines additively with the value estimate, reducing the size of the value difference necessary to be indifferent between the two arms. Thompson sampling, by contrast, predicts a slope shift, because the (inverse) uncertainty combines multiplicatively with estimated value difference. Our experiments, described in the next section, capitalize on these qualitative differences to deconstruct the algorithms underlying human exploration.
Figure 1. Effects of uncertainty on choice probability for upper confidence bound (Left) and Thompson sampling (Right) algorithms.
Each curve shows the probability of choice as a function of the difference in estimated value between the two arms, plotted separately for low/high posterior standard deviation (SD).
3 Methods
Because the two experiments use the same methods, differing only in their reward distributions, we describe them here together. On each trial, participants chose between two arms and received reward feedback, drawn from an arm-specific Gaussian distribution, which did not change within a block (similar procedures have been used in the “decisions from experience” literature; e.g., Barron & Erev, 2003; Gonzalez & Dutt, 2011; Lejarraga et al., 2012). In Experiment 1, one arm yielded stochastic rewards and the other arm yielded a fixed reward of 0. In Experiment 2, both arms yielded stochastic rewards. We used probit regression to quantify bias and slope shifts, thereby allowing us to test the predictions of UCB and Thompson sampling algorithms.
3.1 Participants
Forty-four participants in Experiment 1 and forty-five participants in Experiment 2 were recruited via the Amazon Mechanical Turk web service and paid $1.50. The experiments were approved by the Harvard Institutional Review Board.
3.2 Stimuli and Procedure
Participants played 20 two-armed bandits in blocks of 10 trials. On each trial, participants chose one of the arms and received reward feedback (points). They were instructed to choose the arm that maximizes their total points. In Experiment 1, the mean reward μ(1) for arm 1 on each block was drawn from a Gaussian with mean 0 and variance (thus each block had a different mean reward for arm 1), and the reward for arm 2 was fixed at 0. When participants chose arm 1, they received stochastic rewards drawn from a Gaussian with mean μ(1) and variance τ2(1) = 10. When they chose arm 2, they always received a reward of 0. The structure of Experiment 2 was identical, except that mean rewards for both arms were drawn from the same distribution, with mean 0 and variance , and likewise the reward feedback on each trial was drawn from a Gaussian with mean μ(k) when arm k was selected, with variance .
The exact instructions for participants in Experiment 1 were as follows:
In this task, you have a choice between two slot machines, represented by colored buttons. When you choose the left (variable) machine, you will win or lose points. The left machine will not always give you the same points, but it will tend to give points around its average value. When you choose the right (fixed) machine, you will always get 0 points. Your goal is to choose the slot machine that will give you the most points. After making your choice, you will receive feedback about the outcome. You will play 20 games, each with a different left (variable) slot machine (the right, fixed machine will always stay the same). Each game will consist of 10 trials.
The exact instructions for participants in Experiment 2 were as follows:
In this task, you have a choice between two slot machines, represented by colored buttons. When you click one of the buttons, you will win or lose points. Choosing the same slot machine will not always give you the same points, but one slot machine is always better than the other. Your goal is to choose the slot machine that will give you the most points. After making your choice, you will receive feedback about the outcome. You will play 20 games, each with a different pair of slot machines. Each game will consist of 10 trials.
3.3 Analysis
To estimate the bias and slope of the choice probability functions, we used probit regression. As shown in the Appendix, UCB and Thompson sampling can (under appropriate assumptions) be exactly formalized as probit regression models. In particular, we entered the following regressors into the probit model:
Estimated value difference (V): Qt(1) − Qt(2).
Relative uncertainty (RU): σt(1) − σt(2).
Total uncertainty (TU): .
The reason total uncertainty is defined as the square root of the summed variances rather than the sum of the standard deviations (which would make it symmetric with RU) is that this definition of TU allows us to directly relate it to Thompson sampling. In particular, if V is the estimated value difference between the two arms, then choice probability is a sigmoidal function of V/TU (see Appendix for details). Thus, Thompson sampling predicts a significant positive effect of V/TU on choice probability, but not of RU or V (Figure 2, top right). According to UCB, choice probability is a sigmoidal function of V+RU. Thus, UCB predicts a significant positive effect of both V and RU, but not of V/TU (Figure 2, top left).
Figure 2. Experiment 1: regression results.
Choice probability was modeled using probit regression with 3 regressors: value difference (V), relative uncertainty (RU), and value difference normalized by total uncertainty (V/TU). The first 3 panels show the regression coefficients plotted separately for simulated data (UCB and Thompson sampling, top panels) and the empirical data from Experiment 1 (bottom left). The bottom right panel shows the empirical choice probability functions separately for low and high TU, based on a median split. Error bars represent standard error of the mean.
For each model, we used maximum likelihood estimation to fit the coefficients (w) in the probit regression model. The choice probability on trial t was modeled as:
(4) |
where Φ(·) is the cumulative distribution function of the standard Gaussian distribution (mean 0 and variance 1). We compared this full regression model (which we refer to as the “hybrid” model) to two nested models: UCB (w3 fixed to 0) and Thompson sampling (w2 fixed to 0). For each regression model, we computed the Bayesian information criterion approximation of the log marginal likelihood (model evidence; Bishop, 2006):
(5) |
where a denotes the set of all actions, D is the number of parameters, T is the number of trials, and w* = argmaxw P (a|w) is the maximum likelihood estimate of the coefficients. The model evidence was then entered into a random-effects model selection procedure that estimated the frequency of each model in the population (Rigoux, Stephan, Friston, & Daunizeau, 2014; Stephan, Penny, Daunizeau, Moran, & Friston, 2009). This procedure assumes that each participant’s behavior may have been generated by a different model, drawn from some unknown population distribution. We report the protected exceedance probability (PXP) for each model, defined as the posterior probability that the model has a higher frequency than all the other models under consideration (accounting for the possibility that the differences between models arose from chance). Code for performing model comparison is included in the aforementioned GitHub repository.
3.4 Supplementary material
Code and data for reproducing all analyses reported in this paper are available at https://github.com/sjgershm/exploration.
4 Results
4.1 Modeling choice probability
The probit regression analysis of the Experiment 1 data (Figure 2, bottom) revealed effects of both RU [t(44) = 5.41, p < 0.001] and V/TU [t(44) = 3.28, p < 0.005], but not of V (p = 0.34). These results demonstrate that choices were not consistent with either a pure UCB or a pure Thompson sampling algorithm, but instead exhibited features of both. Specifically, the RU effect is consistent with the uncertainty bonus predicted by UCB, while the TU effect is consistent with the uncertainty-dependent stochasticity predicted by Thompson sampling. Furthermore, the absence of an effect of V is consistent with Thompson sampling, according to which the effect of value on choice is mediated by TU, such that V by itself does not explain any additional variance.
The probit regression analysis of the Experiment 2 data (Figure 3, bottom) was largely consistent with the results of Experiment 1, revealing effects of both RU [t(43) = 5.16, p < 0.001] and TU [t(43) = 5.02, p < 0.001], but not of V (p = 0.72). Qualitatively, Experiment 2 appeared to produce a more pronounced slope shift compared to Experiment 1 (compare Figures 2 and 3, bottom right).
Figure 3. Experiment 2: regression results.
Choice probability was modeled using probit regression with 3 regressors: value difference (V), relative uncertainty (RU), and value difference normalized by total uncertainty (V/TU). The first 3 panels show the regression coefficients plotted separately for simulated data (UCB and Thompson sampling, top panels) and the empirical data from Experiment 2 (bottom left). The bottom right panel shows the empirical choice probability functions separately for low and high TU, based on a median split. Error bars represent standard error of the mean.
In addition to measuring the qualitative effects on bias and slope, we used Bayesian random-effects model comparison (Rigoux et al., 2014; Stephan et al., 2009) to compare UCB and Thompson sampling quantitatively with a hybrid model which included both directed and random exploration components (see Methods for a summary of the model comparison procedure, and the Appendix for details of the Hybrid model). In addition, we included a “value-directed exploration” model in which the choice probability was only a function of the value difference (i.e., no dependence on uncertainty). For both studies, the PXPs favored the hybrid model (Figure 4), indicating that both directed and random exploration are needed to adequately describe choice behavior in these tasks.
Figure 4. Bayesian model comparison.
Protected exceedance probability (PXP) for each model, estimated separately for Experiments 1 and 2. “Value” refers to the value-directed exploration model, in which choice probability is a function of the difference in value between the two arms.
4.2 Deconfounding uncertainty and reward
Because participants tend to select rewarding arms, they will have less uncertainty about these arms, thus creating an uncertainty-reward confound (Wilson et al., 2014). To address this, we measured the probability of choosing arm 1 on the first trial of each block, when expected reward is equated across the two arms (i.e., the posterior mean for both arms is 0 on the first trial, because we assume a zero-mean prior). In Experiment 1, uncertainty is higher for arm 1 (which produces variable rewards) compared to arm 2 (which produces fixed rewards), but their expected reward (0) is equated at the beginning of each block. Thus, a preference for arm 1 on the first trial of each block is consistent with an uncertainty bonus, as in UCB. Consistent with this hypothesis, we found that participants preferred arm 1 on the first trial [t(44) = 9.68, p < 0.001; Figure 5]. In Experiment 2, no such differential uncertainty exists, because both arms are equally variable. Accordingly, we found no preference for arm 1 in Experiment 2 (p = 0.95; Figure 5).
Figure 5. Choice probabilities on the first trial of each block.
(Left) Participants preferentially chose arm 1 in Experiment 1 (where it has higher uncertainty than arm 2) but not in Experiment 2 (where it has equal uncertainty). (Middle) The relative uncertainty (RU) coefficient, which captures the degree to which an individual relies on directed exploration, correlated with individual differences in arm 1 preference on the first trial. (Right) Average probability of choosing the risky option on each trial of the experiment.
One potential concern is that participants may have adopted a heuristic strategy in Experiment 1, whereby they preferred the stochastic arm on the first trial because there was no way they could “maximize rewards” by choosing the fixed arm, which always delivered 0 rewards. If participants are using this heuristic, then their exploratory tendency on the first trial should be uncorrelated with their uncertainty bonus fit to all the other trials. In fact, there is a significant correlation in Experiment 1 (r = 0.6, p < 0.0001; Figure 5), demonstrating that participants who show a larger uncertainty bonus also show a greater preference for the risky option on the first trial. Moreover, participants showed a declining preference for the risky choice over the course of each block (Figure 5), consistent with the UCB strategy. Note also that this is not an averaging artifact: the median probability of choosing arm 1 (across all trials) is 0.72, indicating that participants did not adopt a heuristic of always choosing one arm or the other.
4.3 Modeling choice response times
Response times offer an additional source of data to disambiguate directed and random exploration. Models of value-based decision making, drawing an analogy to perceptual decision making, have asserted that responses are generated when an evidence accumulation process crosses a decision threshold (Milosavljevic, Malmaud, Huth, Koch, & Rangel, 2010; Ratcliff & Frank, 2012; Summerfield & Tsetsos, 2012; Tajima, Drugowitsch, & Pouget, 2016). The “evidence” in these models corresponds to noisy samples of the value difference, weighted by their reliability. More generally, many studies have found that response time increases with uncertainty (e.g., Gershman, Tenenbaum, & Jäkel, 2016; Grossman, 1953; Hick, 1952; Hyman, 1953). Thus, we might reasonably predict that response times should be slower when TU is high. In contrast, an uncertainty bonus should alter the evidence itself, such that higher RU should produce faster response times.
We tested these predictions by modeling log response times using linear regression with V, RU and TU as regressors. We found that there was a significant effect of TU in both experiments [Experiment 1: t(44) = 12.64, p < 0.0001; Experiment 2: t(43) = 15.52, p < 0.0001], but only a significant effect of RU in Experiment 2 [t(43) = 2.39, p < 0.05; Experiment 1: p = 0.25]. Nonetheless, variation in the choice coefficient for RU across participants predicted variation in the response time coefficient in both experiments (Figure 6): relative uncertainty sped up participants to the extent that it also biased subjects towards choosing more uncertain arms [Experiment 1: r = −0.3, p < 0.05; Experiment 2: r = −0.69, p < 0.001].
Figure 6. Response time regression results.
Log response times were modeled using linear regression with value difference (V), relative uncertainty (RU), and total uncertainty (TU) regressors. Top left: coefficients estimated from Experiment 1 data. Error bars represent standard error of the mean. Top right: coefficients estimated from Experiment 2 data. Bottom panels show the relationship between choice and response time (RT) coefficients for the relative uncertainty regressor, with superimposed least-squares line.
4.4 Re-analysis of Wilson et al. (2014)
To obtain further validation of the hybrid model, we fit it to the data collected by Wilson et al. (2014). In this task, participants chose between two options with rewards drawn from a Gaussian distribution (truncated between 1 and 100, and rounded to the nearest integer). The means for the different options were different (either 40 or 60), and the standard deviation for both was set to 8. Both parameters remained fixed for each block, which lasted between 5 and 10 trials. The first 10 trials were “forced choice,” a feature of the task designed to control participants’ uncertainty prior to making free choices. Participants played a total of 320 blocks (see Wilson et al., 2014, for more details).
We fixed the prior variance to and fit the reward variance τ2 using grid search. The rewards were mean-centered prior to fitting, which is equivalent to setting the prior mean to 50. Only data from the free choice trials were fit, but the model used the data from the forced choice trials to update the posterior. The 4 models (UCB, Thompson, hybrid, and value-directed) were again compared using random-effects Bayesian model comparison. Recapitulating the results from Experiments 1 and 2, the data from Wilson et al. (2014) were decisively better explained by the hybrid model, with a protected exceedance probability of 0.999.
5 Discussion
The studies reported in this paper demonstrate that choice behavior on bandit problems is best explained (among the models we considered) by a combination of direct and random exploration strategies, consistent with the findings of Wilson et al. (2014) and Krueger et al. (2017). Expanding on this idea, we showed that two specific algorithmic implementations of these algorithmic strategies (UCB and Thompson sampling) produce qualitatively different effects on choice probability functions, both of which were discernible in the data. In particular, we found a downward shift in the response bias (intercept) with greater uncertainty (consistent with UCB), and an increase in choice stochasticity with greater uncertainty (consistent with Thompson sampling). Accordingly, quantitative model comparison (including a re-analysis of the data reported in Wilson et al., 2014) supported a hybrid model that combined both strategies. Finally, we found signatures of the two strategies in response times: relative uncertainty was correlated with faster response times (consistent with UCB), while total uncertainty was correlated with slower response times (consistent with Thompson sampling).
These discoveries were enabled by a detailed analysis of particular computational models, which allowed us to produce model-based uncertainty and value predictors of choice. However, we did not exhaustively survey the space of plausible models, and other models (e.g., Frank et al., 2009; Lee et al., 2011; Payzan-LeNestour & Bossaerts, 2011, 2012; Speekenbrink & Konstantinidis, 2015; Zhang & Yu, 2013) might produce empirically similar results. We conjecture that our conclusions about the signatures of directed and random exploration will hold at least qualitatively for other models which include both strategies. Conversely, our results rule out models that fail to include both strategies, such as the softmax policy with fixed inverse temperature used in the vast majority of human studies. In other words, it would be unwise at this point to make a strong claim that the uncertainty bonus and source of randomness take the particular functional forms instantiated here; rather, we are making the claim that a specific class of hybrid algorithms can explain the choice behavior reported here, and other classes cannot. So for example the knowledge gradient algorithm advocated by Zhang and Yu (2013), which implements a different form of directed exploration, would be embraced by our theoretical framework provided that it is augmented with a form of posterior sampling.
One complexity in making these comparisons is that models based on normative considerations are typically adapted to specific environmental assumptions and task setups. For example, some models were designed specifically for non-stationary reward distributions (Payzan-LeNestour & Bossaerts, 2011, 2012; Speekenbrink & Konstantinidis, 2015; Zhang & Yu, 2013), whereas others were designed for contextual bandits, where rewards depend on contextual information provided to the agent (Schulz et al., 2015). More research is needed to understand to what extent a combination of directed and random exploration obtains in these other environments.
Wilson et al. (2014) pointed out that some studies (Daw et al., 2006; Payzan-LeNestour & Bossaerts, 2011) may have failed to support uncertainty bonuses because they were masked by a confound between uncertainty and reward: because highly rewarding arms are chosen more frequently than less rewarding arms, the highly rewarding arms will be associated with less uncertainty. Indeed, Payzan-LeNestour and Bossaerts (2012) showed that parsing uncertainty into “expected” and “unexpected” components supported a superior model that included uncertainty bonuses. We addressed this issue in our data by measuring exploration on the first trial of each block, before participants have gathered information about reward values, showing clear evidence for an uncertainty bonus.
One lingering question is why people would use a hybrid algorithm of the sort that best predicted choice behavior. Several papers in the machine learning literature have suggested that a hybrid approach can outperform both UCB and Thompson sampling under some conditions (Chapelle & Li, 2011; May et al., 2012), although we still lack a comprehensive understanding of what those conditions are. Indeed, it is possible that people use different exploration algorithms in different situations. To partially address the normative question, we evaluated the performance of the three algorithms on the two-armed bandit task used in Experiment 2. The results show that in general the hybrid algorithm does in fact outperform UCB and Thompson sampling (Figure 7), thus supporting the hypothesis that people are employing a normatively justifiable exploration strategy.
Figure 7. Relative performance of models.
Probability of choosing the optimal arm is plotted as a function of γ (undirected exploration coefficient), averaged across 500 simulated participants. The hybrid algorithm (with β = 4; see Appendix) consistently outperforms both the UCB and Thompson sampling algorithms on the task used in Experiment 2.
Armed with fine-grained algorithmic hypotheses, it will be fruitful for future research to revisit a number of issues in contemporary research on exploration. For example, Somerville et al. (2017) found that directed exploration emerges over the course of adolescence, whereas random exploration appears to be present at the same level across developmental stages. The analyses developed in the present paper could be used to investigate whether Thompson sampling provides a good quantitative account of pre-adult exploration, with a UCB component developing later. The analyses could also be applied to relating different exploration strategies to individual differences, clinical symptoms, and neural correlates. Recent data suggest a causal role of norepinephrine in random exploration (Warren et al., 2017) and of frontopolar cortex in directed exploration (Zajkowski, Kossut, & Wilson, 2017); these neural correlates are thus good candidates for studying the modulatory role of uncertainty.
While we have studied very simple (stationary, non-contextual, uncorrelated) bandit tasks, human reinforcement learning appears to make use of much more sophisticated knowledge structures. For example, humans use structured knowledge about the environment (Acuña & Schrater, 2010; Knox, Otto, Stone, & Love, 2011; Otto, Knox, Markman, & Love, 2014) and hierarchically organized priors (Gershman & Niv, 2015) to guide exploration. In principle, UCB and Thompson sampling can be applied to more structured state spaces and probabilistic models, provided that the posterior distribution over parameters can be computed. Thus, these techniques may provide a bridge between classical “model-free” reinforcement learning algorithms and structured “model-based” algorithms in the human brain (Gershman, 2017). The idea that simple, generic algorithms like UCB and Thompson sampling can be applied to a diverse range of probabilistic models provides an appealing architectural principle for the brain.
It is important, however, to recognize the limitations of algorithms like UCB and Thompson sampling. The participants in Wilson et al. (2014) modulated their exploration policy as a function of the horizon (how many trials they knew they were going to play in each block). This is at least qualitatively consistent with the optimal policy, but problematic for most implementations of UCB and Thompson sampling, which do not take the horizon into account. Thus, people may be employing model-based algorithms like tree search or dynamic programming, as has been suggested in the sequential decision making literature (Daw & Dayan, 2014; Gershman, Markman, & Otto, 2014), or they may be using hitherto unstudied approximations of these algorithms.
In summary, the key finding reported here is that reward uncertainty has two distinct effects on choice behavior: (1) It acts as a pseudo-reward, encouraging exploration of options with high uncertainty, and (2) it acts as a driver of stochasticity in choice behavior, causing choice to become more random. These dual effects suggest that humans employ a hybrid exploration algorithm, combining directed and random exploration. This hybrid algorithm can outperform pure directed and pure random exploration algorithms on the simple bandit tasks studied here, providing a normative justification for combining the two forms of exploration.
Highlights.
Theoretical analysis shows that exploration algorithms can be distinguished in terms of the bias and slope of choice functions.
Two experiments show evidence for both directed and random exploration.
A hybrid algorithm provides the best quantitative model of the choice data.
Acknowledgments
I am grateful to Eric Schulz and Felix Mauersberger for helpful discussions, and to Bob Wilson for generously sharing his data and ideas. This research was supported by the NSF Collaborative Research in Computational Neuroscience (CRCNS) Program Grant IIS-120 7833.
Appendix: modeling details
Belief updating
Under the Gaussian assumptions stipulated by our task, the posterior over the value of arm k is Gaussian with mean Qt(k) and variance . Following earlier work (Daw et al., 2006; Pearson, Hayden, Raghavachari, & Platt, 2009), we used the Kalman filtering equations (Bishop, 2006) to recursively compute the posterior mean and variance for each arm:
(6) |
(7) |
where the learning rate αt is given by:
(8) |
The initial values were set to the prior means, Q1(k) = 0 for all k, and the initial variances were set to the prior variance, . In Experiment 1, , τ2(1) = 10 and τ2(2) = 0. In Experiment 2, , τ2(1) = 10 and τ2(2) = 10.
Note that although the Kalman filter is often used to model learning in dynamically changing environments, we apply it to a static environment (the reward distribution was fixed within a block). Thus, in this setting the Kalman filter is simply a recursive implementation of Bayesian inference for the mean of a Gaussian distribution.
Action policies
For the Thompson sampling policy, a random sample is drawn for each arm k, and then the arm with highest Q̃t(k) is chosen. Marginalizing over Ṽt = Q̃t(1) − Q̃t(2) gives a closed-form expression for the choice probability:
(9) |
where Vt = 𝔼[Ṽt] = Qt(1) − Qt(2). This policy gives a sigmoidal “probit” function of Vt, much like softmax (which is equivalent to a logistic sigmoid for two-armed bandits), with stochasticity proportional to .
For the UCB policy, we relaxed the assumption that policies are chosen deterministically, in order to accommodate human choice stochasticity and also to facilitate the comparison with Thompson sampling. Specifically, we again used the Gaussian CDF (probit) policy:
(10) |
where γ is a weighting factor on the uncertainty bonus, and λ controls the choice stochasticity (analogous to temperature in the softmax policy).
For the hybrid model, we combined the directed exploration component of UCB with the random component of Thompson sampling:
(11) |
where β is an additional free parameter that controls (along with γ) the balance between directed and random exploration. This is similar in spirit to the “optimistic” Thompson sampling algorithm developed by May et al. (2012).
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Acuña D, Schrater P. Structure learning in human sequential decision-making. PLoS Computational Biology. 2010;6:e1001003. doi: 10.1371/journal.pcbi.1001003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Auer P, Cesa-Bianchi N, Fischer P. Finite-time analysis of the multiarmed bandit problem. Machine Learning. 2002;47:235–256. [Google Scholar]
- Barron G, Erev I. Small feedback-based decisions and their limited correspondence to description-based decisions. Journal of Behavioral Decision Making. 2003;16:215–233. [Google Scholar]
- Bishop CM. Pattern recognition and machine learning. Springer; 2006. [Google Scholar]
- Chapelle O, Li L. Advances in neural information processing systems. 2011. An empirical evaluation of Thompson sampling; pp. 2249–2257. [Google Scholar]
- Cohen JD, McClure SM, Yu AJ. Should i stay or should i go? how the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society of London B: Biological Sciences. 2007;362:933–942. doi: 10.1098/rstb.2007.2098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daw ND, Dayan P. The algorithmic anatomy of model-based evaluation. Phil Trans R Soc B. 2014;369:20130478. doi: 10.1098/rstb.2013.0478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daw ND, O’doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–879. doi: 10.1038/nature04766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frank MJ, Doll BB, Oas-Terpstra J, Moreno F. Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation. Nature Neuroscience. 2009;12:1062–1068. doi: 10.1038/nn.2342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gershman SJ. Reinforcement learning and causal models. In: Waldmann M, editor. The oxford handbook of causal reasoning. Oxford University Press; 2017. [Google Scholar]
- Gershman SJ, Markman AB, Otto AR. Retrospective revaluation in sequential decision making: A tale of two systems. Journal of Experimental Psychology: General. 2014;143:182–194. doi: 10.1037/a0030844. [DOI] [PubMed] [Google Scholar]
- Gershman SJ, Niv Y. Novelty and inductive generalization in human reinforcement learning. Topics in Cognitive Science. 2015;7:391–415. doi: 10.1111/tops.12138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gershman SJ, Tenenbaum JB, Jäkel F. Discovering hierarchical motion structure. Vision research. 2016;126:232–241. doi: 10.1016/j.visres.2015.03.004. [DOI] [PubMed] [Google Scholar]
- Gonzalez C, Dutt V. Instance-based learning: Integrating sampling and repeated decisions from experience. Psychological Review. 2011;118:523–551. doi: 10.1037/a0024558. [DOI] [PubMed] [Google Scholar]
- Grossman E. Entropy and choice time: The effect of frequency unbalance on choice-response. Quarterly Journal of Experimental Psychology. 1953;5:41–51. [Google Scholar]
- Hick WE. On the rate of gain of information. Quarterly Journal of Experimental Psychology. 1952;4:11–26. [Google Scholar]
- Hyman R. Stimulus information as a determinant of reaction time. Journal of Experimental Psychology. 1953;45:188–196. doi: 10.1037/h0056940. [DOI] [PubMed] [Google Scholar]
- Knox WB, Otto AR, Stone P, Love BC. The nature of belief-directed exploratory choice in human decision-making. Frontiers in Psychology. 2011;2 doi: 10.3389/fpsyg.2011.00398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krueger PM, Wilson RC, Cohen JD. Strategies for exploration in the domain of losses. Judgment and Decision Making. 2017;12:104–117. [Google Scholar]
- Lee MD, Zhang S, Munro M, Steyvers M. Psychological models of human and optimal performance in bandit problems. Cognitive Systems Research. 2011;12:164–174. [Google Scholar]
- Lejarraga T, Dutt V, Gonzalez C. Instance-based learning: A general model of repeated binary choice. Journal of Behavioral Decision Making. 2012;25:143–153. [Google Scholar]
- May BC, Korda N, Lee A, Leslie DS. Optimistic Bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research. 2012;13:2069–2106. [Google Scholar]
- Mehlhorn K, Newell BR, Todd PM, Lee MD, Morgan K, Braithwaite VA, … Gonzalez C. Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures. Decision. 2015;2:191–215. [Google Scholar]
- Meyer RJ, Shi Y. Sequential choice under ambiguity: Intuitive solutions to the armed-bandit problem. Management Science. 1995;41:817–834. [Google Scholar]
- Milosavljevic M, Malmaud J, Huth A, Koch C, Rangel A. The drift diffusion model can account for value-based choice response times under high and low time pressure. Judgment and Decision Making. 2010;5:437–449. [Google Scholar]
- Otto AR, Knox WB, Markman AB, Love BC. Physiological and behavioral signatures of reflective exploratory choice. Cognitive, Affective, & Behavioral Neuroscience. 2014;14:1167–1183. doi: 10.3758/s13415-014-0260-4. [DOI] [PubMed] [Google Scholar]
- Payzan-LeNestour E, Bossaerts P. Risk, unexpected uncertainty, and estimation uncertainty: Bayesian learning in unstable settings. PLoS Computational Biology. 2011;7:e1001048. doi: 10.1371/journal.pcbi.1001048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Payzan-LeNestour E, Bossaerts P. Do not bet on the unknown versus try to find out more: estimation uncertainty and “unexpected uncertainty” both modulate exploration. Frontiers in Neuroscience. 2012;6 doi: 10.3389/fnins.2012.00150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearson JM, Hayden BY, Raghavachari S, Platt ML. Neurons in posterior cingulate cortex signal exploratory decisions in a dynamic multioption choice task. Current Biology. 2009;19:1532–1537. doi: 10.1016/j.cub.2009.07.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ratcliff R, Frank MJ. Reinforcement-based decision making in corticostriatal circuits: mutual constraints by neurocomputational and diffusion models. Neural Computation. 2012;24:1186–1229. doi: 10.1162/NECO_a_00270. [DOI] [PubMed] [Google Scholar]
- Rigoux L, Stephan KE, Friston KJ, Daunizeau J. Bayesian model selection for group studiesrevisited. NeuroImage. 2014;84:971–985. doi: 10.1016/j.neuroimage.2013.08.065. [DOI] [PubMed] [Google Scholar]
- Schulz E, Konstantinidis E, Speekenbrink M. Learning and decisions in contextual multi-armed bandit tasks. Proceedings of the 37th annual conference of the cognitive science society; 2015. pp. 2122–2127. [Google Scholar]
- Somerville LH, Sasse SF, Garrad MC, Drysdale AT, Abi Akar N, Insel C, Wilson RC. Charting the expansion of strategic exploratory behavior during adolescence. Journal of Experimental Psychology: General. 2017;146:155–164. doi: 10.1037/xge0000250. [DOI] [PubMed] [Google Scholar]
- Speekenbrink M, Konstantinidis E. Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science. 2015;7:351–367. doi: 10.1111/tops.12145. [DOI] [PubMed] [Google Scholar]
- Srinivas N, Krause A, Seeger M, Kakade SM. Gaussian process optimization in the bandit setting: No regret and experimental design. Proceedings of the 27th international conference on machine learning; 2010. pp. 1015–1022. [Google Scholar]
- Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ. Bayesian model selection for group studies. NeuroImage. 2009;46:1004–1017. doi: 10.1016/j.neuroimage.2009.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Summerfield C, Tsetsos K. Building bridges between perceptual and economic decision-making: neural and computational mechanisms. Frontiers in Neuroscience. 2012;6 doi: 10.3389/fnins.2012.00070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tajima S, Drugowitsch J, Pouget A. Optimal policy for value-based decision-making. Nature Communications. 2016;7 doi: 10.1038/ncomms12400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 1933;25:285–294. [Google Scholar]
- Warren CM, Wilson RC, van der Wee NJ, Giltay EJ, van Noorden MS, Cohen JD, Nieuwenhuis S. The effect of atomoxetine on random and directed exploration in humans. PloS One. 2017;12:e0176034. doi: 10.1371/journal.pone.0176034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology: General. 2014;143:2074–2081. doi: 10.1037/a0038199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zajkowski WK, Kossut M, Wilson RC. A causal role for right frontopolar cortex in directed, but not random, exploration. eLife. 2017;6:e27430. doi: 10.7554/eLife.27430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang S, Yu AJ. Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting. Advances in neural information processing systems. 2013:2607–2615. [Google Scholar]