Optimal policy for attention-modulated decisions explains human fixation behavior

Anthony I Jang; Ravi Sharma; Jan Drugowitsch

doi:10.7554/eLife.63436

. 2021 Mar 26;10:e63436. doi: 10.7554/eLife.63436

Optimal policy for attention-modulated decisions explains human fixation behavior

Anthony I Jang ¹, Ravi Sharma ², Jan Drugowitsch ^1,^✉

Editors: Konstantinos Tsetsos³, Joshua I Gold⁴

PMCID: PMC8064754 PMID: 33769284

Abstract

Traditional accumulation-to-bound decision-making models assume that all choice options are processed with equal attention. In real life decisions, however, humans alternate their visual fixation between individual items to efficiently gather relevant information (Yang et al., 2016). These fixations also causally affect one’s choices, biasing them toward the longer-fixated item (Krajbich et al., 2010). We derive a normative decision-making model in which attention enhances the reliability of information, consistent with neurophysiological findings (Cohen and Maunsell, 2009). Furthermore, our model actively controls fixation changes to optimize information gathering. We show that the optimal model reproduces fixation-related choice biases seen in humans and provides a Bayesian computational rationale for this phenomenon. This insight led to additional predictions that we could confirm in human data. Finally, by varying the relative cognitive advantage conferred by attention, we show that decision performance is benefited by a balanced spread of resources between the attended and unattended items.

Research organism: Human

Introduction

Would you rather have a donut or an apple as a mid-afternoon snack? If we instantaneously knew their associated rewards, we could immediately choose the higher-rewarding option. However, such decisions usually take time and are variable, suggesting that they arise from a neural computation that extends over time (Rangel and Hare, 2010; Shadlen and Shohamy, 2016). In the past, such behavior has been modeled descriptively with accumulation-to-bound models that continuously accumulate noisy evidence from each choice option, until a decision boundary is reached in favor of a single option over its alternatives. Such models have been successful at describing accuracy and response time data from human decision makers performing in both perceptual and value-based decision tasks (Ratcliff and McKoon, 2008; Milosavljevic et al., 2010). Recently, we and others showed that, if we assume these computations to involve a stream of noisy samples of each item’s perceptual feature (for perceptual decisions) or underlying value (for value-based decisions), then the normative strategy could be implemented as an accumulation-to-bound model (Bogacz et al., 2006; Drugowitsch et al., 2012; Tajima et al., 2016). Specifically, the normative strategy could be described with the diffusion decision model (Ratcliff and McKoon, 2008) with time-varying decision boundaries that approach each other over time.

Standard accumulation-to-bound models assume that all choice options receive equal attention during decision-making. However, the ability to drive one’s attention amidst multiple, simultaneous trains of internal and external stimuli is an integral aspect of everyday life. Indeed, humans tend to alternate between fixating on different items when making decisions, suggesting that control of overt visual attention is intrinsic to the decision-making process (Kustov and Robinson, 1996; Mohler and Wurtz, 1976). Furthermore, their final choices are biased toward the item that they looked at longer, a phenomenon referred to as a choice bias (Shimojo et al., 2003; Krajbich et al., 2010; Krajbich and Rangel, 2011; Cavanagh et al., 2014). While several prior studies have developed decision-making models that incorporate attention (Yu et al., 2009; Krajbich et al., 2010; Towal et al., 2013; Cassey et al., 2013; Gluth et al., 2020), our goal was to develop a normative framework that incorporates control of attention as an integral aspect of the decision-making process, such that the agent must efficiently gather information from all items while minimizing the deliberation time, akin to real life decisions. In doing so, we hoped to provide a computational rationale for why fixation-driven choice biases seen in human behavior may arise from an optimal decision strategy. For example, the choice bias has been previously replicated with a modified accumulation-to-bound model, but the model assumed that fixations are driven by brain processes that are exogenous to the computations involved in decision-making (Krajbich et al., 2010). This stands in contrast to studies of visual attention where fixations appear to be controlled to extract choice-relevant information in a statistically efficient manner, suggesting that fixations are driven by processes endogenous to the decision (Yang et al., 2016; Hoppe and Rothkopf, 2016; Hayhoe and Ballard, 2005; Chukoskie et al., 2013; Corbetta and Shulman, 2002).

We asked if the choice bias associated with fixations can be explained with a unified framework in which fixation changes and decision-making are part of the same process. To do so, we endowed normative decision-making models (Tajima et al., 2016) with attention that boost the amount of information one collects about each choice option, in line with neurophysiological findings (Averbeck et al., 2006; Cohen and Maunsell, 2009; Mitchell et al., 2009; Wittig et al., 2018). We furthermore assumed that this attention was overt (Posner, 1980; Geisler and Cormack, 2012), and thus reflected in the decision maker’s gaze which was controlled by the decision-making process.

We first derive the complex normative decision-making strategy arising from these assumptions and characterize its properties. We then show that this strategy featured the same choice bias as observed in human decision makers: it switched attention more frequently when deciding between items with similar values, and was biased toward choosing items that were attended last, and attended longer. It furthermore led to new predictions that we could confirm in human behavior: choice biases varied based on the amount of time spent on the decision and the average desirability across both choice items. Lastly, it revealed why the observed choice biases might, in fact, be rational. Overall, our work provides a unified framework in which the optimal, attention-modulated information-seeking strategy naturally leads to biases in choice that are driven by visual fixations, as observed in human decisions.

Results

An attention-modulated decision-making model

Before describing our attention-modulated decision-making model, we will first briefly recap the attention-free value-based decision-making model (Tajima et al., 2016) that ours builds upon. This model assumes that for each decision trial, a true value associated with each item (z₁,z₂) is drawn from a normal prior distribution with mean $\bar{z}$ and variance $σ_{z}^{2}$ . Therefore, $z_{j} \sim 𝒩 (\bar{z}, σ_{z}^{2})$ for both $j \in {1, 2}$ . The smaller the $σ_{z}^{2}$ , the more information this prior provides about the true values. We assume the decision maker knows the shape of the prior, but can’t directly observe the drawn true values. In other words, the decision maker a priori knows the range of values associated with the items they need to compare, but does not know what exact items to expect nor what their associated rewards will be. For example, one such draw might result in a donut and an apple, each of which has an associated value to the decision maker (i.e. satisfaction upon eating it). In each nth time step of length $δ t$ , they observe noisy samples centered around the true values, called momentary evidence, $δ x_{j, n} | z_{j} \sim 𝒩 (z_{j} δ t, 2 σ_{x}^{2} δ t)$ . In Tajima et al., 2016 , the variance of the momentary evidence was $σ_{x}^{2} δ t$ rather than $2 σ_{x}^{2} δ t$ . We here added the factor 2 without loss of generality to relate it more closely to the attention-modulated version we introduce further below. The variance $2 σ_{x}^{2}$ here controls how informative the momentary evidence is about the associated true value. A large $σ_{x}^{2}$ implies larger noise, and therefore less information provided by each of the momentary evidence samples. While the model is agnostic to the origin of these samples, they might arise from computations to infer the items’ values (e.g. how much do I currently value the apple?), memory recall (e.g. how much did I previously value the apple?), or a combination thereof (Shadlen and Shohamy, 2016). As the decision maker’s aim is to choose the higher valued item, they ought to accumulate evidence for some time to refine their belief in the items’ values. Once they have accumulated evidence for $t = N δ t$ seconds, their posterior belief for the value associated with either item is

z_{j} | δ x_{j, 1 : N} \sim 𝒩 (\frac{σ_{x}^{2} σ_{z}^{- 2} \bar{z} + \frac{1}{2} x_{j} (t)}{σ_{x}^{2} σ_{z}^{- 2} + \frac{1}{2} t}, \frac{σ_{x}^{2}}{σ_{x}^{2} σ_{z}^{- 2} + \frac{1}{2} t}),

(1)

where $x_{j} (t) = \sum_{n = 1}^{N} δ x_{j, n}$ is the accumulated evidence for item $j$ (Tajima et al., 2016). The mean of this posterior (i.e. the first fraction in brackets) is a weighted sum of the prior mean, $\bar{z}$ , and the accumulated evidence, $x_{j} (t)$ . The weights are determined by the accumulation time ( $t$ ), and the variances of the prior ( $σ_{z}^{2}$ ) and the momentary evidence ( $σ_{x}^{2}$ ), which control their respective informativeness. Initially, $t = 0$ and $x_{j} (t) = 0$ , such that the posterior mean equals that of the prior, $\bar{z}$ . Over time, with increasing $t$ , the influence of $x_{j} (t)$ becomes dominant, and the mean approaches $x_{j} (t) / t$ (i.e. the average momentary evidence) for a large $t$ , at which point the influence of the prior becomes negligible. The posterior’s variance (i.e. the second fraction in brackets) reflects the uncertainty in the decision maker’s value inference. It initially equals the prior variance, $σ_{z}^{2}$ , and drops toward zero once $t$ becomes large. In this attention-free model, uncertainty monotonically decreases identically over time for both items, reflecting the standard assumption of accumulation-to-bound models that, in each small time period, the same amount of evidence is gathered for either choice item.

To introduce attention-modulation, we assume that attention limits information about the unattended item (Figure 1). This is consistent with behavioral and neurophysiological findings showing that attention boosts behavioral performance (Cohen and Maunsell, 2009; Cohen and Maunsell, 2010; Wang and Krauzlis, 2018) and the information encoded in neural populations (Ni et al., 2018; Ruff et al., 2018; Wittig et al., 2018). To achieve this, we first assume that the total rate of evidence across both items, as controlled by $σ_{x}^{2}$ , is fixed, and that attention modulates the relative amount of information gained about the attended versus unattended item. This 'attention bottleneck' is controlled by $κ$ ( $0 \leq κ \leq 1$ ), such that $κ$ represents the proportion of total information received for the unattended item, versus $1 - κ$ for the attended item. The decision maker can control which item to attend to, but has no control over the value of $κ$ , which we assume is fixed and known. To limit information, we change the momentary evidence for the attended item $j$ to $δ x_{j, n} \sim 𝒩 (z_{j} δ t, \frac{1}{1 - κ} σ_{x}^{2} δ t)$ , and for the unattended item $k = 3 - j$ to $δ x_{k, n} \sim 𝒩 (z_{k} δ t, \frac{1}{κ} σ_{x}^{2} δ t)$ . Therefore, if $κ \leq \frac{1}{2}$ , the variance of the unattended item increases (i.e. noisier evidence) relative to the attended item. This makes the momentary evidence less informative about z_k, and more informative about z_j, while leaving the overall amount of information unchanged (see Materials and methods). Setting $κ = \frac{1}{2}$ indicates equally informative momentary evidence for both items, and recovers the attention-free scenario (Tajima et al., 2016).

Lowering information for the unattended item impacts the value posteriors as follows. If the decision maker again accumulates evidence for some time $t = N δ t$ , their belief about item $j = 1$ ’s value changes from Equation (1) to

z_{1} | δ x_{1, 1 : N} \sim 𝒩 (\frac{σ_{x}^{2} σ_{z}^{- 2} \bar{z} + (1 - κ) X_{1} (t)}{σ_{x}^{2} σ_{z}^{- 2} + (1 - κ) t_{1} + κ t_{2}}, \frac{σ_{x}^{2}}{σ_{x}^{2} σ_{z}^{- 2} + (1 - κ) t_{1} + κ t_{2}}),

(2)

where t₁ and t₂, which sum up to the total accumulation time ( $t = t_{1} + t_{2}$ ), are the durations that items 1 and 2 have been attended, respectively. The accumulated evidence $X_{1} (t)$ now isn’t simply the sum of all momentary pieces of evidence, but instead down-weights them by $\frac{1 - κ}{κ}$ if the associated item is unattended (see Materials and methods). This prevents the large inattention noise from swamping the overall estimate (Drugowitsch et al., 2014). An analogous expression provides the posterior $z_{2} | δ x_{2, 1 : N}$ for item 2 (see Appendix 1).

The attention modulation of information is clearly observable in the variance of the value’s posterior for item 1 (Equation 2). For $κ < \frac{1}{2}$ , this variance, which is proportional to the decision maker’s uncertainty about the option’s value, drops more quickly over time if item 1 rather than item 2 is attended (i.e. if t₁ rather than t₂ increases). Therefore, it depends on how long each of the two items have been attended to, and might differ between the two items across time (Figure 1C). As a result, decision performance depends on how much time is allocated to attending to each item.

The decision maker’s best choice at any point in time is to choose the item with the larger expected value, as determined by the value posterior. However, the posterior by itself does not determine when it is best to stop accumulating evidence. In our previous attention-free model, we addressed the optimal stopping time by assuming that accumulating evidence comes at cost $c$ per second, and found the optimal decision policy under this assumption (Tajima et al., 2016). Specifically, at each time step of the decision-making process, the decision maker could choose between three possible actions. The first two actions involve immediately choosing one of the two items, which promises the associated expected rewards. The third action is to accumulate more evidence that promises more evidence, better choices, and higher expected reward, but comes at a higher cost for accumulating evidence. We found the optimal policy using dynamic programming that solves this arbitration by constructing a value function that, for each stage of the decision process, returns all expected rewards and costs from that stage onward (Bellman, 1952; Bertsekas, 1995). The associated policy could then be mechanistically implemented by an accumulation-to-bound model that accumulates the difference in expected rewards, $Δ = ⟨ z_{2} | δ x_{2, 1 : N} ⟩ - ⟨ z_{1} | δ x_{1, 1 : N} ⟩$ , and triggers a choice once one of two decision boundaries, which collapse over time, is reached (Tajima et al., 2016).

Once we introduce attention, a fourth action becomes available: the decision maker can choose to switch attention to the currently unattended item (Figure 1B). If such a switch comes at no cost, then the optimal strategy would be to continuously switch attention between both items to sample them evenly across time. We avoid this physically unrealistic scenario by introducing a cost c_s for switching attention. This cost may represent the physical effort of switching attention, the temporal cost of switching (Wurtz, 2008; Cassey et al., 2013), or a combination of both. Overall, this leads to a value function defined over a four-dimensional space: the expected reward difference $Δ$ , the evidence accumulation times t₁ and t₂, and the currently attended item $y \in {1, 2}$ (see Appendix 1). As the last dimension can only take one of two values, we can equally use two three-dimensional value functions. This results in two associated policies that span the three-dimensional state space $(Δ, t_{1}, t_{2})$ (Figure 2).

Figure 2. — (A) The optimal policy space. The policy space can be divided into regions associated with different optimal actions (choose item 1 or 2, accumulate more evidence, switch attention). The boundaries between these regions can be visualized as contours in this space. The three panels on the right show cross-sections after slicing the space at different $Δ$ values, indicated by the gray slices in the left panel. Note that when $Δ = 0$ (middle panel), the two items have equal value and therefore there is no preference for one item over the other. (B) Optimal policy spaces for different values of $y$ (currently attended item). The two policy spaces are mirror-images of each other. (C) Example deliberation process of a single trial demonstrated by a particle that diffuses across the optimal policy space. In this example, the model starts by attending to item 1, then makes two switches in attention before eventually choosing item 1. The bottom row shows the plane in which the particle diffuses. Note that the particle diffuses on the (gray, shaded) plane perpendicular to the time axis of the unattended item, such that it only increases in t_j when attending to item j. Also note that the policy space changes according to the item being attended to, as seen in (B). See results text for more detailed description. See Figure 2—figure supplement 1 to view changes in the optimal policy space depending on changes to model parameters.

Figure 2—figure supplement 1. — (A) The optimal policy space. The policy space can be divided into regions associated with different optimal actions (choose item 1 or 2, accumulate more evidence, switch attention). The boundaries between these regions can be visualized as contours in this space. The three panels on the right show cross-sections after slicing the space at different $Δ$ values, indicated by the gray slices in the left panel. Note that when $Δ = 0$ (middle panel), the two items have equal value and therefore there is no preference for one item over the other. (B) Optimal policy spaces for different values of $y$ (currently attended item). The two policy spaces are mirror-images of each other. (C) Example deliberation process of a single trial demonstrated by a particle that diffuses across the optimal policy space. In this example, the model starts by attending to item 1, then makes two switches in attention before eventually choosing item 1. The bottom row shows the plane in which the particle diffuses. Note that the particle diffuses on the (gray, shaded) plane perpendicular to the time axis of the unattended item, such that it only increases in t_j when attending to item j. Also note that the policy space changes according to the item being attended to, as seen in (B). See results text for more detailed description. See Figure 2—figure supplement 1 to view changes in the optimal policy space depending on changes to model parameters.

Features of the optimal policy

At any point within a decision, the model’s current state is represented by a location in this 3D policy space, such that different regions in this space designate the optimal action to perform (i.e. choose, accumulate, switch). The boundaries between these regions can be visualized as contours in this 3D state space (Figure 2A). As previously discussed, there are two distinct policy spaces for when the decision maker is attending to item 1 versus item 2 that are symmetric to each other (Figure 2B).

Within a given decision, the deliberation process can be thought of as a particle that drifts and diffuses in this state space. The model starts out attending to an item at random ( $y \in 1, 2$ ), which determines the initial policy space (Figure 2B). Assume an example trial where the model attends to item 1 initially ( $y = 1$ ). At trial onset, the decision maker holds the prior belief, such that the particle starts on the origin ( $Δ = 0$ , $t_{1} = t_{2} = 0$ ) which is within the ‘accumulate’ region. As the model accumulates evidence, the particle will move on a plane perpendicular to $t_{2} = 0$ , since t₂ remains constant while attending to item 1 (Figure 2C, first column). During this time, evidence about the true values of both items will be accumulated, but information regarding item 2 will be significantly noisier (as controlled by $κ$ ). Depending on the evidence accumulated regarding both items, the particle may hit the boundary for ‘choose 1’, ‘choose 2’, or 'switch (attention)'. Assume the particle hits the ‘switch’ boundary, indicating that the model is not confident enough to make a decision after the initial fixation to item 1. In other words, the difference in expected rewards between the two items is too small to make an immediate decision, and it is deemed advantageous to collect more information about the currently unattended item. Now, the model is attending to item 2, and the policy space switches accordingly ( $y = 2$ ). The particle, starting from where it left off, will now move on a plane perpendicular to the t₁ axis (Figure 2C, second column). This process is repeated until the particle hits a decision boundary (Figure 2C, third column). Importantly, these shifts in attention are endogenously generated by the model as a part of the optimal decision strategy — it exploits its ability to control how much information it receives about either item’s value.

The optimal policy space shows some notable properties. As expected, the ‘switch’ region in a given policy space is always encompassed in the ‘accumulate’ region of the other policy space, indicating that the model never switches attention or makes a decision immediately after an attention switch. Furthermore, the decision boundaries in 3D space approach each other over time, consistent with previous work that showed a collapsing 2D boundary for optimal value-based decisions without attention (Tajima et al., 2016). The collapsing bound reflects the model’s uncertainty regarding the difficulty of the decision task (Drugowitsch et al., 2012). In our case, this difficulty depends on how different the true item values are, as items of very different values are easier to distinguish than those of similar value. If the difficulty is known within and across choices, the boundaries will not collapse over time, and their (fixed) distance will reflect the difficulty of the choice. However, since the difficulty of individual choices varies and is a priori unknown to the decision maker in our task, the decision boundary collapses so that the model minimizes wasting time on a choice that is potentially too difficult.

The optimal model had five free parameters that affect its behavior: (1) variance of evidence accumulation ( $σ_{x}^{2}$ ), (2) variance of the prior distribution ( $σ_{z}^{2}$ ), (3) cost of evidence accumulation ( $c [s^{- 1}]$ ), (4) cost of switching attention (c_s), and (5) relative information gain from the attended vs. unattended items ( $κ$ ). The contour of the optimal policy boundaries changes in intuitive ways as these parameters are adjusted (Figure 2—figure supplement 1). Increasing the noisiness of evidence accumulation ( $σ_{x}^{2}$ ) causes an overall shrinkage of the evidence accumulation space. This allows the model to reach a decision boundary more quickly under a relatively higher degree of uncertainty, given that evidence accumulation is less reliable but equally costly. Similarly, increasing the cost of accumulating evidence ( $c$ ) leads to a smaller accumulation space so that the model minimizes paying a high cost for evidence accumulation. Increasing the switch cost c_s leads to a smaller policy space for the ‘switch’ behavior, since there is an increased cost for switching attention. Similarly, decreasing the inattention noise by setting $κ$ closer to $\frac{1}{2}$ leads to a smaller ‘switch’ space because the model can obtain more reliable information from the unattended item, reducing the necessity to switch attention. To find a set of parameters that best mimic human behavior, we performed a random search over a large parameter space and selected the parameter set that best demonstrated the qualitative aspects of the behavioral data (see Appendix 1).

The optimal policy replicates human behavior

To assess if the optimal policy features the same decision-making characteristics as human decision makers, we used it to simulate behavior in a task analogous to the value-based decision task performed by humans in Krajbich et al., 2010. Briefly, in this task, participants first rated their preference of different snack items on a scale of −10 to 10. Then, they were presented with pairs of different snacks after excluding the negatively rated items and instructed to choose the preferred item. While they deliberate on their choice, the participants’ eye movements were tracked and the fixation duration to each item was used as a proxy for visual attention.

We simulated decision-making behavior using value distributions similar to those used in the human experiment (see Materials and methods), and found that the model behavior qualitatively reproduces essential features of human choice behavior (Figure 3, Figure 3—figure supplement 1). As expected in value-based decisions, a larger value difference among the compared items made it more likely for the model to choose the higher-valued item (Figure 3A; $t (38) = 105.7, p < 0.001$ ). Furthermore, the model’s mean response time (RT) decreased with increasing value difference, indicating that less time was spent on trials that were easier (Figure 3B; $t (38) = - 11.1, p < 0.001$ ). Of note, while human RTs appeared to drop linearly with increasing value difference, our model’s drop was concave across a wide range of model parameters (Figure 3—figure supplement 1C). The model also switched attention less for easier trials, indicating that difficult trials required more evidence accumulation from both items, necessitating multiple switches in attention (Figure 3C; $t (38) = - 8.10, p < 0.001$ ). Since the number of switches is likely correlated with response time, we also looked at switch rate (number of switches divided by response time). Here, although human data showed no relationship between switch rate and trial difficulty, model behavior showed a positive relationship, suggesting an increased rate of switching for easier trials. However, this effect was absent when using the same number of trials as humans, and did not generalize across all model parameter values (Figure 3—figure supplement 1D–G).

Figure 3. — (A) Monotonic increase in probability of choosing item 1 as a function of the difference in value between item 1 and 2 ( $t (38) = 105.7, p < 0.001$ ). (B) Monotonic decrease in response time (RT) as a function of trial difficulty ( $t (38) = - 11.1, p < 0.001$ ). RT increases with increasing difficulty. (C) Decrease in the number of attention switches as a function of trial difficulty. More switches are made for harder trials ( $t (38) = - 8.10, p < 0.001$ ). (D) Effect of last fixation location on item preference. The item that was fixated on immediately prior to the decision was more likely to be chosen. (E) Attention’s biasing effect on item preference. The item was more likely to be chosen if it was attended for a longer period of time ( $t (38) = 5.32, p < 0.001$ ). Since the probability of choosing item 1 depends on the degree of value difference between the two items, we normalized the p(choose item 1) by subtracting the average probability of choosing item 1 for each difference in item value. (F) Replication of fixation pattern during decision making. Both model and human data showed a fixation pattern where a short initial fixation was followed by a longer, then medium-length fixation. Error bars indicate standard error of the mean (SEM) across both human and simulated participants ( $N = 39$ for both). See Figure 3—figure supplement 2 for an analogous figure for the perceptual decision task.

Figure 3—figure supplement 1. — (A) Monotonic increase in probability of choosing item 1 as a function of the difference in value between item 1 and 2 ( $t (38) = 105.7, p < 0.001$ ). (B) Monotonic decrease in response time (RT) as a function of trial difficulty ( $t (38) = - 11.1, p < 0.001$ ). RT increases with increasing difficulty. (C) Decrease in the number of attention switches as a function of trial difficulty. More switches are made for harder trials ( $t (38) = - 8.10, p < 0.001$ ). (D) Effect of last fixation location on item preference. The item that was fixated on immediately prior to the decision was more likely to be chosen. (E) Attention’s biasing effect on item preference. The item was more likely to be chosen if it was attended for a longer period of time ( $t (38) = 5.32, p < 0.001$ ). Since the probability of choosing item 1 depends on the degree of value difference between the two items, we normalized the p(choose item 1) by subtracting the average probability of choosing item 1 for each difference in item value. (F) Replication of fixation pattern during decision making. Both model and human data showed a fixation pattern where a short initial fixation was followed by a longer, then medium-length fixation. Error bars indicate standard error of the mean (SEM) across both human and simulated participants ( $N = 39$ for both). See Figure 3—figure supplement 2 for an analogous figure for the perceptual decision task.

The model also reproduced the biasing effects of fixation on preference seen in humans (Krajbich et al., 2010). An item was more likely to be chosen if it was the last one to be fixated on (Figure 3D), and if it was viewed for a longer time period (Figure 3E; $t (38) = 5.32, p < 0.001$ ). Interestingly, the model also replicated a particular fixation pattern seen in humans, where a short first fixation is followed by a significantly longer second fixation, which is followed by a medium-length third fixation (Figure 3F). We suspect this pattern arises due to the shape of the optimal decision boundaries, where the particle is more likely to hit the ‘switch’ boundary in a shorter time for the first fixation, likely reflecting the fact that the model prefers to sample from both items at least once. Consistent with this, Figure 3C shows that the ‘accumulate’ space is larger for the second fixation compared to the first fixation. Of note, the attentional drift diffusion model (aDDM) that was initially proposed to explain the observed human data did not generate its own fixations, but rather used fixations sampled from the empirical distribution of human subjects. Furthermore, they were only able to achieve this fixation pattern by sampling the first fixation, which was generally shorter than the rest, separately from the remaining fixation durations (Krajbich et al., 2010; Figure 4—figure supplement 3E).

One feature that distinguishes our model from previous attention-based decision models is that attention only modulates the variance of momentary evidence without explicitly down-weighting the value of the unattended item (Krajbich et al., 2010; Song et al., 2019). Therefore, at first glance, preference for the more-attended item is not an obvious feature since our model does not appear to boost its estimated value. However, under the assumption that decision makers start out with a zero-mean prior, Bayesian belief updating with attention modulation turns out to effectively account for a biasing effect of fixation on the subjective value of items (Li and Ma, 2019). For instance, consider choosing between two items with equal underlying value. Without an attention-modulated process, the model will accumulate evidence from both items simultaneously, and thus have no preference for one item over the other. However, once attention is introduced and the model attends to item 1 longer than item 2, it will have acquired more evidence about item 1’s value. This will cause item 1 to have a sharper, more certain likelihood function compared to item 2 (Figure 4A). As posterior value estimates are formed by combining priors and likelihoods in proportion to their associated certainties, the posterior of item 1 will be less biased towards the prior than that of item 2. This leads to a higher subjective value of item 1 compared to that of item 2 even though their true underlying values are equal.

Figure 4. — (A) Bayesian explanation of attention-driven value preference. Attending to one of two equally-valued items for a longer time (red vs. blue) leads to a more certain (i.e. narrower) likelihood and weaker bias of its posterior towards the prior. This leads to a subjectively higher value for the longer attended item. (B) Effect of response time (RT; left panel; $t (38) = - 3.25, p = 0.0024$ ) and sum of the two item values (value sum; right panel; $t (38) = 2.95, p = 0.0054$ ) on attention-driven choice bias in humans. This choice bias quantifies the extent to which fixations affect choices for the chosen subset of trials (see Materials and methods). (C) Effect of response time (left panel; $t (38) = - 32.0, p < 0.001$ ) and sum of the two item values (right panel; $t (38) = 11.4, p < 0.001$ ) on attention-driven choice bias in the optimal model. See Materials and methods for details on how the choice bias coefficients were computed. For (B) and (C), for the left panels, the horizontal axis is binned according to the number of total fixations in a given trial. For the right panels, the horizontal axis is binned to contain the same number of trials per bin. Horizontal error bars indicate SEM across participants of the mean x-values within each bin. Vertical error bars indicate SEM across participants. (D) Comparing decision performance between the optimal policy and the original aDDM model. Performance of the aDDM was evaluated for different boundary heights (error bars = SEM across simulated participants). Even for the reward-maximizing aDDM boundary height, the optimal model significantly outperformed the aDDM ( $t (38) = 3.01, p = 0.0027$ ). (E) Decision performance for different degrees of the attention bottleneck ( $κ$ ) while leaving the overall input information unchanged (error bars = SEM across simulated participants). The performance peak at $κ = 0.5$ indicates that allocating similar amounts of attentional resource to both items is beneficial ( $t (38) = - 8.51, p < 0.001$ ).

Figure 4—figure supplement 1. — (A) Bayesian explanation of attention-driven value preference. Attending to one of two equally-valued items for a longer time (red vs. blue) leads to a more certain (i.e. narrower) likelihood and weaker bias of its posterior towards the prior. This leads to a subjectively higher value for the longer attended item. (B) Effect of response time (RT; left panel; $t (38) = - 3.25, p = 0.0024$ ) and sum of the two item values (value sum; right panel; $t (38) = 2.95, p = 0.0054$ ) on attention-driven choice bias in humans. This choice bias quantifies the extent to which fixations affect choices for the chosen subset of trials (see Materials and methods). (C) Effect of response time (left panel; $t (38) = - 32.0, p < 0.001$ ) and sum of the two item values (right panel; $t (38) = 11.4, p < 0.001$ ) on attention-driven choice bias in the optimal model. See Materials and methods for details on how the choice bias coefficients were computed. For (B) and (C), for the left panels, the horizontal axis is binned according to the number of total fixations in a given trial. For the right panels, the horizontal axis is binned to contain the same number of trials per bin. Horizontal error bars indicate SEM across participants of the mean x-values within each bin. Vertical error bars indicate SEM across participants. (D) Comparing decision performance between the optimal policy and the original aDDM model. Performance of the aDDM was evaluated for different boundary heights (error bars = SEM across simulated participants). Even for the reward-maximizing aDDM boundary height, the optimal model significantly outperformed the aDDM ( $t (38) = 3.01, p = 0.0027$ ). (E) Decision performance for different degrees of the attention bottleneck ( $κ$ ) while leaving the overall input information unchanged (error bars = SEM across simulated participants). The performance peak at $κ = 0.5$ indicates that allocating similar amounts of attentional resource to both items is beneficial ( $t (38) = - 8.51, p < 0.001$ ).

This insight leads to additional predictions for how attention-modulated choice bias should vary with certain trial parameters. For instance, the Bayesian account predicts that trials with longer response times should have a weaker choice bias than trials with shorter response times. This is because the difference in fixation times between the two items will decrease over time as the model has more opportunities to switch attention. Both the human and model behavior robustly showed this pattern (Figure 4B; human, $t (38) = - 3.25, p = 0.0024$ ; model, $t (38) = - 32.0, p < 0.001$ ). Similarly, choice bias should increase for trials with higher valued items. In this case, since the evidence distribution is relatively far away from the prior distribution, the posterior distribution is ‘pulled away’ from the prior distribution to a greater degree for the attended versus unattended item, leading to greater choice bias. Both human and model data confirmed this behavioral pattern (Figure 4C; human, $t (38) = 2.95, p = 0.0054$ ; model, $t (38) = 11.4, p < 0.001$ ). Since response time may be influenced by the sum of the two item values and vice versa, we repeated the above analyses using a regression model that includes both value sum and response time as independent variables (see Materials and methods). The results were largely consistent for both model (effect of RT on choice bias: $t (38) = - 5.73, p < 0.001$ , effect of value sum: $t (38) = 7.88, p < 0.001$ ) and human (effect of RT: $t (38) = - 1.32, p = 0.20$ , effect of value sum: $t (38) = 2.91, p = 0.006$ ) behavior.

Next, we assessed how the behavioral predictions arising from the optimal model differed from those of the original attentional drift diffusion model (aDDM) proposed by Krajbich et al., 2010. Unlike our model, the aDDM follows from traditional diffusion models rather than Bayesian models. It assumes that inattention to an item diminishes its value magnitude rather than increasing the noisiness of evidence accumulation. Despite this difference, the aDDM produced qualitatively similar behavioral predictions as the optimal model (see Figure 4—figure supplement 1, Figure 4—figure supplement 2, and Figure 4—figure supplement 3 for additional behavioral comparisons between human data, the optimal model, and aDDM). We also tested to which degree the optimal model yielded a higher mean reward than the aDDM, which, despite its simpler structure, could nonetheless collect competitive amounts of reward. Given that our model provides the optimal solution to the decision problem under the current assumptions, it is expected to outperform, or at least match, the performance of alternative models. To ensure a fair comparison, we adjusted the aDDM model parameters (i.e. attentional value discounting and the noise variance) so that the momentary evidence provided to the two models has equivalent signal-to-noise ratios (see Appendix 1). Using the same parameters fit to human behavior without this adjustment in signal-to-noise ratio yielded a higher mean reward for the aDDM model ( $t (76) = - 14.8, p < 0.001$ ), since the aDDM receives more value information at each time point than the optimal model. The original aDDM model fixed the decision boundaries at ±1 and subsequently fit model parameters to match behavioral data. Since we were interested in comparing mean reward, we simulated model behavior using incrementally increasing decision barrier heights, looking for the height that yields the maximum mean reward (Figure 4D). We found that even for the best-performing decision barrier height, the signal-to-noise ratio-matched aDDM model yielded a significantly lower mean reward compared to that of the optimal model ( $t (76) = 3.01, p = 0.0027$ ).

Recent advances in artificial intelligence used attentional bottlenecks to regulate information flow with significant associated performance gains (Bahdanau et al., 2015; Gehring et al., 2017; Mnih et al., 2014; Ba et al., 2015; Sorokin et al., 2015). Analogously, attentional bottlenecks might also be beneficial for value-based decision-making. To test this, we asked if paying relatively full attention on a single item at a time confers any advantages over the ability to pay relatively less reliable, but equal attention to multiple options in parallel. To do so, we varied the amount of momentary evidence provided about both the attended and unattended items while keeping the overall amount of evidence, as controlled by $σ_{x}^{2}$ , fixed. This was accomplished by varying the $κ$ term. The effect of $κ$ on the optimal policy was symmetric around $κ = 0.5$ , such that information gained from attended item at $κ = 0.2$ is equal to that of the unattended item at $κ = 0.8$ . Setting $κ = 0.5$ resulted in equal momentary evidence about both items, such that switching attention had no effect on the evidence collected about either item. When tuning model parameters to best match human behavior, we found a low $κ \approx 0.004$ , suggesting that humans tend to allocate the majority of their presumably fixed cognitive resources to the attended item. This allows for reliable evidence accumulation for the attended item, but is more likely to necessitate frequent switching of attention.

To investigate whether widening this attention bottleneck leads to changes in decision performance, we simulated model behavior for different values of $κ$ (0.1 to 0.9, in 0.1 increments). Interestingly, we found that mean reward from the simulated trials is greatest at $κ = 0.5$ and decreases for more extreme values of $κ$ , suggesting that a more even distribution of attentional resources between the two items is beneficial for maximizing reward ( $t (38) = - 8.51, p < 0.001$ ).

Optimal attention-modulated policy for perceptual decisions

The impact of attention is not unique to value-based decisions. In fact, recent work showed that fixation can bias choices in a perceptual decision-making paradigm (Tavares et al., 2017). In their task, participants were first shown a target line with a certain orientation, then shown two lines with slightly different orientations. The goal was to choose the line with the closest orientation to the previously shown target. Consistent with results in the value-based decision task, the authors demonstrated that the longer fixated option was more likely to be chosen.

We modified our attention-based optimal policy to perform in such perceptual decisions, in which the goal was to choose the option that is the closest in some quantity to the target, rather than choosing the higher valued option. Therefore, our model can be generalized to any task that requires a binary decision based on some perceptual quality, whether that involves finding the brighter dot between two dots on a screen, or identifying which of the two lines on the screen is longer. Similar to our value-based case, the optimal policy for perceptual decisions was successful at reproducing the attention-driven biases seen in humans in Tavares et al., 2017, (Figure 3—figure supplement 2).

Discussion

In this work, we derive a novel normative decision-making model with an attentional bottleneck, and show that it is able to reproduce the choice and fixation patterns of human decision makers. Our model significantly extends prior attempts to incorporate attention into perceptual and value-based decision-making in several ways. First, we provide a unified framework in which fixations are endogenously generated as a core component of the normative decision-making strategy. This is consistent with previous work that showed that fixation patterns were influenced by variables relevant for the decision, such as trial difficulty or the value of each choice item (Krajbich et al., 2010; Krajbich and Rangel, 2011). However, prior models of such decisions assumed an exogenous source of fixations (Krajbich et al., 2010; Krajbich and Rangel, 2011) or generated fixations using heuristics that relied on features such as the salience or value estimates of the choice options (Towal et al., 2013; Gluth et al., 2020). Other models generated fixations under the assumption that fixation duration should depend on the expected utility or informativeness of the choice items (Cassey et al., 2013; Ke et al., 2016; Song et al., 2019). For example, (Cassey et al., 2013) assumed that the informativeness of each item differed, which means the model should attend to the less informative item longer in general. Furthermore, since their decision task involved a fixed-duration, attention switches also occurred at fixed times rather than being dynamically adjusted across time, as in our case with a free-response paradigm. A recent normative model supported a continuous change of attention across choice items, and so could not relate attention to the observed discrete fixation changes (Hébert and Woodford, 2019). Our work significantly builds on these prior models by identifying the exact optimal policy using dynamic programming, demonstrating that fixation patterns could reflect active information gathering through controlling an attentional bottleneck. This interpretation extends previous work on visual attention to the realm of value-based and perceptual decision-making (Yang et al., 2016; Hoppe and Rothkopf, 2016; Hayhoe and Ballard, 2005; Chukoskie et al., 2013; Corbetta and Shulman, 2002).

Second, our model posits that attention lowers the variance of the momentary evidence associated with the attended item, which enhances the reliability of its information (Drugowitsch et al., 2014). In contrast, previous models accounted for attention by down-weighting the value of the unattended item (Krajbich et al., 2010; Krajbich and Rangel, 2011; Song et al., 2019), where one would a priori assume fixations to bias choices. Our approach was inspired by neurophysiological findings demonstrating that visual attention selectively increases the firing rate of neurons tuned to task-relevant stimuli (Reynolds and Chelazzi, 2004), decreases the mean-normalized variance of individual neurons (Mitchell et al., 2007; Wittig et al., 2018), and reduces the correlated variability of neurons at the population level (Cohen and Maunsell, 2009; Mitchell et al., 2009; Averbeck et al., 2006). In essence, selective attention appears to boost the signal-to-noise ratio, or the reliability of information encoded by neuronal signals rather than alter the magnitude of the value encoded by these signals. One may argue that we could have equally chosen to boost the evidence’s mean while keeping its variance constant to achieve a similar boost in signal-to-noise ratio of the attended item. However, doing so would still distinguish our model from previous accumulation-to-bound models, as Bayes-optimal evidence accumulation in this model variant nonetheless demands the use of at least three dimensions (see Figure 2), and could not be achieved in the two dimensions used by previous models. Furthermore, this change would have resulted in less intuitive equations for the value posterior (Equation 2).

Under this framework, we show that the optimal policy can be implemented as a four-dimensional accumulation-to-bound model where the particle drifts and diffuses according to the fixation duration to either item, the currently attended item, and the difference the in items’ value estimates. This policy space is significantly more complex compared to previous attention-free normative models, which can be implemented in a two-dimensional space. Nevertheless, the attention-modulated optimal policy still featured a collapsing boundary in time consistent with the attention-free case (Drugowitsch et al., 2012; Tajima et al., 2016).

When designing our model, we took the simplest possible approach to introduce an attentional bottleneck into normative models of decision-making. Our aim was to provide a precise (i.e. without approximations), normative explanation for how fixation changes qualitatively interact with human decisions rather than quantitatively capture all details of human behavior, which is likely driven by additional heuristics and features beyond the scope of our model (Acerbi et al., 2014; Drugowitsch et al., 2016). For instance, it has been suggested that normative allocation of attention should also depend on the item values to eliminate non-contenders, which we did not incorporate as a part of our model (Towal et al., 2013; Gluth et al., 2020). Perhaps as a result of this approach, our model did not provide the best quantitative fit and was unable to capture all of the nuances of the psychometric curves from human behavior, including a seemingly linear relationship between RT and trial difficulty (Figure 3). As such, we expect other models using approximations to have a better quantitative fit to human data (Krajbich et al., 2010; Callaway et al., 2020). Instead, a normative understanding can provide a basis for understanding limitations and biases that emerge in human behavior. Consistent with this goal, we were able to qualitatively capture a wide range of previously observed features of human decisions (Figure 3), suggest a computational rationale for fixation-based choice biases (Figure 4A), and confirm new predictions arising from our theory (Figure 4B–C). In addition, our framework is compatible with recent work by Sepulveda et al., 2020 that demonstrated that attention can bias choices toward the lower-valued option if participants are instructed to choose the less desirable item (see Appendix 1).

Due to the optimal policy’s complexity (Figure 2), we expect the nervous system to implement it only approximately (e.g. similar to Tajima et al., 2019 for multi-alternative decisions). Such an approximation has been recently suggested by Callaway et al., 2020, where they proposed a model of N-alternative choice using approaches from rational inattention to approximate optimal decision-making in the presence of an attentional bottleneck. Unlike our work, they assumed that the unattended item is completely ignored, and therefore could not investigate the effect of graded shifts of attentional resources between items (Figure 4E). In addition, their model did not predict a choice bias in binary choices due to a different assumption about the Bayesian prior.

In our model, we assumed the decision maker’s prior belief about the item values is centered at zero. In contrast, Callaway et al., 2020 chose a prior distribution based on the choice set, centered on the average value of only the tested items. While this is also a reasonable assumption (Shenhav et al., 2018), it likely contributed to their inability to demonstrate the choice bias for binary decisions. Under the assumption of our zero-mean prior, formulating the choice process through Bayesian inference revealed a simple and intuitive explanation for choice biases (Figure 4A) (see also Li and Ma, 2020). This explanation required the decision maker to a-priori believe the items’ values to be lower than they actually are when choosing between appetitive options, consistent with evidence that item valuations vary inversely with the average value of recently observed items (Khaw et al., 2017). The zero-mean prior also predicts an opposite effect of the choice bias when deciding between aversive items, such that less-fixated items should become the preferred choice. This is exactly what has been observed in human decision makers (Armel et al., 2008). We justified using a zero-mean bias by pointing out that participants in the decision task were allowed to rate items as having both positive or negative valence (negative-valence items were excluded from the binary decision task). However, there is some evidence that humans also exhibit choice biases when only choosing between appetitive items (Cavanagh et al., 2014; Smith and Krajbich, 2018; Smith and Krajbich, 2019). Although our setup suggests a zero-mean prior is required to reproduce the choice bias, the exact features and role of the Bayesian prior in human decisions still remains an open question for future work.

We show that narrowing the attentional bottleneck by setting $κ$ to values closer to 0 or 1 does not boost performance of our decision-making model (Figure 4E). Instead, spreading a fixed cognitive reserve evenly between the attended and unattended items maximized performance. This is consistent with prior work that showed that a modified drift diffusion model with a continuously varying attention would perform optimally when attention is always equally divided (Fudenberg et al., 2018). However, this does not necessarily imply that equally divided attention always constitutes the normative behavior. If the decision maker has already paid more attention to one item over the other within a decision, it may be optimal to switch attention and gain more information about the unattended item rather than to proceed with equally divided attention.

Parameters fit to human behavior reveal that humans tend to allocate a large proportion of their cognitive resource toward the attended item, suggesting that the benefits of an attentional bottleneck might lie in other cognitive processes. Indeed, machine learning applied to text translation (Bahdanau et al., 2015; Gehring et al., 2017), object recognition (Mnih et al., 2014; Ba et al., 2015), and video-game playing (Sorokin et al., 2015) benefits from attentional bottlenecks that allow the algorithm to focus resources on specific task subcomponents. For instance, image classification algorithms that extract only the relevant features of an image for high-resolution processing demonstrated improved performance and reduced computational cost compared to those without such attentional features (Mnih et al., 2014). Similarly, attentional bottlenecks that appear to limit human decision-making performance might have beneficial effects on cognitive domains outside the scope of binary value-based decisions. This is consistent with the idea that the evolutionary advantage of selective attention involves the ability to rapidly fixate on salient features in a cluttered environment, thereby limiting the amount of information that reaches upstream processing and reducing the overall computational burden (Itti and Koch, 2001).

An open question is whether our findings can be generalized to multi-alternative choice paradigms (Towal et al., 2013; Ke et al., 2016; Gluth et al., 2020; Tajima et al., 2019). While implementing the optimal policy for such choices may be analytically intractable, we can reasonably infer that a choice bias driven by a zero-mean prior would generalize to decisions involving more than two options. However, in a multi-alternative choice paradigm where heuristics involving value and salience of items may influence attention allocation, it is less clear whether an equally divided attention among all options would still maximize reward. We hope this will motivate future studies that investigate the role of attention in more realistic decision scenarios.

Materials and methods

Here, we provide an outline of the framework and its results. Detailed derivations are provided in Appendix 1.

Attention-modulated decision-making model

Before each trial, z₁ and z₂ are drawn from $z_{j} \sim 𝒩 (\bar{z}, σ_{z}^{2})$ . z₁ and z₂ correspond to the value of each item. In each time-step $n > 0$ of duration $δ t$ , the decision maker observes noisy samples of each z_j. This momentary evidence is drawn from $δ x_{j, n} | z_{j} \sim 𝒩 (z_{j} δ t, \frac{1}{1 - κ} σ_{x}^{2} δ t)$ for the attended item $j = y_{n}$ , and $δ x_{k, n} | z_{k} \sim 𝒩 (z_{k} δ t, \frac{1}{κ} σ_{x}^{2} δ t)$ for the unattended item $k \neq y_{n}$ . We measure how informative a single momentary evidence sample is about the associated true value by computing the Fisher information it provides about this value. This Fisher information sums across independent pieces of information. This makes it an adequate measure for assessing the informativeness of momentary evidence, which we assume to be independent across time and items. Computing the Fisher information results in $(1 - κ) σ_{x}^{- 2} δ t$ in $δ x_{j, n}$ about z_j for the attended item, and in $κ σ_{x}^{- 2} δ t$ in $δ x_{k, n}$ about z_k for the unattended item. Therefore, setting $κ \leq \frac{1}{2}$ boosts the information of the attended, and reduces the information of the unattended item, while keeping the total information about both items at a constant $(1 - κ) σ_{x}^{- 2} δ t + κ σ_{x}^{- 2} δ t = σ_{x}^{- 2} δ t$ . The posterior z_j for $j \in {1, 2}$ after $t = N δ t$ seconds is found by Bayes’ rule, $p (z_{j} | δ x_{j, 1 : N}, y_{1 : N}) \propto p (z_{j}) \prod_{n = 1}^{N} p (δ x_{j, n} | z_{j}, y_{n})$ , which results in Equation (2). If $y_{n} \in {1, 2}$ identifies the attended item in each time-step, the attention times in this posterior are given by $t_{1} = δ t \sum_{n = 1}^{N} (2 - y_{n})$ and $t_{2} = δ t \sum_{n = 1}^{N} (y_{n} - 1)$ . The attention-weighted accumulated evidence is $X_{1} (t) = \sum_{n = 1}^{N} {(\frac{1 - κ}{κ})}^{y_{n} - 1} δ x_{1, n}$ and $X_{2} (t) = \sum_{n = 1}^{N} {(\frac{1 - κ}{κ})}^{2 - y_{n}} δ x_{2, n}$ , down-weighting the momentary evidence for periods when the item is unattended. Fixing $κ = 1 / 2$ recovers the attention-free case of Tajima et al., 2016, and the associated posterior, Equation (1).

We found the optimal policy by dynamic programming (Bellman, 1952; Drugowitsch et al., 2012), which, at each point in time, chooses the action that promises the largest expected return, including all rewards and costs from that point into the future. Its central component is the value function that specifies this expected return for each value of the sufficient statistics of the task. In our task, the sufficient statistics are the two posterior means, $⟨ z_{j} | X_{j} (t), t_{1}, t_{2} ⟩$ for $j \in {1, 2}$ , the two accumulation times, t₁ and t₂, and the currently attended item y_n. The decision maker can choose between four actions at any point in time. The first two are to choose one of the two items, which is expected to yield the corresponding reward, after which the trial ends. The third action is to accumulate evidence for some more time $δ t$ , which comes at cost $c δ t$ , and results in more momentary evidence and a corresponding updated posterior. The fourth is to switch attention to the other item $3 - y_{n}$ , which comes at cost $c_{s} > 0$ . As the optimal action is the one that maximizes the expected return, the value for each sufficient statistic is the maximum over the expected returns associated with each action. This leads to the recursive Bellman’s equation that relates values with different sufficient statistics (see Appendix 1 for details) and reveals the optimal action for each of these sufficient statistics. Due to symmetries in our task, it turns out these optimal actions only depend on the difference in posterior means $Δ$ , rather than each of the individual means (see Appendix 1). This allowed us to compute the value function and associated optimal policy in the lower-dimensional $(Δ, t_{1}, t_{2}, y)$ -space, an example of which is shown in (Figure 2).

The optimal policy was found numerically by backwards induction (Tajima et al., 2016; Brockwell and Kadane, 2003), which assumes that at a large enough $t = t_{1} + t_{2}$ , a decision is guaranteed and the expected return equals $Δ$ . We set this time point as $t = 6 s$ based on empirical observations. From this point, we move backwards in small time steps of 0.05 s and traverse different values of $Δ$ which was also discretized into steps of 0.05. Upon completing this exercise, we are left with a three-dimensional grid with the axes corresponding to t₁, t₂ and $Δ$ , where the value assigned to each point in space indicates the optimal decision to take for the given set of sufficient statistics. The boundaries between different optimal actions can be visualized as three-dimensional manifolds (Figure 2).

Model simulations

Using the optimal policy, we simulated decisions in a task analogous to the one humans performed in Krajbich et al., 2010. On each simulated trial, two items with values z₁ and z₂ are presented. The model attends to one item randomly ( $y \in [1, 2]$ ), then starts accumulating noisy evidence and adjusts its behavior across time according to the optimal policy. Since the human data had a total of 39 participants, we simulated the same number of participants ( $N = 39$ ) for the model, but with a larger number of trials. For each simulated participant, trials consisted of all pairwise combinations of values between 0 and 7, iterated 20 times. This yielded a total of 1280 trials per simulated participant.

When computing the optimal policy, there were several free parameters that determined the shape of the decision boundaries. Those parameters included the evidence noise term ( $σ_{x}^{2}$ ), spread of the prior distribution ( $σ_{z}^{2}$ ), cost of accumulating evidence ( $c [s^{- 1}]$ ), cost of switching attention (c_s), and the relative information gain for the attended vs. unattended items ( $κ$ ). In order to find a set of parameters that best mimics human behavior, we performed a random search over a large parameter space and simulated behavior using the randomly selected set of parameters (Bergstra and Bengio, 2012). We iterated this process for 2,000,000 sets of parameters and compared the generated behavior to that of humans (see Appendix 1). After this search process, the parameter set that best replicated human behavior consisted of $c_{s} = 0.0065$ , $c = 0.23$ , $σ_{x}^{2} = 27$ , $σ_{z}^{2} = 18$ , $κ = 0.004$ .

Statistical analysis

The relationship between task variables (e.g. difference in item value) and behavioral measurements (e.g. response time) were assessed by estimating the slope of the relationship for each participant. For instance, to investigate the association between response times and absolute value difference (Figure 3B), we fit a linear regression within each participant using the absolute value difference and response time for every trial. Statistical testing was performed using one-sample t-tests on the regression coefficients across participants. This procedure was used for statistical testing involving Figure 3B,C,E, and Figure 4B,C. To test for the effect of RT and value sum on choice bias after accounting for the other variable, we used a similar approach and used both RT and value sum as independent variables in the regression model and the choice bias coefficient as the dependent variable. To test for a significant peak effect for Figure 4E, we used the same procedure after subtracting 0.5 from the original $κ$ values and taking their absolute value. To compare performance between the optimal model and the aDDM (Figure 4D), we first selected the best-performing aDDM model, then performed an independent-samples t-test between the mean rewards from simulated participants from both models.

To quantify the degree of choice bias (Figure 4B,C), we computed a choice bias coefficient. For a given group of trials, we performed a logistic regression with fixation time difference ( $t_{1} - t_{2}$ ) as the independent variable and a binary-dependent variable indicating whether item 1 was chosen on each trial. After performing this regression within each participant’s data, we performed a t-test of the regression coefficients against zero. The the resulting t-statistic was used as the choice bias coefficient, as it quantified the extent to which fixations affected choice in the given subset of trials.

Data and code availability

The human behavioral data and code are available through an open source license at https://github.com/DrugowitschLab/Optimal-policy-attention-modulated-decisions (Jang, 2021; copy archived at https://archive.softwareheritage.org/swh:1:rev:db4a4481aa6522d990018a34c31683698da039cb/).

Acknowledgements

We thank Ian Krajbich for sharing the behavioral data, and members of the Drugowitsch lab, in particular Anna Kutschireiter and Emma Krause, for feedback on the manuscript. This work was supported by the National Institute of Mental Health (R01MH115554, JD) and the James S McDonnell Foundation (Scholar Award in Understanding Human Cognition, grant# 220020462, JD).

Appendix 1

Here, we describe in more detail the derivations of our results, and specifics of the simulations presented in the main text. Of note, we sometimes use $x | y \sim p (y)$ to specify the conditional density $p (x | y)$ . Furthermore, $𝒩 (μ, σ^{2})$ denotes a Gaussian with mean µ and variance $σ^{2}$ .

1 Task setup

1.1 Latent state prior

We assume two latent states z_j, $j \in {1, 2}$ , (here, the true item values) that are before each choice trial drawn from their Gaussian prior, $z_{j} \sim 𝒩 ({\bar{z}}_{j} σ_{z}^{2})$ , with mean ${\bar{z}}_{j}$ and variance $σ_{z}^{2}$ . Throughout the text, we will assume ${\bar{z}}_{1} = {\bar{z}}_{2}$ , to indicate that there is no a-priori preference of one item over the other.

1.2 Likelihood function of momentary evidence

The decision maker doesn’t observe the latent states, but instead, in each time step of size $δ t$ , observes noisy evidence about both z_j’s. Let us assume that, in the $n$ th such time step, the decision maker attends to item $y_{n} \in {1, 2}$ . Then, they simultaneously observe $δ x_{1}$ and $δ x_{2}$ , distributed as

δ x_{j, n} | y_{n}, z_{j} \sim 𝒩 (z_{j} δ t, {(\frac{1 - κ}{κ})}^{| j - y_{n} |} \frac{σ_{x}^{2}}{1 - κ} δ t),

(1)

where we have defined the attention modulation parameter $κ$ , bounded by $0 \leq κ \leq 1$ (we will usually assume $κ \leq \frac{1}{2}$ ), and the overall likelihood variance $σ_{x}^{2}$ . For the attended item $j = y_{n}$ , we have $| j - y_{n} | = 0$ , such that the the variance of the momentary evidence for this item is $σ_{x}^{2} δ t / (1 - κ)$ . For the unattended item, for which $| j - y_{n} | = 1$ , this variance is instead $σ_{x}^{2} δ t / κ$ . As long as $κ < \frac{1}{2}$ this leads to a larger variance for the unattended item than the attended item, making the momentary evidence more informative for the attended item. In particular if we quantify this information by the Fisher information in the momentary evidence $δ x_{j, n}$ about z_j, then we find this information to be $(1 - κ) σ_{x}^{- 2} δ t$ for the attended, and $κ σ_{x}^{- 2} δ t$ for the unattended item. The total Fisher information across both items is thus $σ_{x}^{- 2} δ t$ , independent of $κ$ . This shows that $σ_{x}^{2}$ controls the total information that the momentary evidence provides about the latent states, whereas $κ$ controls how much of this information is provided for the attended vs. the unattended item.

1.3 An alternative form for the likelihood

While the above form of the likelihood has a nice, intuitive parametrization, it is notationally cumbersome. Therefore, we will here introduce an alternative variance parametrization of this likelihood that simplifies the notation in the derivations that follow. We will use this parametrization for the rest of this Appendix.

This alternative parametrization assumes the variance of the momentary evidence of the attended item to be given by $σ^{2} δ t = σ_{x}^{2} / (1 - κ)$ , while that of the unattended item is given by $γ^{- 1} σ^{2} δ t = σ_{x}^{2} δ t / κ$ , where the new attention modulation parameter $γ$ is assumed bounded by $0 \leq γ \leq 1$ . Thus, the previous parameter pair ${σ_{x}^{2}, κ}$ is replaced by the new pair ${σ^{2}, γ}$ . A $γ < 1$ results in an increased variance for the unattended item, resulting in less information about the value of the unattended item. Overall, the momentary evidence likelihood is given with the alternative parametrization by

δ x_{j, n} | y_{n}, z_{j} \sim 𝒩 (z_{j} δ t, \frac{1}{γ^{| j - y_{n} |}} σ^{2} δ t),

(2)

This is the likelihood function that we will use for the rest of this Appendix. Any of the results can easily be mapped back to the original parametrization (as used in the main text) by

σ^{2} = \frac{σ_{x}^{2}}{1 - κ}, σ_{x}^{2} = (1 - κ) σ^{2},

(3)

γ = \frac{κ}{1 - κ}, κ = \frac{γ}{γ + 1} .

(4)

Note that the alternative parametrization does not preserve the separation between total information and balancing the information between the attended and unattended item. In particular, the total Fisher information is now given by $(γ + 1) σ^{- 2} δ t$ , which depends on both $γ$ and $σ^{2}$ .

Below we will derive the posterior z_j’s, given the stream of momentary evidences $[δ x_{1, 1}, δ x_{2, 1}], [δ x_{1, 2}, δ x_{2, 2}], \dots$ , and the attention sequence $y_{1}, y_{2}, \dots$ . The mean and variance of the posterior distributions represent the decision maker’s belief of the items’ true values given all available evidence.

1.4 Costs, rewards, and the decision maker’s overall aim

While the posterior estimates provide information about value, it does not tell the decision maker when to stop accumulating information, or when to switch their attention. To address these questions, we need to specify the costs and rewards associated with these behaviors. For value-based decisions, we assume that the reward for choosing item $j$ is the latent state z_j (i.e. the true value) associated with the item. Furthermore, we assume that accumulating evidence comes at cost $c$ per second, or cost $c δ t$ per time step. The decision maker can only ever attend to one item, and switching attention to the other item comes at cost c_s which may be composed of a pure attention switch cost, as well as a loss of time that might introduce an additional cost. As each attention switch introduces both costs, we only consider them in combination without loss of generality.

The overall aim of the decision maker is to maximize the total expected return, which consists of the expected value of the chosen item minus the total cost of accumulating evidence and attention switches. We address this maximization problem by finding the optimal policy that, based on the observed evidence, determines when to switch attention, when to accumulate more evidence, and when to commit to a choice. We initially focus on maximizing the expected return in a single, isolated choice, and will later show that this yields qualitatively similar policies as when embedding this choice into a longer sequence of comparable choices.

2 Bayes-optimal evidence accumulation

2.1 Deriving the posterior z₁ and z₂

To find the posterior over z₁ after having accumulated evidence $x_{1, 1 : N} \equiv x_{1, 1}, \dots, x_{1, N}$ for some fixed amount of time $t = N δ t$ while paying attention to items $y_{1 : N} \equiv y_{1}, \dots y_{N}$ , we employ Bayes’ rule,

\begin{array}{ll} p (z_{1} | δ x_{1, 1 : N}, y_{1 : N}) & \propto_{z_{1}} p (z_{1}) \prod_{n = 1}^{N} p (δ x_{1, n} | z_{1}, y_{n}) \\ = 𝒩 (\bar{z_{1}}, σ_{z}^{2}) \prod_{n = 1}^{N} 𝒩 (z δ t, \frac{σ^{2}}{γ^{| 1 - y_{n} |}} δ t) \\ \propto_{z_{1}} 𝒩 (\frac{{\bar{z}}_{1} σ^{2} σ_{z}^{- 2} + X_{1} (t)}{σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2}}, \frac{σ^{2}}{σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2}}), \end{array}

(5)

where we have defined $X_{1} (t) = \sum_{n = 1}^{N} γ^{| 1 - y_{n} |} δ x_{1, n}$ as the sum of all attention-weighted momentary evidence up to time t, and $t_{j} = t - δ t \sum_{n = 1}^{N} | j - y_{n} |$ as the total time that item $j$ has been attended. Note that, for time periods in which item 2 is attended to, (i.e., when $y_{n} = 2$ ), the momentary evidence is down-weighted by $γ$ . With $δ t \to 0$ , the process becomes continuous in time, such that $X_{1} (t)$ becomes the integrated momentary evidence, but the above posterior still holds.

Following a similar derivation, the posterior belief about z₂ results in

p (z_{2} | δ x_{2, 1 : N}, y_{1 : N}) = 𝒩 (\frac{{\bar{z}}_{2} σ^{2} σ_{z}^{- 2} + X_{2} (t)}{σ^{2} σ_{z}^{- 2} + γ t_{1} + t_{2}}, \frac{σ^{2}}{σ^{2} σ_{z}^{- 2} + γ t_{1} + t_{2}})

(6)

where $X_{2} (t) = \sum_{n = 1}^{N} γ^{| 2 - y_{n} |} δ x_{2, n}$ . As the decision maker acquires momentary evidence independently for both items, the two posteriors are independent of each other, that is $p (z_{1}, z_{2} | δ x_{1, 1 : N}, δ x_{2, 1 : N}, y_{1 : N}) = p (z_{1} | δ x_{1, 1 : N}, y_{1 : N}) p (z_{2} | δ x_{2, 1 : N}, y_{1 : N})$ .

2.2 The expected reward process

At each point in time, the decision maker must decide whether it’s worth accumulating more evidence versus choosing an item. To do so, they need to predict how the mean estimated reward for each option might evolve if they accumulated more evidence. In this section, we derive the stochastic process that describes this evolution for item 1. The same principles will apply for item 2.

Assume that having accumulated evidence until time $t = N δ t$ , the current expected reward for item 1 is given by ${\hat{r}}_{1} (t)$ , where ${\hat{r}}_{1} (t) = ⟨ z_{1} | δ x_{1, 1 : N}, y_{1 : N} ⟩$ is the mean of the above posterior, Equation (5). The decision maker’s prediction of how the expected reward might evolve after accumulating additional evidence for $δ t$ is found by the marginalization,

\begin{array}{ll} p ({\hat{r}}_{1} (t + δ t) | {\hat{r}}_{1} (t), t_{1}, t_{2}, y_{N + 1}) \\ = \iint p ({\hat{r}}_{1} (t + δ t) | {\hat{r}}_{1} (t), δ x_{1, N + 1}, t_{1}, t_{2}, y_{N + 1}) p (δ x_{1, N + 1} | z_{1}, y_{N + 1}) p (z_{1} | {\hat{r}}_{1} (t), t_{1}, t_{2}) d δ x_{1, N + 1} d z_{1} . \end{array}

(7)

As the last term in the above integral shows, $\hat{r} (t)$ , t₁ and t₂ fully determine the posterior z₁ at time $t$ . We can use this posterior to predict the value of the next momentary evidence $δ x_{1, N + 1} | z_{1}$ . This, in turn, allows us to predict ${\hat{r}}_{1} (t + δ t)$ . As all involved densities are either deterministic or Gaussian, the resulting posterior will be Gaussian as well. Thus, rather than performing the integrals explicitly, we will find the final posterior by tracking the involved means and variances, which in turn completely determine the posterior parameters.

We first marginalize over $δ x_{1, N + 1}$ , by expressing ${\hat{r}}_{1} (t + δ t)$ in terms of $\hat{r} (t)$ and $δ x_{1, N + 1}$ . To do so, we use Equation (5) to express ${\hat{r}}_{1} (t + δ t)$ by

{\hat{r}}_{1} (t + δ t) = \frac{{\bar{z}}_{1} σ^{2} σ_{z}^{- 2} + X_{1} (t) + γ^{| y_{N + 1} - 1 |} δ x_{1, N + 1}}{σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2} + γ^{| 1 - y_{N + 1} |} δ t},

(8)

where we have used $X_{1} (t + δ t) = X_{1} (t) + γ^{| y_{N + 1} - 1 |} δ x_{1, N + 1}$ .

Note that, for a given $δ x_{1, N + 1}$ , $\hat{r} (t + δ t)$ is uniquely determined by $\hat{r} (t)$ . $\hat{r} (t + δ t)$ becomes a random variable once we acknowledge that, for any z₁, $δ x_{1, N + 1}$ is given by Equation (2), which we can write as $δ x_{1, N + 1} = z_{1} δ t + \sqrt{σ^{2} γ^{- | 1 - y_{N + 1} |} δ t} η_{x}$ , where $η_{x} \sim 𝒩 (0, 1)$ . Substituting this expression into ${\hat{r}}_{1} (t + δ t)$ , and using Equation (5) to re-express $X_{1} (t)$ as $X_{1} (t) = {\hat{r}}_{1} (t) (σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2}) - {\bar{z}}_{1} σ^{2} σ_{z}^{- 2}$ , results in

{\hat{r}}_{1} (t + δ t) = \frac{{\hat{r}}_{1} (t) (σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2}) + γ^{| 1 - y_{N + 1} |} z_{1} δ t + \sqrt{σ^{2} γ^{| 1 - y_{N + 1} |} δ t} η_{x}}{σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2} + γ^{| 1 - y_{N + 1} |} δ t} .

(9)

The second marginalization over z₁ is found by noting the distribution of z₁ is given by Equation (5), which can be written as

z_{1} = {\hat{r}}_{1} (t) + \sqrt{\frac{σ^{2}}{σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2}}} η_{z},

(10)

with $η_{z} \sim 𝒩 (0, 1)$ . Substituting this z₁ into the above expression for $\hat{r} (t + δ t)$ results in

{\hat{r}}_{1} (t + δ t) = {\hat{r}}_{1} (t) + \frac{\sqrt{σ^{2} γ^{| 1 - y_{N + 1} |} δ t}}{σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2} + γ^{| 1 - y_{n + 1} |} δ t} η_{x},

(11)

where we have dropped the $η_{z}$ -dependent term which had a $δ t$ pre-factor, and thus vanishes with $δ t \to 0$ . Therefore, ${\hat{r}}_{1} (t)$ evolves as a martingale,

{\hat{r}}_{1} (t + δ t) | {\hat{r}}_{1} (t), t_{1}, t_{2}, y_{N + 1} \sim 𝒩 ({\hat{r}}_{1} (t), \frac{σ^{2} γ^{| 1 - y_{n + 1} |}}{{(σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2} + γ^{| 1 - y_{N + 1} |} δ t)}^{2}} δ t) .

(12)

Using the same approach, the expected future reward for item 2 is given by

{\hat{r}}_{2} (t + δ t) | {\hat{r}}_{2} (t), t_{1}, t_{2}, y_{N + 1} \sim 𝒩 ({\hat{r}}_{2} (t), \frac{σ^{2} γ^{| 2 - y_{N + 1} |}}{{(σ^{2} σ_{z}^{- 2} + γ t_{1} + t_{2} + γ^{| 2 - y_{n + 1} |} δ t)}^{2}} δ t) .

(13)

2.3 The expected reward difference process

In a later section, we will reduce the dimensionality of the optimal policy space by using the expected reward difference rather than each of the of the expected rewards separately. To do so, we define this difference by

Δ (t) = \frac{{\hat{r}}_{1} (t) - {\hat{r}}_{2} (t)}{2} .

(14)

As for ${\hat{r}}_{1} (t)$ and ${\hat{r}}_{2} (t)$ , we are interested in how $Δ (t)$ evolves over time.

To find $Δ (t + δ t) | Δ (t), t_{1}, t_{2}, y_{N + 1}$ we can use

p (Δ (t + δ t) | Δ (t), t_{1}, t_{2}, y_{N + 1}) = p (Δ (t + δ t) = \frac{{\hat{r}}_{1} (t + δ t) - {\hat{r}}_{2} (t + δ t)}{2} | Δ (t) = \frac{{\hat{r}}_{1} (t) - {\hat{r}}_{2} (t)}{2}, t_{1}, t_{2}, y_{N + 1}) .

(15)

As the decision maker receives independent momentary evidence for each item, ${\hat{r}}_{1} (t)$ and ${\hat{r}}_{2} (t)$ are independent when conditioned on t₁, t₂ and $y_{1 : N}$ . Thus, so are their time-evolutions, ${\hat{r}}_{1} (t + δ t) | {\hat{r}}_{1} (t), \dots$ and ${\hat{r}}_{2} (t + δ t) | {\hat{r}}_{2} (t), \dots$ . With this, we can show that

\begin{array}{ll} Δ (t + δ t) | Δ (t), t_{1}, t_{2}, y_{N + 1} \sim \\ 𝒩 (Δ (t), \frac{σ^{2} δ t}{4} (\frac{γ^{| 1 - y_{N + 1} |}}{{(σ^{2} σ_{z}^{- 2} + t_{1} + γ t_{2} + γ^{| 1 - y_{N + 1} |} δ t)}^{2}} + \frac{γ^{| 2 - y_{N + 1} |}}{{(σ^{2} σ_{z}^{- 2} + γ t_{1} + t_{2} + γ^{| 2 - y_{N + 1} |} δ t)}^{2}})) . \end{array}

(16)

Unsurprisingly, $Δ (t)$ is again a martingale.

3 Optimal decision policy

We find the optimal decision policy by dynamic programming (Bellman, 1952; Bertsekas, 1995). A central concept in dynamic programming is the value function $V (\cdot)$ , which, at any point in time during a decision, returns the expected return, which encompasses all expected rewards and costs from that point onwards into the future when following the optimal decision policy. Bellman’s equation links value functions across consecutive times, and allows finding this optimal decision policy recursively. In what follows, we first focus on Bellman’s equation for single, isolated choices. After that, we show how to extend the same approach to find the optimal policy for long sequences of consecutive choices.

3.1 Single, isolated choice

For a single, isolated choice, accumulating evidence comes at cost $c$ per second. Switching attention comes at cost c_s. The expected reward for choosing item $j$ is ${\hat{r}}_{j} (t)$ , and is given by the mean of Equations (5) and (6) for $j = 1$ and $j = 2$ , respectively.

To find the value function, let us assume that we have accumulated evidence for some time $t = t_{1} + t_{2}$ , expect rewards ${\hat{r}}_{1} (t)$ and ${\hat{r}}_{2} (t)$ , and are paying attention to item $y \in {1, 2}$ . These statistics fully describe the evidence accumulation state, and thus fully parameterize the value function $V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2})$ . Here, we use $y$ as a subscript rather than an argument to $V (\cdot)$ to indicate that $y$ can only take one of two values, $y \in {1, 2}$ . At this point, we can choose among four actions. We can either immediately choose item 1, immediately choose item 2, accumulate more evidence without switching attention, or switch attention to the other item, $3 - y$ . The expected return for choosing immediately is either ${\hat{r}}_{1} (t)$ or ${\hat{r}}_{2} (t)$ , depending on the choice. Accumulating more evidence for some time $δ t$ results in cost $c δ t$ , and changes in the expected rewards according to ${\hat{r}}_{j} (t + δ t) | {\hat{r}}_{j} (t), t_{1}, t_{2}, y$ , as given by Equations (12) and (13). Therefore, the expected return for accumulating more evidence is given by

- c δ t + ⟨ V_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩,

(17)

where the expectation is over the time-evolution of ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ , and $t_{1} + | 2 - y | δ t$ and $t_{2} + | 1 - y | δ t$ ensures that only the t_y associated with the currently attended item is increased by $δ t$ . Lastly, switching attention comes at cost c_s, but does not otherwise impact reward expectations, such that the expected return associated with this action is

- c_{s} + V_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}),

(18)

where the use of $V_{3 - y} (\cdot)$ implements that, after an attention switch, item $3 - y$ will be the attended item.

By the Bellman, 1952 optimality principle, the best action at any point in time is the one that maximizes the expected return. Combining the expected returns associated with each possible action results in Bellman’s equation

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = max {\begin{array}{cc} {\hat{r}}_{1}, {\hat{r}}_{2}, \\ ⟨ V_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩ - c δ t, \\ V_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - c_{s} \end{array}} .

(19)

Solving this equation yields the optimal policy for any combination of ${\hat{r}}_{1}$ , ${\hat{r}}_{2}$ , t₁, t₂ and $y$ by picking the action that maximizes the associated expected return, that is, the term that maximizes the left-hand side of the above equation. The optimal decision boundaries that separate the $({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y)$ -space into regions where different actions are optimal lie at manifolds in which two actions yield the same expected return. For example, the decision boundary at which it becomes best to choose item 1 after having accumulated more evidence is the manifold at which

\begin{array}{ll} V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = \\ {\hat{r}}_{1} = ⟨ V_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩ - c δ t . \end{array}

(20)

In Section 6, we describe how we found these boundaries numerically.

Formulated so far, the value function is five-dimensional, with four continuous ( ${\hat{r}}_{1}$ , ${\hat{r}}_{2}$ , t₁, and t₂) and one discrete ( $y$ ) dimension. It turns out that it is possible to remove one of the dimensions without changing the associated policy by focusing on the expected reward difference $Δ (t)$ , Equation (14), rather than the individual expected rewards. To show this, we jump ahead and use the value function property $V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) + C = V_{y} ({\hat{r}}_{1} + C, {\hat{r}}_{2} + C, t_{1}, t_{2})$ for any scalar $C$ , that we will confirm in Section 5. Next, we define the value function on expected reward differences by

{\bar{V}}_{y} (Δ, t_{1}, t_{2}) = V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - \frac{{\hat{r}}_{1} + {\hat{r}}_{2}}{2} = V_{y} (Δ, - Δ, t_{1}, t_{2}) .

(21)

Applying this mapping to Equation (19) leads to Bellman’s equation

{\bar{V}}_{y} (Δ, t_{1}, t_{2}) = max {\begin{array}{cc} Δ, - Δ, \\ ⟨ {\bar{V}}_{y} (Δ (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | Δ, t_{1}, t_{2}, y ⟩ - c δ t, \\ {\bar{V}}_{3 - y} (Δ, t_{1}, t_{2}) - c_{s} \end{array}},

(22)

which is now defined over a four-dimensional rather than a five-dimensional space while yielding the same optimal policy. This also confirms that optimal decision-making doesn’t require tracking individual expected rewards, but only their difference.

3.2 Sequence of consecutive choices

So far, we have focused on the optimal policy for a single isolated choice. Let us now demonstrate that this policy does not qualitatively change if we move to a long sequence of consecutive choices. To do so, we assume that each choice is followed by an inter-trial interval t_i after which the latent z₁ and z₂ are re-drawn from the prior, and evidence accumulation starts anew. As the expected return considers all expected future rewards, it would grow without bounds for a possibly infinite sequence of choices. Thus, rather than using the value function, we move to using the average-adjusted value function, $\tilde{V}$ , which, for each passed time $δ t$ , subtracts $ρ δ t$ , where $ρ$ is the average reward rate (Tajima et al., 2016). This way, the value tells us if we are performing better or worse than on average, and is thus bounded.

Introducing the reward rate as an additional time cost requires the following changes. First, the average-adjusted expected return for immediate choices becomes ${\hat{r}}_{j} (t) - t_{i} ρ + {\tilde{V}}_{y} ({\bar{z}}_{1}, {\bar{z}}_{2}, 0, 0)$ , where $- t_{i} ρ$ accounts for the inter-trial interval, and ${\tilde{V}}_{y} ({\bar{z}}_{1}, {\bar{z}}_{2}, 0, 0)$ is the average-adjusted value at the beginning of the next choice, where ${\hat{r}}_{j} = {\bar{z}}_{j}$ , and $t_{1} = t_{2} = 0$ . Due to the symmetry, ${\tilde{V}}_{y} ({\bar{z}}_{1}, {\bar{z}}_{2}, 0, 0)$ will be the same for both $y = 1$ and $y = 2$ , such that we do not need to specify $y$ . Second, accumulating evidence for some duration $δ t$ now comes at cost $(c + ρ) δ t$ . The expected return for switching attention remains unchanged, as we assume attention switches to be instantaneous. If attention switches take time, we would need to additionally penalize this time by $ρ$ .

With these changes, Bellman’s equation becomes

{\tilde{V}}_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = max {\begin{array}{cc} {\hat{r}}_{1} - ρ t_{i} + {\tilde{V}}_{y} ({\bar{z}}_{1}, {\bar{z}}_{2}, 0, 0), {\hat{r}}_{2} - ρ t_{i} + {\tilde{V}}_{y} ({\bar{z}}_{1}, {\bar{z}}_{2}, 0, 0), \\ ⟨ {\tilde{V}}_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩ - (c + ρ) δ t, \\ {\tilde{V}}_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - c_{s} \end{array}} .

(23)

The resulting average-adjusted value function is shift-invariant, that is, adding a scalar to this value function for all states does not change the underlying policy (Tajima et al., 2016). This property allows us to fix the average-adjusted value for one particular state, such that all other average-adjusted values are relative to this state. For mathematical convenience, we choose ${\tilde{V}}_{y} ({\bar{z}}_{1}, {\bar{z}}_{2}, 0, 0) = ρ t_{i}$ , resulting in the new Bellman’s equation

{\tilde{V}}_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = max {\begin{array}{cc} {\hat{r}}_{1}, {\hat{r}}_{2}, \\ ⟨ {\tilde{V}}_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩ - (c + ρ) δ t, \\ {\tilde{V}}_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - c_{s} \end{array}} .

(24)

Comparing this to Bellman’s equation for single, isolated choices, Equation (19), reveals an increase in the accumulation cost from $c$ to $c + ρ$ . Therefore, we can find a set of task parameters for which the optimal policy for single, isolated choices will mimic that for a sequence of consecutive choices. For this reason, we will focus on single, isolate choices, as they will also capture all policy properties that we expect to see for sequences of consecutive choices.

3.3 Choosing the less desirable option

Recent work by Sepulveda et al., 2020 showed that when decision makers are instructed to choose the less desirable item in a similar value-based binary decision task, fixations bias choices for the lower-valued item. Here, we show that the optimal policy also makes a similar prediction. To set the goal to choosing the less desirable option, we simply flip the signs of the expected reward associated with choosing either item from ${\hat{r}}_{j}$ to $- {\hat{r}}_{j}$ in Equation (19),

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = max {\begin{array}{cc} - {\hat{r}}_{1}, - {\hat{r}}_{2}, \\ ⟨ V_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩ - c δ t, \\ V_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - c_{s} \end{array}} .

(25)

This sign switch makes the item with the higher value the less desirable one to choose. Otherwise, the same principles apply to computing the value function and optimal policy space.

4 Optimal decision policy for perceptual decisions

To apply the same principles to perceptual decision-making, we need to re-visit the interpretation of the latent states, z₁ and z₂. Those could, for example, be the brightness of two dots on a screen, and the decision maker needs to identify the brighter dot. Alternatively, they might reflect the length of two lines, and the decision maker needs to identify which of the two lines is longer. Either way, the reward is a function of z₁, z₂, and the decision maker’s choice. Therefore, the expected reward for choosing either option can be computed from the posterior $z$ ’s, Equations (5) and (6). Furthermore, these posteriors are fully determined by their means, ${\hat{r}}_{1}$ , ${\hat{r}}_{2}$ , and the attention times, t₁ and t₂. As a consequence, we can formulate the expected reward for choosing item $j$ by the expected reward function $R_{j} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2})$ .

What are the consequences for this change in expected reward for the optimal policy? If we assume the attention-modulated evidence accumulation process to remain unchanged, the only change is that the expected return for choosing item $j$ changes from ${\hat{r}}_{j}$ to $R_{j} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2})$ . Therefore, Bellman’s equations changes to

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = max {\begin{array}{cc} R_{1} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}), R_{2} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}), \\ ⟨ V_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩ - c δ t, \\ V_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - c_{s} \end{array}} .

(26)

The optimal policy follows from Bellman’s equation as before.

The above value function can only be turned into one over expected reward differences under certain regularities of R₁ and R₂, which we will not discuss further at this point. Furthermore, for the above example, we have assumed two sources of perceptual evidence that need to be compared. Alternative tasks (e.g. the random dot motion task) might provide a single source of evidence that needs to be categorized. In this case, the formulation changes slightly (see, for example, Drugowitsch et al., 2012), but the principles remain unchanged.

5 Properties of the optimal policy

Here, we will demonstrate some interesting properties of the optimal policy, and the associated value function and decision boundaries. To do so, we re-write the value function in its non-recursive form. To do so, let us first define the switch set $𝒯 = {T_{1}, \dots, T_{M}}$ , which determines the switch times from the current time $t$ onwards. Here, $t + T_{1}$ is the time of the first switch after time $t$ , $t + T_{1} + T_{2}$ is the second switch, and so on. A final decision is made at $t + \bar{T}$ , where $\bar{T} = \sum_{m = 1}^{M} T_{m}$ , after $M - 1$ switches with associated cost $(M - 1) c_{s}$ . As the optimal policy is the one that optimizes across choices and switch times, the associated value function can be written as

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = max_{𝒯} ⟨ max {{\hat{r}}_{1} (t + \bar{T}), {\hat{r}}_{2} (t + \bar{T})} - c \bar{T} - (M - 1) c_{s} | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩,

(27)

where time expectation is over the time-evolution of ${\hat{r}}_{1} (t)$ and ${\hat{r}}_{2} (t)$ , that also depends on $𝒯$ . In what follows, we first derive the shift-invarance of this time-evolution, and then consider its consequences for the value function, as well as the decision boundaries.

5.1 Shift-invariance and symmetry of the expected reward process

Let us fix some $𝒯$ , some time $t$ , and assume that we are currently attending item 1, $y (t) = 1$ . Then, by Equation (12), ${\hat{r}}_{1} (t + \bar{T})$ can be written as

\begin{array}{ll} {\hat{r}}_{1} (t + \bar{T}) = {\hat{r}}_{1} (t) + \int_{0}^{T_{1}} \frac{σ}{σ^{2} σ_{z}^{- 2} + (t_{1} + s_{1}) + γ t_{2}} d B_{1, s_{1}} + \int_{0}^{T_{2}} \frac{σ \sqrt{γ}}{σ^{2} σ_{z}^{- 2} + (t_{1} + T_{1}) + γ (t_{2} + s_{2})} d B_{1, s_{2}} \\ + \int_{0}^{T_{3}} \frac{σ}{σ^{2} σ_{z}^{- 2} + (t_{1} + T_{1} + s_{3}) + γ (t_{2} + T_{2})} d B_{1, s_{3}} + \dots, \end{array}

(28)

where the $B_{1, s_{j}}$ ’s are white noise processes associated with item 1. This shows that, for any $𝒯$ , the change in ${\hat{r}}_{1}$ , that is, ${\hat{r}}_{1} (t + \bar{T}) - {\hat{r}}_{1} (t)$ , is independent of ${\hat{r}}_{1} (t)$ . Therefore, we can shift ${\hat{r}}_{1} (t)$ by any scalar $C$ , and cause an associated shift in ${\hat{r}}_{1} (t + \bar{T})$ , that is

p (\hat{r} (t + \bar{T}) = R + C | {\hat{r}}_{1} (t) = r + C, t_{1}, t_{2}, y) = p (\hat{r} (t + \bar{T}) = R | {\hat{r}}_{1} (t) = r, t_{1}, t_{2}, y),

(29)

As this holds for any choice of $𝒯$ , it holds for all $𝒯$ . A similar argument establishes this property for ${\hat{r}}_{2}$ .

The above decomposition of the time-evolution of ${\hat{r}}_{1}$ furthermore reveals a symmetry between ${\hat{r}}_{1} (t + \bar{T}) - {\hat{r}}_{1} (t)$ and ${\hat{r}}_{2} (t + \bar{T}) - {\hat{r}}_{2} (t)$ . In particular, the same decomposition shows that ${\hat{r}}_{1} (t + \bar{T}) - {\hat{r}}_{1} (t)$ equals ${\hat{r}}_{2} (t + \bar{T}) - {\hat{r}}_{2} (t)$ if we flip t₁, t₂ and $y (t)$ . Therefore,

p ({\hat{r}}_{1} (t + \bar{T}) = R | {\hat{r}}_{1} (t) = r, t_{1} = a, t_{2} = b, y = j) = p ({\hat{r}}_{2} (t + \bar{T}) = R | {\hat{r}}_{2} (t) = r, t_{1} = b, t_{2} = a, y = 3 - j) .

(30)

5.2 Shift-invariance of the value function

The shift-invariance of ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ implies a shift-invariance of the value function. To see this, fix some $𝒯$ and some final choice $j$ , in which case the value function according to Equation (27) becomes

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = ⟨ {\hat{r}}_{j} (t + \bar{T}) | {\hat{r}}_{1}, {\hat{r}}_{2} ⟩ - c \bar{T} - (M - 1) c_{s},

(31)

where the expectation is implicitly conditional on t₁, t₂, $y$ and $𝒯$ . Due to the shift-invariance of the time-evolution of ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ , adding a scalar $C$ to both ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ increases the above expectation by the same amount, $⟨ {\hat{r}}_{j} (t + \bar{T}) | {\hat{r}}_{1}, {\hat{r}}_{2} ⟩ + C$ . As a consequence,

V_{y} ({\hat{r}}_{1} + C, {\hat{r}}_{2} + C, t_{1}, t_{2}) = V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) + C .

(32)

As this holds for any choice of $𝒯$ and $j$ , it also holds for the maximum over $𝒯$ and $j$ , and thus for the value function in general.

A similar argument shows that the value function is increasing in both ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ . To see this, fix $𝒯$ and $j$ and note that increasing either ${\hat{r}}_{1}$ or ${\hat{r}}_{2}$ causes the expectation in Equation (31) to either remain unchanged or to increase to $⟨ {\hat{r}}_{j} (t + \bar{T}) | {\hat{r}}_{1}, {\hat{r}}_{2} ⟩ + C$ . Therefore, for any non-negative $C$ ,

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) \leq V_{y} ({\hat{r}}_{1} + C, {\hat{r}}_{2}, t_{1}, t_{2}) \leq V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) + C,

(33)

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) \leq V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2} + C, t_{1}, t_{2}) \leq V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) + C .

(34)

This again holds for any choice of $𝒯$ and $j$ , such that it holds for the value function in general.

For the value function on expected reward differences, ${\bar{V}}_{y} (Δ, t_{1}, t_{2})$ , changing both ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ by the same amount leaves $Δ$ , and therefore the associated value ${\bar{V}}_{y} (Δ, t_{1}, t_{2})$ , unchanged. In contrast, increasing only ${\hat{r}}_{1}$ or ${\hat{r}}_{2}$ by $2 C$ increases or decreases $Δ$ by $C$ . Thus, we can use $V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = {\bar{V}}_{y} (Δ, t_{1}, t_{2}) + ({\hat{r}}_{1} + {\hat{r}}_{2}) / 2$ from Equation (21) and substitute it into the two above inequalities to find

{\bar{V}}_{y} (Δ, t_{1}, t_{2}) - C \leq {\bar{V}}_{y} (Δ \pm C, t_{1}, t_{2}) \leq {\bar{V}}_{y} (Δ, t_{1}, t_{2}) + C,

(35)

for some non-negative $C \geq 0$ . This shows that ${\bar{V}}_{y} (Δ, t_{1}, t_{2})$ changes sublinearly with $Δ$ . However, we cannot anymore guarantee an increase or decrease in ${\bar{V}}_{y} (\cdot)$ , as an increase in $Δ$ could arise from both an increase in ${\hat{r}}_{1}$ or a decrease in ${\hat{r}}_{2}$ .

5.3 Symmetry of the value function

The symmetry in time-evolution across ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ results in a symmetry in the value function. To show this, let us again fix $𝒯$ and $j$ , such that the value function is given by Equation (31). Then, by Equation (30), the expectation in the value function becomes $⟨ {\hat{r}}_{3 - j} (t + \bar{T}) | {\hat{r}}_{2}, {\hat{r}}_{1} ⟩$ if we flip t₁, t₂, and $j$ , while leaving the remaining terms of Equation (31) unchanged. Therefore,

V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = V_{3 - y} ({\hat{r}}_{2}, {\hat{r}}_{1}, t_{2}, t_{2}) .

(36)

For the value function on expected reward differences, a flip of ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ corresponds to a sign change of $Δ$ , such that we have

{\bar{V}}_{y} (Δ, t_{1}, t_{2}) = {\bar{V}}_{3 - y} (- Δ, t_{2}, t_{1}) .

(37)

Both cases show that we are not required to find the value function for both $y = 1$ and $y = 2$ separately, as knowing one reveals the other by the above symmetry.

5.4 Maximum $| V_{1} (\cdot) - V_{2} (\cdot) |$ difference

By Bellman’s equation, Equation (19), it is best to switch attention if the expected return of accumulating evidence equals that of switching attention, that is, if

\begin{array}{ll} V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = ⟨ V_{y} ({\hat{r}}_{1} (t + δ t), {\hat{r}}_{2} (t + δ t), t_{1} + | 2 - y | δ t, t_{2} + | 1 - y | δ t) | {\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}, y ⟩ - c δ t \\ = V_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - c_{s} . \end{array}

(38)

Before that, $V_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) < V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) + c_{s}$ , as otherwise, an attention switch would have already occurred. When it does, we have $V_{3 - y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) + c_{s}$ . That is, the attention switch happens if the value of doing so exceeds that for accumulating evidence by the switch cost c_s. Therefore, the difference between the value functions V₁ and V₂ can never be larger than the switch cost, that is

| V_{1} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) - V_{2} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) | \leq c_{s} .

(39)

Once their difference equals the switch cost, a switch occurs. It is easy to see that the same property holds for the value function on expected reward differences, leading to

| {\bar{V}}_{1} (Δ, t_{1}, t_{2}) - {\bar{V}}_{2} (Δ, t_{1}, t_{2}) | \leq c_{s} .

(40)

5.5 The decision boundaries are parallel to the diagonal ${\hat{r}}_{1} = {\hat{r}}_{2}$

Following the optimal policy, the decision maker accumulates evidence until $V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = \max {{\hat{r}}_{1}, {\hat{r}}_{2}}$ . For all times before that, $V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) > \max {{\hat{r}}_{1}, {\hat{r}}_{2}}$ , as otherwise, a decision is made. Let us first find an expression for the decision boundaries, and then show that these boundaries are parallel to ${\hat{r}}_{1} = {\hat{r}}_{2}$ . To do so, we will in most of this section fix t₁, t₂ and $y$ , and drop them for notational convenience, that is $V ({\hat{r}}_{1}, {\hat{r}}_{2}) \equiv V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2})$ .

First, let us assume ${\hat{r}}_{1} > {\hat{r}}_{2}$ , such that $\max {{\hat{r}}_{1}, {\hat{r}}_{2}} = {\hat{r}}_{1}$ , and item 1 would be chosen if an immediate choice is required. Therefore $V ({\hat{r}}_{1}, {\hat{r}}_{2}) \geq {\hat{r}}_{1}$ always, and $V ({\hat{r}}_{1}, {\hat{r}}_{2}) = {\hat{r}}_{1}$ once a decision is made. For a fixed ${\hat{r}}_{1}$ , the value function is increasing in ${\hat{r}}_{2}$ , such that reducing ${\hat{r}}_{2}$ if $V ({\hat{r}}_{1}, {\hat{r}}_{2}) > {\hat{r}}_{1}$ will at some point lead to $V ({\hat{r}}_{1}, {\hat{r}}_{2}) = {\hat{r}}_{1}$ . The optimal decision boundary is the largest ${\hat{r}}_{2}$ for which this occurs. Expressed as a function of ${\hat{r}}_{1}$ , this boundary on ${\hat{r}}_{2}$ is thus given by

θ_{1 y} ({\hat{r}}_{1}, t_{1}, t_{2}) = \max {{\hat{r}}_{2} \leq {\hat{r}}_{1} : V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = {\hat{r}}_{1}}

(41)

A similar argument leads to the optimal decision boundary for item 2. In this case, we assume ${\hat{r}}_{2} > {\hat{r}}_{1}$ , such that $V ({\hat{r}}_{1}, {\hat{r}}_{2}) \geq {\hat{r}}_{2}$ always, and $V ({\hat{r}}_{1}, {\hat{r}}_{2}) = {\hat{r}}_{2}$ once a decision is made. The sublinear growth of the value function in both ${\hat{r}}_{1}$ and ${\hat{r}}_{2}$ implies that $V ({\hat{r}}_{1}, {\hat{r}}_{2})$ grows at most as fast as ${\hat{r}}_{2}$ , such that there will be some ${\hat{r}}_{2}$ at which $V ({\hat{r}}_{1}, {\hat{r}}_{2}) > {\hat{r}}_{2}$ turns into $V ({\hat{r}}_{1}, {\hat{r}}_{2}) = {\hat{r}}_{2}$ . The optimal decision boundary is the smallest ${\hat{r}}_{2}$ for which this occurs, that is

θ_{2 y} ({\hat{r}}_{1}, t_{1}, t_{2}) = \min {{\hat{r}}_{2} \geq {\hat{r}}_{1} : V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = {\hat{r}}_{2}}

(42)

Note that both boundaries are on ${\hat{r}}_{2}$ as a function of ${\hat{r}}_{1}$ , t₁, t₂, and $y$ .

To show that these boundaries are parallel to the diagonal, we will use the shift-invariance of the value function, leading, for some scalar $C$ , to

\begin{array}{ll} θ_{1 y} ({\hat{r}}_{1}, t_{1}, t_{2}) + C & = max {{\hat{r}}_{2} + C \leq {\hat{r}}_{1} + C : V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}) = {\hat{r}}_{1}} \\ = max {{\tilde{r}}_{2} \leq {\tilde{r}}_{1} : V_{y} ({\tilde{r}}_{1} - C, {\tilde{r}}_{2} - C, t_{1}, t_{2}) = {\tilde{r}}_{1} - C} \\ = max {{\tilde{r}}_{2} \leq {\tilde{r}}_{1} : V_{y} ({\tilde{r}}_{1}, {\tilde{r}}_{2}, t_{1}, t_{2}) = {\tilde{r}}_{1}} \\ = θ_{1 y} ({\tilde{r}}_{1}, t_{1}, t_{2}) \\ = θ_{1 y} ({\hat{r}}_{1} + C, t_{1}, t_{2}), \end{array}

(43)

where we have used ${\tilde{r}}_{j} = {\hat{r}}_{j} + C$ . This shows that increasing ${\hat{r}}_{1}$ by some scalar $C$ shifts the boundary on ${\hat{r}}_{2}$ by the same amount. Therefore, the decision boundary for choosing item 1 is parallel to ${\hat{r}}_{1} = {\hat{r}}_{2}$ .

An analogous argument for $θ_{2 y} (\cdot)$ results in

θ_{2 y} ({\hat{r}}_{1}, t_{1}, t_{2}) + C = θ_{2 y} ({\hat{r}}_{1} + C, t_{1}, t_{2}),

(44)

which showing that the same property holds for the decision boundary for choosing item 2. Overall, this confirms that the decision boundaries only depend on the expected reward difference (i.e., the direction orthogonal to ${\hat{r}}_{1} = {\hat{r}}_{2}$ ), confirming that it is sufficient to compute $\bar{V} (\cdot)$ instead of $V (\cdot)$ .

5.6 Impact of re-scaled costs, rewards, and standard deviations

To investigate the impact of re-scaling all reward and cost-dependent parameters, $c$ , c_s, $σ$ , and $σ_{z}$ , by a constant factor $α$ , we first show that this re-scaling causes an equal re-scaling of the reward expectation process. To do so, note that $σ \to α σ$ and $σ_{z} \to α σ_{z}$ causes the expected reward expectation decomposition, Equation (28) to yield

\begin{array}{ll} α {\hat{r}}_{1} (t + \bar{T}) = α {\hat{r}}_{1} (t) + \int_{0}^{T_{1}} \frac{α σ}{σ^{2} σ_{z}^{- 2} + (t_{1} + s_{1}) + γ t_{2}} d B_{1, s_{1}} + \int_{0}^{T_{2}} \frac{α σ \sqrt{γ}}{σ^{2} σ_{z}^{- 2} + (t_{1} + T_{1}) + γ (t_{2} + s_{2})} d B_{1, s_{2}} \\ + \int_{0}^{T_{3}} \frac{α σ}{σ^{2} σ_{z}^{- 2} + (t_{1} + T_{1} + s_{3}) + γ (t_{2} + T_{2})} d B_{1, s_{3}} + \dots . \end{array}

(45)

That is, the expected reward process now describes the evolution of a re-scaled version, ${\hat{r}}_{1} \to α {\hat{r}}_{1}$ , of the expected reward. Therefore, with slight abuse of notation, for a fixed switch set $τ$ and final choice $j$ ,

⟨ {\hat{r}}_{j} (t + T) | α {\hat{r}}_{1}, α {\hat{r}}_{2}; α σ, α σ_{z} ⟩ = α ⟨ {\hat{r}}_{j} (t + T) | {\hat{r}}_{1}, {\hat{r}}_{2}; σ, σ_{z} ⟩,

(46)

where we have made explicit the dependency on $σ$ and $σ_{z}$ .

To show the effect of this on the value function, keep again $τ$ and $j$ fixed, and use $c \to α c$ and $c_{s} \to α c_{s}$ , resulting in the value function

\begin{array}{ll} V_{y} (α {\hat{r}}_{1}, α {\hat{r}}_{2}, t_{1}, t_{2}; α c, α c_{s}, α σ, α σ_{z}) & = ⟨ {\hat{r}}_{j} | α {\hat{r}}_{1}, α {\hat{r}}_{2}; α σ, α σ_{z} ⟩ - α c \bar{T} - α (M - 1) c_{s} \\ = α (⟨ {\hat{r}}_{j} | {\hat{r}}_{1}, {\hat{r}}_{2}; σ, σ_{z} ⟩ - c \bar{T} - (M - 1) c_{s}), \end{array}

(47)

which establishes that

V_{y} (α {\hat{r}}_{1}, α {\hat{r}}_{2}, t_{1}, t_{2}; α c, α c_{s}, α σ, α σ_{z}) = α V_{y} ({\hat{r}}_{1}, {\hat{r}}_{2}, t_{1}, t_{2}; c, c_{s}, σ, σ_{z})

(48)

As this holds for all $τ$ and $j$ , it is true in general. Therefore, re-scaling all costs, rewards, and standard deviations of prior and likelihood results in equivalent re-scaling of the value function, and an analogous shift of switch and decision boundaries.

6 Simulation details

6.1 Computing the optimal policy

In Section 3, we described the Bellman equation (Equation (22)) which outputs the expected return given these four parameters: currently attended item ( $y$ ), reward difference ( $Δ$ ), expected return for accumulating more evidence, and expected return for switching attention. Note that the symmetry of the value function (Section 5) allows us to drop $- Δ$ from the original Equation (22). Solving this Bellman equation provides us with a four-dimensional ‘policy space’ which assigns the optimal action to take at any point in this space defined by the four parameters above.

The solution to the optimal policy can be found numerically by backwards induction (Tajima et al., 2016). To do so, first we assume some large $t = t_{1} + t_{2}$ , where a decision is guaranteed. In this case, $V_{y} (Δ, t_{1}, t_{2}) = \max {- Δ, Δ} = | Δ |$ for both $y = 1$ and $y = 2$ . We call this the base case. From this base case, we can move one time step backwards in t₁ ( $y = 1$ ):

{\bar{V}}_{1} (Δ, t_{1} - δ t, t_{2}) = max {\begin{array}{cc} Δ, \\ ⟨ {\bar{V}}_{1} (Δ, t_{1}, t_{2}) | Δ, t_{1}, t_{2} ⟩ - c δ t, \\ {\bar{V}}_{2} (Δ, t_{1} - δ t, t_{2}) - c_{s} \end{array}},

(49)

The second expression in the maximum can be evaluated, since we assume a decision is made at time t. But ${\bar{V}}_{2} (Δ, t_{1} - δ t, t_{2}) - c_{s}$ , which is the value function for switching attention, is unknown. This unknown value function is given by

{\bar{V}}_{2} (Δ, t_{1} - δ t, t_{2}) = max {\begin{array}{cc} Δ, \\ ⟨ {\bar{V}}_{2} (Δ, t_{1} - δ t, t_{2} + δ t) | Δ, t_{1}, t_{2} ⟩ - c δ t, \\ {\bar{V}}_{1} (Δ, t_{1} - δ t, t_{2}) - c_{s} \end{array}},

(50)

In this expression, the second term can again be found, but ${\bar{V}}_{1} (Δ, t_{1} - δ t, t_{2}) - c_{s}$ is unknown. Looking at the two expressions above, we see that under the parameters $(Δ, t_{1} - δ t, t_{2})$ , $V_{1} \geq V_{2} - c_{s}$ , and $V_{2} \geq V_{1} - c_{s}$ , which cannot both be true. Therefore, we first assume that V₁ is not determined by $V_{2} - c_{s}$ , removing the $V_{2} - c_{s}$ term from the maximum. This allows us to find ${\bar{V}}_{1} (Δ, t_{1} - δ t, t_{2})$ in Equation (49). Then, we compute Equation (50) including the $V_{1} - c_{s}$ term. If we find that $V_{2} = V_{1} - c_{s}$ , then $V_{1} \neq V_{2} - c_{s}$ , which means the $V_{2} - c_{s}$ term could not have mattered in Equation (49), and we are done. If not, we re-compute V₁ with the $V_{2} - c_{s}$ term included, and we are done. Therefore, we were able to compute V₁ and V₂ under the parameters $(Δ, t_{1} - δ t, t_{2})$ using information about ${\bar{V}}_{1} (Δ, t_{1}, t_{2})$ and ${\bar{V}}_{2} (Δ, t_{1} - δ t, t_{2} + δ t)$ .

Using the same approach, we can find $V_{1, 2} (Δ, t_{1}, t_{2} - δ t)$ based on ${\bar{V}}_{1} (Δ, t_{1} - δ t, t_{2} + δ t)$ and ${\bar{V}}_{2} (Δ, t_{1}, t_{2})$ . Thus, given that we know $V_{y} (Δ, t_{1}, t_{2})$ above a certain $t = t_{1} + t_{2}$ , we can move backwards to compute V₁ and V₂ for $(Δ, t_{1} - δ t, t_{2})$ , then $(Δ, t_{1} - 2 δ t, t_{2})$ , and so on, until $(Δ, 0, t_{2})$ for all relevant values of $Δ$ . Subsequently, we can do the same moving backwards in t₂, solving for $V_{y} (Δ, t_{1}, t_{2} - δ t)$ , $V_{y} (Δ, t_{1}, t_{2} - 2 δ t)$ , …, $V_{y} (Δ, t_{1}, 0)$ . Following this, we can continue with the same procedure from $V_{y} (Δ, t_{1} - δ t, t_{2} - δ t)$ , until we have found $V_{1, 2}$ for all combinations of t₁ and t₂.

In practice, the parameters of the optimal policy space were discretized to allow for tractable computation. We set the large time at which decisions are guaranteed at $t = 6 s$ , which we determined empirically. Time was discretized into steps of $δ t = 0.05 s$ . The item values, and their difference ( $Δ$ ) were also discretized into steps of 0.05.

Upon completing this exercise, we now have two 3-dimensional optimal policy spaces. The decision maker’s location in this policy space is determined by t₁, t₂, and $Δ$ . Each point in this space is assigned an optimal action to take (choose item, accumulate more evidence, switch attention) based on which expression was largest in the maximum of the respective Bellman equation. The decision maker moves between the two policy spaces depending on which item they are attending to ( $y \in [1, 2]$ ).

In order to find the three-dimensional boundaries that signify a change in optimal action to take, we took slices of the optimal policy space in planes of constant $Δ$ ’s. We found the boundary between different optimal policies within each of these slices. We in turn approximated the three-dimensional contour of the optimal policy boundaries by collating them along the different $Δ$ ’s.

6.2 Finding task parameters that best match human behavior

In computing the optimal policy, there were several free parameters that determined the shape of the policy boundaries, thereby affecting the behavior of the optimal model. These parameters included $σ^{2}$ , $σ_{z}^{2}$ , $c$ , c_s, and $γ$ . Our goal was to find a set of parameters that qualitatively mimic human behavior as best as possible. To do so, we performed a random search over the following parameter values: $c_{s} \in [0.001, 0.05]$ (steps size 0.001), $c \in [0.01, 0.4]$ (steps size 0.01), $σ^{2} \in [1, 100]$ (step size 1), $σ_{z}^{2} \in [1, 100]$ (step size 1), $γ \in [0.001, 0.01]$ (step size 0.001) (Bergstra and Bengio, 2012).

To find the best qualitative fit, we simulated behavior from a randomly selected set of parameter values (see next section for simulation procedure). From this simulated behavior, we evaluated the match between human and model behavior by applying the same procedure to each of Figure 3B,C,E. For each bin for each plot, we subtracted the mean values between the model and human data, then divided this difference by the standard deviation of the human data corresponding to that bin, essentially computing the effect size of the difference in means. We computed the sum of these effect sizes for every bin, which served as a metric for how qualitatively similar the curves were between the model and human data. We performed the same procedure for all three figures, and ranked the sum of the effect sizes for all simulations. We performed simulations for over 2,000,000 random sets of parameter values. The set of parameters for which our model best replicated human behavior according to the above criteria was $c_{s} = 0.0065$ , $c = 0.23$ , $σ^{2} = 27$ , $σ_{z}^{2} = 18$ , $γ = 0.004$ .

6.3 Simulating decisions with the optimal policy

The optimal policy allowed us to simulate decision making in a task analogous to the one humans performed in Krajbich et al., 2010. For a given set of parameters, we first computed the optimal policy. In a simulated trial, two items with values z₁ and z₂ are presented. At trial onset, the model attends to an item randomly ( $y \in [1, 2]$ ), and starts accumulating noisy evidence centered around the true values. At every time step ( $δ t = 0.05$ ), the model evaluates $Δ$ using the mean of the posteriors between the two items (see Equations (5) and (6)). Then, the model performs the optimal action associated with its location in the optimal policy space. If the model makes a decision, then the trial is over. If the model instead accumulates more evidence, then the above procedure is repeated for the next time step. If the model switches attention, it does not obtain further information about either item, but switches attention to the other item. Switching attention allows for more reliable evidence from the now-attended item, and also switches the optimal policy space to the appropriate one (see Figure 2).

To allow for a relatively fair comparison between the model and human data, we simulated the same number of subjects ( $N = 39$ ) for the model, but with a larger number of trials. For each simulated subject, trials were created such that all pairwise combinations of values between 0 and 7 were included, and this was iterated 20 times. This yielded a total of 1280 trials per subject.

6.4 Attention diffusion model

In order compare the decision performance of the optimal model to that of the original attentional drift diffusion model (aDDM) proposed by Krajbich et al., 2010, we needed to ensure that neither model had an advantage by receiving more information. We did so by making sure that the signal-to-noise ratios of evidence accumulation of both models were identical. In aDDM, the evidence accumulation evolved according to the following process, in steps of 0.05 s (assuming y = 1):

v_{t} = v_{t - 1} + d (z_{1} - γ_{k} z_{2}) + η_{t},

(51)

where v_t is the relative decision value that represents the subjective value difference between the two items at time $t$ , $d$ is a constant that controls the speed of integration (in $m s^{- 1}$ ), $γ_{k}$ controls the biasing effect of attention, and $η_{t} \sim 𝒩 (0, σ^{2})$ is a normally distributed random variable zero mean and variance $σ^{2}$ . Written differently, the difference in the attention-weighted momentary evidence between item 1 and item 2 can be expressed as

\begin{array}{ll} δ Δ = d (z_{1} - γ_{k} z_{2}) + η_{t} & \sim 𝒩 (d (z_{1} - γ_{k} z_{2}), σ^{2}) \\ \sim 𝒩 (k (z_{1} - γ_{k} z_{2}) δ t, σ_{k}^{2} δ t), \end{array}

(52)

where $d$ and $σ^{2}$ were replaced by $k δ t$ , and $σ_{k}^{2} δ t$ , respectively. Here, the variance term $σ_{k}^{2} δ t$ can be split into two parts, such that the $δ Δ$ term can be expressed as

δ Δ \sim 𝒩 (z_{1} k δ t, \frac{1}{2} σ_{k}^{2} δ t) - 𝒩 (γ_{k} z_{2} k δ t, \frac{1}{2} σ_{k}^{2} δ t) .

(53)

The signal-to-noise ratios (i.e. the ratio of mean over standard deviation) of the two terms in the above equation are $\frac{z_{1} k δ t}{\sqrt{\frac{δ t}{2}} σ_{k}}$ and $\frac{z_{2} k δ t}{\frac{1}{γ_{k}} σ_{k} \sqrt{\frac{δ t}{2}}}$ , respectively.

Continuing to assume $y = 1$ , in the Bayes-optimal model, evidence accumulation evolves according to

\begin{aligned} δ x_{1} & \sim 𝒩 (z_{1} δ t, σ_{b}^{2} δ t), \\ δ x_{2} & \sim 𝒩 (z_{2} δ t, γ_{b}^{- 1} σ_{b}^{2} δ t) . \end{aligned}

(54)

Therefore, the difference in the attention-weighted momentary evidence between item 1 and item 2 can be expressed as:

\begin{aligned} δ Δ & \sim 𝒩 (z_{1} δ t, σ_{b}^{2} δ t) - γ_{b} 𝒩 (z_{2} δ t, γ_{b}^{- 1} σ_{b}^{2} δ t) \\ \sim 𝒩 (z_{1} δ t, σ_{b}^{2} δ t) - 𝒩 (γ_{b} z_{2} δ t, γ_{b} σ_{b}^{2} δ t) . \end{aligned}

(55)

The signal-to-noise ratios of the two terms in the above equation are $\frac{z_{1} δ t}{\sqrt{δ t} σ_{b}}$ and $\frac{z_{2} δ t}{\frac{1}{\sqrt{γ_{b}}} σ_{b} \sqrt{δ t}}$ , respectively.

In order to match the signal-to-noise ratios of the two models, we set equal their corresponding expressions, to find the following relationship between the parameters of the two models:

\begin{aligned} k & = 1, \\ σ_{k}^{2} & = 2 σ_{b}^{2}, \\ γ_{k} & = \sqrt{γ_{b}} . \end{aligned}

(56)

Therefore, we simulated the aDDM with model parameters $γ_{k} = \sqrt{γ_{b}}$ and $σ_{k}^{2} = 2 σ_{b}^{2}$ .

In the original aDDM model, the model parameters were estimated by fitting the model behavior to human behavior after setting a decision threshold at ±1. Since we adjusted some of the aDDM parameters, we instead iterated through different decision thresholds (1 through 10, in increments of 1) and found the value that maximizes model performance. To keep it consistent with behavioral data, we generated 39 simulated participants that each completed 200 trials where the two item values were drawn from the prior distribution of the optimal policy model, $z_{j} \sim 𝒩 (\bar{z}, σ_{z}^{2})$ using both the optimal model and the aDDM model.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Jan Drugowitsch, Email: jan_drugowitsch@hms.harvard.edu.

Konstantinos Tsetsos, University Medical Center Hamburg-Eppendorf, Germany.

Joshua I Gold, University of Pennsylvania, United States.

Funding Information

This paper was supported by the following grants:

National Institute of Mental Health R01MH115554 to Jan Drugowitsch.
James S. McDonnell Foundation 220020462 to Jan Drugowitsch.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Formal analysis, Methodology, Writing - original draft, Writing - review and editing.

Methodology.

Conceptualization, Supervision, Funding acquisition, Methodology, Writing - original draft, Writing - review and editing.

Ethics

Human subjects: Human behavioral data were obtained from previously published work from the California Institute of Technology (Krajbich et al., 2010). Caltech's Human Subjects Internal Review Board approved the experiment. Written informed consent was obtained from all participants.

Additional files

Source data 1. Human behavioral data readme.

elife-63436-data1.zip^{(481B, zip)}

Source data 2. Human behavioral data.

elife-63436-data2.zip^{(128KB, zip)}

Transparent reporting form

elife-63436-transrepform.docx^{(178.4KB, docx)}

Data availability

The human behavioral data and code are available through an open source license archived at https://doi.org/10.5281/zenodo.4636831 copy archived at https://archive.softwareheritage.org/swh:1:rev:db4a4481aa6522d990018a34c31683698da039cb/.

References

Acerbi L, Vijayakumar S, Wolpert DM. On the origins of suboptimality in human probabilistic inference. PLOS Computational Biology. 2014;10:e1003661. doi: 10.1371/journal.pcbi.1003661. [DOI] [PMC free article] [PubMed] [Google Scholar]
Armel KC, Beaumel A, Rangel A. Biasing simple choices by manipulating relative visual attention. Judgment and Decision Making. 2008;3:396–403. [Google Scholar]
Averbeck BB, Latham PE, Pouget A. Neural correlations, population coding and computation. Nature Reviews Neuroscience. 2006;7:358–366. doi: 10.1038/nrn1888. [DOI] [PubMed] [Google Scholar]
Ba JL, Mnih V, Kavukcuoglu K. Multiple object recognition with visual attention. International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.2015. [Google Scholar]
Bahdanau D, Cho KH, Bengio Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.2015. [Google Scholar]
Bellman R. On the theory of dynamic programming. PNAS. 1952;38:716–719. doi: 10.1073/pnas.38.8.716. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bergstra J, Bengio Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research. 2012;13:281–305. doi: 10.5555/2188385.2188395. [DOI] [Google Scholar]
Bertsekas DP. Athena scientific. In: Floudas C, Pardalos P, editors. Dynamic Programming and Optimal Control. Springer; 1995. pp. 8–36. [DOI] [Google Scholar]
Bogacz R, Brown E, Moehlis J, Holmes P, Cohen JD. The physics of optimal decision making: a formal analysis of models of performance in two-alternative forced-choice tasks. Psychological Review. 2006;113:700–765. doi: 10.1037/0033-295X.113.4.700. [DOI] [PubMed] [Google Scholar]
Bonferroni CE. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni Del R Istituto Superiore Di Scienze Economiche E Commerciali Di Firenze.1936. [Google Scholar]
Brockwell AE, Kadane JB. A gridding method for bayesian sequential decision problems. Journal of Computational and Graphical Statistics. 2003;12:566–584. doi: 10.1198/1061860032274. [DOI] [Google Scholar]
Buhusi CV, Meck WH. What Makes Us Tick? Functional and Neural Mechanisms of Interval Timing. Hachette; 2005. [DOI] [PubMed] [Google Scholar]
Callaway F, Rangel A, Griffiths TL. Fixation patterns in simple choice are consistent with optimal use of cognitive resources. PsyArXiv. 2020 doi: 10.31234/osf.io/57v6k. [DOI]
Cassey TC, Evens DR, Bogacz R, Marshall JA, Ludwig CJ. Adaptive sampling of information in perceptual decision-making. PLOS ONE. 2013;8:e78993. doi: 10.1371/journal.pone.0078993. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cavanagh JF, Wiecki TV, Kochar A, Frank MJ. Eye tracking and pupillometry are indicators of dissociable latent decision processes. Journal of Experimental Psychology: General. 2014;143:1476–1488. doi: 10.1037/a0035813. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chukoskie L, Snider J, Mozer MC, Krauzlis RJ, Sejnowski TJ. Learning where to look for a hidden target. PNAS. 2013;110 Suppl 2:10438–10445. doi: 10.1073/pnas.1301216110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cohen MR, Maunsell JH. Attention improves performance primarily by reducing interneuronal correlations. Nature Neuroscience. 2009;12:1594–1600. doi: 10.1038/nn.2439. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cohen MR, Maunsell JH. A neuronal population measure of attention predicts behavioral performance on individual trials. Journal of Neuroscience. 2010;30:15241–15253. doi: 10.1523/JNEUROSCI.2171-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Corbetta M, Shulman GL. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience. 2002;3:201–215. doi: 10.1038/nrn755. [DOI] [PubMed] [Google Scholar]
Drugowitsch J, Moreno-Bote R, Churchland AK, Shadlen MN, Pouget A. The cost of accumulating evidence in perceptual decision making. Journal of Neuroscience. 2012;32:3612–3628. doi: 10.1523/JNEUROSCI.4010-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Drugowitsch J, Moreno-Bote R, Pouget A. Optimal decision-making with time-varying evidence reliability. Advances in Neural Information Processing Systems.2014. [Google Scholar]
Drugowitsch J, Wyart V, Devauchelle AD, Koechlin E. Computational precision of mental inference as critical source of human choice suboptimality. Neuron. 2016;92:1398–1411. doi: 10.1016/j.neuron.2016.11.005. [DOI] [PubMed] [Google Scholar]
Fudenberg D, Strack P, Strzalecki T. Speed, accuracy, and the optimal timing of choices. American Economic Review. 2018;108:3651–3684. doi: 10.1257/aer.20150742. [DOI] [Google Scholar]
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN. Convolutional sequence to sequence learning. 34th International Conference on Machine Learning ICML.2017. [Google Scholar]
Geisler WS, Cormack LK. Models of Overt Attention. In: Everling S, Liversedge S, Gilchrist I, editors. The Oxford Handbook of Eye Movements. Oxford University Press; 2012. pp. 1–2. [DOI] [Google Scholar]
Gluth S, Kern N, Kortmann M, Vitali CL. Value-based attention but not divisive normalization influences decisions with multiple alternatives. Nature Human Behaviour. 2020;4:634–645. doi: 10.1038/s41562-020-0822-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hayhoe M, Ballard D. Eye movements in natural behavior. Trends in Cognitive Sciences. 2005;9:188–194. doi: 10.1016/j.tics.2005.02.009. [DOI] [PubMed] [Google Scholar]
Hébert B, Woodford M. Rational Inattention When Decisions Take Time. Nber Working Paper Series; 2019. [Google Scholar]
Hoppe D, Rothkopf CA. Learning rational temporal eye movement strategies. PNAS. 2016;113:8332–8337. doi: 10.1073/pnas.1601305113. [DOI] [PMC free article] [PubMed] [Google Scholar]
Itti L, Koch C. Computational modelling of visual attention. Nature Reviews Neuroscience. 2001;2:194–203. doi: 10.1038/35058500. [DOI] [PubMed] [Google Scholar]
Jang AI. DrugowitschLab/Optimal-policy-attention-modulated-decisions: Code as used in manuscript. v1.0Zenodo. 2021 doi: 10.5281/zenodo.4636831. [DOI]
Ke TT, Shen ZJM, Villas-Boas JM. Search for information on multiple products. Management Science. 2016;62:e2316. doi: 10.1287/mnsc.2015.2316. [DOI] [Google Scholar]
Khaw MW, Glimcher PW, Louie K. Normalized value coding explains dynamic adaptation in the human valuation process. PNAS. 2017;114:12696–12701. doi: 10.1073/pnas.1715293114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krajbich I, Armel C, Rangel A. Visual fixations and the computation and comparison of value in simple choice. Nature Neuroscience. 2010;13:1292–1298. doi: 10.1038/nn.2635. [DOI] [PubMed] [Google Scholar]
Krajbich I, Rangel A. Multialternative drift-diffusion model predicts the relationship between visual fixations and choice in value-based decisions. PNAS. 2011;108:13852–13857. doi: 10.1073/pnas.1101328108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kustov AA, Robinson DL. Shared neural control of attentional shifts and eye movements. Nature. 1996;384:74–77. doi: 10.1038/384074a0. [DOI] [PubMed] [Google Scholar]
Li SZ, Ma WJ. Cognitive computational neuroscience. Valuation as Inference: A New Model for the Effects of Fixation on Choice.2019. [Google Scholar]
Li Z, Ma W-J. An uncertainty-based model of the effects of fixation on choice. PsyArXiv. 2020 doi: 10.31234/osf.io/ajmwx. [DOI] [PMC free article] [PubMed]
Milosavljevic M, Malmaud J, Huth A, Koch C, Rangel A. The drift diffusion model can account for the accuracy and reaction time of Value-Based choices under high and low time pressure. SSRN Electronic Journal. 2010;11:1901533. doi: 10.2139/ssrn.1901533. [DOI] [Google Scholar]
Mitchell JF, Sundberg KA, Reynolds JH. Differential attention-dependent response modulation across cell classes in macaque visual area V4. Neuron. 2007;55:131–141. doi: 10.1016/j.neuron.2007.06.018. [DOI] [PubMed] [Google Scholar]
Mitchell JF, Sundberg KA, Reynolds JH. Spatial attention decorrelates intrinsic activity fluctuations in macaque area V4. Neuron. 2009;63:879–888. doi: 10.1016/j.neuron.2009.09.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mnih V, Heess N, Graves A, Kavukcuoglu K. Recurrent models of visual attention. Advances in Neural Information Processing Systems.2014. [Google Scholar]
Mohler CW, Wurtz RH. Organization of monkey superior colliculus: intermediate layer cells discharging before eye movements. Journal of Neurophysiology. 1976;39:722–744. doi: 10.1152/jn.1976.39.4.722. [DOI] [PubMed] [Google Scholar]
Ni AM, Ruff DA, Alberts JJ, Symmonds J, Cohen MR. Learning and attention reveal a general relationship between population activity and behavior. Science. 2018;359:463–465. doi: 10.1126/science.aao0284. [DOI] [PMC free article] [PubMed] [Google Scholar]
Posner MI. Orienting of attention. Quarterly Journal of Experimental Psychology. 1980;32:3–25. doi: 10.1080/00335558008248231. [DOI] [PubMed] [Google Scholar]
Rangel A, Hare T. Neural computations associated with goal-directed choice. Current Opinion in Neurobiology. 2010;20:262–270. doi: 10.1016/j.conb.2010.03.001. [DOI] [PubMed] [Google Scholar]
Ratcliff R, McKoon G. The diffusion decision model: theory and data for two-choice decision tasks. Neural Computation. 2008;20:873–922. doi: 10.1162/neco.2008.12-06-420. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reynolds JH, Chelazzi L. Attentional modulation of visual processing. Annual Review of Neuroscience. 2004;27:611–647. doi: 10.1146/annurev.neuro.26.041002.131039. [DOI] [PubMed] [Google Scholar]
Ruff DA, Ni AM, Cohen MR. Cognition as a window into neuronal population space. Annual Review of Neuroscience. 2018;41:77–97. doi: 10.1146/annurev-neuro-080317-061936. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sepulveda P, Usher M, Davies N, Benson AA, Ortoleva P, De Martino B. Visual attention modulates the integration of goal-relevant evidence and not value. eLife. 2020;9:e60705. doi: 10.7554/eLife.60705. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shadlen MN, Shohamy D. Decision making and sequential sampling from memory. Neuron. 2016;90:927–939. doi: 10.1016/j.neuron.2016.04.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shenhav A, Dean Wolf CK, Karmarkar UR. The evil of banality: when choosing between the mundane feels like choosing between the worst. Journal of Experimental Psychology: General. 2018;147:1892–1904. doi: 10.1037/xge0000433. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shimojo S, Simion C, Shimojo E, Scheier C. Gaze Bias both reflects and influences preference. Nature Neuroscience. 2003;6:1317–1322. doi: 10.1038/nn1150. [DOI] [PubMed] [Google Scholar]
Smith SM, Krajbich I. Attention and choice across domains. Journal of Experimental Psychology: General. 2018;147:1810–1826. doi: 10.1037/xge0000482. [DOI] [PubMed] [Google Scholar]
Smith SM, Krajbich I. Gaze amplifies value in decision making. Psychological Science. 2019;30:116–128. doi: 10.1177/0956797618810521. [DOI] [PubMed] [Google Scholar]
Song M, Wang X, Zhang H, Li J. Proactive information sampling in Value-Based Decision-Making: deciding when and where to saccade. Frontiers in Human Neuroscience. 2019;13:35. doi: 10.3389/fnhum.2019.00035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sorokin I, Seleznev A, Pavlov M, Fedorov A, Ignateva A. Deep attention recurrent Q-network. arXiv. 2015 https://arxiv.org/abs/1512.01693
Tajima S, Drugowitsch J, Pouget A. Optimal policy for value-based decision-making. Nature Communications. 2016;7:12400. doi: 10.1038/ncomms12400. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tajima S, Drugowitsch J, Patel N, Pouget A. Optimal policy for multi-alternative decisions. Nature Neuroscience. 2019;22:1503–1511. doi: 10.1038/s41593-019-0453-9. [DOI] [PubMed] [Google Scholar]
Tavares G, Perona P, Rangel A. The attentional drift diffusion model of simple perceptual Decision-Making. Frontiers in Neuroscience. 2017;11:468. doi: 10.3389/fnins.2017.00468. [DOI] [PMC free article] [PubMed] [Google Scholar]
Towal RB, Mormann M, Koch C. Simultaneous modeling of visual saliency and value computation improves predictions of economic choice. PNAS. 2013;110:E3858–E3867. doi: 10.1073/pnas.1304429110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Krauzlis RJ. Visual selective attention in mice. Current Biology. 2018;28:676–685. doi: 10.1016/j.cub.2018.01.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wittig JH, Jang AI, Cocjin JB, Inati SK, Zaghloul KA. Attention improves memory by suppressing spiking-neuron activity in the human anterior temporal lobe. Nature Neuroscience. 2018;21:808–810. doi: 10.1038/s41593-018-0148-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wurtz RH. Neuronal mechanisms of visual stability. Vision Research. 2008;48:2070–2089. doi: 10.1016/j.visres.2008.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang SC, Lengyel M, Wolpert DM. Active sensing in the categorization of visual patterns. eLife. 2016;5:e12215. doi: 10.7554/eLife.12215. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu AJ, Dayan P, Cohen JD. Dynamics of attentional selection under conflict: toward a rational bayesian account. Journal of Experimental Psychology. Human Perception and Performance. 2009;35:700–717. doi: 10.1037/a0013553. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.63436.sa1

Decision letter

Editor: Konstantinos Tsetsos¹

Reviewed by: Konstantinos Tsetsos²

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Although recent research has described the interplay between attention and decision-making processes, the merits of employing selective attention during choice tasks remain largely underexplored. This paper closes this gap by offering a normative framework that specifies how reward-maximizing agents should employ attention while making value-based decisions. This framework, asides its theoretical importance, makes contact with the existing empirical literature while also providing detailed behavioural predictions that can be tested in future experiments.

Decision letter after peer review:

Thank you for submitting your article "Optimal policy for attention-modulated decisions explains human fixation behavior" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Konstantinos Tsetsos as the Reviewing Editor and Reviewer #, and the evaluation has been overseen by Joshua Gold as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

In this paper the authors derive a normative model of attentional switching during binary value-based decisions. The model the authors propose assumes that attentional allocation is under the control of the agent, who on each moment decides among halting deliberation and choosing an alternative, carrying on sampling from the currently attended alternative, switching attention to the other alternative. Under certain assumptions about deliberation and attentional allocation costs, the authors derive the optimal policy, showing also that this policy reproduces key aspects of human behaviour. The approach here stands in contrast with the majority of past research on this topic, which has assumed that attentional allocation is exogenous and that it influences- but is not influenced by- the ongoing choice process. All reviewers agreed that this is an interesting and timely article that takes a normative approach to the important issue of attentional allocation in decision-making tasks. However, the reviewers also agreed that there are several points that will need to be addressed in a revision.

Essential revisions

1. Link between the normative model and human data.

One of the main claims in the paper is that the normative model reproduces key aspects of human behaviour in a value-based binary choice task reported elsewhere (Krajbich et al., 2010). This claim needs to be backed up by additional analyses and clarifications. In particular:

a. In Figure 3A, the model predicts a linear choice curve (as opposed to a sigmoidal in the human data) while, also unlike in the data, the RT curves are concave. Are the shapes of these curves a general feature of the model or do they emerge for a certain parametrisation? Would for example the results look different with fixed bounds?

b. Similarly, in Figure 3D, the model shows that the last fixation influences probability of choice equally across value differences. In the human data this effect diminishes as value difference increases. Is this prediction or a feature of the model?

c. In Figure 3C the authors focus on the number of switches, which however are highly correlated with RTs. Can the authors show in addition the number of switches per unit of time (equivalent to probability of switching) in the human data and in their model?

d. In Figure 4B-C, RTs and "value sum" may covary, with high value sums presumably resulting in shorter RTs. Does the value sum effect persist in the model and in the data once the influence of RTs is accounted for?

2. Comparison between the normative and aDDM models.

The two models differ in the way they conceptualise attentional switching (exogenous vs. endogenous). Does this difference manifest itself in the predictions the two models make? And can the human data distinguish between the two models? The authors comment in the Discussion that their model is more constrained by normative considerations, and may thus fit the data worse than other ad-hoc models. Asides quantitative fitting, which the authors may elect to perform, we would like to see a qualitative comparison between the two models.

a. How does the probability of switching changes as a function of time (within trial) and as a function of absolute value difference and sum (across trials) in the two models and in the data?

b. How does the fixation duration changes as a function of time (within trial) and as a function of absolute value difference and sum (across trials) in the two models and in the data?

c. The model seems to predict a non-negligible number of single fixation trials. How does this align with the data and the aDDM predictions?

d. How do reaction time distributions look like under the normative model? Are these comparable to the RT distributions in the data?

e. Please also show the aDDM predictions together with the novel predictions made by the normative model in Figure 4C.

f. The comparison between the two models is currently done on the basis of mean reward. How do the predictions of the models compare to the mean reward accrued by human participants in Krajbich et al? Please also clarify in the main text that the models are not compared on the basis of goodness of fit. In particular, in the paragraph starting in line 260, terms such as "outperformed", "competitive performance", "comparison" are ambiguous.

g. Please explain why the model shows the pattern that is demonstrated in Figure 3F. Is this pattern also predicted by the aDDM?

3. Comprehensibility of the modelling.

The authors have done an admirable job of boiling down some heavy mathematics (in the SI) into just the key steps in the main text. A lot of this builds on prior work in Tajima et al. Still, the authors could do more to explain the math, which would make this paper more self-contained and really help the reader understand the core ideas/intuitions.

a. Equation 1 – it isn't obvious how the mean will behave over time or where this expressions comes from. It would be helpful to state that the z part is the prior, and that as t->infinity the σ terms become negligible and the expression converges to x/t, namely the true value. The σ-squared terms seem to come out of nowhere. In the methods the authors explain that this has to do with Fisher information, but most readers won't know what that is or why it is the appropriate thing to include here. Also, σ_x is defined later in the text, but should be defined up front here.

b. Between Equation 1 and 2: In the means for δ_x, one has a δ_t multiplying z and one doesn't. Why? Is this a typo? The δ_t should always be there, but again, it isn't obvious.

c. Equation 2: Why does the mean have a z1| at the beginning of it? Is that a typo?

d. Could the authors elaborate more on why/when the decision maker chooses to switch attention? They say that the decision at each time point only depends on the difference in posterior means but Figure 2c seems to indicate that if the difference in posterior means stays constant over a period of time, then the process enters the "switch" zone and shifts attention.

4. Model assumptions.

a. The model assumes that attention can take one of two discrete states. However, attention is often regarded as operating in a more graded fashion, which would necessitate parameter kappa in the model to be free to change within a trial. This will probably render the model intractable. However, a more viable and less radical assumption, would be to allow for a third "divided attention" mode in which the agent samples equally from both alternatives (divided attention mode). If this extension is technically challenging, one conceptual question that the authors can discuss in the paper, is whether the normative model would ever switch away from this divided attention mode.

b. The authors need to assume a certain prior, namely z_bar = 0, in order to always get a positive effect of attention. This seems like an important controversy in the model; it is a noticeably non-Bayesian feature of a Bayesian model. The authors try to explain this away by noting that the original rating scale included both negative and positive values. However, only positive items were included in the choice task, and there is a consistently positive effect of attention on choice for other tasks (see Cavanagh et al. 2014; Smith and Krajbich 2018; Smith and Krajbich 2019) with only positive outcomes. This needs to be more openly acknowledged and discussed.

c. The last paragraph of Discussion talks about a lack of benefit of focussed attention in the analysed task. Would focussing attention would become beneficial in decision tasks with more than 2 options? Although answering this question would be a separate paper, a few sentences on generalising this work to more than two options could be included in the Discussion.

5. Coverage of literature.

a. In the Introduction, the authors state that the "final choices are biased towards the item that they looked at longer, irrespective of its desirability". This is not quite true. The desirability does matter, as shown in Smith and Krajbich 2019, as well as Westbrook et al. 2020. Moreover, as the authors note in the discussion, Armel et al. 2008 show that attention has a reverse effect when the items are aversive. Please update the introduction accordingly.

b. The authors do not mention that there is some work that has argued for value attracting attention in multi-alternative choice. While that work does not go to the lengths that this paper does, it does make normative arguments for why this should occur, namely to eliminate non-contenders (Krajbich and Rangel 2011; Towal et al. 2013; Gluth et al. 2020). Finally, the authors might also want to mention Ke, Shen, Villas-Boas (2016), which also takes a normative approach to information search in consumer choice.

c. On page 2 the authors state that "no current normative framework incorporates control of attention as an intrinsic aspect of the decision-making process". This does not seem to be accurate given the study by of Cassey et al. (2013). The key difference is that in Cassey et al. the focus was placed on the fixed duration paradigm, while the present manuscript focuses on the free-response paradigm. Please clarify the link of the current study with Cassey et al.

d. In lines 318-321, the authors state that the model of Cassey et al. "could not predict when they [fixation switches] ought to occur". The model of Cassey et al. does predict when the optimal switching times are, but for the case of a fixed duration paradigm with \kappa = 0. In this case the optimal switch policy is much simpler (single or at most double switch at particular times) than in the free-response paradigm nicely analysed in the present manuscript.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Optimal policy for attention-modulated decisions explains human fixation behavior" for further consideration by eLife. Your revised article has been evaluated by Joshua Gold (Senior Editor) and a Reviewing Editor.

Summary:

The authors have done an excellent and thorough job in addressing most of the comments that were raised during the first review. Please find below a list of remaining issues.

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

Essential revisions

1. In our previous points 2a-b the request was to show the switch rate and fixation duration as a function of time, value sum and value difference. We apologise if that was not clear previously but with the term "time" we referred to elapsed time within a trial rather than reaction times (RT). Because RT's will be influenced by various properties of the trial (e.g. trial difficulty) the analyses reported in the revision are not very easy to interpret. Additionally, we are puzzled by the fact that the aDDM model predicts non-flat switch rates and fixation durations as a function of the different quantities (Figure 4—figure supplement 1), given that switching in the aDDM version the authors used is random. The aim of these previous points was to (a) highlight the differences between random switching (aDDM) and deliberate switching (optimal model), and (b) understand how switching tendencies change as a function of trial-relevant quantities and time elapsed in the optimal model. We appreciate that such analyses might be difficult to perform given the interrelationship among time, value sum and absolute value difference. Below we offer a few suggestions:

– One possibility is to consider a "stimulus locked" approach, in which the switch rate and fixation duration is plotted as a function of time elapsed from the stimulus onset and up to "x" milliseconds before the response. Inevitably, later time points will more likely include certain trial types, e.g. only difficult trials (or trials with low value sum). The authors can consider using a "stratification" approach, subsampling the trials such that all time-points have comparable trial distributions (in terms of value difference and value sum). The influence of value difference and value sum can be examined using median splits based on these quantities.

– Specifically for the switch rate analysis, the authors could perform a logistic regression trying to predict at each point in time the probability of switching, also considering covariates such as value sum or absolute value difference.

– Since the last fixation can be cut short, we recommend excluding the last fixation from these analyses.

– We recommend that the scaling of the x and y axis in the data and in the models is the same to allow comparison in absolute terms.

2. Krajbich et al. 2010 used an aDDM implementation in which fixations were not random but instead sampled from the empirical switching times distributions (thus fixations depended on fixation number, trial difficulty etc). The authors currently do not acknowledge this previous implementation. The aDDM with random switching is a good baseline for the scope of this paper, but the version used by Krajbich et al. 2010 makes different predictions than the random aDDM; and this should be explicitly acknowledged. For instance, the aDDM where switching matches the empirical distributions, accounts for the fact that the first fixations were shorter than the rest. Discussing this non-random aDDM used by Krajbich et al. 2010, will fit easily in the current discussion, since the optimal model offers a rationale fort the empirical fixation patterns that the older work simply incorporated into the aDDM simulations.

3. The authors slightly mischaracterise Krajbich and Rangel 2011 by saying that "fixation patterns were assumed to be either independent of the decision-making strategy". That paper did condition fixation patterns on the values of the items. Here is a direct quote from the Discussion: "These patterns are interesting for several reasons. First, they show that the fixation process is not fully independent from the valuation process, and contains an element of choice that needs to be explained in further work."

4. When comparing the mean reward of the models, the authors simulate the aDDM not with the parameters from Krajbich et al. 2010, but with different parameters meant to "ensure a fair comparison" between the models. We believe that this approach sets the aDDM to a disadvantage. If the best-fitting parameters of the "optimal model" lead to a lower signal-to-noise ratio than the aDDM best-fitting parameters, that should be acknowledged and accepted as is. We recommend the authors state upfront that the best-fitting optimal model does not outperform the data or the aDDM. However, if you use a non-best-fitting aDDM, then the aDDM underperforms both.

eLife. 2021 Mar 26;10:e63436. doi: 10.7554/eLife.63436.sa2

Author response

Essential revisions

1. Link between the normative model and human data.

One of the main claims in the paper is that the normative model reproduces key aspects of human behaviour in a value-based binary choice task reported elsewhere (Krajbich et al., 2010). This claim needs to be backed up by additional analyses and clarifications. In particular:

a. In Figure 3A, the model predicts a linear choice curve (as opposed to a sigmoidal in the human data) while, also unlike in the data, the RT curves are concave. Are the shapes of these curves a general feature of the model or do they emerge for a certain parametrisation? Would for example the results look different with fixed bounds?

The model predicts linear choice curves in Figure 3A due to the difficulty of the task, which is set by the model parameters including the evidence noise term (𝜎^"). Therefore, if we set the evidence noise term (𝜎^") to a lower value, the model will exhibit sigmoidal choice curves because the decision becomes easier at extreme value differences. Below is the RT curve after decreasing the evidence noise term (𝜎^") from 27 to 5, which renders the decisions easier due to an overall more reliable evidence accumulation. This can be seen in Figure 3—figure supplement 1A.

b. Similarly, in Figure 3D, the model shows that the last fixation influences probability of choice equally across value differences. In the human data this effect diminishes as value difference increases. Is this prediction or a feature of the model?

Similar to above, for Figure 3D, decreasing the noise term causes the choice curves to reach an asymptote at less extreme value differences. This is reflected in the last fixation effect, which shows a diminishing effect as the value difference increases. In Figure 3—figure supplement 1B is the last fixation effect curve after decreasing the evidence noise term (𝜎^") from 27 to 5.

c. In Figure 3C the authors focus on the number of switches, which however are highly correlated with RTs. Can the authors show in addition the number of switches per unit of time (equivalent to probability of switching) in the human data and in their model?

We thank the reviewers for this suggestion, and agree that the rate of switching is an important measure to report. Human data showed no relationship between rate of switching and trial difficulty (Figure 3—figure supplement 1D; t(38) = -0.32, p = 0.75). Interestingly, we found that with the large number of simulated trials, our optimal model shows an increase in the rate of switching as the task difficulty decreases (Figure 3—figure supplement 1E; t(38) = 2.96, p = 0.0052). However, when using the same number of trials as human data, this relationship was not apparent in the model (Figure 3—figure supplement 1F; t(38) = 1.02, p = 0.31), suggesting the human data may be underpowered to show such a relationship. Furthermore, we find that an increase in switch rate with decreasing trial difficulty is not a general property of the optimal model, as a significant increase in the switch cost (𝐶_$ from 0.018 to 0.1) reduces the overall number of switches, and removes this effect (Figure 3—figure supplement 1G; t(38) = -0.50, p = 0.62), even with a large number of simulated trials.

We have added the following to the Results to briefly summarize the switch rate analysis:

“Since the number of switches is likely correlated with response time, we also looked at switch rate (number of switches divided by response time). Here, although human data showed no relationship between switch rate and trial difficulty, model behavior showed a positive relationship, suggesting an increased rate of switching for easier trials. However, this effect was absent when using the same number of trials as humans, and did not generalize across all model parameter values (Figure 3—figure supplement 2).”

d. In Figure 4B-C, RTs and "value sum" may covary, with high value sums presumably resulting in shorter RTs. Does the value sum effect persist in the model and in the data once the influence of RTs is accounted for?

To address this, we repeated the analysis for the effect of value sum on the choice bias coefficients using a regression model with both value sum and RT as independent variables. We fit a regression model for each participant, then performed a t-test of the regression coefficients across participants. For simulated data from the model, after accounting for the RT, value sum still had a significant effect on choice bias (t(38) = 7.88, p < 0.001). Conversely, in the regression model for RT, adding value sum as another independent variable still led to a significant effect of RT on the choice bias coefficients (t(38) = -5.73, p < 0.001).

For human data, value sum had a significant effect on choice bias coefficients after adding RT to the regression (t(38) = 2.91, p = 0.006). RT had a non-significant effect on choice bias after adding value sum to the regression, although it was trending in the expected direction (t(38) = -1.32, p = 0.20).

We added the following segment to the Results and Methods and Materials sections:

Results:

“Since response time may be influenced by the sum of the two item values and vice versa, we repeated the above analyses using a regression model that includes both value sum and response time as independent variables (see Methods and Materials). The results were largely consistent for both model (effect of RT on choice bias: t(38) = -5.73, p < 0.001, effect of value sum: t(38) = 7.88, p < 0.001) and human (effect of RT: t(38) = -1.32, p = 0.20, effect of value sum: t(38) = 2.91, p = 0.006) behavior.”

Methods and Materials:

“To test for the effect of RT and value sum on choice bias after accounting for the other variable, we used a similar approach and used both RT and value sum as independent variables in the regression model and the choice bias coefficient as the dependent variable.”

2. Comparison between the normative and aDDM models.

The two models differ in the way they conceptualise attentional switching (exogenous vs. endogenous). Does this difference manifest itself in the predictions the two models make? And can the human data distinguish between the two models? The authors comment in the Discussion that their model is more constrained by normative considerations, and may thus fit the data worse than other ad-hoc models. Asides quantitative fitting, which the authors may elect to perform, we would like to see a qualitative comparison between the two models.

Indeed, we had fairly limited qualitative analysis in how our model behavior differs from that of the aDDM, and – potentially – from human data. We have thus performed additional analyses that we describe below. Following this, we provide the text we added to the main text to describe these analyses.

a. How does the probability of switching changes as a function of time (within trial) and as a function of absolute value difference and sum (across trials) in the two models and in the data?

To test the relationship between switch rate and time, for each participant, we divided all trials into five equallysized bins based on RT. The plots show the average curves across participants, where vertical error bars indicate SEM for the relevant y-variable (e.g., switch rate), and horizontal error bars indicate the SEM of the bin means. A linear relationship between the x and y variables were performed by fitting a linear regression model within each participant, then performing a t-test of the regression coefficients across subjects against zero. Analogous binning and statistical procedures were used when dividing trials by value sum and value difference. For model simulations, we used the same trials and their corresponding item values as the human task.

In human data, the probability of switching decreases as a function of time (t(38) = -4.49, p < 0.001), while this relationship is neither apparent in the optimal model nor the aDDM (Author response image 1).

However, the shape of the optimal model curve suggested that trials with only a single fixation (switch rate = 0) may be distorting the relationship between switch rate and time. When only including trials where at least one switch occurred, both models predicted a decrease in switch rate over time, consistent with human data (optimal model: t(38) = -29.6, p < 0.001, aDDM: t(38) = -7.70, p < 0.001). This suggests that in both models, single fixation trials significantly affect the switch rate (Author response image 2).

The shape of the curve before and after removing single fixation trials suggests that the RT distribution in single fixation trials are more tightly distributed in the optimal model compared to the aDDM. Plotting the RT distribution of these trials confirmed this prediction.

Regarding the relationship between switch rate and value sum, human data showed no significant relationship (value sum, t(38) = -0.84, p = 0.40). However, both the optimal model and the aDDM showed a negative association, such that switch rate decreased as the value sum increased, suggesting that the model is less likely to switch attention within the same time frame for trials where higher value items are being compared (optimal model, t(38) = -4.11, p < 0.001; aDDM, t(38) = -2.09, p = 0.044) (Author response image 4).

Regarding the relationship between switch rate and absolute value difference (i.e., trial difficulty), human data again showed no significant relationship (t(38) = -0.67, p = 0.51). The optimal model also showed no significant relationship between switch rate and value difference (t(38) = -0.41, p = 0.68). However, the aDDM showed a positive association, suggesting that more switches occurred within the same time-frame for easier trials (t(38) = 4.62, p < 0.001) (Author response image 5).

To summarize, human data and both models show a decrease in switch rate as a function of time, suggesting that while the number of switches increases as a function of time, optimal behavior involves decreasing the likelihood of switching attention within the same time frame as the deliberation process takes longer. In contrast, we did not see a consistent relationship between switch rate and value sum or value difference for both human and model behavior. Of note, both the optimal model and aDDM show similar patterns when using a large number of simulated trials rather than the same trials used in human data, such that both models predict a negative association between switch rate and value sum, and a positive association between switch rate and absolute value difference. Therefore, we suspect that human data may still exhibit a similar relationship if more trials are performed. We report these findings in Figure 4—figure supplement 1A-D and its corresponding figure legend.

b. How does the fixation duration changes as a function of time (within trial) and as a function of absolute value difference and sum (across trials) in the two models and in the data?

Human data shows a positive association between fixation duration and RT (t(38) = 9.28, p < 0.001), a negative association between fixation duration and value sum (t(38) = -2.81, p = 0.0078), and a negative association between fixation duration and value difference (t(38) = -5.46, p < 0.001). Both the optimal model and the aDDM predicted similar patterns (optimal model, RT: t(38) = 85.6, p < 0.001 , value sum: t(38) = -4.19, p < 0.001, value diff: t(38) = -3.60, p < 0.001; aDDM, RT: t(38) = 13.65, p < 0.001, value sum: t(38) = -3.32, p = 0.002, value diff: t(38) = -6.44, p < 0.001) (Author response image 6).

c. The model seems to predict a non-negligible number of single fixation trials. How does this align with the data and the aDDM predictions?

While both the optimal model and aDDM over-estimated the number of single fixation trials compared to human data, the aDDM predicted significantly more than the optimal model (t(76) = 5.84, p < 0.001) (Author response image 7).

d. How do reaction time distributions look like under the normative model? Are these comparable to the RT distributions in the data?

Both models predict a RT curve that includes more sub-1s trials, consistent with the above results showing more single-fixation trials (Figure 4—figure supplement 3F).

e. Please also show the aDDM predictions together with the novel predictions made by the normative model in Figure 4C.

The aDDM also replicated the effects of RT and value sum on fixation bias (RT: t(38) = -48.6, p < 0.001; value sum: t(38) = 14.7, p < 0.001). This is not surprising given the initial assumptions made by the model wherein fixations boost the value of the item. Since increasing RT allows the model to spend similar amounts of time to each item, fixation bias will decrease. Also, since the impact of fixation is proportional to the value of the items, choosing between items with higher value will lead to a stronger fixation bias (Figure 4—figure supplement 3C and D).

f. The comparison between the two models is currently done on the basis of mean reward. How do the predictions of the models compare to the mean reward accrued by human participants in Krajbich et al? Please also clarify in the main text that the models are not compared on the basis of goodness of fit. In particular, in the paragraph starting in line 260, terms such as "outperformed", "competitive performance", "comparison" are ambiguous.

To test for this, we performed an independent samples t-test to the mean reward achieved by human participants versus the simulated participants of both models. We found there is no significant difference between the mean rewards of humans versus the optimal model (t(76) = 0.69, p = 0.49) and humans versus the aDDM (t(76) = -0.062, p = 0.95). Of note, for this analysis we used the aDDM setup used in the original paper by Krajbich et al., 2010 rather than the signal-to-noise-matched version we used to compare the mean reward between the optimal model and aDDM. Therefore, the mean reward of the optimal model was not greater than that of the aDDM in this scenario. To calculate mean reward, we used the same cost per unit time used for the optimal model (c = 0.23) (Figure 4—figure supplement 3B).

We also modified the ambiguous terms the reviewer mentioned:

“We also tested to which degree the optimal model yielded a higher mean reward than aDDM, which, despite its simpler structure, could nonetheless collect competitive amounts of reward. To ensure a fair comparison, we adjusted the aDDM model parameters (i.e., attentional value discounting and the noise variance) so that the momentary evidence provided to the two models has equivalent signal-to-noise ratios (see Supplementary file 1).”

g. Please explain why the model shows the pattern that is demonstrated in Figure 3F. Is this pattern also predicted by the aDDM?

The aDDM did not predict the same fixation pattern as the data and optimal model. This fixation pattern in the optimal model is well-preserved across different parameter values. We suspect this pattern arises due to the shape of the optimal decision boundaries, where the particle is more likely to hit the “switch” boundary in a shorter time for the first fixation, since the model likely prefers to sample from both items at least once. Consistent with this, Figure 2C shows that the “accumulate” space is larger for the second fixation compared to the first (Author response image 8).

We added the following to the Results:“Interestingly, the model also replicated a particular fixation pattern seen in humans, where a short first fixation is followed by a significantly longer second fixation, which is followed by a medium-length third fixation (Figure 3F). We suspect this pattern arises due to the shape of the optimal decision boundaries, where the particle is more likely to hit the “switch” boundary in a shorter time for the first fixation, likely reflecting the fact that the model prefers to sample from both items at least once. Consistent with this, Figure 2C shows that the “accumulate” space is larger for the second fixation compared to the first fixation. Of note, the attentional drift diffusion model (aDDM) that was initially proposed to explain the observed human data (Krajbich et al., 2010) did not show this fixation pattern Figure 4—figure supplement 2D.”

Overall, we added the following Figures / panels to describe the comparison to aDDM: Figure 3—figure supplement 2G, Figure 4—figure supplement 1, and Figure 4—figure supplement 2. Furthermore, we added the following to the main text:

“Next, we assessed how the behavioral predictions arising from the optimal model differed from those of the original attentional drift diffusion model (aDDM) proposed by Krajbich et al., (2010). Unlike our model, the aDDM follows from traditional diffusion models rather than Bayesian models. It assumes that inattention to an item diminishes its value magnitude rather than the noisiness of evidence accumulation. Despite this difference, the aDDM produced qualitatively similar behavioral predictions as the optimal model (Figure 3—figure supplement 2G, Figure 4—figure supplement 1), although the optimal model was able to better reproduce some of the fixation patterns seen in human behavior (Figure 4—figure supplement 2A,D).”

3. Comprehensibility of the modelling.

The authors have done an admirable job of boiling down some heavy mathematics (in the SI) into just the key steps in the main text. A lot of this builds on prior work in Tajima et al. Still, the authors could do more to explain the math, which would make this paper more self-contained and really help the reader understand the core ideas/intuitions.

a. Equation 1 – it isn't obvious how the mean will behave over time or where this expressions comes from. It would be helpful to state that the z part is the prior, and that as t->infinity the σ terms become negligible and the expression converges to x/t, namely the true value. The σ-squared terms seem to come out of nowhere. In the methods the authors explain that this has to do with Fisher information, but most readers won't know what that is or why it is the appropriate thing to include here. Also, sigma_x is defined later in the text, but should be defined up front here.

We agree that this expression might come as a surprise to some. We could have simplified it slightly, with the downside of a less direct relation to Eq. (2). As we felt that establishing this relationship is essential, we decided to keep Eq. (1) in its more complex form. To nonetheless make Eq. (1) it easier to digest, we have added additional information about how the prior and likelihood variances relate to their respective informativeness. Additionally, we have added additional details about the structure of the posterior mean and how it changes with increasing accumulation time t.

In particular, we added that “the smaller the prior variance (𝜎_%^"), the more information this prior provides about the true values”, and that “the evidence accumulation variance (𝜎_and^") controls how informative the momentary evidence is about the associated true value. A large 𝜎_and^" implies larger noise, and therefore less information provided by each of the momentary evidence samples.” We were also more explicit in describing how the posterior mean and variance evolves over time: “The mean of this posterior (i.e., the first fraction in brackets) is a weighted sum of the prior mean, 𝑧, and the accumulated evidence, 𝑥(𝑡). The weights are determined by accumulation time, 𝑡, and the variances of the prior, 𝜎_%^", and the momentary evidence, 𝜎_and^", which control their respective informativeness. Initially, 𝑡 = 0 and 𝑥(𝑡) = 0, such that the posterior mean equals that of the prior, 𝑧. Over time, with increasing 𝑡, the influence of 𝑥(𝑡) = 0 becomes dominant, and the mean approaches 𝑥(𝑡)/𝑡 (i.e., the average momentary evidence) for large 𝑡, at which point the influence of the prior becomes negligible. The posterior's variance (i.e., the second fraction in brackets) reflects the uncertainty in the decision maker's value inference. It initially equals the prior variance, 𝜎_%^", and drops towards zero once 𝑡 becomes large.”

We refrained from mentioning ‘Fisher information’ in the main text to avoid confusing readers that are not familiar with this concept. Instead, we kept discussion at a more informal level. Furthermore, we have added additional information about why we assess informativeness by Fisher information to Methods and Materials:

“We measure how informative a single momentary evidence sample is about the associated true value by computing the Fisher information it provides about this value. This Fisher information sums across independent pieces of information. This makes it an adequate measure for assessing the informativeness of momentary evidence, which we assume to be independent across time and items.”

b. Between Equation 1 and 2: In the means for delta_x, one has a delta_t multiplying z and one doesn't. Why? Is this a typo? The delta_t should always be there, but again, it isn't obvious.

c. Equation 2: Why does the mean have a z1| at the beginning of it? Is that a typo?

Both instances are indeed typos, and we thank the reviewers for pointing them out. We have corrected all instances of such typos from the main text and the Supplementary file 1.

d. Could the authors elaborate more on why/when the decision-maker chooses to switch attention? They say that the decision at each time point only depends on the difference in posterior means but Figure 2c seems to indicate that if the difference in posterior means stays constant over a period of time, then the process enters the "switch" zone and shifts attention.

Optimal decisions are determined as a function of not only the difference in posterior means (𝛥), but also the times attended to item 1 (𝑡₁) and item 2 (𝑡_"). This results in an optimal policy shape where the particle will hit the “switch” zone if the difference in expected rewards between the two items is too small to make an immediate decision, and it is deemed advantageous to collect more reward-related evidence of the currently unattended item. This prevents the model from deliberating for too long while attending to a single item. We added this clarification in Results under ‘Features of the Optimal Policy.’:

“In other words, the difference in expected rewards between the two items is too small to make an immediate decision, and it is deemed advantageous to collect more information about the currently unattended item.”

4. Model assumptions.

a. The model assumes that attention can take one of two discrete states. However, attention is often regarded as operating in a more graded fashion, which would necessitate parameter kappa in the model to be free to change within a trial. This will probably render the model intractable. However, a more viable and less radical assumption, would be to allow for a third "divided attention" mode in which the agent samples equally from both alternatives (divided attention mode). If this extension is technically challenging, one conceptual question that the authors can discuss in the paper, is whether the normative model would ever switch away from this divided attention mode.

We agree that a decision-making model that incorporates a continuously variable attention would be an interesting endeavor. As the reviewers suggest, we could address this by adding another state to the Bellman equation in which attention is perfectly divided. Since this will add another dimension to the value and policy space, we anticipate this will become intractable quickly. However, we believe there is sufficient literature on this topic to reasonably predict how such a model would behave.

Previous work by Fudenberg, Strack and Strzalecki (2018) discusses a modified drift diffusion model in which attention can vary continuously and gradually across two choice options. They show that, consistent with our results, the model with equally divided attention performs optimally. Drawing from this, we can confidently state that our optimal model would always engage in the divided attention mode. However, the authors also state that there may be instances within a decision when it would be optimal to pay unequal attention. In fact, if the normative decision maker has already paid more attention to one item over the other item, it may be optimal to switch attention and gain more information about the unattended item rather than to proceed in the divided attention mode.

To address this, we added the following to the Discussion:

“We show that narrowing the attentional bottleneck by setting κ to values closer to 0 or 1 does not boost performance of our decision-making model (Figure 4E). Instead, spreading a fixed cognitive reserve evenly between the attended and unattended items maximized performance. This is consistent with prior work that showed that a modified drift diffusion model with a continuously varying attention would perform optimally when attention is always equally divided (Fudenberg et al., 2018). However, this does not necessarily imply that equally divided attention always constitutes the normative behavior. If the decision maker has already paid more attention to one item over the other within a decision, it may be optimal to switch attention and gain more information about the unattended item rather than to proceed with equally divided attention.”

b. The authors need to assume a certain prior, namely z_bar = 0, in order to always get a positive effect of attention. This seems like an important controversy in the model; it is a noticeably non-Bayesian feature of a Bayesian model. The authors try to explain this away by noting that the original rating scale included both negative and positive values. However, only positive items were included in the choice task, and there is a consistently positive effect of attention on choice for other tasks (see Cavanagh et al. 2014; Smith and Krajbich 2018; Smith and Krajbich 2019) with only positive outcomes. This needs to be more openly acknowledged and discussed.

We agree with the reviewers that the exact features and role of the Bayesian prior remains a topic of discussion, and we acknowledge that while our formulation suggests a zero-mean prior, there is also evidence suggesting the prior should be centered on the choice set. We have modified the Discussion to be more transparent regarding this point, and added the citations suggested by the reviewers in support of a non-zeromean prior distribution.

“In our model, we assumed the decision maker's prior belief about the item values is centered at zero. In contrast, Callaway et al. (2020) chose a prior distribution based on the choice set, centered on the average value of only the tested items. While this is also a reasonable assumption (Shenhav et al., 2018), it likely contributed to their inability to demonstrate the choice bias. Under the assumption of our zero-mean prior, formulating the choice process through Bayesian inference revealed a simple and intuitive explanation for choice biases (Figure 4A) (see also Li and Ma (2020)). This explanation required the decision maker to a-priori believe the items' values to be lower than they actually are when choosing between appetitive options, consistent with evidence that item valuations vary inversely with the average value of recently observed items (Khaw et al., 2017). The zero-mean prior also predicts an opposite effect of the choice bias when deciding between aversive items, such that less-fixated items should become the preferred choice. This is exactly what has been observed in human decision-makers (Armel et al., 2008). We justified using a zero-mean bias because participants in the decision task were allowed to rate items as having both positive or negative valence (negative-valence items were excluded from the binary decision task). However, there is some evidence that humans also exhibit choice biases when only choosing between appetitive items (Cavanagh et al., 2014, Smith et al., 2018, Smith et al., 2019). Although our setup suggests a zero-mean prior is required to reproduce the choice bias, the exact features and role of the Bayesian prior in human decisions still remains an open question for future work.”

c. The last paragraph of Discussion talks about a lack of benefit of focussed attention in the analysed task. Would focussing attention would become beneficial in decision tasks with more than 2 options? Although answering this question would be a separate paper, a few sentences on generalising this work to more than two options could be included in the Discussion.

We agree it is interesting to discuss whether our findings would generalize to multi alternative choices. While we cannot definitively answer this question, we believe that under the same framework of our binary choice model, the same principles would apply such that divided attention across all items would lead to optimal behavior. Once the framework becomes more complex and involves features such as increased attention to items based on value or salience, this may lead to scenarios where focused attention may be beneficial. This is consistent with the idea that although divided attention maximizes reward on average, focusing attention to single items may be preferred if the decision maker has already done so for any other item(s) for heuristical reasons.

We have added the following text to the Discussion to highlight this point:

“An open question is whether our findings can be generalized to multi-alternative choice paradigms (Towal et al., 2013, Ke et al., 2016, Gluth et al., 2020, Tajima et al., 2019). While implementing the optimal policy for such choices may be analytically intractable, we can reasonably infer that a choice bias driven by a zero-mean prior would generalize to decisions involving more than two options. However, in a multi alternative choice paradigm where heuristics involving value and salience of items may influence attention allocation, it is less clear whether an equally divided attention among all options would still maximize reward. We hope this will motivate future studies that investigate the role of attention in more realistic decision scenarios.”

5. Coverage of literature.

a. In the Introduction, the authors state that the "final choices are biased towards the item that they looked at longer, irrespective of its desirability". This is not quite true. The desirability does matter, as shown in Smith and Krajbich 2019, as well as Westbrook et al. 2020. Moreover, as the authors note in the discussion, Armel et al. 2008 show that attention has a reverse effect when the items are aversive. Please update the introduction accordingly.

Thank you for this suggestion. We have now removed the phrase “irrespective of its desirability.”

b. The authors do not mention that there is some work that has argued for value attracting attention in multi-alternative choice. While that work does not go to the lengths that this paper does, it does make normative arguments for why this should occur, namely to eliminate non-contenders (Krajbich and Rangel 2011; Towal et al. 2013; Gluth et al. 2020). Finally, the authors might also want to mention Ke, Shen, Villas-Boas (2016), which also takes a normative approach to information search in consumer choice.

We agree that the effect of value and salience of the choice items on attention allocation is a relevant topic to discuss. We also agree that these processes are likely contributing to the fixation behavior of human participants, and elements that could be added to our normative formulation in future work. We have added the following text to note this, and added the citations, including Ke et al., (2016), suggested by the reviewer (see bolded text for added information).

“In previous work, fixation patterns were assumed to be either independent of the decision-making strategy (Krajbich et al., 2010, Krajbich et al., 2011) or generated by heuristics that relied on features such as the salience or value estimates of the choice options (Towal et al., 2013, Gluth et al., 2020). Other models generated fixations under the assumption that fixation time to different information sources should depend on the expected utility or informativeness of the choice items (Ke et al., 2016, Cassey et al., 2013, Song et al., 2019).”

“When designing our model, we took the simplest possible approach to introduce an attentional bottleneck into normative models of decision-making. When doing so, our aim was to provide a precise (i.e., without approximations), normative explanation for how fixation changes qualitatively interact with human decisions rather than quantitatively capture all details of human behavior, which is likely driven by additional heuristics and features beyond the scope of our model (Acerbi et al., 2014, Drugowitsch et al., 2016). For instance, it has been suggested that normative allocation of attention should also depend on the item values to eliminate non-contenders, which we did not incorporate as a part of our model (Towal et al., 2013, Gluth et al., 2020). As such, we expect other models using approximations to have a better quantitative fit to human data (Krajbich et al., 2010, Callaway et al., 2020).”

“An open question is whether our findings can be generalized to multi-alternative choice paradigms (Towal et al., 2013, Ke et al., 2016, Gluth et al., 2020, Tajima et al., 2019).”

c. On page 2 the authors state that "no current normative framework incorporates control of attention as an intrinsic aspect of the decision-making process". This does not seem to be accurate given the study by of Cassey et al. (2013). The key difference is that in Cassey et al. the focus was placed on the fixed duration paradigm, while the present manuscript focuses on the free-response paradigm. Please clarify the link of the current study with Cassey et al.

We agree with this point, and modified the statement to the following:

“While several prior studies have developed decision-making models that incorporate attention (Yu et al., 2009, Krajbich et al., 2010,Towal et al., 2013, Cassey et al., 2013, Gluth et al., 2020), our goal was to develop a normative framework that incorporates control of attention as an intrinsic aspect of the decision-making process in which the agent must efficiently gather information from all items while minimizing the deliberation time, akin to real life decisions. In doing so, we hoped to provide a computational rationale for why fixationdriven choice biases seen in human behavior may arise from an optimal decision strategy.”

d. In lines 318-321, the authors state that the model of Cassey et al. "could not predict when they [fixation switches] ought to occur". The model of Cassey et al. does predict when the optimal switching times are, but for the case of a fixed duration paradigm with \kappa = 0. In this case the optimal switch policy is much simpler (single or at most double switch at particular times) than in the free-response paradigm nicely analysed in the present manuscript.

We thank the reviewers for this clarification, and modified the statement to the following:

“Furthermore, since their decision task involved a fixed-duration, attention switches also occurred at fixed times rather than being dynamically adjusted across time, as in our case with a free-response paradigm.”

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential revisions

1. In our previous points 2a-b the request was to show the switch rate and fixation duration as a function of time, value sum and value difference. We apologise if that was not clear previously but with the term "time" we referred to elapsed time within a trial rather than reaction times (RT). Because RT's will be influenced by various properties of the trial (e.g. trial difficulty) the analyses reported in the revision are not very easy to interpret. Additionally, we are puzzled by the fact that the aDDM model predicts non-flat switch rates and fixation durations as a function of the different quantities (Figure 4—figure supplement 1), given that switching in the aDDM version the authors used is random. The aim of these previous points was to (a) highlight the differences between random switching (aDDM) and deliberate switching (optimal model), and (b) understand how switching tendencies change as a function of trial-relevant quantities and time elapsed in the optimal model. We appreciate that such analyses might be difficult to perform given the interrelationship among time, value sum and absolute value difference. Below we offer a few suggestions:

– One possibility is to consider a "stimulus locked" approach, in which the switch rate and fixation duration is plotted as a function of time elapsed from the stimulus onset and up to "x" milliseconds before the response. Inevitably, later time points will more likely include certain trial types, e.g. only difficult trials (or trials with low value sum). The authors can consider using a "stratification" approach, subsampling the trials such that all time-points have comparable trial distributions (in terms of value difference and value sum). The influence of value difference and value sum can be examined using median splits based on these quantities.

– Specifically for the switch rate analysis, the authors could perform a logistic regression trying to predict at each point in time the probability of switching, also considering covariates such as value sum or absolute value difference.

– Since the last fixation can be cut short, we recommend excluding the last fixation from these analyses.

– We recommend that the scaling of the x and y axis in the data and in the models is the same to allow comparison in absolute terms.

Thank you for the clarification, and we apologize for the misunderstanding.

We reanalyzed the data completely in order to address the discrepancies the editors have brought up. For the aDDM simulations in the previous version of the manuscript, we already used the empirical distribution of fixation durations as described in the original literature (Krajbich et al., 2010), but did not sample the first fixations separately. As we discuss below (point #2), we do so now. We also noted that in the human data, RT was not equal to the sum of all fixation durations since there were procedures to remove non-item fixations by the original authors. This likely added to some discrepancy between the aDDM simulations and human data, especially for the switch rate analysis where we divided the number of switches by the RT. For the following analyses, we have adjusted the behavioral data so that RT is now equal to the sum of all fixations.

First, we would like to address the reviewers’ point that the aDDM model should predict flat switch rates and fixation duration as a function of different quantities. As the reviewers correctly point out, the aDDM samples fixation durations from human data, separately for each absolute value difference between the two items. Therefore, the relationship between switch probability and other variables such as time, value difference, and value sum should be reasonably preserved between the model and human data. However, we find that the implementation of the aDDM results in some systematic discrepancies in the switching behavior. For instance, since the aDDM terminates the trial whenever a decision boundary is reached, long fixations are more likely to result in boundary crossings than short fixations, such that long fixations are more likely considered last fixations and thus excluded from analysis. This causes the model to feature lower mean middle fixation durations and a higher switch frequency when compared to human data.

In Author response image 9, we show how middle fixation duration and switch rate varies across trials based on the trial RT, absolute value difference, and value sum for humans and the aDDM simulations. For every trial, we computed the mean middle fixation duration and switch rate (total number of switches divided by RT), then grouped the trials based on RT (using 0.5s bins from 1-4s), absolute value difference (i.e., trial difficulty) and value sum. We then computed the mean fixation duration and switch rate across participants for each x-variable. We show these results mainly to demonstrate that, although the aDDM randomly samples non-first fixations from empirical middle fixation durations, trial variables such as RT and item values still influence the switching behavior when implemented in the aDDM framework.

The plots are shown in Author response image 9, along with analogous plots for the optimal model, for the new Figure 4—figure supplement 1. The purpose of this figure is to show a clearer version of our between-trial analyses of how switch rate and fixation duration are affected by value sum and value difference for the human and the two simulated behavioral data sets. We did not show the effect of RT on these quantities since, as the reviewers mention, this is difficult to interpret since RT correlates with other task variables such as difficulty. We also decided to use different y-axis scales, since using the same scale for all three datasets makes it difficult to appreciate the slope of certain plots.We next turned to the original intent of the review question, which is to show how switch probability and fixation duration varies across time within trials. To do so, we followed the reviewers’ recommended analyses. For a given participant, we aligned all trials by the stimulus onset, then counted the number of switches within each 200ms time bin. We then averaged the switch count within each time bin across trials to get the switch probability. Since RTs differ across trials, we only included time points up to when at least ⅓ of the total trials are included. This implies that switch probabilities at later times are averaged across fewer trials. We also removed the last fixations as suggested.

We found that switching behavior within trials is well-preserved between humans and the aDDM. Both humans and the aDDM show a peak in switch probability within 1s of stimulus onset, followed by a gradual decrease (Figure 4—figure supplement 2 column A). The optimal model, in comparison, exhibits more discrete time points where a majority of attention switches occur. This is not surprising given the shape of the optimal policy space, where the particle is guaranteed to hit either a switch or decision boundary within a fixed time period for each fixation. Note, however, that these simulations assumed that time is perceived and measured with millisecond precision by the decision-maker, while humans are known to feature noisy time estimates (Buhusi and Meck, 2005). Furthermore, we assumed the additive non-decision time to be noise-free. We therefore anticipate that, with noisy time perception, the shape of the curve will smoothen out and approach that of human behavior and the aDDM.

Next, we explored whether the relationship between switch probability and time is influenced by variables such as RT (Figure 4—figure supplement 2 column B), value sum (Figure 4—figure supplement 2 column C) and value difference (Figure 4—figure supplement 2 column D). To do so, for each participant, we split all trials into three equally sized bins based on the variable of interest, and then made the same plots as column A only including the first and last bin. At each time point, we performed a t-test across participants between the two bins, and marked any time point with a significant difference across bins (Bonferroni corrected) with an asterisk. Humans and aDDM simulations featured a higher switch probability across time for trials with longer RTs compared to trials with shorter RTs, likely reflecting the fact that trials with longer RTs are more difficult, resulting in more early attention switches. Consistent with this, both human data and aDDM simulations showed a slightly higher switch probability across time for trials with low value difference (i.e., difficult trials) compared to those with a high value difference (i.e., easy trials). Within the time periods in which switches occurred, the optimal model featured comparable patterns.

We used a similar method to investigate fixation duration over time. Whenever a switch occurred within a trial, we recorded the fixation duration until the next switch occurred. We only used middle fixations (excluding first and last fixations), similar to the analyses performed in Krajbich et al. (2010). We averaged the fixation durations at each time point across trials, dropping all time points that contain data from less than ⅓ of all trials. We then plotted the mean fixation duration at each time point across participants, shown in Author response image 10.

The results show that under the current parameterization, the optimal model features overall longer fixation durations than humans. Interestingly, these durations increased for both humans and the optimal model with time, suggesting that more time is allotted to each fixation as the trial becomes more difficult (human: t(38)=4.50, p < 0.001; optimal model: t(38)=46.4, p<0.001). This trend is not seen in the aDDM which draws all middle fixations randomly from the same empirical distribution (t(38)=-0.57, p=0.57). Therefore, any effect of time within a single trial on fixation duration would be eliminated.

2. Krajbich et al. 2010 used an aDDM implementation in which fixations were not random but instead sampled from the empirical switching times distributions (thus fixations depended on fixation number, trial difficulty etc). The authors currently do not acknowledge this previous implementation. The aDDM with random switching is a good baseline for the scope of this paper, but the version used by Krajbich et al. 2010 makes different predictions than the random aDDM; and this should be explicitly acknowledged. For instance, the aDDM where switching matches the empirical distributions, accounts for the fact that the first fixations were shorter than the rest. Discussing this non-random aDDM used by Krajbich et al. 2010, will fit easily in the current discussion, since the optimal model offers a rationale fort the empirical fixation patterns that the older work simply incorporated into the aDDM simulations.

We thank the reviewers for this suggestion, as it brought to attention an oversight in our aDDM simulations. In our previous analyses, we simulated the aDDM by sampling all fixation durations from middle fixations in human data (split by absolute value difference), but did not sample the first fixation separately from the other fixations. This led to the result on the previous Figure 4—figure supplement 2D, where the fixation duration of the first three fixations for the aDDM did not replicate the patterns seen in humans. We have now repeated the same analyses while separately sampling the first fixations which leads to an aDDM prediction that better mimics human fixation behavior. We added this new result to Figure 4—figure supplement 3E.

However, as the reviewers correctly point out, a strength of our model is that it provides a rationale for the fixation patterns seen in humans, whereas the aDDM simply sampled from the empirical distribution. We have now removed all discussions stating that the optimal model better-predicted this fixation pattern compared to the aDDM and clarified this point in the Results section as follows:

“Of note, the attentional drift diffusion model (aDDM) that was initially proposed to explain the observed human data (Krajbich et al., 2010) did not generate its own fixations, but rather used fixations sampled from the empirical distribution of human subjects. Furthermore, they were only able to achieve this fixation pattern by sampling the first fixation, which was generally shorter than the rest, separately from the remaining fixation durations.”

3. The authors slightly mischaracterise Krajbich and Rangel 2011 by saying that "fixation patterns were assumed to be either independent of the decision-making strategy". That paper did condition fixation patterns on the values of the items. Here is a direct quote from the Discussion: "These patterns are interesting for several reasons. First, they show that the fixation process is not fully independent from the valuation process, and contains an element of choice that needs to be explained in further work."

We thank the reviewers for pointing this out, and agree that the Krajbich et al. (2010) as well as the Krajbich and Rangel (2011) papers suggest that fixations are not completely independent of the decision-making strategy, but rather, that they are affected by decision-relevant variables such as trial difficulty or item value. We have modified our Discussion to reflect this:

“This is consistent with previous work that showed that fixation patterns were influenced by variables relevant for the decision, such as trial difficulty or the value of each choice item (Krajbich et al., 2010; Krajbich and Rangel, 2011). However, prior models of such decisions assumed an exogenous source of fixations (Krajbich et al., 2010; Krajbich and Rangel, 2011) or generated fixations using heuristics that relied on features such as the salience or value estimates of the choice options (Towal et al., 2013; Gluth et al., 2020).”

4. When comparing the mean reward of the models, the authors simulate the aDDM not with the parameters from Krajbich et al. 2010, but with different parameters meant to "ensure a fair comparison" between the models. We believe that this approach sets the aDDM to a disadvantage. If the best-fitting parameters of the "optimal model" lead to a lower signal-to-noise ratio than the aDDM best-fitting parameters, that should be acknowledged and accepted as is. We recommend the authors state upfront that the best-fitting optimal model does not outperform the data or the aDDM. However, if you use a non-best-fitting aDDM, then the aDDM underperforms both.

We understand the reviewers’ concern here, and believe that the method in which we perform the comparison depends on the question we are trying to answer. Our goal was to establish that, since our model provides the optimal solution to the decision problem under the current assumptions, it should outperform or at least match the performance of any alternative model. Therefore, it is an attempt to demonstrate optimality of our model, rather than necessarily make any conclusions about the performance of the aDDM or any other model we may use as a comparison. However, the editors correctly point out that another relevant question would be to compare the performance of the best-fitting versions of each model. To address both perspectives, we have now provided the results to both instances, and changed the text in the Results section as following:

“Given that our model provides the optimal solution to the decision problem under the current assumptions, it is expected to outperform, or at least match, the performance of alternative models. To ensure a fair comparison, we adjusted the aDDM model parameters (i.e., attentional value discounting and the noise variance) so that the momentary evidence provided to the two models has equivalent signal-to-noise ratios (see Appendix 1). Using the same parameters fit to human behavior without this adjustment in signal-to-noise ratio yielded a higher mean reward for the aDDM model (t(76) = -14.8, p < 0.001), since the aDDM receives more value information at each time point than the optimal model. The original aDDM model fixed the decision boundaries at $\pm1$ and subsequently fit model parameters to match behavioral data. Since we were interested in comparing mean reward, we simulated model behavior using incrementally increasing decision barrier heights, looking for the height that yields the maximum mean reward (Figure 4D). We found that even for the best-performing decision barrier height, the signal-to-noise ratio-matched aDDM model yielded a significantly lower mean reward compared to that of the optimal model (t(76) = 3.01, p = 0.0027).”

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Source data 1. Human behavioral data readme.

elife-63436-data1.zip^{(481B, zip)}

Source data 2. Human behavioral data.

elife-63436-data2.zip^{(128KB, zip)}

Transparent reporting form

elife-63436-transrepform.docx^{(178.4KB, docx)}

Data Availability Statement

[bib1] Acerbi L, Vijayakumar S, Wolpert DM. On the origins of suboptimality in human probabilistic inference. PLOS Computational Biology. 2014;10:e1003661. doi: 10.1371/journal.pcbi.1003661. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Armel KC, Beaumel A, Rangel A. Biasing simple choices by manipulating relative visual attention. Judgment and Decision Making. 2008;3:396–403. [Google Scholar]

[bib3] Averbeck BB, Latham PE, Pouget A. Neural correlations, population coding and computation. Nature Reviews Neuroscience. 2006;7:358–366. doi: 10.1038/nrn1888. [DOI] [PubMed] [Google Scholar]

[bib4] Ba JL, Mnih V, Kavukcuoglu K. Multiple object recognition with visual attention. International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.2015. [Google Scholar]

[bib5] Bahdanau D, Cho KH, Bengio Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.2015. [Google Scholar]

[bib6] Bellman R. On the theory of dynamic programming. PNAS. 1952;38:716–719. doi: 10.1073/pnas.38.8.716. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Bergstra J, Bengio Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research. 2012;13:281–305. doi: 10.5555/2188385.2188395. [DOI] [Google Scholar]

[bib8] Bertsekas DP. Athena scientific. In: Floudas C, Pardalos P, editors. Dynamic Programming and Optimal Control. Springer; 1995. pp. 8–36. [DOI] [Google Scholar]

[bib9] Bogacz R, Brown E, Moehlis J, Holmes P, Cohen JD. The physics of optimal decision making: a formal analysis of models of performance in two-alternative forced-choice tasks. Psychological Review. 2006;113:700–765. doi: 10.1037/0033-295X.113.4.700. [DOI] [PubMed] [Google Scholar]

[bib10] Bonferroni CE. Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni Del R Istituto Superiore Di Scienze Economiche E Commerciali Di Firenze.1936. [Google Scholar]

[bib11] Brockwell AE, Kadane JB. A gridding method for bayesian sequential decision problems. Journal of Computational and Graphical Statistics. 2003;12:566–584. doi: 10.1198/1061860032274. [DOI] [Google Scholar]

[bib12] Buhusi CV, Meck WH. What Makes Us Tick? Functional and Neural Mechanisms of Interval Timing. Hachette; 2005. [DOI] [PubMed] [Google Scholar]

[bib13] Callaway F, Rangel A, Griffiths TL. Fixation patterns in simple choice are consistent with optimal use of cognitive resources. PsyArXiv. 2020 doi: 10.31234/osf.io/57v6k. [DOI]

[bib14] Cassey TC, Evens DR, Bogacz R, Marshall JA, Ludwig CJ. Adaptive sampling of information in perceptual decision-making. PLOS ONE. 2013;8:e78993. doi: 10.1371/journal.pone.0078993. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] Cavanagh JF, Wiecki TV, Kochar A, Frank MJ. Eye tracking and pupillometry are indicators of dissociable latent decision processes. Journal of Experimental Psychology: General. 2014;143:1476–1488. doi: 10.1037/a0035813. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Chukoskie L, Snider J, Mozer MC, Krauzlis RJ, Sejnowski TJ. Learning where to look for a hidden target. PNAS. 2013;110 Suppl 2:10438–10445. doi: 10.1073/pnas.1301216110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] Cohen MR, Maunsell JH. Attention improves performance primarily by reducing interneuronal correlations. Nature Neuroscience. 2009;12:1594–1600. doi: 10.1038/nn.2439. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] Cohen MR, Maunsell JH. A neuronal population measure of attention predicts behavioral performance on individual trials. Journal of Neuroscience. 2010;30:15241–15253. doi: 10.1523/JNEUROSCI.2171-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Corbetta M, Shulman GL. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience. 2002;3:201–215. doi: 10.1038/nrn755. [DOI] [PubMed] [Google Scholar]

[bib20] Drugowitsch J, Moreno-Bote R, Churchland AK, Shadlen MN, Pouget A. The cost of accumulating evidence in perceptual decision making. Journal of Neuroscience. 2012;32:3612–3628. doi: 10.1523/JNEUROSCI.4010-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Drugowitsch J, Moreno-Bote R, Pouget A. Optimal decision-making with time-varying evidence reliability. Advances in Neural Information Processing Systems.2014. [Google Scholar]

[bib22] Drugowitsch J, Wyart V, Devauchelle AD, Koechlin E. Computational precision of mental inference as critical source of human choice suboptimality. Neuron. 2016;92:1398–1411. doi: 10.1016/j.neuron.2016.11.005. [DOI] [PubMed] [Google Scholar]

[bib23] Fudenberg D, Strack P, Strzalecki T. Speed, accuracy, and the optimal timing of choices. American Economic Review. 2018;108:3651–3684. doi: 10.1257/aer.20150742. [DOI] [Google Scholar]

[bib24] Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN. Convolutional sequence to sequence learning. 34th International Conference on Machine Learning ICML.2017. [Google Scholar]

[bib25] Geisler WS, Cormack LK. Models of Overt Attention. In: Everling S, Liversedge S, Gilchrist I, editors. The Oxford Handbook of Eye Movements. Oxford University Press; 2012. pp. 1–2. [DOI] [Google Scholar]

[bib26] Gluth S, Kern N, Kortmann M, Vitali CL. Value-based attention but not divisive normalization influences decisions with multiple alternatives. Nature Human Behaviour. 2020;4:634–645. doi: 10.1038/s41562-020-0822-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Hayhoe M, Ballard D. Eye movements in natural behavior. Trends in Cognitive Sciences. 2005;9:188–194. doi: 10.1016/j.tics.2005.02.009. [DOI] [PubMed] [Google Scholar]

[bib28] Hébert B, Woodford M. Rational Inattention When Decisions Take Time. Nber Working Paper Series; 2019. [Google Scholar]

[bib29] Hoppe D, Rothkopf CA. Learning rational temporal eye movement strategies. PNAS. 2016;113:8332–8337. doi: 10.1073/pnas.1601305113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] Itti L, Koch C. Computational modelling of visual attention. Nature Reviews Neuroscience. 2001;2:194–203. doi: 10.1038/35058500. [DOI] [PubMed] [Google Scholar]

[bib31] Jang AI. DrugowitschLab/Optimal-policy-attention-modulated-decisions: Code as used in manuscript. v1.0Zenodo. 2021 doi: 10.5281/zenodo.4636831. [DOI]

[bib32] Ke TT, Shen ZJM, Villas-Boas JM. Search for information on multiple products. Management Science. 2016;62:e2316. doi: 10.1287/mnsc.2015.2316. [DOI] [Google Scholar]

[bib33] Khaw MW, Glimcher PW, Louie K. Normalized value coding explains dynamic adaptation in the human valuation process. PNAS. 2017;114:12696–12701. doi: 10.1073/pnas.1715293114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Krajbich I, Armel C, Rangel A. Visual fixations and the computation and comparison of value in simple choice. Nature Neuroscience. 2010;13:1292–1298. doi: 10.1038/nn.2635. [DOI] [PubMed] [Google Scholar]

[bib35] Krajbich I, Rangel A. Multialternative drift-diffusion model predicts the relationship between visual fixations and choice in value-based decisions. PNAS. 2011;108:13852–13857. doi: 10.1073/pnas.1101328108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Kustov AA, Robinson DL. Shared neural control of attentional shifts and eye movements. Nature. 1996;384:74–77. doi: 10.1038/384074a0. [DOI] [PubMed] [Google Scholar]

[bib37] Li SZ, Ma WJ. Cognitive computational neuroscience. Valuation as Inference: A New Model for the Effects of Fixation on Choice.2019. [Google Scholar]

[bib38] Li Z, Ma W-J. An uncertainty-based model of the effects of fixation on choice. PsyArXiv. 2020 doi: 10.31234/osf.io/ajmwx. [DOI] [PMC free article] [PubMed]

[bib39] Milosavljevic M, Malmaud J, Huth A, Koch C, Rangel A. The drift diffusion model can account for the accuracy and reaction time of Value-Based choices under high and low time pressure. SSRN Electronic Journal. 2010;11:1901533. doi: 10.2139/ssrn.1901533. [DOI] [Google Scholar]

[bib40] Mitchell JF, Sundberg KA, Reynolds JH. Differential attention-dependent response modulation across cell classes in macaque visual area V4. Neuron. 2007;55:131–141. doi: 10.1016/j.neuron.2007.06.018. [DOI] [PubMed] [Google Scholar]

[bib41] Mitchell JF, Sundberg KA, Reynolds JH. Spatial attention decorrelates intrinsic activity fluctuations in macaque area V4. Neuron. 2009;63:879–888. doi: 10.1016/j.neuron.2009.09.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] Mnih V, Heess N, Graves A, Kavukcuoglu K. Recurrent models of visual attention. Advances in Neural Information Processing Systems.2014. [Google Scholar]

[bib43] Mohler CW, Wurtz RH. Organization of monkey superior colliculus: intermediate layer cells discharging before eye movements. Journal of Neurophysiology. 1976;39:722–744. doi: 10.1152/jn.1976.39.4.722. [DOI] [PubMed] [Google Scholar]

[bib44] Ni AM, Ruff DA, Alberts JJ, Symmonds J, Cohen MR. Learning and attention reveal a general relationship between population activity and behavior. Science. 2018;359:463–465. doi: 10.1126/science.aao0284. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] Posner MI. Orienting of attention. Quarterly Journal of Experimental Psychology. 1980;32:3–25. doi: 10.1080/00335558008248231. [DOI] [PubMed] [Google Scholar]

[bib46] Rangel A, Hare T. Neural computations associated with goal-directed choice. Current Opinion in Neurobiology. 2010;20:262–270. doi: 10.1016/j.conb.2010.03.001. [DOI] [PubMed] [Google Scholar]

[bib47] Ratcliff R, McKoon G. The diffusion decision model: theory and data for two-choice decision tasks. Neural Computation. 2008;20:873–922. doi: 10.1162/neco.2008.12-06-420. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] Reynolds JH, Chelazzi L. Attentional modulation of visual processing. Annual Review of Neuroscience. 2004;27:611–647. doi: 10.1146/annurev.neuro.26.041002.131039. [DOI] [PubMed] [Google Scholar]

[bib49] Ruff DA, Ni AM, Cohen MR. Cognition as a window into neuronal population space. Annual Review of Neuroscience. 2018;41:77–97. doi: 10.1146/annurev-neuro-080317-061936. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] Sepulveda P, Usher M, Davies N, Benson AA, Ortoleva P, De Martino B. Visual attention modulates the integration of goal-relevant evidence and not value. eLife. 2020;9:e60705. doi: 10.7554/eLife.60705. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] Shadlen MN, Shohamy D. Decision making and sequential sampling from memory. Neuron. 2016;90:927–939. doi: 10.1016/j.neuron.2016.04.036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] Shenhav A, Dean Wolf CK, Karmarkar UR. The evil of banality: when choosing between the mundane feels like choosing between the worst. Journal of Experimental Psychology: General. 2018;147:1892–1904. doi: 10.1037/xge0000433. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] Shimojo S, Simion C, Shimojo E, Scheier C. Gaze Bias both reflects and influences preference. Nature Neuroscience. 2003;6:1317–1322. doi: 10.1038/nn1150. [DOI] [PubMed] [Google Scholar]

[bib54] Smith SM, Krajbich I. Attention and choice across domains. Journal of Experimental Psychology: General. 2018;147:1810–1826. doi: 10.1037/xge0000482. [DOI] [PubMed] [Google Scholar]

[bib55] Smith SM, Krajbich I. Gaze amplifies value in decision making. Psychological Science. 2019;30:116–128. doi: 10.1177/0956797618810521. [DOI] [PubMed] [Google Scholar]

[bib56] Song M, Wang X, Zhang H, Li J. Proactive information sampling in Value-Based Decision-Making: deciding when and where to saccade. Frontiers in Human Neuroscience. 2019;13:35. doi: 10.3389/fnhum.2019.00035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] Sorokin I, Seleznev A, Pavlov M, Fedorov A, Ignateva A. Deep attention recurrent Q-network. arXiv. 2015 https://arxiv.org/abs/1512.01693

[bib58] Tajima S, Drugowitsch J, Pouget A. Optimal policy for value-based decision-making. Nature Communications. 2016;7:12400. doi: 10.1038/ncomms12400. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib59] Tajima S, Drugowitsch J, Patel N, Pouget A. Optimal policy for multi-alternative decisions. Nature Neuroscience. 2019;22:1503–1511. doi: 10.1038/s41593-019-0453-9. [DOI] [PubMed] [Google Scholar]

[bib60] Tavares G, Perona P, Rangel A. The attentional drift diffusion model of simple perceptual Decision-Making. Frontiers in Neuroscience. 2017;11:468. doi: 10.3389/fnins.2017.00468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib61] Towal RB, Mormann M, Koch C. Simultaneous modeling of visual saliency and value computation improves predictions of economic choice. PNAS. 2013;110:E3858–E3867. doi: 10.1073/pnas.1304429110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] Wang L, Krauzlis RJ. Visual selective attention in mice. Current Biology. 2018;28:676–685. doi: 10.1016/j.cub.2018.01.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib63] Wittig JH, Jang AI, Cocjin JB, Inati SK, Zaghloul KA. Attention improves memory by suppressing spiking-neuron activity in the human anterior temporal lobe. Nature Neuroscience. 2018;21:808–810. doi: 10.1038/s41593-018-0148-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib64] Wurtz RH. Neuronal mechanisms of visual stability. Vision Research. 2008;48:2070–2089. doi: 10.1016/j.visres.2008.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib65] Yang SC, Lengyel M, Wolpert DM. Active sensing in the categorization of visual patterns. eLife. 2016;5:e12215. doi: 10.7554/eLife.12215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib66] Yu AJ, Dayan P, Cohen JD. Dynamics of attentional selection under conflict: toward a rational bayesian account. Journal of Experimental Psychology. Human Perception and Performance. 2009;35:700–717. doi: 10.1037/a0013553. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Optimal policy for attention-modulated decisions explains human fixation behavior

Anthony I Jang

Ravi Sharma

Jan Drugowitsch

Roles

Abstract

Introduction

Results

An attention-modulated decision-making model

Figure 1. Attention-modulated evidence accumulation.

Figure 2. Navigating the optimal policy space.

Figure 2—figure supplement 1. Changes in the optimal policy space and model behavior with adjustments in free model parameters.

Features of the optimal policy

The optimal policy replicates human behavior

Figure 3. Replication of human behavior by simulated optimal model behavior (Krajbich et al., 2010).

Figure 3—figure supplement 1. Parameter-dependence of psychometric/chronometric curves, and exploration of switch rate rather than switch number for the optimal model.

Figure 3—figure supplement 2. Replicating human perceptual decision-making behavior with the optimal model.

Figure 4. Behavioral predictions from Bayesian value estimation, and further properties of the optimal policy.

Figure 4—figure supplement 1. Effect of item values on attention switch rate and fixation duration across trials for the human data, optimal model, and aDDM.

Figure 4—figure supplement 2. Effect of passed time on switch probability and fixation duration within trials.

Figure 4—figure supplement 3. Additional analyses of fixation behavior and performance between human data, optimal model, and aDDM.

Optimal attention-modulated policy for perceptual decisions

Discussion

Materials and methods

Attention-modulated decision-making model

Model simulations

Statistical analysis

Data and code availability

Acknowledgements

Appendix 1

1 Task setup

1.1 Latent state prior

1.2 Likelihood function of momentary evidence

1.3 An alternative form for the likelihood

1.4 Costs, rewards, and the decision maker’s overall aim

2 Bayes-optimal evidence accumulation

2.1 Deriving the posterior z1 and z2

2.2 The expected reward process

2.3 The expected reward difference process

3 Optimal decision policy

3.1 Single, isolated choice

3.2 Sequence of consecutive choices

3.3 Choosing the less desirable option

4 Optimal decision policy for perceptual decisions

5 Properties of the optimal policy

5.1 Shift-invariance and symmetry of the expected reward process

5.2 Shift-invariance of the value function

5.3 Symmetry of the value function

5.4 Maximum |V1⁢(⋅)-V2⁢(⋅)| difference

5.5 The decision boundaries are parallel to the diagonal r^1=r^2

5.6 Impact of re-scaled costs, rewards, and standard deviations

6 Simulation details

6.1 Computing the optimal policy

6.2 Finding task parameters that best match human behavior

6.3 Simulating decisions with the optimal policy

6.4 Attention diffusion model

Funding Statement

Contributor Information

Funding Information

Additional information

Competing interests

Author contributions

Ethics

Additional files

Data availability

References

Decision letter

Roles

Author response

Author response image 1.

Author response image 2.

Author response image 3.

Author response image 4.

Author response image 5.

Author response image 6.

Author response image 7.

Author response image 8.

Author response image 9.

Author response image 10.

2.1 Deriving the posterior z₁ and z₂

5.4 Maximum $| V_{1} (\cdot) - V_{2} (\cdot) |$ difference

5.5 The decision boundaries are parallel to the diagonal ${\hat{r}}_{1} = {\hat{r}}_{2}$