Optimal reward harvesting in complex perceptual environments

Vidhya Navalpakkam; Christof Koch; Antonio Rangel; Pietro Perona

doi:10.1073/pnas.0911972107

. 2010 Mar 1;107(11):5232–5237. doi: 10.1073/pnas.0911972107

Optimal reward harvesting in complex perceptual environments

Vidhya Navalpakkam ^a,¹, Christof Koch ^a,^b, Antonio Rangel ^c, Pietro Perona ^b

PMCID: PMC2841865 PMID: 20194768

Abstract

The ability to choose rapidly among multiple targets embedded in a complex perceptual environment is key to survival. Targets may differ in their reward value as well as in their low-level perceptual properties (e.g., visual saliency). Previous studies investigated separately the impact of either value or saliency on choice; thus, it is not known how the brain combines these two variables during decision making. We addressed this question with three experiments in which human subjects attempted to maximize their monetary earnings by rapidly choosing items from a brief display. Each display contained several worthless items (distractors) as well as two targets, whose value and saliency were varied systematically. We compared the behavioral data with the predictions of three computational models assuming that (i) subjects seek the most valuable item in the display, (ii) subjects seek the most easily detectable item, and (iii) subjects behave as an ideal Bayesian observer who combines both factors to maximize the expected reward within each trial. Regardless of the type of motor response used to express the choices, we find that decisions are influenced by both value and feature-contrast in a way that is consistent with the ideal Bayesian observer, even when the targets’ feature-contrast is varied unpredictably between trials. This suggests that individuals are able to harvest rewards optimally and dynamically under time pressure while seeking multiple targets embedded in perceptual clutter.

Keywords: decision making, reward, visual saliency, search, multiple targets

Animals and humans often need to make rapid choices among multiple targets embedded in a noisy perceptual environment. Consider, for example, a predator deciding which of several prey to pursue. The more valuable targets might be perceptually less salient, and thus harder to find (e.g., camouflaged prey), while less valuable targets may be perceptually more salient and easier to find. To solve this task, the animal needs to combine the perceptual and value-related information while making choices. This raises a fundamental question: Are rapid choices in cluttered environments dominated by value information (e.g., biased toward seeking the more valuable items) or by perceptual information (e.g., biased toward seeking the more easily and quickly detectable items)?

Understanding how perceptual saliency and value information are combined to make decisions is important for several reasons. From a computational perspective, it is not known how the brain trades off saliency and value under time pressure, especially when they have opposing influences on the decision: Is it optimized for reward harvesting, or is it based on simpler principles of choosing the most valuable or the most easily detectable item? The brain's solution to this tradeoff is not obvious, because items that are more salient are easier to find (1, 2) and, other things being equal, have a higher probability of yielding a reward. From a behavioral perspective, it is not known whether humans take into account saliency-induced variations in probability when making decisions under uncertainty. From a vision science perspective, previous research on visual search has focused on searching for a single target amid clutter (2, 3), and it is not known what happens when subjects search for multiple targets that differ both in saliency and in value. In particular, although saliency is known to affect saccades in a fast, automatic, and bottom-up manner (4), it is not known whether value can have a similar fast effect.

Previous studies have examined the roles of visual saliency and economic or subjective value in isolation. For example, several studies showed that decisions are biased toward the item or location associated with a higher magnitude and probability of reward (5–8), but these studies did not manipulate visual saliency. On the other hand, studies on visual saliency did not manipulate the reward outcome associated with choosing an item. These studies showed that during free viewing of natural scenes and videotapes, saccades are automatically drawn to more salient image regions (e.g., locations with high feature-contrast in luminance, orientation, and motion) (9–13) and that salient targets are detected faster and better amid clutter (1, 2). Thus, whether and how the brain might combine information about visual saliency and value to form rapid decisions have not yet been investigated.

To study this question, we collected data from a number of human subjects who searched visually for two valuable targets amid clutter. The goal of our subjects was to maximize the reward earned by rapidly choosing items from a brief display (Fig. 1A). We systematically varied the relative value and feature-contrast of the targets (a measure of saliency based on the difference in features between the target and distractor) across blocks and studied how our subjects’ behavior changed as a consequence. We tested three possible models of our subjects’ behavior. The first model is motivated by the literature on visual search; it assumes that subjects will attempt to select the most salient or easily detectable target based on visual properties of the targets (1, 14). The second model is motivated by the literature on economics; it assumes that subjects attempt to select the most valuable target based on the economic properties of the targets (5). The third model assumes that the brain dynamically combines information about value and visual saliency to select the location that gives the maximum expected reward.

Fig. 1. — Basic experiment and competing theories. (A) Experiment 1. Stimuli in the display were either “targets” or “distractors.” Subjects earned a reward for fixating a target for at least 100 ms during the display period but did not earn a reward for fixating a distractor. There were two types of targets, horizontal bars (H) and vertical bars (V), and one of each was present in each display. Distractors were diagonal bars whose orientation was varied across blocks to manipulate the “feature-contrast” of the targets (orientation difference between the target and distractors). Subjects expressed their choice by fixating on the chosen item and were asked to try to maximize their total earnings. The experiment consisted of several blocks of 50 trials. Across blocks, we varied the value and feature-contrast of the targets. At the beginning of each block, subjects were informed about the value of targets H and V (e.g., value of H is 20 points, value of V is 10 points), and they received training. (B) We compared the performance of three different computational models. In M1, fixations are deployed to the location that is most likely to contain the target with the maximum feature-contrast for the block. In M2, fixations are deployed to the location that is most likely to contain the target associated with the highest value for the block. In M3, fixations are deployed to the location associated with the maximum expected reward for the trial, as predicted by an ideal Bayesian observer model.

Note that it is not possible to choose a priori, or based on existing data, one of the three models. Model 1 is motivated by the fact that the brain might not have sufficient time to incorporate reward considerations during rapid decisions, which would imply that rapid choices would be driven mostly by the perceptual features of the display. Model 2 is motivated by the fact that because the organism cares only about maximizing the amount of rewards harvested, the brain might have implemented a computational shortcut of always attempting to select the highest value stimuli (e.g., always seek the most valuable prey). Model 3 is the optimal solution to the computational problem faced by the organism, and thus is of particular interest.

Results

Decision Models.

Our models are represented schematically in Fig. 1B. All three models first estimate the location of both targets. This estimate is probabilistic, and it is carried out using the optimal Bayesian estimator. All models assume that noisy estimates of the stimulus feature (e.g., orientation, brightness) are computed at each location x. We hypothesize a diverse population of orientation-tuned mechanisms at each location whose output may be converted into a noisy estimate of the stimulus orientation, wherein the noise is approximately Gaussian (15). Formally, let T_x be the stimulus, θ(T_x) be the stimulus feature, and a_x be the estimate of the stimulus feature at location x. Thus, a_x|T_x ∼ G[a_x;μ = θ(T_x), σ], where σ is the noise parameter, and G(·;μ, σ) denotes the univariate Gaussian probability density function with the mean equal to μ and the SD equal to σ. All three models involve a single free parameter, σ. We denote the resulting vector of estimates at eight locations in the display as Inline graphic .

All models require the computation of posterior probabilities. The posterior probability of stimulus T_x occurring at location x is denoted by Inline graphic and is computed from the likelihood and a prior probability P(T_x) using Bayes’ theorem. If the display consists of n stimuli (two targets, H and V, as well as n-2 distractors, D), and if the probability that any stimulus occupies any position is the same, then we have that

graphic file with name pnas.0911972107eq1.jpg

The likelihood term can be further expanded to make use of the fact that exactly one H and one V appear in the display. Thus, the occurrence of H at location x implies that V must occur at some other location y ≠ x and the distractors must appear at other locations z ≠ x, y. It follows that

graphic file with name pnas.0911972107eq3.jpg

graphic file with name pnas.0911972107eq4.jpg

graphic file with name pnas.0911972107eq5.jpg

A detailed derivation of these equations is presented in SI Text. The terms on the left denote the global likelihood of an item's presence (target/distractor) at location x based on the sensory observations at all locations. In contrast, the terms on the right refer to the local likelihoods of an item's presence at a location based on the sensory observation at that single location, P(a_x|T_x). Substituting Eq. 3–5 in Eq. 1, we can obtain the posterior probability of each object's presence at each location.

The first model (M1) assumes that the decision is dominated by visual properties like feature-contrast (16, 17). In this model, subjects use their prior knowledge of which of the two targets (H or V) is more salient (learned during the 10 training trials preceding each block) and then search for that target when the display appears. Suppose, for instance, that the more salient target is V; according to this model, subjects will choose the location x, where Inline graphic is maximal.

The second model (M2) assumes that the decision is dominated by economic properties of the targets (5), such as value. Here, subjects determine in advance (from the 10 training trials) which of the two targets (H or V) has the higher payoff and then search for that target when the display appears. For example, consider a condition for Fig. 1A in which fixating on the horizontal bar pays 20 points, fixating on the vertical bar pays 10 points, and fixating on the distractors pays nothing. In this case, the model assumes that subjects will search for the horizontal bar and will find it with a higher probability if it is more salient and with a lower probability otherwise. More formally, if the most valuable target is H, then subjects, according to this model, will choose the location x where Inline graphic is maximal.

The third model (M3) has not been previously considered in the search literature. It assumes that the subject computes the expected reward associated with choosing each location using optimal Bayesian inference and then chooses the location with the highest expected reward. We refer to this model as the reward maximizer because it optimizes the expected reward trial-by-trial, given the noise in the system. Note that, unlike M1 and M2, M3 predicts that subjects will not search for a fixed target (e.g., horizontal); instead, they will select dynamically and image-by-image the location of maximum expected reward. More formally, the model assumes that subjects compute the expected reward at every location x, denoted as E[R_x] and then choose the location associated with the highest expected reward. The expected reward at a location is given by

where v_i denotes the value associated with the item and Inline graphic is computed as in Eq. 1.

Experiment 1.

This experiment investigates how the brain combines feature-contrast and value information about objects when subjects express their choice by moving their eyes. The feature-contrast and value of the targets were manipulated independently. As illustrated in Fig. 1A and described in Methods, the subjects’ task was to harvest the maximum possible amount of monetary rewards by fixating items in the display.

Fig. 2 A–D displays the performance data for one subject across different relative value conditions (Figs. S1–S5 provide the data from the remaining subjects). The figure illustrates two patterns that were present in all the subjects. First, the probability of fixating on H increases with its relative feature-contrast for all values of the targets. Second, the probability of fixating on H also increases with its relative value.

For each subject and model, we determined the maximum-likelihood estimate of the internal noise parameter (σ) from the data in the high-value condition (in which target H is twice as valuable as V; Fig. 2C) and used it to predict behavior in three other value conditions (Fig. 2 A, B, and D and Methods provide details). Fig. 2 A–D compares the predictions of the models with subjects’ data. Note a few things about the predictions made by the models. First, all of them predict that the frequency of fixations to H should increase with its feature-contrast. Second, model M1 predicts that fixations are independent of the values (M1 curves are identical across Fig. 2 A–D), although the data appear to shift gradually from right (Fig. 2A) to left (Fig. 2D). Third, model M2 predicts that the less valuable target will almost never be fixated, even if it has higher feature-contrast; this prediction is clearly violated (Fig. 2A, notice that the M2 curve is constant at zero over the whole feature-contrast range). Fourth, model M3 not only performs better than models M1 and M2, but, more importantly, it predicts the data well, as shown by a quantitative comparison based on a χ² measure of goodness of fit as well as a Bayesian model comparison (18) of the log likelihood of data given each model (Fig. 2F). This last point is further explored in Fig. 2E, which shows that subjects’ fixations correlate well with the predictions of M3 (R² = 0.97). We can conclude that subjects’ behavior is highly consistent with the reward maximizer model M3 and that both value and feature-contrast affect the saccadic decision.

Experiment 2.

To test if the previous findings are robust to the presence of other types of low-level features, we repeated experiment 1 using brightness intensity, rather than orientation, as the feature differentiating the target from the distractors (Fig. 3 and Methods provide details). We obtained data in 21 conditions (7 feature-contrast × 3 value conditions). Fig. 3 C–E displays the data for one of the subjects (Figs. S6 and S7 provide the data for the remaining subjects). Note that the key findings from the first experiment also hold here: First, the saccadic decision is affected by both value and feature-contrast, and, second, M3 (reward maximizer) provides a quantitative account of subjects’ data.

Fig. 3. — Results for experiment 2. (A) Task is identical to experiment 1 except that the feature-contrast variable that was manipulated was the brightness of the targets relative to the distractors. (B) Correlations. (C) Bayesian model comparison. In all three subjects, M3 had a higher likelihood of explaining the data than M1 or M2. (D-F) Data and model fits for subject 1.

Experiment 3.

The findings from experiments 1 and 2 show that subjects can optimize the reward when the targets’ values and feature-contrast are fixed within a block. One hypothesis for why subjects perform so well is that they may learn the optimal strategy during the training period (10 trials preceding each experimental block) and then deploy this learned strategy in the remaining trials in that block. This raises an important question: Can subjects optimize reward dynamically in the absence of training or learning? To test this, we designed a third experiment in which we varied the targets’ feature-contrast unpredictably between trials as opposed to the blocked condition in experiments 1 and 2.

Another question is whether the findings of experiments 1 and 2 reflect properties specific to the saccadic system or a general property of decision making, regardless of the type of motor response (e.g., key press, saccade, verbal report) used to express the choices. To test this, we allowed subjects to report their decision through either a key press in half of the trials or through a saccade in the other half. We made one key modification in the display response paradigm: Subjects maintained cental fixation while viewing the display for 300 ms. The display was followed by a mask containing a masking stimulus at each location as well as a number from 1 to 8 identifying each location uniquely. Subjects were instructed to report their decision as soon as possible by saccading to the chosen location or by typing on a computer keyboard a number from 1 to 8 to indicate the chosen location (maximum response time was 2 s).

The results across nine conditions (3 value × 3 feature-contrast conditions) are shown in Fig. 4). Fig. 4 B and C shows that decisions, regardless of whether expressed through a key press with a finger or through a saccadic eye movement, are influenced by both value and feature-contrast. Comparing Fig. 4 B and C does not reveal any significant difference (pairwise t tests, 0.05 significance level) between the decisions expressed through a key press or saccade. The surprising finding is that despite trial-by-trial variations in feature-contrast, subjects’ performance is still optimal for decisions expressed through a key press (Fig. 4 D and E) and for decisions expressed through a saccade (Fig. 4 F and G). This rules out constant decision strategies that may result from ample practice or overtraining and shows that subjects are optimizing reward on a trial-by-trial basis by using a dynamic strategy of choosing flexibly different targets between trials rather than a static strategy of choosing the same target on all trials in a block. Thus, Fig. 4 B–G shows clearly that decisions, regardless of the type of motor response (key press vs. saccade), are influenced by both value and feature-contrast optimally.

Discussion

Although objects in the real world usually differ in both visual saliency and economic value, previous studies on saccades and decisions among multiple objects have mostly examined the effect of saliency (9–13) and value (5–8) in isolation. For example, Platt and Glimcher (5) showed that saccadic decisions are biased toward the item associated with the higher expected reward, but they did not vary the feature-contrast or saliency of items. Similarly, Berg et al. (13) studied how visual saliency affects saccades but did not consider the role of stimulus value or reward outcome associated with the saccade. In contrast, we explored the computational mechanisms by which the brain combines visual saliency and value information to make rapid decisions. Our experimental results show that decisions are affected by both variables in a manner that leads to the maximization of the expected reward within each trial and are consistent with the predictions of an ideal Bayesian observer. We found that such reward maximization behavior is robust across multiple visual features, such as orientation and intensity, and does not depend on the type of motor response used to express the choices (key press vs. saccade). Furthermore, behavior is near-optimal even when the targets’ feature-contrast is varied unpredictably between trials, which suggests that subjects deploy a dynamic decision strategy.

An important open question is where in the brain is the information about value and feature-contrast combined. One hypothesis is that value integration may occur as early as in V1, possibly in the form of a greater top-down attentional bias on the more valuable target (19, 20). A second hypothesis is that value integration may occur later in the lateral intraparietal area (LIP), an area that has been shown to integrate multiple sources of information such as attention (21, 22), saliency (23), expected rewards (5, 24, 25), and saccade selection (26). Given our results, it would be worth investigating if the LIP encodes a reward-modulated saliency map that implements the computations of the Bayesian observer.

Our findings add to the growing literature on the Bayesian optimality of the motor and perceptual systems (27, 28). In our study, both feature-contrast and value of the stimulus are manipulated simultaneously. We found that learning was fast in experiments 1 and 2; after a change in either feature-contrast or value of the stimulus, subjects learned within 10 training trials to deploy the optimal strategy of saccading to the location offering the maximum expected reward. Other studies on rapid motor tasks and target detection tasks have also reported very fast learning (29, 30). These findings suggest that humans are capable of performing rapid perceptual and motor decisions optimally in both reaching tasks and saccadic/key press tasks.

The results of the experiments are also consistent with the existence of a general, motor-independent, decision-making process in which the decision is formed first and is later expressed through any motor response (e.g., saccade, key press). Future studies are required to investigate whether reward maximization behavior applies only to the final decision made to acquire explicit rewards or whether it extends to intermediate actions like saccades that are used to gather visual information about the display in the absence of direct reward outcomes. This study focused on feature-contrast of simple stimuli (oriented bars, disks of varying brightness). Additional studies are required to test whether the current findings extend to more complex stimuli (e.g., faces, cars) embedded in natural scenes (14).

Finally, our findings have implications for visual search tasks that involve multiple targets. Unlike most previous studies focusing on the search for a single target (3) and on the role of visual properties of the display [e.g., number of items in the display (1, 31), distractor heterogeneity (32), target-distractor similarity (32)], we ask how value and visual information combine to influence search in the presence of multiple valuable targets. Our findings show that instead of searching for a single target that is most valuable or most salient, humans go to the location of the maximum expected reward per trial, and thus perform optimal reward harvesting.