Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2013 Jan 22;8(1):e53344. doi: 10.1371/journal.pone.0053344

Reward Optimization in the Primate Brain: A Probabilistic Model of Decision Making under Uncertainty

Yanping Huang 1, Rajesh P N Rao 1,*
Editor: Floris P de Lange2
PMCID: PMC3551910  PMID: 23349707

Abstract

A key problem in neuroscience is understanding how the brain makes decisions under uncertainty. Important insights have been gained using tasks such as the random dots motion discrimination task in which the subject makes decisions based on noisy stimuli. A descriptive model known as the drift diffusion model has previously been used to explain psychometric and reaction time data from such tasks but to fully explain the data, one is forced to make ad-hoc assumptions such as a time-dependent collapsing decision boundary. We show that such assumptions are unnecessary when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). We propose an alternative model for decision making based on POMDPs. We show that the motion discrimination task reduces to the problems of (1) computing beliefs (posterior distributions) over the unknown direction and motion strength from noisy observations in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. The resulting optimal policy (belief-to-action mapping) is shown to be equivalent to a collapsing decision threshold that governs the switch from evidence accumulation to a discrimination decision. We show that the model accounts for both accuracy and reaction time as a function of stimulus strength as well as different speed-accuracy conditions in the random dots task.

Introduction

Animals are constantly confronted with the problem of making decisions given noisy sensory measurements and incomplete knowledge of their environment. Making decisions under such circumstances is difficult because it requires (1) inferring hidden states in the environment that are generating the noisy sensory observations, and (2) determining if one decision (or action) is better than another based on uncertain and delayed reinforcement. Experimental and theoretical studies [1][6] have suggested that the brain may implement an approximate form of Bayesian inference for solving the hidden state problem. However, these studies typically do not address the question of how probabilistic representations of hidden state are employed in action selection based on reinforcement. Daw, Dayan and their colleagues [7], [8] explored the suitability of decision theoretic and reinforcement learning models in understanding several well-known neurobiological experiments. Bogacz and colleagues proposed a model that combines a traditional decision making model with reinforcement learning [9] (see also [10]). Rao [11] proposed a neural model for decision making based on the framework of partially observable Markov decision processes (POMDPs) [12]; the model focused on network implementation and learning but assumed a deadline to explain the collapsing decision threshold. Drugowitsch et al. [13] sought to explain the collapsing decision threshold by combining a traditional drift diffusion model with reward rate maximization. Other recent studies have used the general framework of POMDPs to explain experimental data in decision making tasks such as those involving a stop-signal [14], [15] and different types of prior knowledge [16].

In this paper, we derive from first principles a POMDP model for the well-known random dots motion discrimination task [17]. We show that the task reduces to the problems of (1) computing beta-distributed beliefs over the unknown direction and motion strength from noisy observations, and (2) selecting actions based on these beliefs in order to maximize the expected sum of future rewards. Without making ad-hoc assumptions such as a hypothetical deadline, a collapsing decision threshold emerges naturally via expected reward maximization. We present results comparing the model's predictions to experimental data and show that the model can explain both reaction time and accuracy as a function of stimulus strength as well as different speed-accuracy conditions.

Methods

POMDP framework

We model the random dots motion discrimination task as a POMDP. The POMDP framework assumes that at any particular time step, the environment is in a particular hidden state, Inline graphic, that is not directly accessible to the animal. This hidden state however can be inferred by making a sequence of sensory measurements. At each time step Inline graphic, the animal receives a sensory measurement (observation), Inline graphic, from the environment, which is determined by an Inline graphic probability distribution Inline graphic. Since the hidden state Inline graphic is unknown, the animal must maintain a belief (posterior probability distribution) over the set of possible states given the sensory observations seen so far: Inline graphic, where Inline graphic represents the sequence of observations that the animal has accumulated so far. At each time step, an action (decision) Inline graphic made by the animal can affect the environment by changing the current state to another according to a Inline graphic probability distribution Inline graphic where Inline graphic is the current state, and Inline graphic is a new state. The animal then gets a reward Inline graphic from the environment, depending on the current state and the action taken. During training, the animal learns a policy, Inline graphic, which indicates which action Inline graphic to perform for each belief state Inline graphic. We make two main assumptions in the POMDP model. First, the animal uses Bayes rule to update its belief about the hidden state after each new observation Inline graphic: Inline graphic. Second, the animal is trained to follow an optimal policy Inline graphic that maximizes the animal's expected total future reward in the task. Figure 1 illustrates the decision making process using the POMDP framework. Note that in the decision making tasks that we model in this paper, the hidden state Inline graphic is fixed by experimenters within a trial and thus there is no transition distribution to include in the belief update equation. In general, the hidden state in a POMDP model follows a Markov chain, making the observations Inline graphic temporally correlated.

Figure 1. POMP Framework for Decision Making.

Figure 1

Left: The graphical model representing the probabilistic relationship between random variables Inline graphic, Inline graphic, Inline graphic and Inline graphic. In the POMDP model, the hidden state Inline graphic corresponds to coherence Inline graphic and direction Inline graphic jointly. The observation Inline graphic corresponds to MT response Inline graphic. The relations between these variables are summarized in table 1. Right: In order to solve a POMDP problem, the animal maintains a belief Inline graphic, which is a posterior probability distribution over hidden states Inline graphic of the world given observations Inline graphic. At a current belief state Inline graphic, an action is selected according to the learned policy Inline graphic, which maps belief states to actions.

Random dots task as a POMDP

We now describe how the general framework of POMDPs can be applied to the random dots motion discrimination task as shown in Figure 1. In each trial, experimenter chooses a fixed direction Inline graphic corresponding to leftward and rightward motion respectively, and a stimulus strength (motion coherence) Inline graphic, where Inline graphic corresponds to completely random motion and Inline graphic corresponds to Inline graphic coherent motion (i.e., all dots moving in the same direction). Intermediate values of Inline graphic represent a corresponding fraction of dots moving in the coherent direction (e.g., Inline graphic represents Inline graphic coherent motion). The animal is shown a movie of randomly moving dots, a fraction Inline graphic of which are moving in the same direction Inline graphic.

In a given trial, neither the direction Inline graphic nor the coherence Inline graphic is known to the animal. We therefore regard Inline graphic as the joint hidden environment state Inline graphic in the POMDP model. Neurophysiological evidence suggests that information regarding random dot motion is received from neurons in cortical area MT [18][21]. Therefore, following previous models (e.g., [22][24]), we define the observation model Inline graphic in the POMDP as a function of the responses of MT neurons. Let the firing rate of MT neurons preferring rightward and leftward direction be Inline graphic and Inline graphic respectively. We can define:

graphic file with name pone.0053344.e054.jpg
graphic file with name pone.0053344.e055.jpg (1)

where Inline graphic spikes/second is the average spike rate for Inline graphic coherent motion stimulus, and Inline graphic and Inline graphic are the “drive” in the preferred and null directions respectively. These constants (Inline graphic, Inline graphic and Inline graphic) are based on fits to experimental data as reported in [23], [25]. Let Inline graphic be the elapsed time between time steps Inline graphic and Inline graphic. Then, the number of spikes emitted by MT neurons Inline graphic within Inline graphic follows a Poisson distribution:

graphic file with name pone.0053344.e068.jpg (2)

We define the observation Inline graphic at time Inline graphic as the spike count from MT neurons preferring rightward motion, given the total spike count from rightward and leftward-preferring neurons, i.e., the observation is a conditional random variable Inline graphic where Inline graphic. Then Inline graphic follows a stationary Binomial distribution Inline graphic. Note that the duration of each POMDP time step need not be fixed, and we can therefore adjust Inline graphic such that Inline graphic for some fixed Inline graphic, i.e., the animal updates the posterior distribution over hidden state each time it receives Inline graphic spikes from the MT population. Inline graphic is exponentially distributed, and the standard deviation of Inline graphic will approach zero as Inline graphic increases. When Inline graphic, Inline graphic becomes an indicator random variable representing whether a spike was emitted by a rightward motion preferring neuron or not.

It can be shown [26] that Inline graphic follows a Binomial distribution Inline graphic with

graphic file with name pone.0053344.e086.jpg (3)

Inline graphic represents the probability that the MT neurons favoring rightward movement will spike given that there is a spike in the MT population. Since Inline graphic is a joint function of Inline graphic and Inline graphic, we could equivalently regard it as the hidden state of our POMDP model: Inline graphic indicates rightward direction (Inline graphic) while Inline graphic indicates the opposite direction (Inline graphic). The coherence Inline graphic corresponds to Inline graphic while Inline graphic corresponds to the two extreme values Inline graphic or Inline graphic for direction Inline graphic being left or right respectively. Note that both direction Inline graphic and coherence Inline graphic are unknown to the animal in the experiments, but they are held constant within a trial.

Bayesian inference of hidden state

Given the framework above, the task of deciding the direction of motion of the coherently moving dots is equivalent to the task of deciding whether Inline graphic or not, and deciding when to make such a decision. The POMDP model makes decisions based on the “belief” state Inline graphic, which is the posterior probability distribution over Inline graphic given a sequence of observations Inline graphic:

graphic file with name pone.0053344.e107.jpg (4)

where Inline graphic, Inline graphic, and Inline graphic. To facilitate the analysis, we represent the prior probability Inline graphic as a beta distribution with parameters Inline graphic and Inline graphic. Note that the beta distribution is quite flexible: for example, a uniform prior can be obtained using Inline graphic. Without loss of generality, we will fix Inline graphic throughout this paper. The posterior distribution can now be written as:

graphic file with name pone.0053344.e116.jpg (5)

The belief state Inline graphic at time step Inline graphic thus follows a beta distribution with two parameters Inline graphic and Inline graphic as defined above. Consequently, the posterior probability distribution over Inline graphic depends only on the number of spikes Inline graphic and Inline graphic for rightward and leftward motion respectively. These in turn determine Inline graphic and Inline graphic, where

graphic file with name pone.0053344.e126.jpg (6)

is the point estimator of Inline graphic, and Inline graphic. The animal only needs to keep track of Inline graphic and Inline graphic in order to encode the belief state Inline graphic. After marginalizing over coherence Inline graphic, we have the posterior probability over direction Inline graphic:

graphic file with name pone.0053344.e134.jpg (7)
graphic file with name pone.0053344.e135.jpg (8)

where Inline graphic is the regularized incomplete beta function.

Actions, rewards, and value function

The animal updates its belief after receiving the current observation Inline graphic, and chooses one of the three actions (decisions) Inline graphic, denoting rightward eye movement, leftward eye movement, and sampling (i.e., waiting for one more observation) respectively. The model assumes the animal receives rewards Inline graphic as follows (rewards are modeled using real numbers). When the animal makes a correct choice, Inline graphic, a rightward eye movement Inline graphic when Inline graphic (Inline graphic) or a leftward eye movement Inline graphic when Inline graphic (Inline graphic), the animal receives a positive reward Inline graphic. The animal receives a negative reward (i.e., penalty) or nothing when an incorrect action is chosen Inline graphic. We further assume that the animal is motivated by hunger or thirst to make a decision as quickly as possible. This is modeled using a unit penalty Inline graphic for each observation the animal makes, representing the cost the animal needs to pay when choosing the sampling action Inline graphic.

Recall that a belief state Inline graphic is determined by the parameters Inline graphic. The goal of the animal is to find an optimal “policy” Inline graphic that maximizes the “value” function Inline graphic, defined as the expected sum of future rewards given the current belief state:

graphic file with name pone.0053344.e155.jpg (9)

where the expectation is taken with respect to all future belief states Inline graphic. The reward term Inline graphic above is the expected reward for the given belief state and action:

graphic file with name pone.0053344.e158.jpg (10)
graphic file with name pone.0053344.e159.jpg

The above equations can be interpreted as follows. When Inline graphic is selected, the animal receives Inline graphic more samples at a cost of Inline graphic. When Inline graphic is selected, the expected reward Inline graphic depends on the probability density function of the hidden parameter Inline graphic given belief state Inline graphic. With probability Inline graphic, the true parameter Inline graphic is less than Inline graphic, making Inline graphic an incorrect decision with penalty Inline graphic, and with probability Inline graphic, action Inline graphic is correct, earning the reward Inline graphic.

Finding the optimal policy

A policy Inline graphic defines a mapping from a belief state to one of the available actions Inline graphic. A method for learning a POMDP policy by trial and error using the method of temporal difference (TD) learning was suggested in [11]. Here, we derive a policy from first principles and compare the result with behavioral data.

One standard way [12] to solve a POMDP is to first convert it into a Markov Decision Process (MDP) over belief state, and then apply standard dynamical programming techniques such as value iteration [27] to compute the value function in equation 9. For the corresponding belief MDP, we need to define the transition probabilities Inline graphic. When Inline graphic, the belief state can be updated using the previous belief state and current observation based on Bayes' rule:

graphic file with name pone.0053344.e179.jpg (11)

for all Inline graphic. In the above equation, Inline graphic is the Kronecker delta, and Inline graphic is the expected value of the likelihood function Inline graphic over the posterior distribution Inline graphic:

graphic file with name pone.0053344.e185.jpg (12)

which is a stationary distribution independent of time Inline graphic. When the selected action is Inline graphic or Inline graphic, the animal stops sampling and makes an eye movement. To account for such cases, we include an additional state Inline graphic, representing a terminal state, with zero reward Inline graphic and absorbing behavior, Inline graphic for all actions Inline graphic. Formally, the transition probabilities with respect to the absorbing (termination) state are defined as Inline graphic for all Inline graphic, indicating the end of a trial.

Given the time-independent belief state transition Inline graphic, the optimal value Inline graphic and policy Inline graphic can be obtained by solving Bellman's equation:

graphic file with name pone.0053344.e198.jpg
graphic file with name pone.0053344.e199.jpg (13)

Before we proceed to results from the model, we note that the one-step belief transition probability matrix Inline graphic with Inline graphic can be shown be mathematically equivalent to the Inline graphic-steps transition matrix Inline graphic with Inline graphic. The solution to Bellman's equation 13 is independent of Inline graphic. Therefore, unless otherwise mentioned, the results are based on the most general scenario where the animal needs to select an action whenever a new spike is received, Inline graphic, Inline graphic.

We summarize the model variables as well as their statistical relationships in table 1.

Table 1. Summary of model variables and paramters.

POMDP Variables Descriptions
Inline graphic The hidden variable of POMDP, Inline graphic. In the random dots task, Inline graphic is a constant over time
Inline graphic The coherence (motion strength) of the random dots task. Inline graphic. Inline graphic is fixed during a task.
Inline graphic The underlying direction of the random dots task. Inline graphic. Inline graphic is fixed during a task.
Inline graphic The average spike rate of MT neurons preferring rightward or leftward direction, respectively, as a function of both coherence Inline graphic and Inline graphic described in equations 1.
Inline graphic The number of spikes emitted by MT neurons preferring rightward or leftward direction, respectively during one POMDP step. Inline graphic follows a Poisson distribution with mean Inline graphic
Inline graphic Total number of spikes emitted by MT neurons during one POMDP step. Inline graphic
Inline graphic The noisy observation at time step t, which is a conditional random variable Inline graphic following a Binomial distribution Inline graphic. Note that Inline graphic are conditional dependent of each other given the hidden variable Inline graphic
Inline graphic The belief (posterior distribution) Inline graphic. With a beta-distributed initial belief Inline graphic, Inline graphic is also beta distributed due to the binomial distributed emission probability Inline graphic. Without loss of generality, Inline graphic throughout the paper.
Inline graphic Action chosen by the animal at time Inline graphic. Inline graphic.
Model Parameters
Inline graphic A negative reward associated with the cost of an observation.
Inline graphic A positive reward associated with a correct eye movement.
Inline graphic A negative reward associated with an incorrect eye movement.
Inline graphic The duration of a single observation, the real elapsed time per POMDP step. Only used to translate the number of POMDP time steps to real elapsed time when comparing with experimental data.
Inline graphic Non-decision residual time. Both Inline graphic and Inline graphic are obtained from a linear regression to compare model predictions (in unit of POMDP steps) with animals' response time (in unit of seconds), independent of the POMDP model.

Results

Optimal value function and policy

Figure 2 (a) shows the optimal value function computed by applying value iteration [27] to the POMDP defined in the Methods and Analysis section, with parameters Inline graphic, Inline graphic, and Inline graphic. The Inline graphic-axis of Figure 2 (a) represents the total number of observations Inline graphic encountered thus far, which is equal to the elapsed time Inline graphic in the trial. The Inline graphic-axis represents the ratio Inline graphic, which is the estimator of the hidden parameter Inline graphic. In general, the model predicts a high value when Inline graphic is close to Inline graphic or Inline graphic, or equivalently, when the estimated coherence is close to Inline graphic. This is because at these two extremes, selecting the appropriate action has a high probability of receiving a large positive reward Inline graphic. On the other hand, for Inline graphic near Inline graphic (estimated Inline graphic near Inline graphic), choosing Inline graphic or Inline graphic in these states has a high chance of resulting in an incorrect decision and a large negative reward Inline graphic (see [11] for a similar result using a different model and under the assumption of a deadline). Thus, belief states with Inline graphic have a much lower value compared to belief states with Inline graphic or Inline graphic.

Figure 2. Optimal Value and Policy for the Random Dots Task.

Figure 2

(a) Optimal value as a joint function of Inline graphic and the number of POMDP steps Inline graphic. (b) Optimal Policy as a function of Inline graphic and the number of POMDP steps Inline graphic. The boundaries Inline graphic and Inline graphic divide the belief space into three areas: Inline graphic (red), Inline graphic (green), and Inline graphic (blue), each of which represents belief states whose optimal actions are Inline graphic and Inline graphic respectively. Model parameters: Inline graphic, Inline graphic, and Inline graphic. (c) Left: The rightward decision boundary Inline graphic for different values of Inline graphic. Right: The half time Inline graphic of Inline graphic for different values of Inline graphic, where Inline graphic.

Figure 2 (b) shows the corresponding optimal policy Inline graphic as a joint function of Inline graphic and Inline graphic. The optimal policy Inline graphic partitions the belief space into three regions: Inline graphic, Inline graphic, and Inline graphic, representing the set of belief states preferring actions Inline graphic, Inline graphic and Inline graphic respectively. Let Inline graphic be the set of belief states preferring action Inline graphic after Inline graphic observations, for Inline graphic and Inline graphic. Early in a trial, when Inline graphic is small, the model selects the sampling action Inline graphic regardless of the value of Inline graphic. This is because for small Inline graphic, the variance of the point estimator Inline graphic is high. For example, even when Inline graphic when Inline graphic, the probability that the true Inline graphic is still high. The sampling action Inline graphic is required to reduce this variance by accruing more evidence. As Inline graphic becomes larger, the variance of Inline graphic decreases, and the deviation between Inline graphic and the true value of Inline graphic diminishes by the law of large numbers. Consequently, the animal will pick action Inline graphic even when Inline graphic is only slightly above Inline graphic. This gradual decrease in the threshold over time for choosing the overt actions Inline graphic or Inline graphic has been called a “collapsing bound” in the decision making literature [28][30].

The optimal policy Inline graphic is entirely determined by three reward parameters Inline graphic. At a given belief state, Inline graphic picks one of the three available actions that leads to the largest expected future reward. Thus, the choice is determined by the relative, not the absolute, value of the expected future reward for the different actions. From equation 10, we have

graphic file with name pone.0053344.e326.jpg (14)

If we regard the sampling penalty Inline graphic as specifying the unit of reward, the optimal policy Inline graphic is determined by the ratio Inline graphic alone. Figure 2 (c) shows the relationship between Inline graphic and the optimal policy Inline graphic by showing the rightward decision boundaries Inline graphic for different values of Inline graphic. As Inline graphic increases (e.g., by making the sampling cost Inline graphic smaller), the boundary Inline graphic gradually moves towards the upper right corner, giving the animal more time to make decisions which results in more accurate decisions. To better understand this relationship, we fit the decision boundary to a hyperbolic function:

graphic file with name pone.0053344.e337.jpg (15)

We find that Inline graphic exhibits nearly logarithmic growth with Inline graphic. Interestingly, a collapsing bound is obtained even with extremely small Inline graphic because the goal is reward maximization across trials: it is better to terminate a trial and accrue reward in future trials than to continue sampling noisy (possibly Inline graphic coherent) stimuli.

Model predictions: psychometric function and reaction time

We compare predictions of the model based on the learned policy Inline graphic with experimental data from the reaction time version (rather than the fixed duration version) of the motion discrimination task [31]. As illustrated in Figure 3, the model assumes that motion information regarding the random dots on the screen is processed by MT neurons. These neurons provide the observations Inline graphic (and Inline graphic) to right- and left-direction coding LIP neurons, which maintain the belief state Inline graphic. Actions are selected based on the optimal policy Inline graphic. If Inline graphic or Inline graphic, the animal makes a rightward or leftward decision respectively and terminates the trial. When Inline graphic, the animal chooses the sampling action and gets a new observation Inline graphic.

Figure 3. Relationship between Model and Neural Activity.

Figure 3

The input to the model is a random dots motion sequence. Neurons in MT with tuning curves Inline graphic emit Inline graphic spikes at time step Inline graphic, which constitutes the observation Inline graphic in the POMDP model. The animal maintains the belief state Inline graphic by computing Inline graphic (Inline graphic can be parameterized by Inline graphic and Inline graphic - see text). The optimal policy is implemented by selecting rightward eye movement Inline graphic when Inline graphic, or equivalently, when Inline graphic (and likewise for leftward eye movement Inline graphic).

The performance on the task using the optimal policy Inline graphic can be measured in terms of both the accuracy of direction discrimination (the so-called psychometric function), and the reaction time required to reach a decision (the chronometric function). In this section, we derive the expected accuracy and reaction time as a function of stimulus coherence Inline graphic, and compare them to the psychometric and chronometric functions of a monkey performing the same task [31].

The sequence of random variables Inline graphic forms a (non-stationary) Markov chain with transition probabilities determined by equation 11. Let Inline graphic be the joint probability that the animal keeps selecting Inline graphic until time step Inline graphic:

graphic file with name pone.0053344.e370.jpg (16)

At Inline graphic, the animal will select Inline graphic regardless of Inline graphic under Inline graphic, making Inline graphic. At Inline graphic, Inline graphic can be expressed recursively as:

graphic file with name pone.0053344.e378.jpg (17)

Let Inline graphic and Inline graphic be the joint probability mass functions that the animal makes a right or left choice at time Inline graphic, respectively. These correspond to the probability that the point estimator Inline graphic crosses the boundary of Inline graphic or Inline graphic for the first time at time Inline graphic:

graphic file with name pone.0053344.e386.jpg (18)
graphic file with name pone.0053344.e387.jpg (19)

The probabilities of making rightward or leftward eye movement are the marginal probabilities summing over all possible crossing times: Inline graphic and Inline graphic. When the underlying motion direction is rightward, Inline graphic represents the accuracy of motion discrimination and Inline graphic represents the error rate. The mean reaction times for correct and error choices are the expected crossing times over the conditional probability that the animal makes decision Inline graphic and Inline graphic respectively at time Inline graphic:

graphic file with name pone.0053344.e395.jpg (20)
graphic file with name pone.0053344.e396.jpg (21)

The left panel of Figure 4 shows performance accuracy as a function of motion strength Inline graphic for the model (solid curve) and a monkey (black dots). The model parameters are the same as those in Figure 2, obtained using a binary search within Inline graphic with a minimum step size Inline graphic.

Figure 4. Comparison of Performance of the Model and Monkey.

Figure 4

Black dots with error bars represent a monkey's decision accuracy and reaction time for correct trials. Blue solid curves are model predictions (Inline graphic and Inline graphic in the text) for parameter values Inline graphic, and Inline graphic. Monkey data from [31].

The right panel of Figure 4 shows for the same model parameters the predicted mean reacton time Inline graphic for correct choices as a function of coherence Inline graphic (and fixed direction Inline graphic) for the model (solid curve) and the monkey (black dots). Note that Inline graphic represents the expected number of POMDP time steps for making a rightward eye movement Inline graphic. It follows from the Poisson spiking process that the duration of each POMDP time step follows a exponential distribution with its expectation proportional to Inline graphic. In order to make a direct comparison to the monkey data Inline graphic, which is in units of real time, a linear regression was used to to determine the duration Inline graphic of a single observation and the onset of decision time Inline graphic:

graphic file with name pone.0053344.e413.jpg (22)

Note that the reaction time in a trial is the sum of decision time plus the non-decision delays whose properties are not well understood. The offset Inline graphic represents the non-decision residual time. We applied the experimental mean reaction time reported in [31] with motion coherence Inline graphic to compute the two coefficients Inline graphic and Inline graphic. The unit duration per POMDP step Inline graphic ms/step, and the offset Inline graphic ms, which is comparable to the Inline graphic ms non-decision time on average reported in the literature [23], [32].

There is essentially one parameter in our model needed to fit the experimental accuracy data, namely, the reward ratio Inline graphic. The other two parameters Inline graphic and Inline graphic are independent of the POMDP model, and are used only to translate the POMDP time steps into real elapsed time. This reward ratio has direct physical interpretation and can be easily manipulated by the experimenters. For example, changing the amount of awards for the correct/incorrect choices, or giving subjects different speed instructions will effectively change Inline graphic. In Figure 5 (a), we show performance accuracies Inline graphic and predicted mean reaction time Inline graphic with different values of Inline graphic. With fixed Inline graphic and Inline graphic, decreasing Inline graphic makes the observations more affordable and allows subjects to accumulate more evidence, in turn leads to a longer decision time and higher accuracy. Our model thus provides a quantitative framework for predicting the effects of reward parameters on the accuracy and speed of decision making. To test our theory, we compare the model predictions with the experimental data from a human subject, reported by Hanks et al [33], under different speed-accuracy regimes. In their experiments, human subjects were instructed to perform the random dots task under different speed-accuracy conditions. The red crosses in Figure 5 (b) represent the response time and accuracy of a human subject in the direction discrimination task with instructions to perform the task more carefully at a slower speed, while the black dots represent the task under normal speed conditions. The slower speed instruction encourages human subjects to accumulate more observations before making the final decision. In the model, this amounts to reducing the negative cost associated with each sample Inline graphic. Indeed, this tradeoff between speed and accuracy was consistent with predicted effects of changing the reward ratio. We first fit the model parameters to experimental data under normal speed conditions, based on fitting Inline graphic, Inline graphic ms/step, and Inline graphic ms (Figure 5 (b), black solid curves). The red dashed lines shown in Figure 5 (b) are model fits to the data under slower speed instruction. There is just one degree of freedom in this fit, as all model parameters except the reward ratio were fixed to the values used to fit data in the normal speed regime.

Figure 5. Effect of Inline graphic on speed-accuracy tradeoff.

Figure 5

(a) Model predictions of psychometric and chronometric functions for different values of Inline graphic. (b) Comparison of model predictions and experimental data for different speed-accuracy regimes. The black dots represent the response time and accuracy of a human subject in the direction discrimination task under normal speed conditions, while the red crosses represent data with a slower speed instruction. The model predictions are plotted as black solid curves (with Inline graphic) and red dashed lines (Inline graphic), respectively. The per-step duration and non-decision residual time are fixed to be the same for both conditions: Inline graphic ms/step, and Inline graphic ms. Human data are from human subject LH in [33].

Neural response during direction discrimination task

From Figure 2 (b), it is clear that for the random dots task, the animal does not need to store the whole two dimensional optimal policy but only the two one-dimensional decision boundaries Inline graphic and Inline graphic. This naturally suggests a neural mechanism for decision making similar to that in drift diffusion models: LIP neurons compute the belief state from MT responses and employ divisive normalization to maintain the point estimate Inline graphic. We now explore the hypothesis that the response of LIP neurons represents the difference between Inline graphic and the optimal decision threshold Inline graphic. In this model, a rightward eye movement is initiated only when the difference Inline graphic reaches a fixed bound (in this case, Inline graphic). Therefore, we modeled the firing rates in the lateral intraparietal area (LIP) Inline graphic as:

graphic file with name pone.0053344.e449.jpg (23)

where Inline graphic is the spontaneous firing rate for LIP neurons. Since Inline graphic, a constant Inline graphic is added to make Inline graphic. Inline graphic represents the termination bound; Inline graphic spikes sInline graphic from [30]. The firing rate Inline graphic is defined similarly.

The above model makes two testable predictions about neural responses in LIP. The first is that the neural response to Inline graphic coherent motion (the so called “urgency” signal [30], [34]) encodes the decision boundary Inline graphic (or Inline graphic for leftward-preferring LIP neurons). In Figure 6a, we plot the model response to Inline graphic coherent motion, along with a fit to a hyperbolic function Inline graphic, the same function that Churchland et al [30] used to parametrize the experimentally observed “urgency signal.” The parameter Inline graphic is the time taken to reach Inline graphic of the maximum. The estimate of Inline graphic for the model from Figure 6 (a) is Inline graphic ms, which is consistent with the Inline graphic ms estimated from neural data [30].

Figure 6. Comparison of Model and Neural Responses.

Figure 6

(a) Model response to Inline graphic coherence motion is shown in red. Blue curve depicts a fit using a hyperbolic function Inline graphic where Inline graphic ms, which is comparable to the value of Inline graphic ms estimated from neural data [30]. (b) The first Inline graphic ms of decision time was used to compute the buildup rate from the model response following the procedure in [30]. The red points show model buildup rates estimated for each coherence value. The effect of a unit change in the coherence on buildup rate can be estimated from the slope of the blue fitted line: this value, Inline graphic spike sInline graphic cohInline graphic, is similar to the corresponding value Inline graphic spike sInline graphic cohInline graphic estimated from neural data [30].

The second prediction concerns the buildup rate (in units of spikes sInline graphic cohInline graphic) of the LIP firing rates. The buildup rate of LIP at each motion strength is calculated from the slope of a line fit to model LIP firing rate during the first Inline graphic ms of decision time. As shown in Figure 6 (b), buildup rates scaled approximately linearly as a function of motion coherence. The effect of a unit change in coherence on the buildup rate can be estimated from the slope of the fitted line to be Inline graphic spike sInline graphic cohInline graphic, similar to what has been reported in the literature [30] (Inline graphic spike sInline graphic cohInline graphic).

Discussion

The random dots motion discrimination task has provided a wealth of information regarding decision making in the primate brain. Much of this data has previously been modeled using the drift diffusion model [35], [36], but to fully account for the experimental data, one has to sometimes use ad-hoc assumptions. This paper introduces an alternative model for explaining the monkey's behavior based on the framework of partially observable Markov decision processes (POMDPs).

We believe that the POMDP model provides a more versatile framework for decision making compared to the drift diffusion model, which can be viewed as a special case of sequential statistical hypothesis testing (SSHT) [37]. Sequential statistical hypothesis testing assumes that the stimuli (observations) are independent and identically distributed whereas the POMDP model allows observations be temporally correlated. The observations in the POMDP are conditionally independent given the hidden state Inline graphic, which evolves according to a Markov chain. Thus, the POMDP framework for decision making [11], [14], [16], [38], [39] can be regarded as a strictly more general model than the SSHT models. We intend to explore the applicability of our POMDP model to time-dependent stimuli, such as temporally dynamic attention [40] and temporally blurred stimulus representations [41] in future studies.

Another advantage of a POMDP model is that the model parameters have direct physical interpretations and can be easily manipulated by the experimenter. Our analysis shows that the optimal policy is fully determined by the reward parameters Inline graphic. Thus, the model psychometric and chronometric functions, which are derived from the optimal policy, are also fully determined by these model parameters. Experimenters can control these reward parameters by changing the amount of awards for the correct/incorrect choices, or by giving subjects different speed instructions. This allows our model to make testable predictions, as demonstrated by the effects of the change in the reward ratios on the speed-accuracy trade-off. It should be noted that these reward parameters can be subjective and may vary from individual to individual. For example, Inline graphic can be directly related to the external food or juice reward provided by the experimenter while Inline graphic may be linked to internal factors such as degree of hunger or thirst, drive, and motivation. The precise relationship between these reward parameters and the external reward/risk controlled by the experimenter remains unknown. Our model thus provides a quantitative framework for studying this relationship between internal reward mechanisms and external physical reward.

The proposed model demonstrates how the monkey's choices in the random dots task can be interpreted as being optimal under the hypothesis of reward maximization. The reward maximization hypothesis has previously been used to explain behavioral data from conditioning experiments [8] and dopaminergic responses under the framework of temporal difference (TD) learning [42]. Our model extends these results to the more general problem of decision making under uncertainty. The model predicts psychometric and chronometric functions that are quantitatively close to those observed in monkeys and humans solving the random dots task.

We showed through analytical derivations and numerical simulation that the optimal threshold for selecting overt actions is a declining function of time. Such a collapsing decision bound has previously been obtained for decision making under a deadline [11], [29]. It has also been proposed as an ad-hoc mechanism in drift diffusion models [28], [30], [43] for explaining finite response time at zero percent coherence. Our results demonstrate that a collapsing bound emerges naturally as a consequence of reward maximization. Additionally, the POMDP model readily generalizes to the case of decision making with arbitrary numbers of states and actions, as well as time-varying state.

Instead of traditional dynamic programming techniques, the optimal policy Inline graphic and value Inline graphic can be learned via Monte Carlo approximation-based methods such as temporal difference (TD) learning [27]. There is much evidence suggesting that the firing rate of midbrain dopaminergic neurons might represent the reward prediction error in TD learning. Thus, the learning of value and policy in the current model could potentially be implemented in a manner similar to previous TD learning models of the basal ganglia [8], [9], [11], [42].

Acknowledgments

The authors would like to thank Timothy Hanks, Roozbeh Kiani, Luke Zettlemoyer, Abram Friesen, Adrienne Fairhall and Mike Shadlen for helpful comments.

Funding Statement

This project was supported by the National Science Foundation (NSF) Center for Sensorimotor Neural Engineering (EEC-1028725), NSF grant 0930908, Army Research Office (ARO) award W911NF-11-1-0307, and Office of Naval Research (ONR) grant N000140910097. YH is a Howard Hughes Medical Institute International Student Research fellow. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Knill D, Richards W (1996) Perception as Bayesian inference. Cambridge: Cambridge University Press.
  • 2. Zemel RS, Dayan P, Pouget A (1998) Probabilistic interpretation of population codes. Neural Computation 10. [DOI] [PubMed] [Google Scholar]
  • 3.Rao RPN, Olshausen BA, Lewicki MS (2002) Probabilistic Models of the Brain: Perception and Neural Function. Cambridge, MA: MIT Press.
  • 4. Rao RPN (2004) Bayesian computation in recurrent neural circuits. Neural Computation 16: 1–38. [DOI] [PubMed] [Google Scholar]
  • 5. Ma WJ, Beck JM, Latham PE, Pouget A (2006) Bayesian inference with probabilistic population codes. Nature Neuroscience 9: 1432–1438. [DOI] [PubMed] [Google Scholar]
  • 6.Doya K, Ishii S, Pouget A, Rao RPN (2007) Bayesian Brain: Probabilistic Approaches to Neural Coding. Cambridge, MA: MIT Press.
  • 7. Daw ND, Courville AC, Touretzky D (2006) Representation and timing in theories of the dopamine system. Neural Computation 18: 1637–1677. [DOI] [PubMed] [Google Scholar]
  • 8. Dayan P, Daw ND (2008) Decision theory, reinforcement learning, and the brain. Cognitive, Affective and Behavioral Neuroscience 8: 429–453. [DOI] [PubMed] [Google Scholar]
  • 9. Bogacz R, Larsen T (2011) Integration of reinforcement learning and optimal decision making theories of the basal ganglia. Neural Computation 23: 817–851. [DOI] [PubMed] [Google Scholar]
  • 10. Law CT, Gold JI (2009) Reinforcement learning can account for associative and perceptual learning on a visual-decision task. Nat Neurosci 12: 655–663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Rao RPN (2010) Decision making under uncertainty: A neural model based on POMDPs. Frontiers in Computational Neuroscience 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101: 99–134. [Google Scholar]
  • 13. Drugowitsch J, Moreno-Bote R, Churchland AK, Shadlen MN, Pouget A (2012) The cost of accu-mulating evidence in perceptual decision making. J Neurosci 32: 3612–3628. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Shenoy P, Rao RPN, Yu AJ (2010) A rational decision-making framework for inhibitory control. Advances in Neural Information Processing Systems (NIPS) 23. Available: http://www.cogsci.ucsd.edu/~ajyu/Papers/nips10.pdf. Accessed 2012 Dec 24.
  • 15.Shenoy P, Yu AJ (2012) Rational impatience in perceptual decision-making: a bayesian account of discrepancy between two-alternative forced choice and go/nogo behavior. Advances in Neural Information Processing Systems (NIPS) 25. Cambridge, MA: MIT Press.
  • 16.Huang Y, Friesen AL, Hanks TD, Shadlen MN, Rao RPN (2012) How prior probability influences decision making: A unifying probabilistic model. Advances in Neural Information Processing Systems (NIPS) 25. Cambridge, MA: MIT Press.
  • 17. Shadlen MN, Newsome WT (2001) Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. Journal of Neurophysiology 86. [DOI] [PubMed] [Google Scholar]
  • 18. Newsome WT, Britten KH, Movshon JA (1989) Neuronal correlates of a perceptual decision. Nature 341: 52–54. [DOI] [PubMed] [Google Scholar]
  • 19. Salzman CD, Britten KH, Newsome WT (1990) Cortical microstimulation inuences perceptual judgements of motion direction. Nature 346: 174–177. [DOI] [PubMed] [Google Scholar]
  • 20. Britten KH, Shadlen MN, Newsome WT, Movshon JA (1992) The analysis of visual motion: a comparison of neuronal and psychophysical performance. J Neurosci 12: 4745–4765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Shadlen MN, Newsome WT (1996) Motion perception: seeing and deciding. Proc Natl Acad Sci 93: 628–633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Wang XJ (2002) Probabilistic decision making by slow reverberation in cortical circuits. Neuron 36: 955–968. [DOI] [PubMed] [Google Scholar]
  • 23. Mazurek ME, Roitman JD, Ditterich J, Shadlen MN (2003) A role for neural integrators in per-ceptual decision-making. Cerebral Cortex 13: 1257–1269. [DOI] [PubMed] [Google Scholar]
  • 24. Beck JM, Ma W, Kiani R, Hanks TD, Churchland AK, et al. (2008) Probabilistic population codes for Bayesian decision making. Neuron 60: 1142–1145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Britten KH, Shadlen MN, Newsome WT, Movshon JA (1993) Responses of neurons in macaque MT to stochastic motion signals. Vis Neurosci 10(6): 1157–1169. [DOI] [PubMed] [Google Scholar]
  • 26.Casella G, Berger R (2001) Statistical Inference, 2nd edition. Pacific Grove, CA: Duxbury Press.
  • 27.Sutton RS, Barto AG (1998) Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
  • 28. Latham PE, Roudi Y, Ahmadi M, Pouget A (2007) Deciding when to decide. SocNeurosciAbstracts 740. [Google Scholar]
  • 29. Frazier P, Yu A (2008) Sequential hypothesis testing under stochastic deadlines. Advances in Neural Information Processing Systems 20: 465–472. [Google Scholar]
  • 30. Churchland AK, Kiani R, Shadlen MN (2008) Decision-making with multiple alternatives. Nat Neurosci 11: 693–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Roitman JD, Shadlen MN (2002) Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of Neuroscience 22 21: 9475–9489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Luce RD (1986) Response times: their role in inferring elementary mental organization. Oxford: Oxford University Press.
  • 33. Hanks TD, Mazurek ME, Kiani R, Hopp E, Shadlen MN (2011) Elapsed decision time affects the weighting of prior probability in a perceptual decision task. Journal of Neuroscience 31: 6339–6352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Cisek P, Puskas G, El-Murr S (2009) Decisions in changing conditions: The urgency-gating model. Journal of Neuroscience 29: 11560–11571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Palmer J, Huk AC, Shadlen MN (2005) The effects of stimulus strength on the speed and accuracy of a perceptual decision. Journal of Vision 5: 376–404. [DOI] [PubMed] [Google Scholar]
  • 36. Bogacz R, Brown E, Moehlis J, Hu P, Holmes P, et al. (2006) The physics of optimal decision making: A formal analysis of models of performance in two-alternative forced choice tasks. Psychological Review 113: 700–765. [DOI] [PubMed] [Google Scholar]
  • 37. Lai TL (1988) Nearly optimal sequential tests of composite hypotheses. The Annals of Statistics 16 2: 856–886. [Google Scholar]
  • 38.Frazier PL, Yu AJ (2007) Sequential hypothesis testing under stochastic deadlines. In Advances in Neural Information procession Systems 20. Cambridge, MA: MIT Press.
  • 39. Yu A, Cohen J (2008) Sequential effects: Superstition or rational behavior. In Advances in Neural Information Processing Systems 21: 1873–1880. [PMC free article] [PubMed] [Google Scholar]
  • 40. Ghose GM, Maunsell JHR (2002) Attentional modulation in visual cortex depends on task timing. Nature 419 6907: 616–620. [DOI] [PubMed] [Google Scholar]
  • 41. Ludwig CJH (2009) Temporal integration of sensory evidence for saccade target selection. Vision Research 49: 2764–2773. [DOI] [PubMed] [Google Scholar]
  • 42. Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275: 1593–1599. [DOI] [PubMed] [Google Scholar]
  • 43. Ditterich J (2006) Stochastic models and decisions about motion direction: Behavior and physiology. Neural Networks 19: 981–1012. [DOI] [PubMed] [Google Scholar]

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES