Skip to main content
eLife logoLink to eLife
. 2020 Apr 15;9:e49834. doi: 10.7554/eLife.49834

Reinforcement biases subsequent perceptual decisions when confidence is low, a widespread behavioral phenomenon

Armin Lak 1,2,, Emily Hueske 3,4,5, Junya Hirokawa 6,7, Paul Masset 3,6,8, Torben Ott 6,9, Anne E Urai 6,10, Tobias H Donner 10, Matteo Carandini 2, Susumu Tonegawa 4,11, Naoshige Uchida 3, Adam Kepecs 6,9,
Editors: Emilio Salinas12, Michael J Frank13
PMCID: PMC7213979  PMID: 32286227

Abstract

Learning from successes and failures often improves the quality of subsequent decisions. Past outcomes, however, should not influence purely perceptual decisions after task acquisition is complete since these are designed so that only sensory evidence determines the correct choice. Yet, numerous studies report that outcomes can bias perceptual decisions, causing spurious changes in choice behavior without improving accuracy. Here we show that the effects of reward on perceptual decisions are principled: past rewards bias future choices specifically when previous choice was difficult and hence decision confidence was low. We identified this phenomenon in six datasets from four laboratories, across mice, rats, and humans, and sensory modalities from olfaction and audition to vision. We show that this choice-updating strategy can be explained by reinforcement learning models incorporating statistical decision confidence into their teaching signals. Thus, reinforcement learning mechanisms are continually engaged to produce systematic adjustments of choices even in well-learned perceptual decisions in order to optimize behavior in an uncertain world.

Research organism: Human, Mouse, Rat

Introduction

Learning from the outcomes of decisions can improve subsequent decisions and yield greater success. For instance, to find the best meal on a busy street where restaurants often change menus, one needs to frequently sample food and learn. Humans and other animals efficiently learn from past rewards and choose actions that have recently lead to the best rewards (Daw et al., 2006; Lee et al., 2012; Samejima et al., 2005; Tai et al., 2012). In addition to evaluating past rewards, decision making often require consideration of present perceptual signals; the restaurants’ signs along the busy street might be too far and faded to be trusted. Therefore, good decisions ought to take into account both current sensory evidence as well as the prior history of successes and failures.

Decisions guided by the history of rewards can be studied in a reinforcement learning framework (Sutton and Barto, 1998). Perceptual decisions, on the other hand, have been classically conceptualized within a statistical, psychometric framework (Green and Swets, 1966). Although statistical decision theory and reinforcement learning provide two largely distinct frameworks for studying decisions, we are often challenged by both limits in our perception as well as limits in learning from past rewards. For sensory decisions, classical psychometric analysis estimates three fundamental variables that determine the quality of choices: the bias, lapse rate and sensitivity (Green and Swets, 1966; Wichmann and Hill, 2001). When bias and lapse rates are negligible and sensitivity has reached its maximum over time, then fluctuations in decisions are solely attributed to the noise in the perceptual processing. Under these assumptions, incorrect decisions are caused by perceptual noise creating imperfect percepts. Here, we show a systematic deviation from the assumption of no learning during well-trained perceptual decisions: past rewards bias perceptual choices specifically when the previous stimulus was difficult to judge, and the confidence in obtaining the reward was low.

In laboratory perceptual decision-making paradigms, there is typically no overt learning after the task acquisition is complete. Nevertheless, several studies have shown that past rewards, actions, and stimuli can appreciably influence subsequent perceptual choices (Abrahamyan et al., 2016; Akaishi et al., 2014; Akrami et al., 2018; Braun et al., 2018; Busse et al., 2011; Cho et al., 2002; Fan et al., 2018; Fischer and Whitney, 2014; Fritsche et al., 2017; Fründ et al., 2014; Gold et al., 2008; Hwang et al., 2017; Lueckmann et al., 2018; Luu and Stocker, 2018; Marcos et al., 2013; Tsunada et al., 2019; Urai et al., 2017). Some of these observations support the view that simple forms of reward-based learning are at work during asymptotic perceptual performance. For instance, subjects might repeat the previously rewarded choice or avoid it after an unsuccessful trial (Abrahamyan et al., 2016; Busse et al., 2011; Tsunada et al., 2019; Urai et al., 2017). However, these types of choices biases seem to be suboptimal and might reflect simple heuristics. Thus, the extent to which choice biases in perceptual decisions can be expected from normative considerations in reinforcement learning has been unclear. Perhaps, the most prominent prediction of reinforcement learning under perceptual uncertainty is that the strength of sensory evidence (i.e. confidence in the accuracy of a decision) should modulate how much to learn from the outcome of a decision (Lak et al., 2017; Lak et al., 2019). Outcomes of easy decisions are highly predictable, and thus there is little to be learned from such decisions. In contrast, outcomes of difficult, low confidence decisions, provide the most prominent opportunity to learn and adjust subsequent decisions (Lak et al., 2017; Lak et al., 2019). These considerations lead to the hypothesis that decision confidence regulates trial-by-trial biases in perceptual choices.

Here, we demonstrate that well-trained perceptual decisions can be systematically biased based on previous outcomes in addition to current sensory evidence. We show that these outcome-dependent biases depend on the strength of past sensory evidence, suggesting that they are consequences of confidence-guided updating of choice strategy. We demonstrate that this form of choice updating is a widespread behavioral phenomenon that can be observed across various perceptual decision-making paradigms in mice, rats and humans. This trial-to-trial choice bias was also present in different sensory modalities and transferred across modalities in an interleaved auditory/olfactory choice task. To explain these observations, we present a class of reinforcement learning models and Bayesian classifiers that adjust learning based on the statistical confidence in the accuracy of previous decisions.

Results

Perceptual decisions are systematically updated by past rewards and past sensory stimuli

To investigate how the history of rewards and stimuli influences subsequent perceptual decisions, we began with an olfactory decision task (Figure 1a). Rats were trained on a two-alternative choice olfactory decision task (Uchida and Mainen, 2003). Two primary odors were associated with rewards at left and right choice ports and mixtures (morphs) of these odors were rewarded according to a categorical boundary (50:50 mixture; Figure 1a). To manipulate perceptual uncertainty, we varied the odor mixtures, that is the ratio of odor A and B in a trial-by-trial manner, testing mixtures 100:0, 80:20, 65:35, 55:45, 45:55, 35:65, 20:80 and 0:100. Rats showed near-perfect performance for easy mixtures and made errors more frequently for difficult mixtures (Figure 1a, bottom panel).

Figure 1. Rats update their trial-by-trial perceptual choice strategy in a stimulus-dependent manner.

(a) Top: Schematic of a 2AFC olfactory decision-making task for rats. Bottom) Average performance of an example rat. (b) Following learning, the psychometric curves showed minimal fluctuations across test sessions. Bias, sensitivity and lapse were measured for each test session. (c) After successful completion of a trial, rats tended to shift their choice toward the previously rewarded side. Left and right panels illustrate example animal and population average. (d) Schematic of analysis procedure for computing conditional psychometric curves and updating plots. Left: Black curve shows the overall psychometric curve and the green curve shows the curve only after trials with 48% odor A (i.e. conditional on the stimulus (48% A) in the previous trial). Middle: Each point in the heatmap indicates the vertical difference between data points of the conditional psychometric curve and the overall psychometric curve. Red and purple boxes indicate data points which are averaged to compute data points shown in the rightmost plot. Right: Updating averaged across current easy trials (in this case the easiest two stimulus levels) and current difficult trials. (e) Performance of the example rat (left) and population (right) computed separately based on the quality of olfactory stimulus (shown as colors mixtures from blue to green) in the previously rewarded trial. After successful completion of a trial, rats tended to shift their choices towards the previously rewarded side but only when the previous trial was difficult. (f) Choice updating, that is the size of shift of psychometric curve relative to the average psychometric curve, as a function of sensory evidence in the previously rewarded trial, and current trial. Positive numbers refer to a bias towards choice A and negative numbers refer to a bias toward the alternative choice. The left and right plots refer to the example rat and population, respectively. (g) Choice updating as a function of previous stimulus separated for current easy (square) and difficult (circle) trials. These plots are representing averages across graphs presented in f.

Figure 1.

Figure 1—figure supplement 1. Left: Performance of population of rats (n=16) computed from trials in which the previous stimulus was difficult (45% odor A, 55% odor B), separated based on whether the previous choice was rewarded (correct) or unrewarded (error).

Figure 1—figure supplement 1.

Right: Performance of population of rats (n=16) computed from trials in which the previous stimulus was easy (20% odor A, 80% odor B), separated based on whether the previous choice was rewarded (correct) or unrewarded (error).

After task learning, rats showed stable behavior across testing sessions (Figure 1b). To quantify behavioral performance and stability across sessions, we fitted choice behavior with psychometric functions that included parameters for sensitivity (reflecting perceptual noise), bias (the tendency to take one specific action) and lapse rate (stimulus-independent occasional errors possibly reflecting attentional or learning deficits) (Figure 1a, b). We found that in well-trained rats, the bias was near zero (2±4.6 %Odor A (mean± S.D), p = 0.08, signed rank test). Likewise, lapse rates were low (3±4%), indicating that for easy stimuli rats’ performance was near perfect and not substantially degraded by attention or incomplete learning. Lapse rate, sensitivity and bias remained stable over sessions, indicating that rats reached asymptotic performance (Figure 1b, 14/16 rats, p > 0.1, linear regression).

Despite the stable, asymptotic performance for easy decisions, previous trials had substantial effects on subsequent choices (Figure 1c). In order to assess the effects of reinforcement on perceptual decisions, we calculated conditional psychometric functions. We first considered the effect of previous choice (Figure 1c). Psychometric functions were systematically biased by previously rewarded choices: after correct leftward choices rats tended to make a left choice. Conversely, following correct rightward choices, animals made rightward choices more often (F = 29.8, p=0.001, 2-way ANOVA).

The effects of the previous decision on subsequent choices also depended on the difficulty of the previous choices (Figure 1d-g). We computed psychometric functions after correct (and hence rewarded) trials separately for different stimuli of the previous trial (Figure 1d). The resulting psychometric functions were systematically biased towards the recently rewarded side for difficult decisions (Figure 1e). Rats tended to repeat their previous choices particularly when they succeeded to correctly categorize a challenging odor mixture and earn reward (Figure 1e). We quantified the magnitude of this choice bias for each pair of current and previous stimuli (Figure 1d, see Materials and methods). To do so, we subtracted the average psychometric curve (computed from all trials) from each psychometric curve computed conditional on the specific previous correct stimulus, and plotted the size and sign of this difference (positive: bias to choose A; negative: bias to choose B) (Figure 1d,f). To summarize choice biases, we then averaged these differences across trials in which the current choice was easy or difficult (Figure 1d,g). The magnitude of this choice bias was proportional to the difficulty of the previous decision, in addition to the difficulty of the current decision (Figure 1f). Updating was minimal when the current stimulus was easy, regardless of the difficulty of the previous decision (Figure 1g, squares). This is because the data points are overlapping when the current stimulus is easy, and hence the distance between them is close to zero (Figure 1d,e). When the current stimulus was difficult, updating was also minimal after correct easy choices, whereas it was strong following correct difficult choices (Figure 1g, circles, p=0.0002, rank sum test). Thus, when the current sensory evidence was strong, it determined the choice, without detectable effects of the previous trial (Figure 1g, squares). However, when the sensory evidence in the current trial was weak, the previous reward influenced choices only if the reward was earned in a difficult trial (Figure 1g, circles). The difficulty of previous decision did not influence the slope (sensitivity) nor the lapse of psychometric curves in the next trial (p>0.1 rank sum test). Additionally, plotting the psychometric curves conditional on the specific stimulus in the previous trial but separated according to the outcome of the previous trial (correct vs error) further illustrated that past outcome influence subsequent choice in particular when the previous choice was difficult (Figure 1—figure supplement 1). Together, these observations indicate that the effects of past rewards on perceptual choices depend on the difficulty of the previous perceptual judgments.

Choice updating is not due to slow drifts in choice side bias

The results so far demonstrate that previous rewards influence subsequent perceptual decisions especially when the previous decision was difficult. One possibility is that these behavioral effects arise as a byproduct of slow fluctuations of side bias causing correlations across consecutive trials and hence systematic shifts in choices. This scenario can be illustrated within a signal detection theory (SDT) framework. In SDT, the perceived stimulus is compared to a decision boundary and produces a correct choice when the stimulus falls on the appropriate side of the boundary (Figure 2a). When the decision boundary is fixed, there is no apparent updating, as expected (Figure 2—figure supplement 1a, b). However, simulating a slowly drifting decision boundary reveals systematic effects of previous trials on next choices, because the drifting boundary induces correlations across trials, producing apparent choice updating based on the previous trials (Figure 2b; Figure 2—figure supplement 1c, d). For instance, if the decision boundary slowly drifts to the left side then rightward choices will be more frequent and occur in succession, producing a shift in the psychometric curve (Figure 2b). Importantly, this effect is independent of decision outcomes and is observed in a sequence of trials, both before and after a rewarded trial (Figure 2b). An alternative, more intriguing, scenario for explaining our results is an active learning process that produces trial-by-trial adjustment of the decision boundary. If the decision boundary is adjusted in a trial-by-trial manner according to the outcome of the previous trial, psychometric shifts will be observed in the next trial contingent on the past reward, but absent in the preceding trial (Figure 2c). It is thus critical to remove slow fluctuations of side bias, before concluding that psychometric shifts are signatures of an active learning process.

Figure 2. Choice updating is not due to slow and nonspecific drift in response bias.

(a) Signal detection theory-inspired schematic of task performance. The psychometric curve illustrates the average choice behavior. (b) Slow non-specific drift in choice bias, visualized here as drift in the decision boundary, could lead to shift in psychometric curves which persisted for several trials and was not specific to stimulus and outcome of the previous trial. This global bias effect is cancelled when subtracting the psychometric curve of trialt-1 (orange) from trialt+1 (brown). (c) Trial-by-trial updating of decision boundary shifts psychometric curves depending on the outcome and perceptual difficulty of the preceding trial. Subtracting psychometric curves does not cancel this effect. (d) Choice bias of the example rat following a rewarded trial. (e) Similar to d but for population. (f) Choice bias of the example rat in one trial prior to current trial, reflecting global nonspecific bias visualized in b. (g) Similar to f but for population. (h) Subtracting choice bias in trialt-1 from trialt+1 reveals the trial-by-trial choice updating in the example rat. (i) Similar to h but for the population. See Figure 2—figure supplement 1 for details of the normalization procedure.

Figure 2.

Figure 2—figure supplement 1. Isolation and correction of slowly drifting non-specific choice bias.

Figure 2—figure supplement 1.

(a,b) A simple signal detection theory-based simulation with a fixed decision boundary. In this model, stimuli are drawn from a normal distribution and are compared to a fixed decision boundary (50%) for choice computation. This model generates psychometric curves that are not depending on the previous trial (left panel in a) and hence no updating is observed (middle and right panel in a). Our normalization (explained in e) does not influence updating in this model, as shown in b. (c,d) A signal detection theory-based simulation using a slowly drifting decision boundary. Psychometric curves appear to depend on the previous trial (left panel in c), resulting in apparent updating effect (middle and right panels in c). However, this effect is removed after applying our normalization as shown in d. (e) The normalization procedure for isolating trial-by-trial updating. Upper row middle panel shows the performance for two levels of stimuli (48 and 52%) which were both rewarded, hence the delta function. Upper row left panel shows the psychometric curves separately for trials followed by 48% or 52% stimuli. Any separation between these curves indicates a side bias which extend beyond a single trial. Upper right panel shows psychometric curves separately computed based on whether the stimulus in trial t was 48 or 52%. The full conditional psychometric curves in trial t-1 and t and in trial t and t+1 were used to compute heatmaps (middle row). The heatmap of t-1 was subtracted from the heatmap of t+1 to compute normalized trial-by-trial updating (lowest row).

We next asked to what extent the trial history effects we observed reflect correlations across consecutive trials due to slowly fluctuating bias over trials. To do so, we devised a model-independent analysis to identify and remove slow fluctuations of side bias (Figure 2—figure supplement 1). While it is possible to formulate model-based analyses to correct for slow biases, there are numerous possibilities that could produce similar fluctuations. Therefore, we sought a model-independent technique, reasoning that slow fluctuations are, by definition, slower than one trial, and hence should have largely similar impact across adjacent trials. Specifically, slow fluctuations will produce similar biases one trial before and after a given decision outcome. This assumption leads to a simple strategy to correct for possible slow drifts and isolate psychometric curve shifts due to active processes: subtracting the psychometric shifts between trial t and t-1 from that of trial t+1 and t removes the effect of slow response bias, that is slow boundary drift. Importantly, applying this normalization to the STD model with a drifting decision boundary removes the apparent but artefactual dependence of decisions on previous trials (Figure 2—figure supplement 1d). This subtraction technique thus provides an estimate of how the current trial influences choices in the next trial. Another intuition for this analysis is that future rewards cannot influence past choices, and therefore any systematic dependence of psychometric curves on next trials cannot reflect causal mechanisms and need to be adjusted for. We thus define ‘choice updating’ as a trial-by-trial bias beyond slowly fluctuating and non-specific side biases.

We found that the difficulty of previous choices had a strong effect on subsequent choices in rat olfactory discriminations even after correcting for slow fluctuations in the choice bias (Figure 2d–i). We computed the psychometric curves conditional on stimuli and outcomes, for both the next trial (Figure 2d,e) and also the previous trial (Figure 2f,g). Choice biases tended to be larger when considering the next trial compared to the previous trial. The difference between these provides an estimate of choice updating that is due to the most recent reward and without slow and non-outcome specific fluctuations in side bias (Figure 2h,i). The choice updating effect remained statistically significant even after this correction (p=0.003, rank sum test). These results rule out the possibility that psychometric shifts are only due to slow drift in side bias and indicate that reward received in the past trial influences subsequent perceptual decisions specifically if the sensory evidence in the previous trial was uncertain and difficult to judge.

Belief-based reinforcement learning models account for choice updating

We next considered what types of reinforcement learning processes could account for the observed choice updating effects. Reinforcement learning models have been long used to study how choices are influenced by past decisions and rewards (Daw and Doya, 2006; Sutton and Barto, 1998). A key distinction between RL model variants is whether and how they treat ambiguous signals for state inference and prediction error computation.

We show that a reinforcement learning model with a belief-state representing ambiguous perceptual stimuli accounts for choice updating (Figure 3a,b). A reinforcement model for our behavioral task has to consider the inherent perceptual ambiguity in sensory decisions in addition to tracking reward outcomes. The normative way to cope with such ambiguity about state representations is to introduce a partially observable Markov decision process (POMDP) framework for the temporal difference RL (TDRL) algorithm (Dayan and Daw, 2008; Lak et al., 2017; Lak et al., 2019; Rao, 2010). POMDPs capture the intuitive notion that when perception is ambiguous, the model needs to estimate current perceptual experience as a ‘belief state’, which expresses state uncertainty as a probability distribution over states.

Figure 3. Belief-based reinforcement learning model accounts for choice updating.

(a) Left: schematics of the temporal difference reinforcement learning (TDRL) model that includes belief state reflecting perceptual decision confidence. Right: predicted values and reward prediction errors of the model. After receiving a reward, reward prediction errors depend on the difficulty of the choice and are largest after a hard decision. Reward prediction errors of this model are sufficient to replicate our observed choice updating effect. (b) Choice updating of the model shown in a. This effect can be observed even after correcting for non-specific drifts in the choice bias (right panel). The model in all panels had σ2=0.2 and α=0.5. (c) A TDRL model which follows a Markov decision process (MDP) and that does not include decision confidence into prediction error computation produces choice updating that is largely independent of the difficulty of the previous decision. (d) A MDP TDRL model that includes slow non-specific drift in choice bias fails to produce true choice updating. The normalization removes the effect of drift in the choice bias, but leaves the difficulty-independent effect of past reward (e) A MDP TDRL model that includes win-stay-lose-switch strategy fails to produce true choice updating. For this simulation, win-stay-lose-switch strategy is applied to 10% of randomly-selected trials. See Figure 3—figure supplement 1 and the Materials and methods for further details of the models.

Figure 3.

Figure 3—figure supplement 1. Further characteristics of the confidence-dependent TDRL model and the MDP TDRL model.

Figure 3—figure supplement 1.

(a) Confidence-dependent TDRL model which uses a softmax for computing choice produces confidence-dependent updating similar to the model run that uses argmax for choice computation. (b) Confidence-dependent choice updating is stronger after two rewarded difficult trials (left), consistent with the model predictions (right). Left panel shows the absolute size of choice updating computed after one rewarded difficult choice (black) and after two rewarded difficult choices to the same choice side (light red) (n=16 rats). Right panel shows the size of updating after one reward and two rewarded difficult choices. (c) The stored values of actions converge to different quantities in the confidence-dependent model and the MDP TDRL model. The stored value of left actions averaged over 1000 model runs are shown (the results would be same for the right actions). In both models, the size of delivered reward in correct trials was 1. (d) The difference in the prediction errors of the confidence-dependent model and the MDP TDRL model. The prediction errors in the confidence-dependent model results in choice updating in the next trial.

Previously, we showed that such a model is analogous to a TDRL model that uses statistical decision confidence, the conditional probability of getting reward given the choice, to scale prediction errors (Lak et al., 2017). We reasoned that this model could also account for how uncertainty of past perceptual decisions influences learning and updating of subsequent perceptual choice. In the model, the rewards received after difficult, low confidence, choices lead to large reward prediction errors and hence the strong updating of decision values in the next trial (Figure 3a, b). The belief state in the model reflects the subject’s internal representations of a stimulus, which in the case of our 2AFC task is the probability that the stimulus belongs to the left or right category (pL and pR). To estimate these probabilities, the model assumes that the internal estimate of the stimulus, s^, is normally distributed with constant variance around the true stimulus contrast: p(s^|s)=N(s^;s,σ2), where s^ parameterizes a belief distribution over all possible values of s that are consistent with the sensory evidence, given by Bayes rule:

p(s|s^)=p(s^|s).p(s)p(s^)

Assuming that the prior belief about s is uniform, then the optimal belief will also be Gaussian, with the same variance as the sensory noise distribution, and mean given by s^: p(s|s^)=N(s;s^,σ2).

From this, the agent computes a belief, that is the probability that the stimulus was on the right side of the monitor, pR=p(s>0|s^), according to:

pR(s^)=0p(s|s^)ds

where pR represents the trial-by-trial probability of the stimulus being on the right side and pL=1-pR similarly represents the probability of it being on the left. Multiplying these probabilities with the learned action values of left and right, VL and VR, provides the expected values of left and right choices: QL=pLVL and QR=pRVR. The higher of these two determines the choice C (either L or R), its associated confidence pC, and its predicted value Qc=pCVC. Note that although the choice computation is deterministic, the same stimulus can produce left or right choices caused by fluctuations in the percept due to randomized trial-to-trial variation around the stimulus identity (Figure 3—figure supplement 1a). Following the choice outcome, the model learns by updating the value of the chosen action by VCVC+αδ, where α is a learning rate, and δ=r-QC is the reward prediction error. Thus, in this model, prediction error computation has access to the belief state used for computing the choice, and hence reward prediction error is scaled by the confidence in obtaining the reward (Lak et al., 2017; Lak et al., 2019). The largest positive prediction error occurs when receiving a reward after a difficult, low confidence, choice while receiving a reward after an easy choice results in a small prediction error (Figure 3a, right). After training, the choices produced by this model exhibit confidence-guided updating (Figure 3b, left), similar to those we observed in choices of rats. Similar to the data, the choice updating in this model persisted after accounting for possible slow fluctuations in the choice bias (Figure 3b, right). Note that RL models can produce correlations in choices across trials due to the correlation of stored values across trials, and hence it is important to evaluate the size of the updating effect after the normalization. An additional prediction of our model is that updating effect should be slightly stronger when considering trials preceded by two (rather than one) rewarded difficult choices in the same direction (Figure 3—figure supplement 1b), which we also observed in rats’ choices (Figure 3—figure supplement 1b).

We next considered whether classical TDRL models that follow a Markov decision process (MDP), could also produce confidence-guided updating. Such a model is largely similar to the model described above with one fundamental difference: the computation of prediction error does not have access to the belief state. In other words, prediction errors are computed by comparing the outcome with the average value of chosen action, without consideration of the belief in the accuracy of that action. In the model variant with two states (L and R), after learning, VL and VR reflect the average reward expectations for each choice and prediction errors are computed by comparing the outcome with this average expectation. Decisions made by this model show substantial effects of past reward (Figure 3c): after receiving a reward, the model has a tendency to make the same decision. However, compared to the belief-based model, this bias shows little dependence on the difficulty of the previous choice, and this dependence is absent after applying our normalization (Figure 3c, left). An extended version of this model that represents stimuli across multiple states also cannot reproduce confidence-dependent updating (see Materials and methods). Thus, MDP TDRL models (i.e. TDRL without a stimulus belief state) do not exhibit confidence-guided choice updating. We also considered whether a MDP TDRL model with slowly fluctuating response side bias could show trial-by-trial choice updating (Figure 3d). A modified MDP TDRL model that includes a slowly drifting side bias term showed a substantial effect of past reward on choices, which was mildly dependent on the difficulty of the previous sensory judgement (Figure 3d, left). This dependence, however, vanished after normalizing the choices to account for slowly drifting side bias (Figure 3d, right). This reveals that slowly fluctuating side bias in TDRL models without a belief state does not result in confidence-guided choice updating, despite apparent trial to trial fluctuations. The results also further confirm the effectiveness of our normalization procedure (i.e. subtracting the bias in the previous trials from next trials) in isolating trial-by-trial confidence-guided choice updating. We also ruled out the possibility that choice updating reflected an elementary win-stay strategy (Figure 3e). The decisions of an MDP TDRL model that was modified to include win-stay strategy (to repeat the previously rewarded choice, with p=10% in this example) show strong dependence on past rewards. The normalization removes the dependence of updating size on choice difficulty that is due to a correlation across choices. However, it does not remove the signatures of win-stay / loose-switch behavior (Figure 3e). Thus, in this model variant the effect of past rewards is independent of the difficulty of the previous choice, differing from the rat data in which the size of the bias induced by previous reward is proportional to the difficulty of the previous decision.

The predictions of the confidence-dependent TDRL and MDP TDRL models differ in two principal ways. First, the learned values of actions (VL and VR) converge to different values over learning (Figure 3—figure supplement 1c). In the confidence-dependent TDRL, VL and VR both converge to just below the true size of reward. However, in the MDP TDRL model they converge to the average choice accuracy (average reward harvest), which is lower than the true reward size (Figure 3—figure supplement 1c). This difference emerges because in the belief-based model values are updated using prediction errors scaled by confidence. Confidence is relatively low in the error trials, leading to small adjustments (reductions) of values after those trials, and hence the convergence of values to just below the true reward size (i.e. reward value, when the reward is given). This difference is important for understanding updating in particular in the case of very easy correct trials. In the confidence-dependent model, large confidence associated with a correct easy choice together with higher values of stored values produce Q values similar to the true reward size, and hence near-zero prediction errors when receiving the reward. However, in the MDP model, Q values are compared with relatively low stored values (compared to the reward value), and hence reward prediction errors persist even for very easy correct choices. The second major difference between the two models is how choice difficulty determines reward prediction errors and hence updating. While in the confidence-dependent model prediction errors depend on the choice difficulty (Figure 3a; Figure 3—figure supplement 1d), in the MDP TDRL the prediction errors do not reflect choice difficulty (Figure 3—figure supplement 1d, see Materials and methods). The prediction errors of the belief-based model thus result in graded levels of updating in the subsequent trials depending on the difficulty of the previous choice.

On-line learning in margin-based classifiers explains choice updating

We next considered whether another class of models based on classifiers could also explain choice updating. Perceptual decision-making processes produce discrete choices from ambiguous sensory evidence. This process can be modelled with classifiers that learn a boundary in the space of sensory evidence. For instance, in our olfactory decision task evidence is a two-dimensional space spanned by the odor components A and B (Figure 4a). The hyperplane separating choice options creates a decision boundary determining whether each odor mixture is classified as A or B, which can be learned over trials. To apply these ideas to our decision tasks, we examine a probabilistic interpretation of Support Vector Machine (SVM) classifiers, a powerful technique from machine learning (see Materials and methods). SVMs learn classification hyperplanes that produce decision boundaries that are maximally far away from any data sample. Applying this framework to our particular decision problem requires an on-line implementation of a probabilistic interpretation of SVMs (Sollich, 2002). In the SVM terminology, the distance of a sample x from the category boundary or hyperplane is called the ‘margin’ of the data point (orange arrow in Figure 4a). The size of the margin for a stimulus is proportional to the likelihood of that data point belonging to a class given the current classifier. After appropriate normalization, this model yields an estimate of the classification success for a given decision. Indeed, the Bayesian posterior of classification success is a normative definition of confidence (Hangya et al., 2016; Pouget et al., 2016).

Figure 4. An on-line statistical classifier accounts for choice updating.

Figure 4.

(a) Schematic of a classifier using Support Vector Machine for learning to categorize odor samples. The dashed line shows one possible hyperplane for classification and shaded area around the dashed line indicates the margin. Orange arrow indicates the distance between one data point and the classification hyperplane, that is the margin for that data point, given the hyperplane. Each circle is one odor sample in one trial. (b) Average estimates of the margins of the classifier. (c) The size of shift in the classification as a function of previous and current stimulus. (d) Choice updating as a function previous odor separated for current easy and hard choices.

Simulation of this on-line, Bayesian SVM produces choice updating similar to that from the RL model (Figure 4b, c). Thus, statistical classifiers, in which the decision boundary is continuously updated in proportion to the estimated classification success (i.e. decision confidence), provides another account of choice updating. The core computational feature common to reinforcement learning and classifier models is that statistical decision confidence contributes to the trial-by-trial adjustment of choice bias.

In the following sections, we examine whether confidence-guided choice updating is observed in sensory modalities other than olfaction, and in species other than rats.

Confidence-guided choice updating in rat auditory decisions

We next investigated decisions of rats performing a two-alternative auditory decision task (Figure 5). Rats were trained to report which of two auditory click trains delivered binaurally had a greater number of clicks (Sanders et al., 2016; Sanders and Kepecs, 2012; Brunton et al., 2013Figure 5a). The click trains were presented for 250 ms and generated using a Poisson process with the sum of the two rates held constant across trials. To control decision difficulty, the ratio of click rates for each side was randomly varied from trial to trial. The strength of evidence on a given trial was computed based on the number of clicks, that is the difference in the number of clicks between left and right divided by the total number of clicks. Rats showed steep psychometric curves (slope: 0.37±0.03, mean± S.D) with minor overall bias (-1±4.9% sound A) and near-zero lapse for easy stimuli (0.05±0.09%). We found that, similar to the olfactory task, difficult choices were biased in proportion to the difficulty of previous choice (Figure 5b-d, p=0.007, rank sum test).

Figure 5. Rats update their trial-by-trial auditory choices in a confidence-dependent fashion.

Figure 5.

(a) Schematic of a 2AFC auditory decision-making task for rat. (b) Performance of an example rat computed separately based on the quality of auditory stimulus (shown as colors from blue to green) in the previously rewarded trial. (c) Choice updating as a function of sensory evidence in the previous and current trial in the population of rats (n = 5). (d) Choice updating as a function of previous stimulus separated for current easy (square) and difficult (circle) trials, averaged across rats.

Confidence-guided choice updating in mouse auditory decisions

We next considered decisions of mice performing an auditory decisions task (Figure 6). Mice were trained on a two-alternative auditory tone decision task. The auditory stimuli in different trials were presented as percentages of a high and low frequency complex tone, morph A and morph B (Figure 6a). To manipulate perceptual uncertainty, we varied the amplitude ratio of the two spectral peaks in a trial-by-trial manner. Mice showed steep psychometric curves (slope: 0.25±0.06, mean ± S.D) with minor overall bias (-1±6.9% sound A) and negligible lapse for easy stimuli (3±2%). The choices showed significant dependence on the difficulty of previous choice (Figure 6b-d, p=0.01, rank sum test).

Figure 6. Mice update their trial-by-trial auditory choices in a confidence-dependent fashion.

Figure 6.

(a) Schematic of a 2AFC auditory decision making task for mice. (b) Performance of an example mouse computed separately based on the quality of auditory stimulus (shown as colors from blue to green) in the previously rewarded trial. (c) Choice updating as a function of sensory evidence in the previous and current trial in the population of mice (n = 6). (d) Choice updating as a function of previous stimulus separated for current easy (square) and difficult (circle) trials, averaged across mice.

Confidence-guided choice updating in mouse visual decisions

We next considered mice trained to perform visual decisions (Figure 7Burgess et al., 2017). Head-fixed mice were trained to report the position of a grating on the monitor by turning a steering wheel placed under their front paws (Figure 7a). If the mouse turned the wheel such that the stimulus reached the center of the screen, the animal received water. If instead the mouse moved the stimulus by the same distance in the opposite direction, this incorrect decision was penalized with a timeout of 2 s (Burgess et al., 2017). We varied task difficulty by varying the contrast of the stimulus in different trials. Mice showed steep psychometric curves (slope: 0.260±11, mean ± S.D) with minor overall bias (1±9.1% stimulus contrast) and negligible lapse for easy stimuli (5±4.9%). The decisions showed significant dependence on the difficulty of previous choice (Figure 7b-d, p=0.001, rank sum test), consistent with similar results in a different version of this task that also included manipulation of reward size (Lak et al., 2019).

Figure 7. Mice update their trial-by-trial visual choices in a confidence-dependent fashion.

Figure 7.

(a) Schematic of a 2AFC visual decision making task for mice. (b) Performance of an example mouse computed separately based on the quality of visual stimulus (shown as colors from blue to green) in the previously rewarded trial. (c) Choice updating as a function of sensory evidence, that is the contrast of stimulus, in the previous and current trial in the population of mice (n = 12). (d) Choice updating as a function of previous stimulus separated for current easy (square) and difficult (circle) trials, averaged across mice.

Confidence-guided choice updating in human visual decisions

Next, we asked whether confidence-guided choice strategy was specific to rodents or could also be observed in humans (Urai et al., 2017). Human observers performing a visual decision task updated their choices in a confidence-dependent manner (Figure 8). Observers performed a two-interval forced choice (2IFC) motion coherence discrimination task (Figure 8a). They judged the difference in motion coherence between two successively presented random dot kinematograms: a constant reference stimulus (70% motion coherence) and a test stimulus with varying motion coherence in different trials (Urai et al., 2017). Observers performed the task well (slope: 0.2±0.01, bias: 0±3% stimulus coherence, lapse: 1±2%), and their choices showed significant dependence on the difficulty of previous choice (Figure 8b-d, P<0.05, rank sum test).

Figure 8. Humans update their trial-by-trial visual choices in a confidence-dependent fashion.

Figure 8.

(a) Schematic of a 2IFC visual decision making task in human subjects. (b) Performance of an example subject computed separately based on the quality of visual stimulus (shown as colors from blue to green) in the previously rewarded trial. (c) Choice updating as a function of sensory evidence, that is the difference in coherence of moving dots between two intervals, in the previous and current trial, averaged across subjects (n = 23). (d) Choice updating as a function of previous stimulus strength, separated for current easy (square) and difficult (circle) trials, averaged across subjects.

Confidence-guided choice updating transfers across sensory modalities

We found that rats exhibit confidence-dependent choice updating even if the sensory modality of current decision differed from the modality of the previous decision (Figure 9). We trained rats in a dual sensory modality 2AFC task with randomly interleaved trials of auditory and olfactory decisions (Figure 9a-b). Rats performed the task well (slope: 0.35±0.03, bias: -2±6%, lapse: 1±0.4%) and their choices showed significant dependence on the difficulty of the past choice, and this dependence transferred across the sensory modalities. Rats updated their olfactory decisions after difficult auditory decisions (Figure 9c-e p=0.008, rank sum test), and similarly, they updated their auditory choices after difficult olfactory decisions (Figure 9f, h, p=0.01, rank sum test). The updating occurred also across trials in which the modality of the choice did not change (i.e. in consecutive auditory choices and consecutive olfactory choices). Choice updating appeared largest in two consecutive olfactory decisions, yet updating was present in consecutive auditory choices as well as in consecutive trials with a modality switch (Figure 9—figure supplement 1). These results show that the updating effect depends on the choice outcome, rather than the identity, that is modality, of the sensory stimulus. This observation further indicates that choice updating mainly occurs in the space of action values, similar to our RL model.

Figure 9. Confidence-dependent choice updating transfers across sensory modalities.

(a-b) Schematic of a 2AFC task in which rats performed either an olfactory (a) or auditory (b) decisions in randomly interleaved trials. (c) Performance of an example rat computed for olfactory trials separately based on the quality of auditory stimulus (shown as colors from blue to green) in the previously rewarded trial. (d) Choice updating as a function of sensory evidence (auditory stimulus) in the previous trial and odor mixture in the current trial, averaged across subjects (n = 6). (e) Choice updating as a function of previous auditory stimulus separated for current odor-guided easy (square) and difficult (circle) trials, averaged across subjects. (f-h) Similar to c-e but for trials in which the current stimulus has been auditory and the previous trial has been based on olfactory stimulus.

Figure 9.

Figure 9—figure supplement 1. Choice-updating in rats performing a task in which the modality of sensory stimulus in different trials is either auditory or olfactory.

Figure 9—figure supplement 1.

See Figure 10 for the definition of Updating Index.

Diversity of confidence-guided choice updating across individuals

Having observed confidence-guided choice updating across various data sets, we next quantified this behavioral effect in each individual, and observed a weak but negative correlation between the strength of choice updating and psychometric lapse rate (Figure 10). To quantify the choice updating effect for each individual, we performed linear regressions on the updating data (Figure 10a inset) and computed the updating index as the difference in the slope of the fits for current easy and current difficult trials. A large fraction of individuals in each data set showed substantial choice updating consistent with the predictions of the model (positive numbers in Figure 10a). Nevertheless, we also observed individuals with negligible updating, and even in rare instances a choice bias in the direction opposite to the model’s prediction (negative numbers in Figure 10a). Quantifying individual behavior enabled us to ask whether this observed diversity could be explained by variations in the quality of perceptual processing (psychometric slope and lapse rate; Figure 10b,c). The slope and updating did not exhibit a significant relationship (p=0.21), but the lapse rate and updating showed a weak significant negative correlation (p=0.03). In other words, choice updating was strongest among individuals with lower lapse rate (Figure 10c). The results suggest that choice updating of the form we observed, is strongest when subjects are well trained in the perceptual task, with stable psychometric slope and minimal lapse rate.

Figure 10. Confidence-guided choice updating is strongest in individuals with well-defined psychometric behavior.

Figure 10.

(a) The strength of choice updating among individuals. The vertical lines show the mean. Inset: schematics illustrates the calculation of updating index. The index is defined as the difference in the slope of lines fitted to the data. (b) Scatter plot of choice updating as a function of the slope of psychometric curve. Each circle is one individual. Dashed lines illustrate a linear fit on each data set, and the gray solid line shows a linear fit on all subjects. (c) Scatter plot of choice updating as a function of the lapse rate of the fitted psychometric curve.

Different strategies for choice updating after error trials due to different noise sources

Lastly, we asked whether subjects show choice updating following error trials. We observed substantial diversity in choice bias after error trials, both across individuals and data sets. For example, two rats that performed olfactory decisions with similar choice updating after correct choices (Figure 11a,b top panel), showed divergent patterns of updating after incorrect choices (Figure 11a,b bottom panel). Interestingly, reinforcement learning models with different parameter settings also produced diverse choice bias patterns after error trials, depending on the dominant source of noise, that is whether errors were produced by sensory noise (external) or due to value fluctuations (internal). When sensory noise is high and leads to errors, these errors cannot be systematically corrected hence there is little or no net updating effect (Figure 11c, bottom panel). On the other hand, when the internal noise is high, such as high learning rate producing over-correction, systematic post-error updating is observed (Figure 11d, bottom panel). Although we found individual subjects matching these specific patterns of post-error updating (Figure 11a,b), the diversity across individuals and populations, as well as the low number of error trials precludes a more in-depth analysis. Note that for post-correct updating our RL model makes qualitatively similar predictions about choice updating, independent of whether the dominant source of noise is external or internal (Figure 11c,d top panels).

Figure 11. Diverse learning effects after error trials.

Figure 11.

(a) Choice updating after correct trials (top) and after error trials (bottom) in one example rat. (b) Similar to a for another example rat. (c) Choice updating of the TDRL model ran with large sensory noise (σ2= 0.5). This model exhibit choice updating qualitatively similar to the rat shown in a. (d) Choice updating in the TDRL model with large internal noise (α=0.8). This model run exhibits choice updating similar to the rat shown in b.

Discussion

Our central observation is that even well-trained and well-performed perceptual decisions can be informed by past sensory evidence and outcomes. When the sensory evidence is strong, it determines choices, as expected, and past outcomes have little influence. When sensory evidence is weak, however, choices are influenced by trial history. We show that this effect of past rewards can depend on the difficulty of past sensory decisions: subjects repeat the previously rewarded choice mainly when past sensory decision was difficult. This confidence-guided choice updating occurs in a trial-by-trial manner, and it is not due to slow drifts in choice strategy. We demonstrate that these history-dependent choice biases can be explained with reinforcement learning models that consider sensory ambiguity as a belief state computation, and hence produce confidence-scaled reward prediction error signals. We illustrate that this form of choice updating is a robust and widespread behavioral phenomenon observed across various perceptual decision-making paradigms in mice, rats and humans. Notably we found evidence for the same reinforcement learning process across data sets despite substantial variations in experimental setups and other conditions across the experiments examined.

The influence of past trials on perceptual decisions does not necessarily reflect active, trial-by-trial learning. In fact, our simulations illustrate that slow and non-specific drifts in the decision boundary result in correlations between consecutive choices, which can produce psychometric shifts similar to choice bias updating (Figures 2 and 3). To correct for this and isolate reinforcement learning-based choice updating, we used a simple procedure to compute choice updating with respect to slow fluctuations in the choice bias. We show the trial history effects persisting after this correction reflect confidence-guided reinforcement learning processes. These analyses also indicate that trial history effects and serial choice biases should be considered with care because correlations across choices at various time scales can produce apparent updating of perceptual choices. A similar confound has been previously reported in post-error slowing analysis, and similar normalization procedures were used to correct for it (Dutilh et al., 2012; Purcell and Kiani, 2016).

Confidence in the correctness of a choice determines the degree to which the reward can be expected. Hence, decision confidence informs how much the decision maker should learn from the decision outcome, as suggested by RL models that incorporate belief states representing confidence (Lak et al., 2017; Lak et al., 2019). Rewards received after decisions with high confidence are expected and hence there is not much to learn from them. In contrast, rewards received after decisions with lower confidence are relatively unexpected and could provide an opportunity to learn (Figure 3). Our results reveal that rodents and humans exhibit this form of learning (Figure 10). These results were robust across various datasets, enabling us to isolate an elementary cognitive computation driving choice biases.

Trial-by-trial transfer of choice updating across sensory modalities provides some evidence that this form of learning is driven by comparing the decision outcome with confidence-dependent expectation and performing updating in the space of action values. However, updating across trials with different modalities was relatively weaker compared to trials within the same sensory modality. This later observation might point to the fact that in trial-by-trial learning animals follow a mixture of two strategies: one which updates values in the space of actions, and one that keeps track of stimulus identity and statistics across trials for such learning. The trade-off between these model-free and model-based trial-by-trial learning during perceptual decisions remains to be explored in future studies.

What are the neuronal substrates of confidence-dependent choice updating? Several lines of evidence indicate that the dopaminergic system is centrally involved in this phenomenon. First, dopamine neuron responses during perceptual decisions quantitatively match confidence-dependent prediction errors in both monkeys and mice (Lak et al., 2017; Lak et al., 2019). Second, dopamine responses predict the magnitude of the psychometric choice bias in the subsequent trial (Lak et al., 2019). Third, optogenetic manipulation of dopamine neurons biases psychometric curves in a trial-by-trial fashion (Lak et al., 2019). In addition, different frontal cortical regions, medial prefrontal cortex (Lak et al., 2019) and orbitofrontal cortex (Hirokawa et al., 2019) are also likely to contribute to confidence-guided choice updating strategies.

Rewards induce choices bias in perceptual decisions

There is mounting evidence that in perceptual decision making tasks, even though reward is only contingent on accurate judgment about the current sensory stimulus, choices can be influenced by previous trials across species (Abrahamyan et al., 2016; Akaishi et al., 2014; Akrami et al., 2018; Braun et al., 2018; Busse et al., 2011; Cho et al., 2002; Fan et al., 2018; Fischer and Whitney, 2014; Fritsche et al., 2017; Fründ et al., 2014; Gold et al., 2008; Hwang et al., 2017; Lueckmann et al., 2018; Luu and Stocker, 2018; Marcos et al., 2013; Tsunada et al., 2019; Urai et al., 2017). Several such studies have shown that subjects might repeat the previously rewarded choice or avoid it after an unsuccessful trial, suggesting that basic forms of reward-based learning are at work even at asymptotic, steady-state perceptual performance. Given that these trial-history effects diminish the overall reward return, and are hence suboptimal, the question is why they persist even after subjects are well trained in the task? Our results showed a similar phenomenon across various data sets and species that produced choice bias in perceptual decisions after rewarded decisions that were difficult, consistent with recent reports (Mendonça et al., 2018; Lak et al., 2019). We show that these behavioral effects are normatively expected from various models that consider the uncertainty of stimulus states inherent in perceptual decisions. It is worth noting that the confidence-gauged learning described here requires observing the trial feedback, and it might thus differ from sequential choice effects in the absence of trial feedback (Braun et al., 2018Glaze et al., 2015). Moreover, confidence-dependent learning differs from trial history-effects for highly discriminable stimuli, that is “priming of popout“ effect (Maljkovic and Nakayama, 1994). These observations suggest that there are various types of selection-history mechanism operating in the brain with distinct constraints and properties.

Computational mechanisms of confidence-driven choices bias

What classes of models can account for our behavioral observations? It is clear that a purely sensory-based model or a purely reward-based model cannot account for our data. One approach is to start with a reinforcement learning model and add a belief state to account for stimulus-induced uncertainty (Lak et al., 2017; Lak et al., 2019). We show that this class of models provides teaching signal reflecting past confidence and accounts for the observed choice bias strategy. Alternatively, we also considered a statistical classifier model that on-line adjusts the decision boundary in proportion to estimated classification success (Sollich, 2002). We show that this Bayesian on-line support vector machine also accounts for the observed choice bias strategy. Similarly, Bayesian learning in drift-diffusion models of decision making also makes similar predictions about confidence-dependent choice biases (Drugowitsch et al., 2019). Thus, either a sensory-based classification model modified to produce statistically optimal adjustments based on on-line feedback, or a reward-based model modified to account for the ambiguity in stimulus states produce broadly similar confidence-dependent choice biases. These models share one main computation: they adjust the degree of learning based on the statistical confidence in the accuracy of previous decisions. Therefore, the key features of our data can be accounted for by either reinforcement learning mechanisms or on-line statistical classifiers.

These model classes can be distinguished chiefly on the basis of their respective decision variables: RL updates value, while classifiers update boundaries in sensory coordinates. Using a mixed sensory modality decision task it is possible to test whether updating is based on action values or stimulus variables. Actions value updating, predicted by RL models, leads to the transfer of choice bias across decisions with mixed sensory modalities. Category boundary updating in sensory coordinates, predicted by classifier models, leads to updating solely across the same sensory modality. We found that choice bias transferred across sensory modalities, suggesting that updating occurs in the space of action values. However, across-modality choice bias updating was weaker than within-modality updating, pointing to the possibility that animals update choices both based on stimulus statistics and action values.

All correct trials are alike; each incorrect trial is incorrect in its own way

When learning from outcomes, it is natural to consider not only correct choices but also what happens after incorrect choices. Surprisingly, we found that post-error behavioral effects were highly variable across subjects and datasets, unlike the post-correct choice updating we observed.

To paraphrase Leo Tolstoy’s famous opening sentence of the novel Anna Karenina: all correct trials are alike; each incorrect trial is incorrect in its own way. Correct perceptual performance requires appropriate processing and evaluation of the stimulus. In contrast there are many processes that can lead to incorrect performance without consideration of the stimulus, from inattention to lack of motivation to exploration. Indeed, in our behavioral data, the post-error behavioral effects were diverse, usually even within the same dataset.

We examined the RL model to gain insights into the possible origins of this post-error diversity. Note that the model’s qualitative predictions about post-correct trials are largely independent of model parameters. In contrast, the predictions of the model for post-error trials depend on parameter settings, in part based on the balance in the sources of decision noise. When stimulus-noise was dominant there was little post-error updating. However, when the model’s internal noise was high (e.g. large learning rate), it exhibited post-error updating. We found that individual subjects match these specific patterns but in fact the diversity in the data was greater, as expected, since the modeling framework does not take into account many relevant sources of decision noise, such as attentional lapses, lack of motivation or exploration, a multitude of processes that can all lead to errors.

Identifying the origin of decision noise is critical for the appropriate interpretation of any psychometric decision task. In the signal detection theory-based psychometric framework, negligible lapse rates and asymptotic psychometric slopes point to the interpretation that fluctuations in decisions are solely due to noise in perceptual processing. However, the ongoing learning mechanisms described here contribute to apparent fluctuations in the decisions, indicating that the contribution of sensory processing to decision noise may be often lower than previously thought (Zariwala et al., 2013). This ongoing learning might eventually go away after extended behavioral training yet we still observed signatures of this learning process in many subjects after 3–4 months of almost daily training.

Consilience of perceptual and reward-guided decisions

Choice biases in perceptual decisions are typically considered maladaptive and suboptimal because in laboratory experiments trials are often designed to be independent from each other and isolate the perceptual process under study (Britten et al., 1992; Hernández et al., 1997). Indeed, perceptual choice biases that are entirely stimulus-independent are suboptimal. However, the confidence-guided choice bias we examined could point to an underlying optimal choice strategy from various perspectives. First, this strategy is optimal when considering that the world can change, and hence the precise decision category boundary may not be stable over trials. In this situation, the outcome of decisions closer to the category boundary provide the most informative feedback as to where the category boundary should be set. Second, confidence-guided choice updating is optimal when considering that in natural environments external events can be temporally correlated. In such environments, when the evidence in favor of choice options is limited and hence there is uncertainty in decisions, it is beneficial to consider prior beliefs and the temporal statistics of events and adjust choices accordingly (Yu and Cohen, 2008). Thus, despite being suboptimal from the experimenter’s perspective, confidence-guided learning can be optimal in dynamic, real world situations, revealing that perceptual and reinforcement learning processes jointly contribute to many previously studied decision paradigms.

Materials and methods

Data analysis and psychometric fitting

For the psychometric analysis, we calculated the percentage of choice as a function of sensory stimuli. We fitted these data with the psychometric function ψs=λ+1-2λFs;μ,σ where Fx is a cumulative Gaussian. The parameter µ represents the mean of the Gaussian and define the side bias. The parameter σ determines the slope of the fitted curve. The parameter λ represents the lapse rate of the curve. We fitted this function via maximum likelihood estimation (Wichmann and Hill, 2001).

We used the psychometric fits to evaluate whether the performance was substantially differed across days of testing in each subject. In addition, we computed conditional psychometric curves by computing the curves from a subset of trials, that is those that were preceded by specific stimulus level, action direction and outcome in the previous trial. We used the same procedure as above for fitting these conditional curves.

Subtracting the average performance (for each level of stimulus) from performance in the conditional curves for the same level of stimulus provided an estimate of choice updating, that is the level of side bias for each stimulus (Figure 2). The size of these side biases was plotted in the heatmaps for each dataset.

To isolate trial-by-trial updating independent of possible slow fluctuation in the choice bias, we estimated the slow side bias and subtracted it from the updating heatmaps (Figure 2, Figure 2—figure supplement 1e, Figure 3). This procedure involved computing conditional psychometric curves after a subset of trials (with specific stimulus, action and outcome) as well as computing the conditional curves prior to these trials, plotting heatmaps for both these sets of curves, and subtracting the later heatmap from the former heatmap (Figure 2—figure supplement 1e, Figure 3).

Behavioral models

TDRL model with stimulus belief state

In order to examine the nature of choice updating during perceptual decision making, we adopted a reinforcement learning model which accommodate trial-by-trial estimates of perceptual uncertainty (Lak et al., 2017; Lak et al., 2019). In all our tasks, the subject can select one of two responses (often left vs right) to indicates its judgement about the stimulus (i.e. whether it belongs to category A or B, or left or right for simplicity). Knowing the state of the trial (left or right) is only partially observable, and it depends on the quality of sensory evidence.

In keeping with the standard psychophysical treatments of sensory noise, the model assumes that the internal estimate of the stimulus, s^, is normally distributed with constant variance around the true stimulus contrast: p(s^|s)=N(s^;s,σ2). In the Bayesian view, the observer’s belief about the stimulus s is not limited to a single estimated value s^. Instead, s^ parameterizes a belief distribution over all possible values of s that are consistent with the sensory evidence. The optimal form for this belief distribution is given by Bayes rule:

p(s|s^)=p(s^|s).p(s)p(s^)

We assume that the prior belief about s is uniform, which implies that this optimal belief will also be Gaussian, with the same variance as the sensory noise distribution, and mean given by s^: p(s|s^)=N(s;s^,σ2). From this, the agent computes a belief, that is the probability that the stimulus was indeed on the right side of the monitor, pR=p(s>0|s^), according to:

pR(s^)=0p(s|s^)ds

where pR represents the trial-by-trial probability of the stimulus being on the right side and pL=1-pR similarly represents the probability of it being on the left.

The expected values of the two choices L and R are computed as QL=pLVL and QR=pRVR, where VL and VR represent the stored values of L and R actions. Over learning these values are converging to a quantity just below the true reward size available in correct choices (Figure 3—figure supplement 1c). To choose between the two options, we used an argmax rule which selects the action with higher expected value deterministically. Using other decision functions such as softmax did not substantially change our results (Figure 3—figure supplement 1a). The outcome of this is thus the choice c (L or R), its associated confidence pC, and its predicted value QC.

QC={QLifchoice=LQRifchoice=R

When the trial begins, the expected reward prior to any information about the stimulus is Vtrialonset=(VL+VR)/2. Upon observing the stimulus and making a choice, the prediction error signal is: QC-Vtrialonset. After receiving the reward, r, the reward prediction error is δ=r-QC.

Given this prediction error, the value of the chosen action will be updated according to:

VCVC+α.δ where α is the learning rate. For simplicity, the model does not include temporal discounting. Parameter values used in Figure 3 are: σ2=0.2 and α=0.5. Each agent received 500 trials per stimulus level (randomly presented to the model), and plots reflect averages across 1000 agents.

TDRL models without stimulus belief state

The TDRL model without the belief state was largely similar to the model described above with one fundamental difference; computation of prediction error did not have access to the belief state used for choice computation. In other words, prediction errors are computed by comparing the outcome with the average value of chosen action, without consideration of the belief in the accuracy of that action.

Similar to the model above, the expected values of the two choices L and R are computed as QL=pLVL and QR=pRVR, where VL and VR represent the stored values of L and R actions. Over learning these values are converging to the average value of reward received in the past trials, that is average performance in the task (Figure 3—figure supplement 1c). To choose between the two options, we used an argmax rule which selects the action with higher expected value deterministically. The outcome of this is thus the choice (L or R), its associated confidence pC, and its predicted value QC.

QC={QLifchoice=LQRifchoice=R

After receiving the reward, r, the reward prediction error is δ=r-VC, where c is choice, as before. Given this prediction error, the value of the chosen action will be updated according to:

VCVC+α.δ

Because VC does not reflect decision confidence, the prediction errors only reflect the presence or absence of reward, but are not modulated by the decision confidence (Figure 3—figure supplement 1d). Thus, they drive learning based on past outcome but not past decision confidence. Note that prior to normalization, there is an apparent small tendency for this model to exhibit updating that depends on the difficulty of the previous choice (Figure 3c, left). This reflect the correlation of stored values across trials and transient regimes in these stored values that make it more probable to achieve two consecutive correct same-side difficult choice. However, after the normalization, the updating does not show any effect of previous difficulty, and merely reflects the presence or absence of reward in the previous trial (Figure 3c, right).

An extended version of this model is the one that includes multiple states: that is one state for storing the average value of each stimulus level. In this model, reward prediction error can be computed in two ways: with or without access to the inferred state. The first scenario makes prediction errors independent of past difficulty. The second scenario has access to the inferred state and compares reward with the value of that state to produce confidence-dependent prediction errors. However, since the updates only impact the current state there is no learning expected for different (nearby) stimuli.

Modifications to include slowly drifting choice bias

We modified the TDRL model without the belief state to include slowly drifting side bias (Figure 3d). The bias over trials was defined as the moving average (across 50 trials) of sint+A+B, where A is the temporal noise over trials (drawn from a normal distribution, N0,1), and B is the amplitude noise (drawn from a normal distribution, N0,1. On each trial, the bias (a negative or positive number), was added to QL, and choice was made by comparing QL and QR, as described before. This induced drift causes a strong correlation across trials that influence choices. However, the normalization removes this correlation and the model shows an effect of past reward which is independent of the difficulty of the past stimulus.

Modifications to include with win-stay-lose-switch strategy

We incorporated a simple probabilistic win-stay-lose-switch strategy into the TDRL model without the stimulus belief state. To do so, in 10% of randomly chosen trial, the choice was determined by the choice and outcome of the previous trial, rather than comparing QL and QR. In these trials, the previous choice was repeated when the previous trial was rewarded (win-stay), or the alternative choice was reported if the previous trial was not rewarded (lose-switch).

On-line Bayesian support vector machine (SVM)

The SVM algorithm finds the decision hyperplane that maximizes the margin between the data points belonging to two classes. The margin refers to the space around the classification boundary in which there is no data point, that is the largest minimal distance from any of the data points (Figure 4a). The data points falling on the margin are referred to as support vectors.

Following the conventional form of a linear SVM, stimuli are represented as xi each representing the two odor components A and B, and yi-1,1 corresponding to different class labels reflecting dominant odor A or B in the stimulus. The algorithm finds the hyperplane in this feature space that separate the stimuli. Assuming that the classification is determined by the following decision function fx=signwT.ϕx+b, the classification hyperplane would be w.x+b=0, where w is a weight vector and b is an offset. Thus, data point x is assigned to the first class if fx=signwT.ϕx+b equals +1 or to the second class if fx equals −1. In the simplest form (linear separability and hard margin), it is possible to select two parallel hyperplanes that separate the two classes of data, the distance between these hyperplanes is 2/w, and hence to maximize this distance, w should be minimized. In a more general form, the weights w can be found by minimizing the following equation subject to the SVM constraints (i.e. no data point within the margin):

λ2w22+i1yiwTxi

where λ is the optimization hyperparameter that determines the trade-off between increasing the margin size and ensuring that the xi lie on the correct side of the margin. The equation above thus ensures a tradeoff between classification errors and the level of separability.

In the online active form, weights can be iteratively adjusted. In this form, is updated iteratively according to

w={w+α(λw+yx)ifywTx<1wαλw ifywTx>=1

where α is a learning rate. In Bayesian SVMs the size of the margin for one data point is proportional to the likelihood of that point belonging to a class given the classifier. Thus, the posterior class probabilities, P(yi=+1|xi,D), can be obtained by integrating over the posterior distribution of w and b after appropriate normalization. This quantity is proportional to the statistical decision confidence, that is an estimate of confidence about which odor mixture component predominates (Figure 4b). In simulation presented in Figure 4, we used α=0.15 as the learning rate for updating of weight vector and the σ2=0.25 for underlying noise level of stimuli to be classified. Each agent received 500 trials per stimulus level (randomly presented to the model), and plots reflect averages across 1000 agents.

Behavioral experiments

The experimental procedures were approved by Institutional committees at Cold Spring Harbor Laboratory (for experiments on rats), MIT and Harvard University (for mice auditory experiments) and were in accordance with National Institute of Health standards (project ID: 18-14-11-08-1). Experiments on mice visual decisions were approved by the home Office of the United Kingdom (license 70/8021). Experiments in humans were approved by the ethics committee at the University of Amsterdam (project ID: 2014-BC-3376).

Rat olfactory experiment

The apparatus and task have been described previously (Hirokawa et al., 2019; Kepecs et al., 2008; Uchida and Mainen, 2003). The apparatus was controlled using PulsePal and Bpod (Sanders and Kepecs, 2014) or with the BControl system (https://brodylabwiki.princeton.edu/bcontrol/index.php?title=Main_Page). Rats self-initiated each experimental trial by introducing their snout into the central port where odor was delivered. After a variable delay, drawn from a uniform random distribution of 0.2–0.5 s, a binary mixture of two pure odorants, S(+)−2-octanol and R()−2-octanol, was delivered at one of six concentration ratios (80:20, 65:35, 55:45, 45:55, 35:65, 20:80) in pseudorandom order within a session. In some animals we also used ratios of 100:0 and 0:100. After a variable odor sampling time up to 0.7 s, rats responded by withdrawing from the central port, and moved to the left or right choice port. Choices were rewarded with 0.025 ml of water delivered from stainless tube inside of the choice port according to the dominant component of the mixture, that is, at the left port for mixtures A/B > 50/50 and at the right port for A/B < 50/50. The task training for these data and other dataset presented here focused on achieving high accuracy and did not specifically promote the confidence-dependent choice bias we have observed.

Rat auditory experiment

Rats self-initiated each trial by entering the central stimulus port. After a random delay of 0.2–0.4 s, auditory stimuli were presented. Rats had to determine the side with the higher number of clicks in binaural streams of clicks (Brunton et al., 2013; Sanders et al., 2016; Sanders and Kepecs, 2012). Auditory stimuli were Poisson-distributed click trains played binaurally at the two speakers placed outside of the behavioral box for a fixed time of 0.25 s. For each rat, we chose a maximum click rate according to the performance of the animal, typically 50 clicks/s. This maximum click rate was fixed for each animal. For each trial, we randomly chose a delta click rate between left and right from a uniform distribution between 0 and maximum click rate. The sum of the left and right click rate was kept constant at maximum click rate. Rats indicated their choice by exiting the stimulus port and entering one of two choice ports (left or right) with a maximum response time of 3 s after leaving the stimulus port. Choices were rewarded according the higher number of clicks presented between the left and right click train. Exiting the stimulus port during the pre-stimulus delay or during the stimulus time (first 0.25 s) were followed by a white noise and a time out of 3–7 s.

Rat randomly interleaved auditory-olfactory experiment

Rats self-initiated each trial by entering the central stimulus port. After a random delay of 0.2–0.4 s, either an olfactory or auditory stimulus was presented (randomly interleaved). For olfactory stimuli, rats had to determine the dominant odor of a mix of pure odorants +2-octanol and –2-octanol. Odor stimuli were delivered for at least 0.35 s or until the rat left the center port (max. 3 s). Odor mixtures were fixed at seven concentration ratios, which we adjusted to match the performance levels for each mixture ratio across animals, as described above in the rat olfactory experiment. One of the stimuli was a 50/50 ratio stimulus for which correct side is randomly assigned and those trials were removed from the analysis. After a variable odor sampling time, rats exited the stimulus port, which terminated odor delivery, and indicated their choice by entering one of two choice ports (left or right) with a maximum response time of 3 s after leaving the stimulus port. Choices were rewarded according to the dominant odor component in the mixture. For auditory stimuli, rats had to determine the side with the higher number of clicks in binaural streams of clicks. Auditory stimuli were random Poisson-distributed click trains played binaurally at the two speakers placed outside of the behavioral box for a fixed time of 0.25 s, as described above for the rat auditory experiment. Exiting the stimulus port during the pre-stimulus delay or during the stimulus time (first 0.35 s for olfactory trials) were followed by a white noise and a time out of 3–7 s. Choices were rewarded according the higher number of clicks presented between the left and right click train. Reward timing was sampled from a truncated exponential distribution: minimum reward delay was 0.6 s, maximum delay 8 s and decay tine constant of 1.5 s.

Mouse auditory experiment

Mice were water restricted for a week (administered 1.2 ml of water in a single session per day), handled for 2 days and then gradually shaped for 5–7 sessions to the contingencies of a two-alternative choice paradigm and subsequently trained to discriminate two complex stimuli before introducing 6 morphs of those stimuli. Trials were self-initiated upon the breaking of an infrared beam by a nose poke into the center port of three adjacent ports. Once mice remain in the center port over 0.2 s, mice were presented with one of two complex tones, following a 0.2–0.5 s delay (uniformly distributed). Auditory cues were presented until the mouse exited the center port. If a mouse entered the correct side port within 4 s, a 4 µl water drop was delivered from gravity-fed reservoirs regulated by solenoid valves. Trials in which mice did not remain in the center port long enough to elicit a cue were not considered valid trials and are not represented in our analyses. During the training period only, error trials were followed by a progressively increasing 3–10 s timeout in order to prevent rapid guessing. During initial training of the task, two complex tones were used for training. These are comprised of three tones centered on 3 kHz and three tones centered on 7.5 kHz, all components share base frequency of 1.5 kHz. These two training tones are described as 0 and 100 (%A) stimuli, respectively. In the perceptual decision-making task, each of the 36 complex tones varied in the balance of 6 components of Tones A and B. To vary discrimination difficulty, we varied the amplitude ratio of the two spectral peaks (3 kHz and 7.5 kHz). Morph tones comprised six sets of auditory stimuli described here as percentages of a high- and low-frequency complex tone, morph A and morph B, respectively. Each of six sets is comprised of six similar stimuli with percentages in terms of morph A of 5–10, 25–30, 35–40, 60–65, 70–75, 90–95. We thereby challenged mice with a variety of 36 stimuli, and were able to pool members of each stimulus set for the analysis. Stimuli were delivered through generic electromagnetic dynamic speakers located on each side of the behavior chamber.

Mouse visual experiment

Mice were trained in a 2-alternative forced choice visual detection task (Burgess et al., 2017). After mouse kept the wheel still for at least 0.5 s, a sinusoidal grating stimulus of varying contrast appeared on either the left or right monitor, together with a brief tone (0.1 s, 12 kHz) indicating that the trial had started. The mouse could immediately report its decision by turning the wheel located underneath its forepaws. Wheel movements drove the stimulus on the monitor, and a reward was delivered if the stimulus reached the center of the middle monitor (a successful trial), but a 2 s white noise was played if the stimulus reached the center of the either left or right monitors (an error trial). The inter trial interval was set to 3 s. As previously reported, well-trained mice often reported their decisions using fast stereotypical wheel movements (Burgess et al., 2017). After 2–3 weeks of training, the task typically included 6 or 7 levels of contrast (three on the left, three on the right) which were presented in a random order across trials with equal probability.

Human visual experiment

The experiments are described in detail in Urai et al. (2017). Observers performed a two-interval forced choice motion coherence discrimination task at constant luminance. Specifically, observers judged the difference in motion coherence between two successively presented random dot kinematograms (RDKs): a constant reference stimulus (70% motion coherence) and a test stimulus (varying motion coherence levels specified below). A beep indicated the onset of each (test and reference) stimulus. The intervals before, in between, and after (until the inter-trial interval) these two task-relevant stimuli had variable duration and contained randomly moving dots. After offset of the test stimulus, observers had 3 s to report their judgment (button press with left or right index finger, counterbalanced across observers). After a variable interval (1.5–2.5 s), a feedback tone was played. Dot motion was stopped 2–2.5 s after feedback, with stationary dots indicating the inter-trial interval, during which observers were allowed to blink their eyes. Observers self-initiated the next trial by button press. The difference between motion coherence of test and reference was taken from three sets: easy (2.5, 5, 10, 20, 30), medium (1.25, 2.5, 5, 10, 30) and hard (0.625, 1.25, 2.5, 5, 20). All observers started with the easy set and were switched to the medium set when their psychophysical thresholds (70% accuracy) dropped below 15%, and to the hard set when thresholds dropped below 10%, in a given session. Motion coherence differences were randomly shuffled within each block.

Acknowledgements

This work was supported by the Wellcome Trust (grants 106101 and 213465 to AL). NU was funded by NIH R01 MH110404 and Mind Brain Behavior (MBB) Faculty Award. AK was funded by NIH R01MH097061 and R01DA038209. MC was funded by by the Wellcome Trust (205093). AEU was funded by a German Academic Exchange Service (DAAD) doctoral scholarship. THD was funded by German Research Foundation (DFG) grants: DO 1240/2–1 and DO 1240/3–1. ST and EH were supported by RIKEN-CBS, JPB Foundation, and HHMI.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Armin Lak, Email: armin.lak@dpag.ox.ac.uk.

Adam Kepecs, Email: akepecs@wustl.edu.

Emilio Salinas, Wake Forest School of Medicine, United States.

Michael J Frank, Brown University, United States.

Funding Information

This paper was supported by the following grants:

  • Wellcome 106101 to Armin Lak.

  • National Institutes of Health R01MH110404 to Naoshige Uchida.

  • Wellcome 205093 to Matteo Carandini.

  • Deutsche Forschungsgemeinschaft DO 1240/2-1 to Tobias H Donner.

  • RIKEN to Emily Hueske, Susumu Tonegawa.

  • JPB Foundation to Emily Hueske, Susumu Tonegawa.

  • Howard Hughes Medical Institute to Emily Hueske, Susumu Tonegawa.

  • German Academic Exchange Service to Anne E Urai.

  • National Institutes of Health R01DA038209 to Adam Kepecs.

  • Harvard University Mind Brain Behavior (MBB) Faculty Award to Naoshige Uchida.

  • Deutsche Forschungsgemeinschaft DO 1240/3-1 to Tobias H Donner.

  • Wellcome 213465 to Armin Lak.

  • National Institutes of Health R01MH097061 to Adam Kepecs.

Additional information

Competing interests

Reviewing editor, eLife.

Reviewing editor, eLife.

No competing interests declared.

Author contributions

Conceptualization, Data curation, Software, Formal analysis, Resources, Funding acquisition, Investigation, Visualization, Methodology.

Data curation, Investigation, Methodology.

Data curation, Investigation, Methodology.

Data curation, Investigation, Methodology.

Data curation, Investigation, Methodology.

Data curation, Investigation, Methodology.

Resources, Supervision, Funding acquisition.

Resources, Supervision, Funding acquisition.

Resources, Supervision, Funding acquisition.

Conceptualization, Resources, Supervision, Funding acquisition.

Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition.

Ethics

Human subjects: The ethics committee at the University of Amsterdam approved the study, and all observers gave their informed consent.project ID: 2014-BC-3376.

Animal experimentation: The experimental procedures were approved by Institutional committees at Cold Spring Harbor Laboratory (for experiments on rats), MIT and Harvard University (for mice auditory experiments) and were in accordance with National Institute of Health standards (project ID: 18-14-11-08-1). Experiments on mice visual decisions were approved by the home Office of the United Kingdom (license 70/8021). Experiments in humans were approved by the ethics committee at the University of Amsterdam (project ID: 2014-BC-3376).

Additional files

Transparent reporting form

Data availability

The human dataset used in this study is available at https://doi.org/10.6084/m9.figshare.4300043.

The following previously published dataset was used:

Urai AE, Braun A, Donner THD. 2018. Pupil-linked arousal is driven by decision uncertainty and alters serial choice bias. Figshare.

References

  1. Abrahamyan A, Silva LL, Dakin SC, Carandini M, Gardner JL. Adaptable history biases in human perceptual decisions. PNAS. 2016;113:E3548–E3557. doi: 10.1073/pnas.1518786113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Akaishi R, Umeda K, Nagase A, Sakai K. Autonomous mechanism of internal choice estimate underlies decision inertia. Neuron. 2014;81:195–206. doi: 10.1016/j.neuron.2013.10.018. [DOI] [PubMed] [Google Scholar]
  3. Akrami A, Kopec CD, Diamond ME, Brody CD. Posterior parietal cortex represents sensory history and mediates its effects on behaviour. Nature. 2018;554:368–372. doi: 10.1038/nature25510. [DOI] [PubMed] [Google Scholar]
  4. Braun A, Urai AE, Donner TH. Adaptive history biases result from Confidence-Weighted accumulation of past choices. The Journal of Neuroscience. 2018;38:2418–2429. doi: 10.1523/JNEUROSCI.2189-17.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Britten KH, Shadlen MN, Newsome WT, Movshon JA. The analysis of visual motion: a comparison of neuronal and psychophysical performance. The Journal of Neuroscience. 1992;12:4745–4765. doi: 10.1523/JNEUROSCI.12-12-04745.1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Brunton BW, Botvinick MM, Brody CD. Rats and humans can optimally accumulate evidence for decision-making. Science. 2013;340:95–98. doi: 10.1126/science.1233912. [DOI] [PubMed] [Google Scholar]
  7. Burgess CP, Lak A, Steinmetz NA, Zatka-Haas P, Bai Reddy C, Jacobs EAK, Linden JF, Paton JJ, Ranson A, Schröder S, Soares S, Wells MJ, Wool LE, Harris KD, Carandini M. High-Yield methods for accurate Two-Alternative visual psychophysics in Head-Fixed mice. Cell Reports. 2017;20:2513–2524. doi: 10.1016/j.celrep.2017.08.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Busse L, Ayaz A, Dhruv NT, Katzner S, Saleem AB, Schölvinck ML, Zaharia AD, Carandini M. The detection of visual contrast in the behaving mouse. Journal of Neuroscience. 2011;31:11351–11361. doi: 10.1523/JNEUROSCI.6689-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cho RY, Nystrom LE, Brown ET, Jones AD, Braver TS, Holmes PJ, Cohen JD. Mechanisms underlying dependencies of performance on stimulus history in a two-alternative forced-choice task. Cognitive, Affective, & Behavioral Neuroscience. 2002;2:283–299. doi: 10.3758/CABN.2.4.283. [DOI] [PubMed] [Google Scholar]
  10. Daw ND, O'Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006;441:876–879. doi: 10.1038/nature04766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Daw ND, Doya K. The computational neurobiology of learning and reward. Current Opinion in Neurobiology. 2006;16:199–204. doi: 10.1016/j.conb.2006.03.006. [DOI] [PubMed] [Google Scholar]
  12. Dayan P, Daw ND. Decision theory, reinforcement learning, and the brain. Cognitive, Affective, & Behavioral Neuroscience. 2008;8:429–453. doi: 10.3758/CABN.8.4.429. [DOI] [PubMed] [Google Scholar]
  13. Drugowitsch J, Mendonça AG, Mainen ZF, Pouget A. Learning optimal decisions with confidence. PNAS. 2019;116:24872–24880. doi: 10.1073/pnas.1906787116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dutilh G, van Ravenzwaaij D, Nieuwenhuis S, van der Maas HLJ, Forstmann BU, Wagenmakers E-J. How to measure post-error slowing: a confound and a simple solution. Journal of Mathematical Psychology. 2012;56:208–216. doi: 10.1016/j.jmp.2012.04.001. [DOI] [Google Scholar]
  15. Fan Y, Gold JI, Ding L. Ongoing, rational calibration of reward-driven perceptual biases. eLife. 2018;7:e36018. doi: 10.7554/eLife.36018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fischer J, Whitney D. Serial dependence in visual perception. Nature Neuroscience. 2014;17:738–743. doi: 10.1038/nn.3689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Fritsche M, Mostert P, de Lange FP. Opposite effects of recent history on perception and decision. Current Biology. 2017;27:590–595. doi: 10.1016/j.cub.2017.01.006. [DOI] [PubMed] [Google Scholar]
  18. Fründ I, Wichmann FA, Macke JH. Quantifying the effect of intertrial dependence on perceptual decisions. Journal of Vision. 2014;14:9. doi: 10.1167/14.7.9. [DOI] [PubMed] [Google Scholar]
  19. Glaze CM, Kable JW, Gold JI. Normative evidence accumulation in unpredictable environments. eLife. 2015;4:e08825. doi: 10.7554/eLife.08825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gold JI, Law CT, Connolly P, Bennur S. The relative influences of priors and sensory evidence on an oculomotor decision variable during perceptual learning. Journal of Neurophysiology. 2008;100:2653–2668. doi: 10.1152/jn.90629.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Green DM, Swets JA. Signal Detection Theory and Psychophysics. John Wiley and Sons; 1966. [Google Scholar]
  22. Hangya B, Sanders JI, Kepecs A. A mathematical framework for statistical decision confidence. Neural Computation. 2016;28:1840–1858. doi: 10.1162/NECO_a_00864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hernández A, Salinas E, García R, Romo R. Discrimination in the sense of flutter: new psychophysical measurements in monkeys. The Journal of Neuroscience. 1997;17:6391–6400. doi: 10.1523/JNEUROSCI.17-16-06391.1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hirokawa J, Vaughan A, Masset P, Ott T, Kepecs A. Frontal cortex neuron types categorically encode single decision variables. Nature. 2019;576:446–451. doi: 10.1038/s41586-019-1816-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hwang EJ, Dahlen JE, Mukundan M, Komiyama T. History-based action selection Bias in posterior parietal cortex. Nature Communications. 2017;8:1242. doi: 10.1038/s41467-017-01356-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kepecs A, Uchida N, Zariwala HA, Mainen ZF. Neural correlates, computation and behavioural impact of decision confidence. Nature. 2008;455:227–231. doi: 10.1038/nature07200. [DOI] [PubMed] [Google Scholar]
  27. Lak A, Nomoto K, Keramati M, Sakagami M, Kepecs A. Midbrain dopamine neurons signal belief in choice accuracy during a perceptual decision. Current Biology. 2017;27:821–832. doi: 10.1016/j.cub.2017.02.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lak A, Okun M, Moss MM, Gurnani H, Farrell K, Wells MJ, Reddy CB, Kepecs A, Harris KD, Carandini M. Dopaminergic and prefrontal basis of learning from sensory confidence and reward value. Neuron. 2019;105:700–711. doi: 10.1016/j.neuron.2019.11.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Lee D, Seo H, Jung MW. Neural basis of reinforcement learning and decision making. Annual Review of Neuroscience. 2012;35:287–308. doi: 10.1146/annurev-neuro-062111-150512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lueckmann JM, Macke JH, Nienborg H. Can serial dependencies in choices and neural activity explain choice probabilities? The Journal of Neuroscience. 2018;38:3495–3506. doi: 10.1523/JNEUROSCI.2225-17.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Luu L, Stocker AA. Post-decision biases reveal a self-consistency principle in perceptual inference. eLife. 2018;7:e33334. doi: 10.7554/eLife.33334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Maljkovic V, Nakayama K. Priming of pop-out: I. role of features. Memory & Cognition. 1994;22:657–672. doi: 10.3758/BF03209251. [DOI] [PubMed] [Google Scholar]
  33. Marcos E, Pani P, Brunamonti E, Deco G, Ferraina S, Verschure P. Neural variability in premotor cortex is modulated by trial history and predicts behavioral performance. Neuron. 2013;78:249–255. doi: 10.1016/j.neuron.2013.02.006. [DOI] [PubMed] [Google Scholar]
  34. Mendonça AG, Drugowitsch J, Vicente MI, DeWitt E, Pouget A, Mainen ZF. The impact of learning on perceptual decisions and its implication for speed-accuracy tradeoffs. bioRxiv. 2018 doi: 10.1101/501858. [DOI] [PMC free article] [PubMed]
  35. Pouget A, Drugowitsch J, Kepecs A. Confidence and certainty: distinct probabilistic quantities for different goals. Nature Neuroscience. 2016;19:366–374. doi: 10.1038/nn.4240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Purcell BA, Kiani R. Neural mechanisms of Post-error adjustments of decision policy in parietal cortex. Neuron. 2016;89:658–671. doi: 10.1016/j.neuron.2015.12.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Rao RP. Decision making under uncertainty: a neural model based on partially observable markov decision processes. Frontiers in Computational Neuroscience. 2010;4:146. doi: 10.3389/fncom.2010.00146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Samejima K, Ueda Y, Doya K, Kimura M. Representation of action-specific reward values in the striatum. Science. 2005;310:1337–1340. doi: 10.1126/science.1115270. [DOI] [PubMed] [Google Scholar]
  39. Sanders JI, Hangya B, Kepecs A. Signatures of a statistical computation in the human sense of confidence. Neuron. 2016;90:499–506. doi: 10.1016/j.neuron.2016.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Sanders JI, Kepecs A. Choice ball: a response interface for two-choice psychometric discrimination in head-fixed mice. Journal of Neurophysiology. 2012;108:3416–3423. doi: 10.1152/jn.00669.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Sanders JI, Kepecs A. A low-cost programmable pulse generator for physiology and behavior. Frontiers in Neuroengineering. 2014;7:43. doi: 10.3389/fneng.2014.00043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Sollich P. Bayesian methods for support vector machines: evidence and predictive class probabilities. Machine Learning. 2002;46:21–52. doi: 10.1023/A:1012489924661. [DOI] [Google Scholar]
  43. Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT press; 1998. [Google Scholar]
  44. Tai LH, Lee AM, Benavidez N, Bonci A, Wilbrecht L. Transient stimulation of distinct subpopulations of striatal neurons mimics changes in action value. Nature Neuroscience. 2012;15:1281–1289. doi: 10.1038/nn.3188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Tsunada J, Cohen Y, Gold JI. Post-decision processing in primate prefrontal cortex influences subsequent choices on an auditory decision-making task. eLife. 2019;8:e46770. doi: 10.7554/eLife.46770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Uchida N, Mainen ZF. Speed and accuracy of olfactory discrimination in the rat. Nature Neuroscience. 2003;6:1224–1229. doi: 10.1038/nn1142. [DOI] [PubMed] [Google Scholar]
  47. Urai AE, Braun A, Donner TH. Pupil-linked arousal is driven by decision uncertainty and alters serial choice Bias. Nature Communications. 2017;8:e14637. doi: 10.1038/ncomms14637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Wichmann FA, Hill NJ. The psychometric function: I. fitting, sampling, and goodness of fit. Perception & Psychophysics. 2001;63:1293–1313. doi: 10.3758/BF03194544. [DOI] [PubMed] [Google Scholar]
  49. Yu AJ, Cohen JD. Sequential effects: superstition or rational behavior?. Advances in Neural Information Processing Systems; 2008. pp. 1873–1880. [PMC free article] [PubMed] [Google Scholar]
  50. Zariwala HA, Kepecs A, Uchida N, Hirokawa J, Mainen ZF. The limits of deliberation in a perceptual decision task. Neuron. 2013;78:339–351. doi: 10.1016/j.neuron.2013.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Emilio Salinas1
Reviewed by: Emilio Salinas2, Carlos D Brody3, Long Ding4

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This study investigates how past choice history influences decisions that, in principle, should be guided just by currently available sensory information. It combines two classic approaches, perceptual decision making and reinforcement learning, to show that the success of prior choices and the quality of the perceptual information guiding them are automatically tracked, and that both factors are used to bias upcoming choices that are difficult, i.e., those for which the quality of the sensory information is poor and the subject is largely guessing. The effect is robust across sensory modalities, species, task details, and labs. It explains an important source of variance in behavior.

Decision letter after peer review:

Thank you for sending your article entitled "Confidence-guided updating of choice bias during perceptual decisions is a widespread behavioral phenomenon" for peer review at eLife. Your article has been evaluated by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Michael Frank as the Senior Editor.

All reviewers found the work interesting and appealing. However, some potentially

serious issues were identified that could undermine the main conclusions. The full set of recommendations is below. Please focus your revisions on points 1, 2, and 4, which are most critical.

Essential revisions:

1) The first major concern is in regard to a claim made throughout the paper, namely that how easy or hard the current stimulus is affects the strength of the modulation. This is first presented in Figures 1F and 1G, and then repeated in almost all the figures. However, the measure presented in figures of the style of Figure 1F and summarized in Figure 1G is the% change in responses (given a particular current and previous stimulus, and relative to the overall average for that stimulus). This measure suffers from a large problem that, unfortunately, appears to not be addressed (apologies if it was missed on reading): there is a ceiling effect, in the sense that when a choice is easy (near 100%) it cannot increase further, even if there were strong underlying plasticity. The ceiling applies preferentially to easy stimuli, and therefore will manifest as a difference between easy and hard stimuli. It is currently impossible to tell whether the data shown in support of the claim of "current stimulus strength modulates effect" is merely an artifact resulting from this ceiling issue, or an actual finding. It is very important that the authors clarify this issue.

2a) Another key claim is that only the model with a belief state could account for the results. But the data are not convincing. Let's start with a TD model, without learning of actions, merely learning the values VR and VL, and let's suppose the value of VR is modified only after R choices (and similarly for VL and L choices). Under these conditions (and assuming R-L symmetry) VR and VL will converge to the overall% correct. This will be more than 50%; for a typical smooth psychometric curve it might be 75% or 80%. Now move to a TDRL model that also learns actions: because there are more errors for hard stimuli, the error signal (VC – r) will, on average, be greater for hard stimuli than for easy stimuli. And therefore, in this situation, there will on average be greater plasticity after hard stimuli than after easy stimuli, entirely without a belief state. Could this not account for the authors' experimental data?

The situation and conclusion sketched here could be wrong. But readers are likely to think about it and wonder whether it undermines the authors' conclusions. Thus, explicitly addressing this argument, and whether it is qualitatively or quantitatively incorrect, would strengthen the manuscript.

2b) A related concern is that, if we understand correctly, the model produces only step-function psychometric curves: for a given stimulus s, the response is deterministic, and when VR = VL, the model would produce 100% correct behavior. This might seem unimportant, since the details of the shapes of psychometric curves are not the focus of the manuscript, and replacing the distribution-based model with one that uses noisy samples might seem a trivial change. However, distinguishing between hard and easy trials is central to the arguments in the manuscript, and the corresponding difference in error rates might easily become important (as in the situation sketched in point 2a above). So a model that produces smooth psychometric functions (as opposed to step-shaped) might be important after all (and at least cosmetically would be an obviously better match to the data).

2c) The authors posit that the subjects use an internal measure of decision confidence to update their decision policy. In support of this claim, they show that a qualitatively similar modulation can be produced by a temporal difference reinforcement learning (TDRL) model with prediction errors based on perceived stimulus strength. In this model, the prediction errors are used to update the stored value of each choice. We found this model design counterintuitive. It seems that an equivalent model could be constructed in which prior stimulus probabilities (pR and pL) are updated instead. This would be more consistent with the fact that the animal always receives the same amount of reward on rewarded trials, but may have uncertainty about exactly where to set the decision boundary leading to trial-to-trial updating. It would be helpful if the authors reformulated their model, or explained why reward values are being updated rather than reward probabilities.

2d) The Bayesian model that updates stimulus statistics seems to ignore which choice was made in the previous trial or whether it was correct. That is, there is a built-in handicap of no feedback, compared to the confidence-based models. How much does this "handicap" contribute to the poor match between model predictions and data in Figure 9A?

2e) Model predictions should be comparable to real data. In Figure 2, the plots for rats' average performance show three levels of previous stim (%A) for each direction, more or less evenly distributed. In Figure 3, however, previous stim (%A) are 20, 50 and another value very close to 50. Does the model performance depend on the stimulus strength used? The same stimulus strengths used in behavioral testing should be used for model predictions.

2f) Parameter values used to generate Figure 3 should be reported.

3) More detail and clarity are needed regarding the description of the basic phenomenon (e.g., sorting and analysis of the shown quantities). The Materials and methods section should include a brief section on how the psychophysical results were generated. That could include all the details of what was conditioned on what, as well as how the trials were divided into "Hard" and "Easy".

Many of the key figures depict "updating% " and "updating index." These terms should also be mathematically defined in the Materials and methods.

The equation used to fit the bias, lapse, sensitivity of psychometric curves should also be presented in the Materials and methods. These parameters are said to be "stable." What is the criterion for determining stability?

Also, the trial nomenclature and labels used in Figure 2 (e.g., "Next" and "Previous") were confusing.

Do subjects also show adjustments of sensitivity after correct trials? Would a change in sensitivity contribute to the observed confidence dependence (e.g., if the psychometric curves are shallower, the difference in choice might appear larger for difficult trials)?

4) What happens after errors? The results demonstrate an effect akin to a win-stay strategy but limited to "guesses" only. Is there also a trend toward the corresponding lose-switch strategy? That is, when the choice in trial n-1 is difficult and not rewarded, is the subject more likely to choose the alternative option in trial n (when such decision is also difficult)? There is no obvious reason to expect that confidence-guided choice updating would not also happen after errors, but in any case, error trials should not simply be put aside. The authors should present an analysis of error trials and if they do not see the effect predicted by the model, should propose an explanation for the inconsistency. Other work from the same first author (Lak et al., 2017), presented an alternative TDRL model without the "belief state," which seemed to have qualitatively similar results for correct trials but divergent results for errors. It seems that error trials are needed again in order to rule out this other model.

5) Related to this: "The choice updating remained statistically significant even after this correction." True, but the effect did seem to get a bit weaker. I imagine this is because the history effects are not limited to one trial in the past, but possibly more. If so, you would expect the effect to become stronger when the current difficult choice is preceded by two rewarded guesses made in the same direction. Whether the trend is weak or not, it could be compared to the model's prediction.

6) The transfer of choice updating across modalities is interesting. Comparing Figure 2, 4C, and 8D, it seems that the updating is larger on pure olfactory tasks and about the same for the pure auditory or mixed tasks (this is more obvious in Figure 10). I assume the rats have different sensitivity to olfactory and auditory stimuli. Then the difference seems to contradict the statement that "updating is guided by outcome expectations, rather than stimulus statistics". How much updating was there for trials in the mixed modality task but without modality switches?

7) Figure 10B and C should show scatterplots separately for each task. The authors' hypothesis is that updating is independent of stimulus statistics. If this is true, it makes sense to pool data across tasks. However, if the alternative is true that updating depends on certain properties of sensory stimulus, i.e., there could be different relationships between updating and slope/lapse, which could be obscured by pooling across tasks. A more direct test would be to fit the model to one task and predict results for the other tasks.

8) In Introduction and Discussion, the authors seem to suggest that this phenomenon persists after training is completed or after the subjects performed the task for extended periods and thus may reflect some optimal strategy. Was the training specifically targeting the suboptimal bias? Or is it possible that the subjects just settled on a suboptimal strategy that satisfies the training requirements? It might be useful to clarify what criteria were used to deem these subjects "well-learned".

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your article "Confidence-guided updating of choice bias during perceptual decisions is a widespread behavioral phenomenon" for consideration by eLife. Your revised article has been reviewed by three peer reviewers, including Emilio Salinas as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Michael Frank as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Carlos D Brody (Reviewer #2); Long Ding (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This manuscript presents interesting, important data characterizing how past choice history influences decisions that, in principle, should be guided just by currently available sensory information. The results show that the success of prior choices and the quality of the perceptual information guiding them are automatically tracked, and that this information is used to bias upcoming choices that are difficult, i.e., for which the quality of the sensory information is poor and the subject is largely guessing. The work is significant because the phenomenon seems robust across sensory modalities, species, task details, and labs, and because the modeling results provide insight into the underlying associative mechanisms responsible for the biasing effects.

Essential revisions:

The paper is significantly improved, and the reviewers are now convinced that it contains important work that should be published. Nevertheless, the presentation of the data is still confusing, and some further clarifications would substantially strengthen the manuscript.

1) Explaining more in detail the fundamental difference between the standard TDRL model and the belief model would be important. Specifically, the updating that they generate during easy trials, and how that updating depends on reward, could be the critical reason why the standard TDRL model fails.

2) The control analysis that corrects for potential slow drifts in the internal categorical boundary may also be implicitly addressing a separate issue (whether the shifts in the psychometric curve depend on the previous trial outcome, i.e., rewarded vs. not rewarded). Making this distinction explicit would be helpful.

3) The zig-zag patterns of the main data plots can be confusing because sometimes it is unclear what matters, the slope, the discontinuity, or both. Some suggestions for modifying the plots are provided below. That and/or additional text to guide the reader to the relevant features of the data would be helpful.

4) Figure 4 is a very nice addition to the manuscript. However, the results do differ somewhat from the data and from those of the belief model. Does this simply reflect a fundamental, qualitative difference between the classifier and the other models? It would be helpful to clarify whether the classifier has parameters that can be adjusted.

Details about each of these points are provided below.

Reviewer #2:

1) Figure 3 TDRL model: In the previous round of review, we (the reviewers) presented an argument as to why the TDRL model would lead to updating that depended on previous trial difficulty. The authors replied that this model results "in updating which is independent of previous difficulty." I believe that response is simply not true: in Figure 3C left, one can clearly see a non-zero slope to the updating% on hard trials. (And Figure 3C right worries me no end: since in the TDRL model there are no slow side biases independent of outcome, I am unsettled by the correction for such biases changing the results of the model.) [Note that doing the zig-zag plots in the style suggested below for Figure 1G would focus the eye on that slope much better.] What am I supposed to focus on? That the slope is less than in the belief model? That the belief model reaches zero updating for the easiest previous trials?

I spent a lot of time trying to think this one through – if the TDRL model could indeed be described as accounting for the data, that would be a pretty bad hit at the heart of the manuscript – and eventually hit upon something that I think could help the authors. Assuming I'm not getting this wrong, perhaps the authors had already thought of this, but either way, the suggestion is that it might be clarifying to have it in the manuscript. Here's the idea: in the TDRL model, VL and VR converge onto the average reward given when the subject chooses those ports, where the average is over both correct and incorrect trials. In other words, VL and VR are the value of the port, averaged over all trials. But in the belief model, VL and VR converge onto the value of the reward when a reward is given. Not an average over all trials, but conditioned on the reward having been given. Thus if the reward r = 1, in the belief model, VL and VR will converge on 1. And that means that for very easy trials, the reward prediction error will be zero, and the updating will be zero. In contrast, in TDRL they will converge on something like 0.8; and thus even on very easy correct trials there will be a non-zero RPE and non-zero updating. I may have gotten this wrong, but if it is correct, there are two interesting things here: (a) the difference on what VL and VR converge to in the two cases, in the sense of one being reward averaged over errors and corrects, the other being average reward conditioned on trials being correct; (b) I believe the real difference between TDRL and belief is not that TDRL has a zero slope for updating versus previous stimulus (the non-zero slope is right there in Figure 3C). It is that TDRL will never have zero updating for the easiest stimuli, whereas the belief model will.

While on this topic: don't Figure 9E and 9H look more like TDRL in Figure 3 than like the belief model? Why are they being interpreted as supporting the belief model?

2) Subsection “Choice updating is not due to slow drift in choice side bias”: This confused me the second, the third, and the fourth time I read it. Eventually I realized that there may be two issues being treated simultaneously here. I think things would be a lot clearer if you separated them. Issue (a) is "are shifts in the psychometric curve contingent on the previous trial's outcome?" Issue (b) is "are there slow drifts in the decision boundary that would induce correlations across trials that would make one trial appear to depend on the previous one"? The thing that confused me is that to solve issue (a), the obvious and easiest thing is to compare two psychometric curves, both conditioned on a previous stimulus p, according to the previous trial's outcome, i.e., whether the previous trial was rewarded or not. That would be a really easy plot to make, understand, and interpret: if previous trial's outcome matters, it will be obvious. Why not add it to the paper?

Issue (b) is also interesting. The approach in Figure 2 addresses issue (b). If this section and figure were described as focused on issue (b), it would be a lot easier to understand.

3) Figure 3 model: the full model really needs to be fully explained, in the main text. Please use equations. In particular, while the sentences “Note that although the choice computation is deterministic, the same stimulus can produce left or right choices caused by fluctuations in the percept due to randomized trial-to-trial variation around the stimulus identity (Figure 3—figure supplement 1)” is a welcome addition, it is not enough. Please specify, in the main text, how pR and pL are computed. Note that the integral in subsection “TDRL model with stimulus belief state” needs to specify what you're integrating with respect to (it needs a "ds"), this would make it clear that pR is a function of ŝ, which is itself a random variable drawn anew on each trial. I suggest that you make this explicit in the equation, by writing the left-hand-side as pR(ŝ). (Note that I'm suggesting you bring this integral and some of the description into the main text.)

4) In the previous round, we requested substantial clarifications for panels of the type of Figure 1F and 1G. Even with the clarifications provided, I still find these panels hard to read.

– Figure 1F: this should be explained in a way that readers can understand without having to trawl through the Materials and methods. Here's my current understanding: (a) you plot the average psychometric curve; (b) you plot the psychometric curve conditioned on a particular previous trial stimulus p; (c) for each current stimulus c, you compute the vertical distance between those two curves, and that is what you call "updating".

Why not show this graphically directly, to make it easy for readers to understand? That is, something along the lines of: add a panel to Figure 1 where you show the average curve and the curve conditioned on p, add arrows pointing to the vertical differences between those two curves, and add an arrow from there to Figure 1F to indicate that these two particular curves and the vertical shift between them are what become column pFigure 1F.

Among other things, this would make it obvious why, if the psychometric curves asymptote at 0% and 100%, updating is necessarily going to be small for easy current stimuli. Which is why I don't like the plot of Figure 1F so much: the eye gets drawn to the dominant pattern, which is that the top and bottom row are lighter than the middle rows. But that's the unimportant part, that's the "expected" part, as the authors now write. The important part is happening in the middle rows. Could the authors think of a display format that focuses the eye on that, on the middle rows, instead of the already-expected parts?

– Figure 1G: The zig-zag pattern confused me no end. What is it that I'm supposed to focus on here? The difference between easy and hard? The fact that the pattern is antisymmetrical? The slope on the hard stimuli?

It eventually dawned on me that the "A" response and "B" response are of course anti-symmetrical with each other. For a model which has no intrinsic side bias, this has to be true. And there appears to be no systematic, overall, side bias across the experimental rat data in the paper. So the zig and the zag are actually redundant with each other.

I would therefore suggest the following: collapse the two with each other (the zig and the zag), which gives you better statistics, and in addition focuses the eye on the important parts, not the antisymmetry. In other words, instead of plotting as a function of previous odor A% , plot (% updating towards correct side) as a function of |A% – 50|. By halving the x and y axes, that would also allow you to zoom in by 2x, so readers can see the data better. (An added suggestion would be to plot the easy trials in a light grey, to emphasize that it's in the hard trials that the action is.) And then you'd have a plot that, at a single glance, tells the reader "there is bigger updating for harder stimuli".

Reviewer #3:

The authors addressed my earlier concerns. But their new data raised a new concern:

I think it is good of the authors to try another class of models to explain the data. However, the predictions of the statistical classifier in Figure 4D differ from experimental data in Figure 1G in three aspects: (1) there appears to be a strong dependence on previous stimul us strength for easy choices; (2) perhaps as a consequence, there is a large jump in Updating% going from green to blue at% A = 50; and (3) the range of updating% is only about half of the experimental data. It is hard to judge if these differences represent fundamental deficits of the model or just wrong parameterization. Because this model is not as intuitive as the RL model in Figure 3B, it would be helpful if the authors can expand this section (or add a supplemental figure) to give the readers a sense of how varying each model parameter changes the predictions.

eLife. 2020 Apr 15;9:e49834. doi: 10.7554/eLife.49834.sa2

Author response


Essential revisions:

1) The first major concern is in regard to a claim made throughout the paper, namely that how easy or hard the current stimulus is affects the strength of the modulation. This is first presented in Figures 1F and 1G, and then repeated in almost all the figures. However, the measure presented in figures of the style of Figure 1F and summarized in Figure 1G is the% change in responses (given a particular current and previous stimulus, and relative to the overall average for that stimulus). This measure suffers from a large problem that, unfortunately, appears to not be addressed (apologies if it was missed on reading): there is a ceiling effect, in the sense that when a choice is easy (near 100%) it cannot increase further, even if there were strong underlying plasticity. The ceiling applies preferentially to easy stimuli, and therefore will manifest as a difference between easy and hard stimuli. It is currently impossible to tell whether the data shown in support of the claim of "current stimulus strength modulates effect" is merely an artifact resulting from this ceiling issue, or an actual finding. It is very important that the authors clarify this issue.

We agree with the reviewers that our core finding is the dependence of choices on the difficulty of previous trial. Indeed, this can be observed only when the “current” stimulus is difficult for the reason described by the reviewer. Psychometric curves have this property that bias appears as a shift of the curve with minimal effects for easy trials. In our revised manuscript we have made this point clear. We have clarified that when the current sensory evidence is strong it determines the choice independent of past trials. This is as expected and has been shown previously. However, when the current sensory evidence is weak, choices depend on the difficulty of previous stimuli (paragraph four in subsection “Perceptual decisions are systematically updated by past rewards and past sensory stimuli” and Discussion paragraph one).

Note that the absence of the effect of past trials when the “current” sensory evidence is strong is fully captured by our models. In the RL model choices are computed by comparing the expected vale of L and R actions (QL and QR). Because QL = PL.VL and QR = PR.VR (the product of sensory confidence and learned reward), when current sensory evidence in favor of the choice is strong (large P), then choice is largely determined by P and the contribution of past trials (stored in V) is minimal, hence there is no bias from past trials when current sensory evidence is strong.

2a) Another key claim is that only the model with a belief state could account for the results. But the data are not convincing. Let's start with a TD model, without learning of actions, merely learning the values VR and VL, and let's suppose the value of VR is modified only after R choices (and similarly for VL and L choices). Under these conditions (and assuming R-L symmetry) VR and VL will converge to the overall% correct. This will be more than 50%; for a typical smooth psychometric curve it might be 75% or 80%. Now move to a TDRL model that also learns actions: because there are more errors for hard stimuli, the error signal (VC – r) will, on average, be greater for hard stimuli than for easy stimuli. And therefore, in this situation, there will on average be greater plasticity after hard stimuli than after easy stimuli, entirely without a belief state. Could this not account for the authors' experimental data?

The situation and conclusion sketched here could be wrong. But readers are likely to think about it and wonder whether it undermines the authors' conclusions. Thus, explicitly addressing this argument, and whether it is qualitatively or quantitatively incorrect, would strengthen the manuscript.

In our revision we attempted to clarify this issue. We explain why models of the type described by the reviewers cannot fully capture the behavioral effect.

Although intuitively we agree that this argument appears compelling, in fact the suggested model does not account for the effect that we observed. Figure 3C shows the model that the reviewers described; it shows a model that computes prediction error by comparing outcome and average expectation (i.e. VL and VR converged over trials). The prediction errors are independent of the difficulty of previous decision, resulting in updating which is independent of previous difficulty. In this model with two states (L and R), the value functions (VL and VR) indeed converge over trials to reflect average performance (say 80%): VL = 80% and VR = 80%. For choice computations one can take two approaches: (1) entirely based on the sensory evidence, i.e. by comparing PL and PR, which will result in choices that are independent of past trials and hence it will not account for our observations; (2) comparison of Q values, QL = PL × VL and QR = PR × VR, which allows past trials to have an influence on the current choice. After the choice, prediction error is the difference between reward r and average performance (converged over trials). Consider two different situations: when the previous trial was a leftward rewarded trial and it was either easy or difficult. In both cases, the prediction errors (r – VL) are similar, and they hence update VL in a similar level. This would thus cause a choice bias in the next trial to be largely independent of the past sensory difficulty (Figure 3C). The model with two state produces confidence-dependent learning only if it includes the belief that each state is occupied in the current trial (i.e. the main model shown in Figure 3A, B).

We can also consider alternative models that produce different average expectations for different stimuli by representing stimuli across multiple states (rather than the two state version considered above). Such model learns the average value for each stimulus, i.e. the state values are converging to reflect the average value of each level of sensory evidence. In this model, in each trial is based on the internal sensory stimulus the model infers and assigns a single state to the stimulus. Prediction error could be computed in two ways: either it does not have access to this inferred state or it has the access to the inferred state. The first scenario makes prediction errors independent of past difficulty. The second scenario has access to the inferred state and compares reward with the value of that state to produce confidence- dependent prediction errors. However, since the updates only impact the current state there is no bias expected for different (nearby) stimuli.

Note that one of our observations, the transfer of updating across sensory modalities, also constrains our models. Choice updating is transferred between modalities and hence models that include multiple states to store the value of each stimulus cannot account for this behavioral observation. Conversely, the model with two states stores past values into the left and right action value and can account for updating across modalities. Based on this observation we suggest that the reduced model with two states is more appropriate for our findings.

We apologize that our description was not sufficiently clear and did not clearly convey the model illustrated in Figure 3C and did not spell out the necessity of the belief state. In the revised manuscript we have substantially expand on this issue in the text (Subsections “Belief-based reinforcement learning models account for choice updating” and “TDRL models without stimulus belief state”).

2b) A related concern is that, if we understand correctly, the model produces only step-function psychometric curves: for a given stimulus s, the response is deterministic, and when VR = VL, the model would produce 100% correct behavior. This might seem unimportant, since the details of the shapes of psychometric curves are not the focus of the manuscript, and replacing the distribution-based model with one that uses noisy samples might seem a trivial change. However, distinguishing between hard and easy trials is central to the arguments in the manuscript, and the corresponding difference in error rates might easily become important (as in the situation sketched in point (2a) above). So a model that produces smooth psychometric functions (as opposed to step-shaped) might be important after all (and at least cosmetically would be an obviously better match to the data).

We have clarified that due to sensory noise even a deterministic choice rule will produce smooth psychometric curves (subsection “Belief-based reinforcement learning models account for choice updating” paragraph three). We have also clarified that using a softmax rule would not substantially change our results. To illustrate this point, we also added a supplemental figure (Figure 3—figure supplement 1).

The choice computation is entirely deterministic. However, since the percept is drawn from a normal distribution in each trial, the same stimulus can produce left or right choices across trials. For instance, an external stimulus (s = 55%) could result in an internal percept (ŝ = 40%) in one trial and 57% in the next trial leading to left and right choices respectively (assuming that 50% is the decision boundary). Thus, when averaging across trials, the psychometric curve will be graded and not step-like. We now show that using a non-deterministic, e.g. softmax, rule for choice computation does not influence our core findings (Figure 3—figure supplement 1).

2c) The authors posit that the subjects use an internal measure of decision confidence to update their decision policy. In support of this claim, they show that a qualitatively similar modulation can be produced by a temporal difference reinforcement learning (TDRL) model with prediction errors based on perceived stimulus strength. In this model, the prediction errors are used to update the stored value of each choice. We found this model design counterintuitive. It seems that an equivalent model could be constructed in which prior stimulus probabilities (pR and pL) are updated instead. This would be more consistent with the fact that the animal always receives the same amount of reward on rewarded trials, but may have uncertainty about exactly where to set the decision boundary leading to trial-to-trial updating. It would be helpful if the authors reformulated their model, or explained why reward values are being updated rather than reward probabilities.

The class of models which update the probabilities or the position of the decision boundary indeed accounts for our data. In fact, we considered including such models in our initial submission but decided to leave it out to avoid having too many models. We agree that it is valuable to reformulate our findings using such models, and we have added one such model, a statistical classifier, to the revised manuscript (new Figure 4 and accompanying text in subsection “On-line learning in margin-based classifiers explains choice updating”).

Note that we did not claim that the model presented is the only one that accounts for our observations. Instead we conclude that several classes of model could account for our data, as long as they adjust their choice strategy based on the confidence in receiving reward in the preceding trial. However, there is one observation that favors our RL model (which updates action value) over models that update sensory probabilities. We observed that learning transfers across trials even when the evidence comes from different sensory modalities.

This question taps into a fundamental and important issue: which classes of models can account for our behavioral observations? It is clear that a purely sensory-based model or a purely reward-based model cannot account for the data. Therefore, we can start with a sensory-based model and make modifications so that decision boundary is adjusted according to past trials (reviewers’ suggestion), or to start with a reward-based model and modify it so that it considers sensory uncertainty (i.e. the teaching signal reflects past confidence as in our RL model). While various classes of models account for our data, they share one main computation: they adjust the degree of learning based on the statistical confidence in the accuracy of previous decisions. Bayesian learning in drift-diffusion models of decision making also makes similar predictions about confidence-dependent choice biases (Drugowitsch et al., 2019).

In our revised manuscript we have added an on-line Bayesian support vector machine model as a main figure (Figure 4) to show another class of models that could account for our data. This class of model is analogous to the ones suggested by the reviewers: it updates the decision boundary. We would like to keep our RL model as the main model as it might be easier for readers to understand the nature of learning in this class of models, and it also accounts for cross-modality effect in a simple manner.

We have also included a paragraph in the Discussionto clarify that a large class of models could account for our observations, as long as they include a trial-by-trial adjustment of choice strategy which scales by the confidence in obtaining the reward (subsection “Rewards induce choices bias in perceptual decisions”).

2d) The Bayesian model that updates stimulus statistics seems to ignore which choice was made in the previous trial or whether it was correct. That is, there is a built-in handicap of no feedback, compared to the confidence-based models. How much does this "handicap" contribute to the poor match between model predictions and data in Figure 9A?

We agree that this specific model ignores trial choice and reward and only updates stimulus statistics. We have added a new Bayesian model which includes these and does account for the updating effect (Figure 4). As such, in our revised manuscript we no longer present the model that reviewers questioned, and have removed the corresponding figure from the manuscript.

2e) Model predictions should be comparable to real data. In Figure 2, the plots for rats' average performance show three levels of previous stim (%A) for each direction, more or less evenly distributed. In Figure 3, however, previous stim (%A) are 20, 50 and another value very close to 50. Does the model performance depend on the stimulus strength used? The same stimulus strengths used in behavioral testing should be used for model predictions.

Model predictions do not depend on the stimulus levels used in the simulation. In our revised manuscript, we have plotted model prediction using 10 evenly distributed levels of stimuli (Figure 3). In general, we note that the exact value of stimuli in the model or in the behavioral data do not matter as long as they produce psychometric curves with graded levels of choice accuracy.

2f) Parameter values used to generate Figure 3 should be reported.

Parameter values are now reported in the text as well as the caption to Figure 3 and Figure 11. Thank you.

3) More detail and clarity are needed regarding the description of the basic phenomenon (e.g., sorting and analysis of the shown quantities). The Materials and methods section should include a brief section on how the psychophysical results were generated. That could include all the details of what was conditioned on what, as well as how the trials were divided into "Hard" and "Easy".

Many of the key figures depict "updating% " and "updating index." These terms should also be mathematically defined in the Materials and methods.

The equation used to fit the bias, lapse, sensitivity of psychometric curves should also be presented in the Materials and methods. These parameters are said to be "stable." What is the criterion for determining stability?

Also, the trial nomenclature and labels used in Figure 2 (e.g., "Next" and "Previous") were confusing.

We have included this information as requested. We have devoted a subsection in the Materials and methodsto describe how psychometric plots are generated. We also defined the calculation of updating index and included equations for psychometric fitting. These are included in subsection “Data analysis and psychometric fitting”. We also included a new Figure (Figure 2—figure supplement 1) to visualize our psychometric measures and normalizations.

Do subjects also show adjustments of sensitivity after correct trials? Would a change in sensitivity contribute to the observed confidence dependence (e.g., if the psychometric curves are shallower, the difference in choice might appear larger for difficult trials)?

We have included an analysis on the effect of past sensory difficulty on the psychometric sensitivity (slope) in paragraph four of subsection “Perceptual decisions are systematically updated by past rewards and past sensory stimuli”. We did not observe any significant effect of past confidence on the psychometric slopes.

4) What happens after errors? The results demonstrate an effect akin to a win-stay strategy but limited to "guesses" only. Is there also a trend toward the corresponding lose-switch strategy? That is, when the choice in trial n-1 is difficult and not rewarded, is the subject more likely to choose the alternative option in trial n (when such decision is also difficult)? There is no obvious reason to expect that confidence-guided choice updating would not also happen after errors, but in any case, error trials should not simply be put aside. The authors should present an analysis of error trials and if they do not see the effect predicted by the model, should propose an explanation for the inconsistency. Other work from the same first author (Lak et al., 2017), presented an alternative TDRL model without the "belief state," which seemed to have qualitatively similar results for correct trials but divergent results for errors. It seems that error trials are needed again in order to rule out this other model.

In the revised manuscript we dedicate a small section (“Different strategies for choice updating after error trials due to different noise sources”) and a new figure (Figure 11) to updating effect after error trials. Note that error trials are few, compared to correct trials, making it hard to make firm conclusions.

The challenge is the diversity of such post-error effects. To paraphrase Leo Tolstoy’s famous opening sentence of the novel Anna Karenina: all correct trials are alike; each incorrect trial is incorrect in its own way. Correct perceptual performance requires appropriate processing and evaluation of the stimulus. In contrast there are many processes that can lead to incorrect performance without consideration of the stimulus, from inattention to lack of motivation to exploration. Indeed, in our behavioral data, the post-error behavioral effects are diverse, usually even within the same data set.

Note that the model’s predictions about post-correct trials is largely independent of the model parameters (i.e. learning rate, sensory noise, possible memory/decision noise). In contrast, the predictions of the model for post-error trials depend on which parameters are the dominant source of noise, i.e. whether errors are related to perception, or they are internal and due to memory of values. Briefly, there are parameter settings such that choice noise is dominated by perception, these errors cannot be systematically corrected hence there is no net updating (Figure 11). Alternatively, when there is internal noise, such as high learning rate, systematic updating patterns are observed (Figure 11).

We find that some individual subjects match these specific patterns (Figure 11) but in fact the diversity of post-error updating goes beyond these two patterns as expected. We believe this is the consequence of the modeling framework that does not take many processes into account that could lead to errors, such as attentional lapses, lack of motivation or exploration. These are also briefly discussed in subsection “All correct trials are alike; each incorrect trial is incorrect in its own way”.

5) Related to this: "The choice updating remained statistically significant even after this correction." True, but the effect did seem to get a bit weaker. I imagine this is because the history effects are not limited to one trial in the past, but possibly more. If so, you would expect the effect to become stronger when the current difficult choice is preceded by two rewarded guesses made in the same direction. Whether the trend is weak or not, it could be compared to the model's prediction.

We thank the reviewers for this suggestion. We have performed this analysis and compared with the model prediction. Figure 3—figure supplement 2 shows this analysis. As the reviewers guessed, the effect is slightly stronger after 2 rewarded low-confidence trials in the same direction, consistent with the model. This has been added to paragraph three of subsection “Belief-based reinforcement learning models account for choice updating”.

6) The transfer of choice updating across modalities is interesting. Comparing Figure 2, 4C, and 8D, it seems that the updating is larger on pure olfactory tasks and about the same for the pure auditory or mixed tasks (this is more obvious in Figure 10). I assume the rats have different sensitivity to olfactory and auditory stimuli. Then the difference seems to contradict the statement that "updating is guided by outcome expectations, rather than stimulus statistics". How much updating was there for trials in the mixed modality task but without modality switches?

We thank the reviewer for this question; indeed, the data suggest that there is outcome-based updating in the space of action values but also that it is not the entire story. In our revision we have included the updating effect without modality switch in the task with mixed modalities (Figure 9—figure supplement 1). Updating appears largest in consecutive olfactory trials but it exists in both within and across modality switches.

We have added the following paragraph to the Discussionto discuss this point:

“Trial-by-trial transfer of choice updating across sensory modalities provides some evidence that this form of learning is driven by comparing the decision outcome with confidence – dependent expectation and performing updating in the space of action values. Updating across trials with different modalities appeared relatively smaller than trials without the switch in the sensory modality. This observation might point to the fact that in trial-by-trial learning animals follow a mixture of two strategies: one which updates values in the space of actions, and one that keeps track of stimulus identity and statistics across trials for such learning. The trade-off between these model-free and model-based trial-by-trial learning during perceptual decisions remains to be explored in future studies.”

7) Figure 10B and C should show scatterplots separately for each task. The authors' hypothesis is that updating is independent of stimulus statistics. If this is true, it makes sense to pool data across tasks. However, if the alternative is true that updating depends on certain properties of sensory stimulus, i.e., there could be different relationships between updating and slope/lapse, which could be obscured by pooling across tasks. A more direct test would be to fit the model to one task and predict results for the other tasks.

In our revised manuscript we have separately plotted the relationship between the size of confidence-dependent updating and psychometric slope/lapse for each dataset, as suggested by reviewers (new Figure 10).

We would like to clarify that our aim in this analysis is to test if the updating is stronger/weaker in individuals with better psychometric curves (i.e. higher slope and lower lapse). We find a mild inverse relationship between lapse rate and the degree of updating. This suggests that updating is not due to a lack of understanding of the task. The aim of this analysis was not to test the relation between stimulus statistics and updating.

8) In Introduction and Discussion, the authors seem to suggest that this phenomenon persists after training is completed or after the subjects performed the task for extended periods and thus may reflect some optimal strategy. Was the training specifically targeting the suboptimal bias? Or is it possible that the subjects just settled on a suboptimal strategy that satisfies the training requirements? It might be useful to clarify what criteria were used to deem these subjects "well-learned".

The training did not target any specific bias: both animals and human observers were presented with random sequences of trials and were rewarded for correct choices. The goal of training in each dataset was to achieve high-quality and stable psychometric performance. We consider the subjects to be well-trained because the psychometric fitting parameters reported for each dataset show asymptotic and stable behavior (i.e. sharp psychometric slope, minimal overall bias and near zero lapse). These have been reported for each dataset tested. We have also clarified in the Materials and methodsthat the training did not target or promote confidence- dependent bias (subsection “Rat olfactory experiment”).

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential revisions:

The paper is significantly improved, and the reviewers are now convinced that it contains important work that should be published. Nevertheless, the presentation of the data is still confusing, and some further clarifications would substantially strengthen the manuscript.

1) Explaining more in detail the fundamental difference between the standard TDRL model and the belief model would be important. Specifically, the updating that they generate during easy trials, and how that updating depends on reward, could be the critical reason why the standard TDRL model fails.

We have expanded on this issue and clarified the differences between the models by adding new figures and text (see below for the detailed answer).

2) The control analysis that corrects for potential slow drifts in the internal categorical boundary may also be implicitly addressing a separate issue (whether the shifts in the psychometric curve depend on the previous trial outcome, i.e., rewarded vs. not rewarded). Making this distinction explicit would be helpful.

We have addressed this and followed the reviewer’s advice to clearly dissociate these two issues. We have added new analyses, figures and text (see below for the detailed answer).

3) The zig-zag patterns of the main data plots can be confusing because sometimes it is unclear what matters, the slope, the discontinuity, or both. Some suggestions for modifying the plots are provided below. That and/or additional text to guide the reader to the relevant features of the data would be helpful.

We appreciate the suggestions for modifying these plots. We have now added a new figure panel to explain the calculations underlying the plot, have changed the colors in the plots for clarification, and have included additional text to clarify the analyses (see below for the detailed answer).

4) Figure 4 is a very nice addition to the manuscript. However, the results do differ somewhat from the data and from those of the belief model. Does this simply reflect a fundamental, qualitative difference between the classifier and the other models? It would be helpful to clarify whether the classifier has parameters that can be adjusted.

The difference simply reflects model parameters. We have now corrected this and expanded on this section for further clarification (see below for the detailed answer).

Details about each of these points are provided below.

Reviewer #2:

1) Figure 3 TDRL model: In the previous round of review, we (the reviewers) presented an argument as to why the TDRL model would lead to updating that depended on previous trial difficulty. The authors replied that this model results "in updating which is independent of previous difficulty." (Response to reviewers.) I believe that response is simply not true: in Figure 3C left, one can clearly see a non-zero slope to the updating% on hard trials. (And Figure 3C right worries me no end: since in the TDRL model there are no slow side biases independent of outcome, I am unsettled by the correction for such biases changing the results of the model.) [Note that doing the zig-zag plots in the style suggested below for Figure 1G would focus the eye on that slope much better.] What am I supposed to focus on? That the slope is less than in the belief model? That the belief model reaches zero updating for the easiest previous trials?

I spent a lot of time trying to think this one through, if the TDRL model could indeed be described as accounting for the data, that would be a pretty bad hit at the heart of the manuscript, and eventually hit upon something that I think could help the authors. Assuming I'm not getting this wrong, perhaps the authors had already thought of this, but either way, the suggestion is that it might be clarifying to have it in the manuscript. Here's the idea: in the TDRL model, VL and VR converge onto the average reward given when the subject chooses those ports, where the average is over both correct and incorrect trials. In other words, VL and VR are the value of the port, averaged over all trials. But in the belief model, VL and VR converge onto the value of the reward when a reward is given. Not an average over all trials, but conditioned on the reward having been given. Thus if the reward r = 1, in the belief model, VL and VR will converge on 1. And that means that for very easy trials, the reward prediction error will be zero, and the updating will be zero. In contrast, in TDRL they will converge on something like 0.8; and thus even on very easy correct trials there will be a non-zero RPE and non-zero updating. I may have gotten this wrong, but if it is correct, there are two interesting things here: (a) the difference on what VL and VR converge to in the two cases, in the sense of one being reward averaged over errors and corrects, the other being average reward conditioned on trials being correct; (b) I believe the real difference between TDRL and belief is not that TDRL has a zero slope for updating versus previous stimulus (the non-zero slope is right there in Figure 3C). It is that TDRL will never have zero updating for the easiest stimuli, whereas the belief model will.

While on this topic: don't Figure 9E and 9H look more like TDRL in Figure 3 than like the belief model? Why are they being interpreted as supporting the belief model?

We appreciate the reviewer’s suggestions and have followed the advice. In short, we agree with the reviewer’s insights. There are two differences between the belief-based model and the TDRL. First, the values converge through learning to two different quantities: in the belief- based model they converge to just below the true size of reward (i.e. reward value, given the reward given) and in the TDRL they converge to the average performance (average rewards over trials). This has implications for the size of prediction error for the easiest correct choices, as the reviewer described. The second difference among models is that in the belief-based model the size of prediction errors scales with the difficulty of previous choice while in the TDRL prediction errors shows little scaling with choice difficulty (see below for further details on this). As such, regardless of the magnitude of converged values, the teaching signals in the two models differ, and the teaching signals in the belief-based model causes different levels of learning in the next trial. We have expanded on this and spelled out these differences in the final paragraph of subsection “Belief-based reinforcement learning models account for choice updating”. We have also added new figure panels (Figure 3—figure supplement 1C,D) to further describe these differences.

The two effects, small to no updating for perfect accuracy (typically the easiest stimuli) and increasing updating for lower accuracy choices are both consequences of the model, so we are not trying to focus on one compared to the other. Note that zero updating would be only expected for maximum confidence choices with zero lapse rates in our model and the easiest stimuli differ across data sets, often producing less than 100% accuracy, hence non-zero updating is expected. On the other hand, we agree that zero updating for zero lapse rate trials is a central point that likely hid the role of reinforcement learning in perceptual decisions.

Regarding weak effect of previous trial difficulty for the TDRL model (Figure 3C) that the reviewer pointed out, our point is not only that this effect is weak compared to the main model. Rather, this effect vanishes after our normalization (Figure 3C left). Consequently, after applying the normalization, the only model that still shows updating like our subjects is the belief-based model (Figure 3B, left). Also note that in all TDRL models there is some correlation across trials due to the stored values that are by definition correlated across trials (similar to a drifting boundary in the signal detection model (see below). Hence it is important to evaluate models after the normalization (righthand panels of Figure 3). These issues have been clarified in subsection “Belief-based reinforcement learning models account for choice updating”.

How could a correlation across trials cause small difficulty-dependent effects on the updating in the TDRL model before applying the normalization (Figure 3C left)? Let us first consider the prediction errors in this model after a correct choice. They are largely independent of the choice difficulty. In fact, if anything, they are even marginally smaller when getting a reward after a difficult choice (Figure 3—figure supplement 1D), i.e. opposite to the belief-based model. This pattern in the prediction errors is due to correlation of values across trials, and how stored values and current sensory evidence interact in computing choice (Q = V*P). The model has a better chance of getting a difficult stimulus correct if it happens that it is transiently in the regime that the stored value of the correct side (i.e. stimulus side) is higher than the stored value of the other side (hence a difference in Q values). In this situation, at the time of reward prediction error computation, the reward is compared with relatively high stored value of the chosen side, resulting in marginally smaller positive prediction error (Figure 3—figure supplement 1D). Consequently, after this trial the value difference between L and R still persists. Let’s now consider a next trial that happens to be difficult and correct (with similar stimulus/choice side to the previous trial). The model comes to this trial with the persisting value difference described above, and again has a good chance of getting the choice correct. This bias appears as an apparent updating. These value difference-induced effects are more likely to occur for difficult stimuli (in the case of easy stimuli, choices are dominated by the sensory stimulus), resulting in slightly larger updating for difficult stimuli and hence the weak effect in Figure 3C, left. These, and other similar considerations highlight the fact that the models should compared and evaluated after normalizations to minimize correlations across trials. These have been briefly described in subsections “Belief-based reinforcement learning models account for choice updating” and “TDRL models without stimulus belief state”.

Finally, the reviewer is correct that in different datasets we are seeing various levels of choice updating and those in Figure 9 show weaker effects compared to some of the other datasets. The effect for each individual subject has been quantified in Figure 10 indicating the strength of the effect across subjects. Despite the diversity, many subjects show a substantial level of choice updating even in the data of Figure 9. In addition, the dataset presented in Figure 9 is particularly interesting, because we show that while there is some level of choice updating that is transferred across sensory modalities, the strength of choice updating within each modality is somewhat stronger (Figure 9—figure supplement 1). We speculate that this observation might point to the fact that in trial-by-trial learning animals follow a mixture of two strategies: one which updates values in the space of actions, and one that keeps track of stimulus identity and statistics across trials for such learning. This was mentioned in the text.

2) Subsection “Choice updating is not due to slow drift in choice side bias”: This confused me the second, the third, and the fourth time I read it. Eventually I realized that there may be two issues being treated simultaneously here. I think things would be a lot clearer if you separated them. Issue (a) is "are shifts in the psychometric curve contingent on the previous trial's outcome?" Issue (b) is "are there slow drifts in the decision boundary that would induce correlations across trials that would make one trial appear to depend on the previous one"? The thing that confused me is that to solve issue (a), the obvious and easiest thing is to compare two psychometric curves, both conditioned on a previous stimulus p, according to the previous trial's outcome, i.e., whether the previous trial was rewarded or not. That would be a really easy plot to make, understand, and interpret: if previous trial's outcome matters, it will be obvious. Why not add it to the paper?

Issue (b) is also interesting. The approach in Figure 2 addresses issue (b). If this section and figure were described as focused on issue (b), it would be a lot easier to understand.

We have followed the reviewer’s advice. Addressing issue (a), we have added an example figure that the reviewer has suggested (psychometric curve conditional on one previous stimulus separated based on previous outcome; Figure 1—figure supplement 1) and have added related text to the main text in subsection “Perceptual decisions are systematically updated by past rewards and past sensory stimuli”. This, together with extensive analyses we presented in Figure 1 (looking at post-correct trials only) demonstrate our main point, that is the effect of the previous reward on the next choice depends on the previous choice difficulty.

Addressing issue (b), we have extensively expanded the manuscript to explain the importance of considering correlation across trials. We agree that the normalization may not be entirely intuitive. Therefore, we include a new simulation to illustrate how slow boundary shift alone can produce apparent trial-to-trial updating that is captured by the normalization method Figure 2—figure supplement 1). We first show that a signal detection model simulation with a fixed boundary does not show any effect of previous trials, as expected (Figure 2—figure supplement 1A,B). We then show that if we use drifting boundary in the simulation it will result in an apparent updating in the data (Figure 2—figure supplement 1C). Lastly, we show that our normalization method can correct for this (Figure 2—figure supplement 1D). We hope that this clarifies how the normalization works. We have also added additional to text to explain this point (subsection “Choice updating is not due to slow drifts in choice side bias”).

3) Figure 3 model: the full model really needs to be fully explained, in the main text. Please use equations. In particular, while the sentence “Note that although the choice computation is deterministic, the same stimulus can produce left or right choices caused by fluctuations in the percept due to randomized trial-to-trial variation around the stimulus identity (Figure 3—figure supplement 1)” is a welcome addition, it is not enough. Please specify, in the main text, how pR and pL are computed. Note that the integral in subsection “TDRL model with stimulus belief state” needs to specify what you're integrating with respect to (it needs a "ds"), this would make it clear that pR is a function of ŝ, which is itself a random variable drawn anew on each trial. I suggest that you make this explicit in the equation, by writing the left-hand-side as pR(ŝ). (Note that I'm suggesting you bring this integral and some of the description into the main text.)

We have added more details, including equations about the main model in the text, as suggested, and have clarified the computation of pR and pL in the main text (see subsection “Belief-based reinforcement learning models account for choice updating”). We have also corrected the typo in the integral equation. Thank you.

4) In the previous round, we requested substantial clarifications for panels of the type of Figure 1F and 1G. Even with the clarifications provided, I still find these panels hard to read.

– Figure 1F: this should be explained in a way that readers can understand without having to trawl through the Materials and methods. Here's my current understanding: (a) you plot the average psychometric curve; (b) you plot the psychometric curve conditioned on a particular previous trial stimulus p; (c) for each current stimulus c, you compute the vertical distance between those two curves, and that is what you call "updating".

Why not show this graphically directly, to make it easy for readers to understand? That is, something along the lines of: add a panel to Figure 1 where you show the average curve and the curve conditioned on p, add arrows pointing to the vertical differences between those two curves, and add an arrow from there to panel f to indicate that these two particular curves and the vertical shift between them are what become column p in the panel in Figure 1F.

Among other things, this would make it obvious why, if the psychometric curves asymptote at 0% and 100%, updating is necessarily going to be small for easy current stimuli. Which is why I don't like the plot of Figure 1F so much: the eye gets drawn to the dominant pattern, which is that the top and bottom row are lighter than the middle rows. But that's the unimportant part, that's the "expected" part, as the authors now write. The important part is happening in the middle rows. Could the authors think of a display format that focuses the eye on that, on the middle rows, instead of the already-expected parts?

The reviewer’s understanding about the underlying calculations is completely correct. We have followed reviewer’s advice and added a graphical description of these calculations (Figure 1D) and we have also included additional sentences in the main text for further clarification (subsection “Perceptual decisions are systematically updated by past rewards and past sensory stimuli”).

– Figure 1G: The zig-zag pattern confused me no end. What is it that I'm supposed to focus on here? The difference between easy and hard? The fact that the pattern is antisymmetrical? The slope on the hard stimuli?

It eventually dawned on me that the "A" response and "B" response are of course anti-symmetrical with each other. For a model which has no intrinsic side bias, this has to be true. And there appears to be no systematic, overall, side bias across the experimental rat data in the paper. So the zig and the zag are actually redundant with each other.

I would therefore suggest the following: collapse the two with each other (the zig and the zag), which gives you better statistics, and in addition focuses the eye on the important parts, not the antisymmetry. In other words, instead of plotting as a function of previous odor A% , plot (% updating towards correct side) as a function of |A% – 50|. By halving the x and y axes, that would also allow you to zoom in by 2x, so readers can see the data better. (An added suggestion would be to plot the easy trials in a light grey, to emphasize that it's in the hard trials that the action is.) And then you'd have a plot that, at a single glance, tells the reader "there is bigger updating for harder stimuli". Right now the zig-zag makes me squint 40 times at it before I get it.

We appreciate reviewer’s suggestion to fold the zig-zag plots across the stimulus axis. However, we think that the unfolded zig-zag plot represents a more appropriate signature of the updating effect. As the reviewer pointed out, among other things, it reveals potential side bias in the data. Moreover, it shows the asymmetry of the effect, which is a strong tendency to repeat left action or right action for nearby stimuli (in the middle of the x-axis) that would be obscured after folding. We have, however, accommodated part of the reviewer’s suggestion and changed the color of easy trials to grey and difficult trials to black so that the attention can be focused on the difficult trials. Once again, we are most grateful for all the helpful advice. Also, please note that the subject-by-subject analyses we presented (Figure 10) is similar to the reviewer’s suggestion as in that figure we are folding the zig-zag plots and use those for statistics.

Reviewer #3:

The authors addressed my earlier concerns. But their new data raised a new concern:

I think it is good of the authors to try another class of models to explain the data. However, the predictions of the statistical classifier in Figure 4D differ from experimental data in Figure 1G in three aspects: (1) there appears to be a strong dependence on previous stimulus strength for easy choices; (2) perhaps as a consequence, there is a large jump in Updating% going from green to blue at% A = 50; and (3) the range of updating% is only about half of the experimental data. It is hard to judge if these differences represent fundamental deficits of the model or just wrong parameterization. Because this model is not as intuitive as the RL model in Figure 3B, it would be helpful if the authors can expand this section (or add a supplemental figure) to give the readers a sense of how varying each model parameter changes the predictions.

We thank the reviewer for pointing out this apparent inconsistency and apologize for our insufficient explanation. The differences arose from a few model parameters, i.e. the rate by which the weight vector is updated and the level of sensory noise in the stimuli. For these parameters, we previously used the quantities used in the main model. However, it is straightforward to achieve updating results much more similar to the behavioral data by adjusting these parameters. In our revised manuscript, we have corrected the figure to show the results of our latest simulation. We have included the value of parameters used and also expanded on the model description in the Materials and methods section.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Urai AE, Braun A, Donner THD. 2018. Pupil-linked arousal is driven by decision uncertainty and alters serial choice bias. Figshare. [DOI] [PMC free article] [PubMed]

    Supplementary Materials

    Transparent reporting form

    Data Availability Statement

    The human dataset used in this study is available at https://doi.org/10.6084/m9.figshare.4300043.

    The following previously published dataset was used:

    Urai AE, Braun A, Donner THD. 2018. Pupil-linked arousal is driven by decision uncertainty and alters serial choice bias. Figshare.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES