Skip to main content
eLife logoLink to eLife
. 2023 Feb 14;12:e64978. doi: 10.7554/eLife.64978

Strategically managing learning during perceptual decision making

Javier Masís 1,2,†,, Travis Chapman 2, Juliana Y Rhee 1,2,, David D Cox 1,2,§, Andrew M Saxe 3,#,
Editors: Valentin Wyart4, Michael J Frank5
PMCID: PMC9928425  PMID: 36786427

Abstract

Making optimal decisions in the face of noise requires balancing short-term speed and accuracy. But a theory of optimality should account for the fact that short-term speed can influence long-term accuracy through learning. Here, we demonstrate that long-term learning is an important dynamical dimension of the speed-accuracy trade-off. We study learning trajectories in rats and formally characterize these dynamics in a theory expressed as both a recurrent neural network and an analytical extension of the drift-diffusion model that learns over time. The model reveals that choosing suboptimal response times to learn faster sacrifices immediate reward, but can lead to greater total reward. We empirically verify predictions of the theory, including a relationship between stimulus exposure and learning speed, and a modulation of reaction time by future learning prospects. We find that rats’ strategies approximately maximize total reward over the full learning epoch, suggesting cognitive control over the learning process.

Research organism: Rat

Introduction

Optimal behavior in decision making is frequently defined as maximization of reward over time (Gold and Shadlen, 2002), and this requires balancing the speed and accuracy of one’s choices (Bogacz et al., 2006). For example, imagine you are given a multiple-choice quiz on an esoteric topic with which you are familiar, such as behavioral neuroscience or cognitive psychology, and rewarded for every correct answer. In balancing speed and accuracy, you should spend some time on each question to ensure you get it right. Now imagine that you are given a different quiz on an esoteric topic with which you are not familiar, such as low Reynolds number hydrodynamics or underwater basket weaving. In balancing speed and accuracy, you should now guess on as many questions as you can as quickly as you can in order to maximize reward. The ideal balance of speed and accuracy differs considerably in the cases of high and low competence. However, there is an important additional dynamical aspect to consider: competence can change as a function of experience through learning. For instance, taking the hydrodynamics quiz enough times, you might start to get the hang of it, by going slow enough that you can remember questions and their associated answers, rather than guessing as quickly as you can. Given these almost opposing normative strategies for high and low competence, how does one effectively move from low competence to high competence? In other words, how does an agent strategically manage decision making in light of learning?

In this study, we formalize this problem in the context of a two-choice perceptual decision making task in rodents and simulated agents. Perceptual decisions, in particular two-choice decisions, allow us to leverage one of the most prolific decision making models, the drift-diffusion model (DDM) (Ratcliff, 1978) (and the considerable analytical dissections of it Bogacz et al., 2006), and one of the most prolific paradigms captured by it, the speed-accuracy trade-off (SAT), as a measurement of optimal behavior (i.e. maximization of reward per unit time) (Woodworth, 1899, Henmon, 1911, Garrett, 1922, Pew, 1969, Pachella, 1974, Wickelgren, 1977, Ruthruff, 1996, Ratcliff and Rouder, 1998, Gold and Shadlen, 2002, Bogacz et al., 2006; Bogacz et al., 2010; Heitz and Schall, 2012; Heitz, 2014, Rahnev and Denison, 2018).

Studies of the SAT have focused on how the brain may solve it (Gold and Shadlen, 2002, Roitman and Shadlen, 2002), what the optimal solution is (Bogacz et al., 2006), and whether agents can indeed manage it (Simen et al., 2006, Balci et al., 2011a; Simen et al., 2009; Bogacz et al., 2010, Drugowitsch et al., 2014, Drugowitsch et al., 2015; Manohar et al., 2015). Though most work in this area has taken place in humans and non-human primates, several studies have established the presence of a SAT in rodents (Uchida and Mainen, 2003, Abraham et al., 2004; Rinberg et al., 2006; Reinagel, 2013a; Reinagel, 2013b; Kurylo et al., 2020). The broad conclusion of much of this literature is that after extensive training, many subjects come close to optimal performance (Simen et al., 2009,Bogacz et al., 2010; Balci et al., 2011b; Zacksenhouse et al., 2010, Balci et al., 2011b, Starns and Ratcliff, 2010, Holmes and Cohen, 2014, Drugowitsch et al., 2014, Drugowitsch et al., 2015). When faced with deviations from optimality, several hypotheses have been proposed, including error avoidance, poor internal estimates of time, and a minimization of the cognitive cost associated with an optimal strategy (Maddox and Bohil, 1998, Bogacz et al., 2006, Zacksenhouse et al., 2010).

Past studies have shown how agents behave after reaching steady-state performance (Simen et al., 2009, Starns and Ratcliff, 2010, Bogacz et al., 2010,Zacksenhouse et al., 2010, Balci et al., 2011b, Balci et al., 2011a; Starns and Ratcliff, 2010, Drugowitsch et al., 2014, Drugowitsch et al., 2015), but relatively less attention has been paid to how agents learn to approach near-optimal behavior (but see Law and Gold, 2009, Balci et al., 2011b, Drugowitsch et al., 2019). While maximizing instantaneous reward rate is a sensible goal when the task is fully mastered, it is less clear that this objective is appropriate during learning.

Here, we set out to understand how agents manage the SAT during learning by studying the learning trajectory of rats and simulated agents in a free-response two-alternative forced-choice visual object recognition task (Zoccolan et al., 2009). Rats near-optimally maximized instantaneous reward rate (iRR) at the end of learning but chose response times that were too slow to be iRR-optimal early in learning. To understand the rats’ learning trajectory, we examined learning trajectories in a recurrent neural network (RNN) trained on the same task. We derive a reduction of this RNN to a learning drift-diffusion model (LDDM) with time-varying parameters that describes the network’s average learning dynamics. Mathematical analysis of this model reveals a dilemma: at the beginning of learning when error rates are high, iRR is maximized by fast responses (Bogacz et al., 2006). However, fast responses mean minimal stimulus exposure, little opportunity for perceptual processing, and consequently slow learning. Because of this learning speed/iRR (LS/iRR) trade-off, slow responses early in learning can yield greater total reward over engagement with the task, suggesting a normative basis for the rats’ behavior. We then experimentally tested and confirmed several model predictions by evaluating whether response time and learning speed are causally related, and whether rats choose their response times so as to take advantage of learning opportunities. Our results suggest that rats exhibit cognitive control of the learning process, adapting their behavior to approximately accrue maximal total reward across the entire learning trajectory, and indicate that a policy that prioritizes learning in perceptual tasks may be advantageous from a total reward perspective.

Results

Trained rats solve the SAT

We trained n=26 rats on a visual object recognition two-alternative forced-choice task (see Methods) (Zoccolan et al., 2009). The rats began a trial by licking the central of three capacitive lick ports, at which time a static visual object that varied in size and rotation from one of two categories appeared on a screen. After evaluating the stimulus, the rats licked the right or left lick port. When correct, they received a water reward, and when incorrect, a timeout period (Figure 1a, Figure 1—figure supplement 1). Because this was a free-response task, rats were also able to initiate a trial and not make a response, but these ignored trials made up a small fraction of all trials and were not considered during our analysis (Figure 1—figure supplement 2).

Figure 1. Trained rats solve the speed-accuracy trade-off.

(a) Rat initiates trial by licking center port, one of two visual stimuli appears on the screen, rat chooses correct left/right response port for that stimulus and receives a water reward. (b) Speed-accuracy space: a decision making agent’s ER and mean normalized DT (a normalization of DT based on the average timing between one trial and the next, see Methods). Assuming a simple drift-diffusion process, agents that maximize iRR (see Methods) must lie on an optimal performance curve (OPC, black trace) (Bogacz et al., 2006). Points on the OPC relate error rate to mean normalized decision time, where the normalization takes account of task timing parameters (e.g. average response-to-stimulus interval). For a given SNR, an agent’s performance must lie on a performance frontier swept out by the set of possible threshold-to-drift ratios and their corresponding error rates and mean normalized decision times. The intersection point between the performance frontier and the OPC is the error rate and mean normalized decision time combination that maximizes iRR for that SNR. Any other point along the performance frontier, whether above or below the OPC, will achieve a suboptimal. iRR Overall, iRR increases toward the bottom left with maximal instantaneous reward rate at error rate = 0.0 and mean normalized decision time = 0.0. (c) Mean performance across 10 sessions for trained rats (n=26) at asymptotic performance plotted in speed-accuracy space. Each cross is a different rat. Color indicates fraction of maximum instantaneous reward rate (iRR) as determined by each rat’s performance frontier. Errors are bootstrapped SEMs. (d) Violin plots depicting fraction of maximum, iRR a quantification of distance to the OPC, for same rats and same sessions as c. Fraction of maximum iRR is a comparison of an agent’s current iRR with its optimal iRR given its inferred SNR. Approximately 15 of 26 (∼60%) of rats attain greater than 99% fraction maximum iRRs for their individual inferred SNRs. * denotes p < 0.05 one-tailed Wilcoxon signed-rank test for mean >0.99.

Figure 1.

Figure 1—figure supplement 1. Task schematic for error trials.

Figure 1—figure supplement 1.

Error trial: rat chooses incorrect left/right response port and incurs a timeout period.
Figure 1—figure supplement 2. Fraction of ignored trials during learning.

Figure 1—figure supplement 2.

(a) Schematic of an ignore trial: rat does not choose a left/right response port and receives no feedback. (b) Fraction of trials ignored (ignored trials/(correct + incorrect + ignored trials)) during learning for animals encountering the task for the first time (stimulus pair 1). (c) Fraction of trials ignored for animals learning stimulus pair 2 after training on stimulus pair 1.
Figure 1—figure supplement 3. Drift-diffusion model data fits.

Figure 1—figure supplement 3.

(a) The accuracy and reaction time data from 26 trained rats was fit to a simple drift-diffusion model using the hierarchical Bayesian estimation of the drift-diffusion model (HDDM) package (Wiecki et al., 2013). (b) Estimated posterior distributions of parameter values across all animals.
Figure 1—figure supplement 4. Estimating T0.

Figure 1—figure supplement 4.

(a) Linear and quadratic extrapolations to accuracy as a function of reaction time. The T0 estimate is when each extrapolation intersects chance accuracy (0.5). (b) Mean accuracy for trials with reaction times 350–375 ms for n=26 rats. (c) Minimum motor time estimated by looking at first peak of time between licks to/from center port for n=11 rats. (d) Cartoon of stimulus onset latency across visual areas from Vermaercke et al., 2014 to estimate minimum visual processing time. (e) Diagram of T0 estimates, with an upper limit (minimum reaction time) and lower limit (minimum motor time + minimum visual processing time). (f) Mean learning trajectory for n=26 rats with various t0 estimates. (g) Subjects (n=26) in speed-accuracy space with various T0 estimates.
Figure 1—figure supplement 5. Analysis of voluntary intertrial intervals (ITIs).

Figure 1—figure supplement 5.

(a) Histogram of voluntary ITIs (time in addition to mandatory experimentally determined Derr and Dcorr) for n=26 rats across 10 sessions for previous correct (blue) and previous error (red) ITIs. Voluntary ITIs are spaced every 500 ms because of violations to the ‘cannot lick’ period. Inset: proportion of voluntary ITIs below 500, 1000, and 2000 ms boundaries. (b) Median voluntary ITIs up too 500, 1000, and 2000 ms boundaries. (c) Overlay of voluntary ITIs spaced 500 ms apart after previous correct trials. (d) Overlay of voluntary ITIs spaced 500 ms apart after previous error trials.
Figure 1—figure supplement 6. Mandatory post-error (D~err) and post-correct (D~corr) response-to-stimulus interval times.

Figure 1—figure supplement 6.

(a) Diagram of intertrial interval (ITI) after previous error trial. All times (punishment stimulus, enforced intertrial interval, cannot lick reward ports, and pre-stimulus time) were verified based on timestamps on experimental file logs. After the punishment stimulus and enforced intertrial interval, there is a 300 ms period where rats cannot lick the reward ports. If violated, 500 ms are added to the intertrial interval followed by another 300 ms ‘cannot lick’ period. In addition to this restriction, rats may take as much voluntary time between trials as they wish. Any violation of the ‘cannot lick’ period is counted as voluntary time, and only the minimum mandatory time of 3136 ms is counted for D~err. (b) Diagram of ITI after previous correct trial. All times (dispense water reward, collect water reward, enforced intertrial interval, cannot lick reward ports, pre-stimulus time) were verified based on timestamps on experimental file logs. The same ‘cannot lick’ period is present as in a. All times (dispense water reward, collect water reward, enforced intertrial interval, cannot lick reward ports, pre-stimulus time) were verified based on timestamps on experimental file logs. Any violation of the ‘cannot lick’ period is counted as voluntary time, and only the minimum mandatory time of 6370 ms is counted for D~corr.
Figure 1—figure supplement 7. Reward rate sensitivity to T0 and voluntary intertrial interval (ITI).

Figure 1—figure supplement 7.

(a) Fraction of maximum instantaneous reward rate across n=26 rats over 10 sessions at asymptotic performance over possible voluntary ITI values of 0–1000 ms and over the minimum and maximum estimated T0 values. (b) Fraction of maximum instantaneous reward rate across n=26 rats over 10 sessions at asymptotic performance over possible T0 values from 160 to 350 ms (min to max estimated T0 values) and over the median voluntary ITIs with 500, 1000, and 2000 ms boundaries. (c) Fraction of maximum instantaneous reward rate across n=26 rats as a function of normalized training time during learning period and possible voluntary ITIs from 0 to 2000 ms calculated with the T0 minimum of 160 ms. The gray curves represent a weighted average over previous correct/error median voluntary ITIs over normalized training time. Contours with different fractions of maximum instantaneous reward rate in pink. (d) Same as in c but calculated with T0 maximum of 350 ms.

We examined the relationship between error rate (ER) and reaction time (RT) during asymptotic performance using the DDM (Figure 1—figure supplement 3). In the DDM, perceptual information is integrated through time until the level of evidence for one alternative reaches a threshold. The SAT is controlled by the subject’s choice of threshold, and is solved when a subject’s performance lies on an optimal performance curve (OPC; Figure 1b; Bogacz et al., 2006). The OPC defines the mean normalized decision time (DT) and ER combination for which an agent will collect maximal iRR (see Methods). At any given time, an agent will have some perceptual sensitivity (signal-to-noise ratio [SNR]) which reflects how much information about the stimulus arrives per unit time. Given this SNR, an agent’s position in speed-accuracy space (the space relating ER and DT) is constrained to lie on a performance frontier traced out by different thresholds (Figure 1b). Using a low threshold yields fast but error-prone responses, while using a high threshold yields slow but accurate responses. An agent only maximizes iRR when it chooses the ER and DT combination on its performance frontier that intersects the OPC. After learning the task to criterion, over half the subjects collected over 99% of their total possible reward, based on inferred SNRs assuming a DDM (Figure 1c and d).

Calculating mean normalized DT for comparison with the OPC requires knowing two quantities, DT and the average non-decision time per error trial Derr. The average non-decision time Derr=T0+D~err contains the motor and initial perceptual processing components of RT, denoted T0; and the post-response timeout on error trials D~err. Mean normalized DT is then the ratio DT/Derr. In order to determine each subject’s DT, we estimated T0 through a variety of methods, opting for a biological estimate (measured lickport latency response times and published visual processing latencies; Figure 1—figure supplement 4). To ensure that our results did not depend on our choice of T0, we ran a sensitivity analysis on a wide range of possible values of T0 (Figure 1—figure supplement 4f). We then had to determine D~err, which can contain mandatory and voluntary intertrial intervals. We found that the rats generally kept voluntary intertrial intervals to a minimum, and we interpreted longer intervals as effectively ‘exiting’ the DDM framework (Figure 1—figure supplement 5). As such, we defined D~err to only contain mandatory intertrial intervals (see Methods, Figure 1—figure supplement 6). To ensure that our results did not depend on either choice, we ran a sensitivity analysis on the combined effects of T0 and a D~err containing voluntary intertrial intervals on RR (Figure 1—figure supplement 7). A full discussion of how these parameters were determined is included in the Methods.

Across a population, a uniform stimulus difficulty will reveal different SNRs because the internal perceptual processing ability in every subject will be different. Thus, although we did not explicitly vary stimulus difficulty (Simen et al., 2009, Bogacz et al., 2010; Zacksenhouse et al., 2010,Balci et al., 2011b), as a population, animals clustered along the OPC across a range of ERs (Figure 1d), supporting the assertion that well-trained rats achieve a near maximal iRR in this perceptual task. We note that subjects did not span the entire range of possible ERs, and that the differences in optimal DTs dictated by the OPC for the ERs we did observe are not large. It remains unclear whether our subjects would be optimal over a wider range of task parameters. Notwithstanding, previous work with a similar task found that rats did increase DTs in response to increased penalty times, indicating a sensitivity to these parameters (Reinagel, 2013a). Thus, for our perceptual task and its parameters, trained rats approximately solve the SAT.

Rats do not maximize instantaneous reward rate during learning

Knowing that rats harvested reward near-optimally after learning, we next asked whether rats harvested instantaneous reward near-optimally during learning as well. If rats optimized iRR throughout learning, their trajectories in speed-accuracy space should always track the OPC.

During learning, a representative individual (n=1) started with long RTs that decreased as accuracy increased across training time (Figure 2a). Transforming this trajectory to speed-accuracy space revealed that throughout learning the individual did not follow the OPC (Figure 2b). Early in learning, the individual started with a much higher DT than optimal, but as learning progressed it approached the OPC. The maximum iRR opportunity cost is the fraction of maximum possible iRR relinquished for a choice of threshold (and average DT) (see Methods). We found that this individual gave up over 20% of possible iRR at the beginning of learning but harvested reward near-optimally at asymptotic performance (Figure 2c). These trends held when the learning trajectories of n=26 individuals were averaged (Figure 2d–f). To ensure that our particular training regime (which involved changes in stimulus size and rotation) was not responsible for these trends, we trained a separate cohort (n=8) with a simplified regime that did not involve any changes to the stimuli and we did not observe any meaningful differences (Figure 2—figure supplement 1, see Methods). These results show that rats do not greedily maximize iRR throughout learning and lead to the question: if rats maximize iRR at the end of learning, what principle governs their strategy at the beginning of learning?

Figure 2. Rats do not greedily maximize instantaneous reward rate during learning.

(a) Reaction time (blue) and error rate (pink) for an example subject (rat AL14) across 23 sessions. (b) Learning trajectory of individual subject (rat AL14) in speed-accuracy space. Color map indicates training time. Optimal performance curve (OPC) in blue. (c) Maximum iRR opportunity cost (see Methods) for individual subject (rat AL14). (d) Mean reaction time (blue) and error rate (pink) for n=26 rats during learning. Sessions across subjects were transformed into normalized sessions, averaged and binned to show learning across 10 bins. Normalized training time allows averaging across subjects with different learning rates (see Methods). (e) Learning trajectory of n=26 rats in speed-accuracy space. Color map and OPC as in a. (f) Maximum iRR opportunity cost of rats in b throughout learning. Errors reflect within-subject session SEMs for a and b and across-subject session SEMs for d, e, and f.

Figure 2.

Figure 2—figure supplement 1. Comparison of training regimes.

Figure 2—figure supplement 1.

(a) ‘Canonical only’: rats trained to asymptotic performance with only front-view image of each of the two stimuli. ‘Size and rotation’: rats first shown front-view image of stimuli. After reaching criterion (accuracy=0.7), size staircased. Following criterion, rotation staircased. Upon criterion, stimuli randomly drawn across size and rotation. (b) Learning trajectory in speed-accuracy space over normalized training time for rats trained with the ‘size and rotation’ (left panel) and the ‘canonical only’ training regimes (right panel). (c) Average location in speed-accuracy space for 10 sessions after asymptotic performance for individual rats in both training regimes, as in b. (d) Mean accuracy over learning (left panel) and for 5 sessions after asymptotic performance (right panel) for rats trained with the ‘size and rotation’ (n=26) and the ‘canonical only’ (n=8) training regimes. (e) Mean reaction time. (f) Mean fraction max iRR. (g) Mean total trials per session. (h) Mean voluntary intertrial interval up to 500 ms after error trials. (i) Mean fraction ignored trials. All errors are SEM. Significance in right panels of d–i determined by Wilcoxon rank-sum test with p<0.05.

Learning DDM

To theoretically understand the effect of different learning strategies, we developed a simple linear RNN formalism for our task. This framework enables investigation of how long-term perceptual learning across many trials is influenced by the choice of decision time on individual trials (Figure 3). We first describe this neural network formalism, before showing how it can be analytically reduced to a classic DDM with time-dependent parameters that evolve over the course of learning.

Figure 3. Recurrent neural network and learning drift diffusion model (DDM).

(a) Roll out in time of recurrent neural network (RNN) for one trial. (b) The decision variable for the recurrent neural network (dark gray), and other trajectories of the equivalent DDM for different diffusion noise samples (light gray). (c, d, e) Changes in ER, DT, and iRR over a long period of task engagement in the RNN (light gray, pixel simulation individual traces; black, pixel simulation mean; pink, Gaussian simulation mean) compared to the theoretical predictions from the learning DDM (blue). (f) Visualization of traces in c and d in speed-accuracy space along with the optimal performance curve (OPC) in green. The threshold policy was set to be iRR-sensitive for c–f.

Figure 3.

Figure 3—figure supplement 1. Analytical reduction of linear drift-diffusion model (LDDM) matches error-corrective learning neural network dynamics during learning.

Figure 3—figure supplement 1.

(a) The recurrent linear neural network can be analytically reduced. In the reduction, the decision variable draws an observation from one of two randomly chosen Gaussian ‘stimuli’. The observations are scaled by a perceptual weight. After the addition of some irreducible noise, the value of the decision variable at previous time step is added to the current time step. A trial ends once the decision variable hits a predetermined threshold. The dynamics of the perceptual weight capture the mean effect of gradient descent learning in the recurrent linear neural network. (b) Weight w of neural network across task engagement time for multiple simulations of the network (gray), the mean of the simulations (black), and the analytical reduction of the network (blue). (c) Same as in b but for the threshold z. (d) Same as in b but for the error rate. (e) Same as in b but for the decision time. (f) Same as in b but for the instantaneous reward rate (correct trials per second). (g) Learning trajectory in speed-accuracy space for simulations, simulation mean, and analytical reduction (theory). Optimal performance curve (OPC) is shown in red.

Linear RNN

Our model takes the form of a simple RNN, depicted unrolled through time in Figure 3a. The network receives noisy sensory input over time during a trial, amplifies this evidence through weighted synaptic connections, and integrates the result until a threshold is reached. After making a decision and receiving feedback, the synaptic connections are updated a small amount according to an error-corrective gradient descent learning rule. Therefore, there are two key timescales in the model: first, the fast activity dynamics during a single trial, which produces a single decision with a certain reaction time; and second, the slow weight dynamics due to learning across many trials. In the following, we denote time within trial as the variable t, and the trial number as trial. We now describe the dynamics on each timescale in greater detail.

Within a trial, N dimensional inputs s(t)RN arrive at discrete times t=1dt,2dt,, where dt is a small time step parameter. In our experimental task, s(t) might represent the activity of LGN neurons in response to a given visual stimulus. Because of eye motion and noise in the transduction from light intensity to visual activity, the response of individual neurons will only probabilistically relate to the correct answer at any given instant. In our simulations, we take s(t) to be the pixel values of the exact images presented to the animals, but transformed at each time point by small rotations (±20°) and translations (±25% of the image width and height), as depicted in Figure 3a. This input variability over time makes temporal integration valuable even in this visual classification task. To perform this integration, each input s(t) is filtered through perceptual weights w(trial)RN and added to a read-out node (decision variable) y^(t) along with i.i.d. integrator noise η(t)N(0,co2dt). This integrator noise models internal neural noise. The evolution of the decision variable is given by the simple linear recurrence

y^(t+dt)=y^(t)+w(trial)s(t)+η(t), (1)

until the decision variable hits a threshold ±z(trial) that is constant on each trial. Here, the RNN already performs an integration through time (a choice motivated by prior experiments in rodents Brunton et al., 2013), and improvements in performance come from adjusting the input-to-integrator weights w(trial) to better extract task-relevant sensory information.

Across trials, the perceptual weights w(trial) are updated to improve performance. In principle this could be accomplished with many possible learning mechanisms such as reinforcement learning (Law and Gold, 2009) or Bayesian inference (Drugowitsch et al., 2019). Here, we investigate gradient-based optimization of an objective function, as commonly used in deep learning approaches (Richards et al., 2019, Saxe et al., 2021). In particular, we consider using gradient descent on the hinge loss, corresponding to standard practice in deep learning. The hinge loss is

Loss(trial)=max(0,1y(trial)y^(trial)) (2)

where y(trial)=±1 is the correct output sign for the trial. Then the weights are updated by gradient descent on this loss,

w(trial+1)=w(trial)λLoss(trial)w, (3)

where λ is a small learning rate. The hinge loss is a proxy for accuracy, and so this weight update implements a learning scheme based on error feedback. In essence, perceptual weights are updated after error trials to improve the likelihood of answering correctly in the future.

To summarize the key parameters of the RNN, the model requires specifying the input distribution s(t), the initial perceptual weights w(0), the integrator noise variance co2, the gradient descent learning rate λ, and the decision threshold z(trial) used on each trial. With these parameters specified, the model can be simulated to make predictions for how behavior will evolve over training, as shown in Figure 3c–f, gray and black traces.

Reduction to LDDM

While the behavior of the RNN model obtained in simulations can be compared to data, deep network models remain challenging to understand (Saxe et al., 2021). We therefore sought to mathematically analyze this setting to derive a simple theory of the average learning dynamics that highlights key trade-offs.

We start by noting that the input to the decision variable y^ at each time step is a weighted sum of many random variables, which by the law of large numbers will be approximately Gaussian. We therefore develop a reduction of this model based on an effective Gaussian scalar input distribution. At each time step the input pathway receives a Gaussian input x(t)N(Aydt,ci2dt), where A parametrizes the signal related to y, and the input noise variance ci2 parametrizes irreducible noise in input channels that cannot be rejected. This input is multiplied by a scalar weight u, added to output noise η of variance co2 and sent into the integrating node y^,

y^(t+dt)=y^(t)+u(trial)x(t)+η(t), (4)

where we emphasize that u and x(t) are now both scalar. We may then perform gradient descent on the hinge loss, yielding the update u(trial+1)=u(trial)-λLoss(trial)u. As expected from the law of large numbers, for the right choice of input signal and parameters A and ci, simulations of this effective Gaussian model closely match the full simulation from pixels, as shown in Figure 3c–f, pink trace.

Next, to relate these dynamics to the well-studied DDM framework, we examine behavior when the time step is small (dt0) to obtain a continuous time formulation. In the continuum limit, these discrete within-trial dynamics of the network yield decision variables with identical distributions to a drift-diffusion process with an effective SNR A¯ and normalized threshold z¯

A¯=A2u2u2ci2+co2, (5)
z¯=zAu, (6)

yielding the mean error rate (ER) and decision time (DT)

ER=11+e2z¯A¯, (7)
DT=z¯tanh(z¯A¯). (8)

Finally, we assume that the learning rate is small (λ1), such that weights change little on any given trial and the gradient dynamics are driven by the mean update,

,u(trial+dt)=u(trial)λLoss(trial)u (9)

where denotes the average with respect to the distribution of outputs obtained with perceptual weights u(trial) and threshold z(trial). These average dynamics depend in a complex way on the current performance of the network. We compute these average dynamics analytically (see Methods), yielding the continuous time change in effective SNR in the DDM that is equivalent to gradient descent learning in the underlying neural network model. In particular, gradient descent in the RNN is equivalent to the following SNR dynamics in the DDM:

τ~ddtA¯(t)=2A¯(t)(A¯)c(1A¯(t)A¯)5/2ER(t)DT(t)+Dtot(t)[DT(t)log(1/ER(t)1)A¯(1A¯(t)A¯)2]. (10)

Here, time t measures seconds of task engagement (i.e. it measures time passing within a trial as well as intertrial time and any penalty delays after error trials), and Dtot(t)=(1-ER(t))Dcorr+(ER(t))Derr is the average non-decision task engagement time per trial (where Dcorr and Derr are the average non-decision task engagement times after correct and error trials). The SNR dynamics depend on five parameters: the time constant τ~ related to the learning rate, the initial SNR A¯(0), the asymptotic achievable SNR after learning A¯*, the integration-noise to input-noise variance ratio cco2/ci2, and the choice of threshold z(t) over training. We note that the dependence of the dynamics on the choice of threshold z(t) is implicit in ER(t),DT(t), and Dtot(t) in Equation 10. The dynamics of this LDDM closely tracks simulated trajectories of the full network from pixels (Figure 3c–f blue trace, Figure 3—figure supplement 1; see Methods).

Remarkably, this reduction shows that the high-dimensional dynamics of the RNN receiving stochastic pixel input and performing gradient descent on the weights (Figure 3, gray trace) can be described by a DDM with a single deterministic scalar variable – the effective SNR – that changes over time (Figure 3, blue trace). Notably, without the mapping to the original RNN, it is not possible to understand what effect error-corrective gradient descent learning would have at the level of the DDM, or how the learning process is influenced by choice of decision times. In particular, the change in SNR that arises from gradient descent on the underlying RNN weights (Equation 10) is not equivalent to that arising from gradient descent on the SNR parameter in the DDM directly because gradient descent is not parametrization invariant.

Learning speed trades off with instantaneous reward rate

The LDDM reveals that learning dynamics depend on the choice of threshold z(t) on each trial over learning, because threshold impacts both error rate and decision time, which appear in the SNR dynamics of Equation 10. We next sought to qualitatively understand this relationship. A key prediction of the LDDM is a tension between learning speed and iRR, the LS/iRR trade-off. This tension is clearest early in learning when ERs are near 50%. Then the rate of change in SNR is

ddtA¯DTDT+Dtot, (11)

where the proportionality constant does not depend on DT (see derivation, Methods). Hence learning speed increases with increasing DT. By contrast, when accuracy is 50% the iRR decreases with increasing DT,

iRR(t)1/2DT+Dtot. (12)

When encountering a new task, therefore, agents face a dilemma: they can either harvest a large iRR or they can learn quickly.

Learning dynamics depend on threshold policies

Just as the standard DDM instantiates different decision making strategies as different choices of threshold (for instance aimed at maximizing iRR, accuracy, or robustness) (Holmes and Cohen, 2014; Zacksenhouse et al., 2010), the LDDM instantiates different learning strategies through the choice of threshold trajectory over learning. Threshold affects DT and ER, and through these, the learning dynamics in Equation 10. To consider a range of strategies, we developed four potential threshold policies.

Constant threshold. This policy implements a fixed constant threshold zc(t)=z0. It serves as a control for behavior that would arise without the ability to modulate decision threshold. Constant thresholds across difficulties have been found to be used as part of near-optimal and presumably cognitively cheaper strategies in humans (Balci et al., 2011b). This policy introduces the parameter z0.

iRR-greedy. This policy sets the threshold to the value that maximizes instantaneous reward on each trial, zg(t)=z*(A¯), such that behavior always lies on the OPC. This instantiates a ‘myopic’ strategy that does not consider how threshold can impact long-term learning. This policy is similar to a previously proposed neural network model of rapid threshold adjustment based on reward rate (Simen et al., 2006). The policy introduces no parameters.

iRR-sensitive. This policy implements a threshold zs(t) that decays with time constant γ from an initial value zs(0)=z0 toward the iRR-optimal threshold,

γddtzs(t)=z(A¯(t))zs(t).

Notably, as the SNR changes due to learning, the target threshold also changes through time. Asymptotically, this policy converges to greedy iRR-optimal behavior; however, by starting with a high initial threshold, it can undergo a transient period where responses are slower or faster than iRR-optimal, potentially influencing learning. It instantiates a heuristic strategy in which behavior differs from iRR-optimal behavior early in learning. This policy introduces two parameters, z0 and γ.

Global optimal. This policy selects the threshold zo(t) that maximizes total cumulative reward at some known predetermined end to the task Ttot,

zo(t)=argmaxz(t)0TtotRR(t)dt.

We approximately compute this threshold function using automatic differentiation (see Methods). This policy serves as a normative oracle to which behavior may be compared. We note that this optimal policy considers the full time course of learning and is aware of all task parameters such as the duration of total task engagement Ttot, asymptotically achievable SNR A*, etc. In practice these parameters cannot be known before experiencing the task, and so this policy is not an implementable strategy but a normative reference point. The policy introduces no parameters.

In designing this model, we kept components as simple as possible to highlight key qualitative trade-offs between learning speed and decision strategy. Because of its simplicity, like the standard DDM, it is not meant to quantitatively describe all aspects of behavior. We instead use it to investigate qualitative features of decision making strategy, and expect that these features would be preserved in other related models of perceptual decision making (Usher and McClelland, 2001,Mazurek et al., 2003, Gold and Shadlen, 2007, Heekeren et al., 2004,Heekeren et al., 2008,Ma et al., 2006,Brown and Heathcote, 2008, Ratcliff and McKoon, 2008, Beck et al., 2008, Roitman and Shadlen, 2002; Purcell et al., 2010,Bejjanki et al., 2011; Drugowitsch et al., 2012; Fard et al., 2017).

Model reveals that prioritizing learning can maximize total reward

In order to qualitatively understand how these models behave through time, we visualized their learning dynamics. To approximately place the LDDM task parameters in a similar space to the rats, we performed maximum likelihood fitting using automatic differentiation through the discretized reduction dynamics (see Methods). The four policies we considered clustered into two groups, distinguished by their behavior early in learning. A ‘greedy’ group, which contained just the iRR-greedy policy, remained always on the OPC (Figure 4a), and had fast initial response times (Figure 4b), a long initial period at high error (Figure 4c), and high initial iRR (Figure 4d). By contrast, a ‘non-greedy’ group, which contained the iRR-sensitive, constant threshold, and global optimal policies, started far above the OPC (Figure 4a), and had slow initial response times (Figure 4b), rapid improvements in ER (Figure 4c), and low iRR (Figure 4d). Notably, while members of the non-greedy group started off with lower iRR, they rapidly surpassed the slow learning group (Figure 4d) and ultimately accrued more total reward (Figure 4e). Overall, these results show that threshold strategy strongly impacts learning dynamics due to the learning speed/iRR trade-off (Figure 4f), and that prioritizing learning speed can achieve higher cumulative reward than prioritizing instantaneous reward rate.

Figure 4. Model reveals rat learning dynamics lead to higher instantaneous reward rate and long-term rewards than greedily maximizing instantaneous reward rate.

(a) Model learning trajectories in speed-accuracy space plotted against the optimal performance curve (OPC) (black). (b) Decision time through learning for the four different threshold policies in a. (c) Error rate throughout learning for the four different threshold policies in a. (d) Instantaneous reward rate as a function of task engagement time for the full learning trajectory and a zoom-in on the beginning of learning (inset). (e) Cumulative reward as a function of task engagement time for the full learning trajectory and a zoom-in on the beginning of learning (inset). Threshold policies: iRR-greedy (green), constant threshold (blue), iRR-sensitive (orange), and global optimal (red). (f) In the speed-accuracy trade-off (left), ER (blue) decreases with increasing initial mean RT (green) at high error rates (∼0.5) also decreases with increasing initial mean RT. Thus, at high ERs, an agent solves the speed-accuracy trade-off by choosing fast RTs that result in higher ERs and maximize iRR. In the learning speed/ iRR trade-off (right), initial learning speed (dSNR/dt, pink) increases with increasing initial mean RT, whereas iRR (green) follows the opposite trend. Thus, an agent must trade iRR in order to access higher learning speeds. Plots generated using linear drift-diffusion model (LDDM).

Figure 4.

Figure 4—figure supplement 1. Allowing both drift rate and threshold to vary with learning provides the best drift-diffusion model (DDM) fits.

Figure 4—figure supplement 1.

(a) Deviance information criterion (DIC) for different hierarchical DDM (HDDM) fits to learning during stimulus pair 1 (lower value indicates a better fit). The models were fit to the first 1000 and last 1000 trials for every animal using the HDDM framework (Wiecki et al., 2013). Different parameters were allowed to vary with learning phase while the rest were fixed across learning phase. We fit three simple DDMs, one model that only allowed drift rate variability to vary with learning, three DDMs that included a fixed drift rate variability across learning phase (‘include drift variability’), and three DDMs where drift rate variability varied with learning in addition to different combinations of drift rate and threshold. The best models were those that allowed both drift rate and threshold to vary with learning. Including drift rate variability and allowing it to also vary with learning phase did not improve the model fits. Parameters for these model fits are included in the subsequent figures. (b) Same as a but for stimulus pair 2. The models were fit to the last 500 trials of baseline sessions with stimulus pair 1, and the first 500 trials and last 500 trials of stimulus pair 2, with each 500 trial batch serving as a learning phase. As with stimulus pair 1, the best models were those that allowed both drift rate and threshold to vary with learning, and drift rate variability did not appear to allow a better model fit.
Figure 4—figure supplement 2. Simple drift-diffusion model (DDM) fits indicate threshold decreases and drift rate increases during learning.

Figure 4—figure supplement 2.

(a) The learning data from stimulus pair 1 (a, b, c) and 2 (d, e, f) were fit with a simple DDM using the hierarchical DDM (HDDM) framework (Wiecki et al., 2013) as described in Figure 4—figure supplement 1. The HDDM reports posterior probability estimates for its parameters. The posterior for mean parameters across subjects is on the left of every panel, and the mean of the posterior for every individual fit is on the right of every panel. (a) While holding threshold constant, drift increased with learning. (b) While holding drift rate constant, threshold decreased with learning. (c) When allowing both drift rate and threshold to vary with learning, drift rate increased and threshold decreased with learning. (d) For stimulus pair 2, while holding threshold constant, drift increased with learning, matching its value during baseline sessions. (e) While holding drift rate constant, threshold decreased with learning, matching its value during baseline sessions. (f) When allowing both drift rate and threshold to vary with learning, drift rate increased and threshold decreased with learning, matching their values during baseline sessions. p-Values for mean estimates were calculated by taking the difference of the posteriors and counting the proportion of differences that was, depending on directionality 0. p-Values for individual estimates were calculated by taking a Wilcoxon rank-sum test across pairs.
Figure 4—figure supplement 3. Simple drift-diffusion model (DDM) + fixed drift rate variability fits indicate threshold decreases and drift rate increases during learning.

Figure 4—figure supplement 3.

(a) The learning data from stimulus pair 1 (a, b, c) and 2 (d, e, f) were fit with a simple DDM + fixed drift rate variability using the hierarchical DDM (HDDM) framework (Wiecki et al., 2013) as described in Figure 4—figure supplement 1. (a) While holding threshold constant, drift increased with learning. (b) While holding drift rate constant, threshold decreased with learning. (c) When allowing both drift rate and threshold to vary with learning, drift rate increased and threshold decreased with learning. Drift rate variability estimates were close to 0. (d) For stimulus pair 2, while holding threshold constant, drift increased with learning, matching its value during baseline sessions. (e) While holding drift rate constant, threshold decreased with learning, matching its value during baseline sessions. (f) When allowing both drift rate and threshold to vary with learning, drift rate increased and threshold decreased with learning, matching their values during baseline sessions. p-Values were calculated as in Figure 4—figure supplement 2.
Figure 4—figure supplement 4. Simple drift-diffusion model (DDM) + variable drift rate variability fits indicate threshold decreases and drift rate increases during learning.

Figure 4—figure supplement 4.

(a) The learning data from stimulus pair 1 (a, b, c) and 2 (d, e, f) were fit with a simple DDM + variable drift rate variability using the hierarchical DDM (HDDM) framework (Wiecki et al., 2013) as described in Figure 4—figure supplement 1. (a) While holding threshold constant, drift and drift rate variability increased with learning. (b) While holding drift rate constant, threshold and drift rate variability decreased with learning. (c) When allowing both drift rate and threshold to vary with learning, drift rate and drift rate variability increased and threshold decreased with learning. (d) For stimulus pair 2, while holding threshold constant, drift increased with learning, matching its value during baseline sessions, while drift rate variability trended toward decreasing with stimulus pair 2. (e) While holding drift rate constant, threshold decreased with learning, matching its value during baseline sessions, and drift rate variability decreased. (f) When allowing both drift rate and threshold to vary with learning, drift rate increased and threshold decreased with learning, matching their values during baseline sessions. Drift rate variability trended toward decreasing with stimulus pair 2. p-Values were calculated as in Figure 4—figure supplement 2.
Figure 4—figure supplement 5. Model reveals rat learning dynamics resemble optimal trajectory without relinquishing initial rewards.

Figure 4—figure supplement 5.

(a) Fraction of instantaneous reward rate with respect to the iRR-greedy policy for all model threshold policies during learning. The instantaneous reward rates of all policies were normalized by the iRR-greedy policy’s instantaneous reward rate through task engagement time. (b) Same as a but for the full trajectory of the simulation. (c) Fraction of instantaneous reward rate with respect to the global optimal policy for all model threshold policies during learning. The instantaneous reward rates of all policies were normalized by the greedy policy’s instantaneous reward rate through task engagement time. (d) Same as c but for the full trajectory of the simulation.

We further analyzed the differences between the three strategies in the non-greedy group. The global optimal policy selects extremely slow initial DTs to maximize the initial speed of learning. By contrast, the iRR-sensitive and constant threshold policies start with moderately slow responses. Nevertheless, we found that these simple strategies accrued 99% of the total reward of the global optimal strategy (Figure 4—figure supplement 5). Hence these more moderate policies, which do not require oracle knowledge of future task parameters, derive most of the benefit in terms of total reward and may reflect a reasonable approach when the duration of task engagement is unknown.

Considering the rats’ trajectories in light of these strategies, their slow responses early in learning stand in stark contrast to the fast responses of the iRR-greedy policy (Figure 2b, Figure 4a). Equally, their responses were faster than the extremely slow initial DTs of the global optimal model. Both the iRR-sensitive and constant threshold models qualitatively matched the rats’ learning trajectory. However, the best DDM parameter fits of the rats’ behavior allowed their thresholds to decrease throughout learning, failing to support the constant threshold model (Figure 4—figure supplements 14). Subsequent experiments (Figure 6) provide further evidence against a simple constant threshold strategy. Consistent with substantial improvements in perceptual sensitivity through learning, DDM fits to the rats also showed an increase in drift rate throughout learning (Figure 4—figure supplements 14). Similar increases in drift rate have been observed as a universal feature of learning throughout numerous studies fitting learning data with the DDM (Ratcliff et al., 2006; Dutilh et al., 2009,Petrov et al., 2011,Balci et al., 2011b, Liu and Watanabe, 2012, Zhang and Rowe, 2014). These qualitative comparisons suggest that rats adopt a ‘non-greedy’ strategy that trades initial rewards to prioritize learning in order to harvest a higher iRR sooner and accrue more total reward over the course of learning.

Learning speed scales with reaction time

To test the central prediction of the LDDM that learning (change in SNR) scales with mean DT, we designed an RT restriction experiment and studied the effects of the restriction on learning in the rats. Previously trained rats (n=12) were randomly divided into two groups in which they would have to learn a new stimulus pair while responding above or below their individual mean RTs (‘slow’ and ‘fast’) for the previously trained stimulus pair (Figure 5a). Before introducing the new stimuli, we carried out practice sessions with the new timing restrictions to reduce potential effects related to a lack of familiarity with the new regime. After the restriction, RTs were significantly different between the two groups (Figure 5b). In the model, we simulated an RT restriction by setting two different DTs (Figure 5c).

Figure 5. Longer reaction times lead to faster learning and higher instantaneous reward rates.

Figure 5.

(a) Schematic of experiment and hypothesized results. Previously trained animals were randomly divided into two groups: could only respond above (blue, n=7) or below (black, n=5) their individual mean reaction times for the previously trained stimulus and the new stimulus. Subjects responding above their individual mean reaction times were predicted to learn faster, reach a higher instantaneous reward rate sooner and accumulate more total reward. (b) Mean and individual reaction times before and after the reaction time restriction in rats. The mean reaction time for subjects randomly chosen to respond above their individual mean reaction times (blue, n=7) was not significantly different to those randomly chosen to respond below their individual means (black, n=5) before the restriction (Wilcoxon rank-sum test p > 0.05), but were significant after the restriction (Wilcoxon rank-sum test p < 0.05). Errors represent 95% confidence intervals. (c) In the model a long (blue) and a short (black) target decision time were set through a control feedback loop on the threshold, ddtz(t)=γ(DTtargDT(t)) with parameter γ=0.01. (d) Mean accuracy ±95% confidence interval across sessions for rats required to respond above (blue, n=7) or below (black, n=5) their individual mean reaction times for a previously trained stimulus. Both groups had initial accuracy below chance because rats assume a response mapping based on an internal assessment of similarity of new stimuli to previously trained stimuli. To counteract this tendency and ensure learning, we chose the response mapping for new stimuli that contradicted the rats’ mapping assumption, having the effect of below-chance accuracy at first. * denotes p < 0.05 in two-sample independent t-test. Inset: accuracy change (slope of linear fit to accuracy across sessions to both groups, units: fraction per session). * denotes p < 0.05 in a Wilcoxon rank-sum test. (e) Mean inferred signal-to-noise ratio (SNR), (f) mean, iRR and (g) mean cumulative reward across task engagement time for new stimulus pair for animals in each group. (h) Accuracy, (i) SNR, (j) iRR, and (k) cumulative reward across task engagement time for long (blue) and short (black) target decision times in the linear drift-diffusion model (LDDM).

We found no difference in initial mean session accuracy between the two groups, followed by significantly higher accuracy in the slow group in subsequent sessions (Figure 5d). The slope of accuracy across sessions was significantly higher in the slow group (Figure 5d, inset). Importantly, the fast group had a positive slope and an accuracy above chance by the last session of the experiment, indicating this group learned (Figure 5d).

Because of the SAT in the DDM, however, accuracy could be higher in the slow group even with no difference in perceptual sensitivity (SNR) or learning speed simply because on average they view the stimulus for longer during a trial, reflecting a higher threshold. To see if underlying perceptual sensitivity increased faster in the slow group, we computed the rats’ inferred SNR throughout learning (see Methods, Equation 24), which takes account of the relationship between RT and ER. The SNR of the slow group increased faster (Figure 5e), consistent with a learning speed that scales with DT.

We found that the slow group had a lower initial iRR, but that this iRR exceeded that of the fast group halfway through the experiment (Figure 5f). Similarly, the slow group trended toward a higher cumulative reward by the end of the experiment (Figure 5g). The LDDM qualitatively replicates all of our behavioral findings (Figure 5h–k). These results demonstrate the potential total reward benefit of faster learning, which in this case was a product of enforced slower RTs.

Our experiments and simulations demonstrate that longer RTs lead to faster learning and higher reward for our task setting both in vivo and in silico. Moreover, they are consistent with the hypothesis that rats choose high initial RTs in order to prioritize learning and achieve higher iRRs and cumulative rewards during the task.

Rats choose reaction time based on learning prospects

The previous experiments suggest that rats trade initial rewards for faster learning. Nonetheless, it is unclear how much control rats exert over their RTs. A control-free heuristic approach, such as adopting a fixed high threshold (our constant threshold policy), might incidentally appear near optimal for our particular task parameters, but might not be responsive to changed task conditions. If an agent is controlling the reward investment it makes in the service of learning, then it should only make that investment if it is possible to learn.

To test whether the rats’ RT modulations were sensitive to learnability, we conducted a new experiment in which we divided rats into a group that encountered new learnable visible stimuli (n=16, sessions = 13), and another that encountered unlearnable transparent or near-transparent stimuli (n=8, sessions = 11) (Figure 6a). From the perspective of the LDDM, both groups start with approximately zero SNR, however only the group with the visible stimuli can improve that SNR. Because the rats do not know the learnability of new stimuli, we initialize the LDDM with a high threshold to model the belief that any new stimuli may be learnable. If the rats choose their RTs based on how much it is possible to learn, then: (1) rats encountering new stimuli that they can learn will increase their RTs to learn quickly and increase future iRR. (2) Rats encountering new stimuli that they cannot learn might first increase their RTs to learn that there is nothing to learn, but (3) will subsequently decrease RTs to maximize iRR.

Figure 6. Rats choose reaction time based on stimulus learnability.

(a) Schematic of experiment: rats trained on stimulus pair 1 were presented with new visible stimulus pair 2 or transparent (alpha = 0, 0.1) stimuli. If rats change their reaction times based on stimulus learnability, they should increase their reaction times for the new visible stimuli to increase learning and future iRR and decrease their reaction time to increase iRR for the transparent stimuli. (b) Learning across normalized sessions in speed-accuracy space for new visible stimuli (n=16, crosses) and transparent stimuli (n=8, squares). Color map indicates time relative to start and end of the experiment. (c) iRR-sensitive threshold model runs with ‘visible’ (crosses) and ‘transparent’ (squares) stimuli (modeled as containing some signal, and no signal) plotted in speed-accuracy space. The crosses are illustrative and do not reflect any uncertainty. Color map indicates time relative to start and end of simulation. (d) Mean change in reaction time across sessions for visible stimuli or transparent stimuli compared to previously known stimuli. Positive change means an increase relative to previous average. Inset: first and second half of first session for transparent stimuli. * denotes p<0.05 in permutation test. (e) Correlation between initial individual mean change in reaction time (quantity in d) and change in signal-to-noise ratio (SNR) (learning speed: slope of linear fit to SNR per session) for first three sessions with new visible stimuli. R2 and p from linear regression in d. Error bars reflect standard error of the mean in b and d. (f) Decision time across time engagement time for visible and transparent stimuli runs in model simulation. (g) Instantaneous change in SNR (ddtA¯) as a function of initial reaction time (decision time + non-decision time T0) in model simulation.

Figure 6.

Figure 6—figure supplement 1. Reaction time analysis of transparent stimuli experiment.

Figure 6—figure supplement 1.

(a) During transparent stimuli, the reaction time (RT) minimum was relaxed to 0 ms to fully measure a possible shift in RT behavior. To be able to ascertain whether transparent stimuli led to a significant change in RT, the RT histogram of transparent stimuli (early [first two sessions]: purple, late [last two sessions]: yellow) sessions was compared to control sessions with visible stimuli (gray) with no RT minimum. Medians indicated with dashed lines. Kolmogorv-Smirnov two-sample tests over distributions found significant differences (p<104). (b) Vincentized RTs for transparent and control visible stimuli sessions with no minimum reaction time showed the early transparent sessions were slower than the control sessions, and the late sessions were faster across quantiles.
Figure 6—figure supplement 2. Vincentized reaction time distributions throughout learning.

Figure 6—figure supplement 2.

(a) Vincentized reaction time distributions for n=26 subjects learning stimulus pair 1 (first 3 sessions, purple; last 10 asymptotic sessions, yellow). (b) Vincentized reaction time distributions for n=16 subjects learning stimulus pair 2 (first 2 sessions, purple; last 2 sessions, yellow). (b) Vincentized reaction time distributions for n=8 subjects learning transparent stimuli (first 2 sessions, purple; last 2 sessions, yellow).
Figure 6—figure supplement 3. Simple drift-diffusion model (DDM) + variable drift rate variability fits for transparent stimuli.

Figure 6—figure supplement 3.

(a) The learning data from transparent stimuli were fit with a simple DDM + variable drift rate variability using the hierarchical DDM (HDDM) framework (Wiecki et al., 2013). Three learning phases were included: the last 500 trials with control visible stimuli, and the first 500 and the last trials with transparent stimuli. (a) We allowed drift rate, threshold, drift rate variability, and T0 to vary with learning phase. Drift rate decreased with transparent stimuli, remaining constant throughout. Threshold monotonically decreased with transparent stimuli. Drift rate variability appeared to decrease and stay constant with transparent stimuli, albeit at a value near 0. T0 appeared to decrease with transparent stimuli. p-Values were calculated as in Figure 4—figure supplement 2.
Figure 6—figure supplement 4. Analysis of stimulus-independent strategies for transparent stimuli.

Figure 6—figure supplement 4.

In order to measure the extent of stimulus-independent strategies for transparent stimuli, we fit baseline sessions with visible stimuli and sessions with the transparent stimuli with PsyTrack, a flexible generalized linear model (GLM) package for measuring the weights of different inferred psychophysical variables (Roy et al., 2021). We fit our data with a model that included bias, win-stay/lose-switch (previous trial outcome), perseverance (previous trial choice), and the actual stimulus as potential explanatory variables for left/right choice behavior. (a) Measurement of bias across n=8 animals. Generally, bias and bias variability increased with transparent stimuli compared to visible stimuli. Although not uniform, animals tended to become more biased to the side that they were already biased during visible stimuli. (b) During visible stimuli, the stimulus had strong non-zero weights, indicating it influenced choice behavior. Stimulus has positive weights for some animals and negative for others because stimuli mappings were counterbalanced across animals. Win-stay/lose-switch and perseverance weights varied across animals during visible stimuli. Generally, these variables increased weights and variability during transparent stimuli, while the stimulus collapsed to a weight of 0 (as expected, given it was transparent). The weight of the bias variable was omitted for visual clarity as the actual bias was reported in a.
Figure 6—figure supplement 5. Post-error slowing during rat learning dynamics.

Figure 6—figure supplement 5.

(a) Individual (gray) and mean (black) post-error slowing across first 15 sessions for n=26 animals. Post-error slowing was calculated by taking the difference between RTs on trials with previous correct trials and previous error trials. A positive difference indicates post-error slowing. (b) Individual mean (gray) and population mean (black) post-error slowing for first 2 sessions of learning and last 2 sessions of learning for n=26 animals. A Wilcoxon signed-rank test found no significant difference in post-error slowing between the first 2 and last 2 sessions for every animal (p = 0.585). (c) Same as in a for n=16 rats, with the addition of 4 baseline sessions with stimulus pair 1 plus the 13 sessions while subjects were learning stimulus pair 2. (d) Same as in b but comparing the last 2 baseline sessions with stimulus pair 1 and the first 2 sessions learning stimulus pair 2. A Wilcoxon signed-rank test found no significant difference in post-error slowing when the animals started learning stimulus pair 2 (p = 0.255).

We found that the rats with the visible stimuli qualitatively replicated the same trajectory in speed-accuracy space that we found when rats were trained for the first time (Figure 2b, Figure 6b). Indeed, the best DDM fits were those that allowed both threshold and drift rate to vary with learning, as was the case with the first stimuli the rats encountered, and in line with the LDDM (Figure 4—figure supplements 14). Because these previously trained rats had already mastered the task mechanics, this result rules out non-stimulus-related learning effects as the sole explanation for long RTs at the beginning of learning and supports our hypothesis that the slowdown in RT was attributable to the rats trying to learn the new stimuli efficiently. We calculated the mean change in RT (mean ΔRT) of new stimuli versus known stimuli. The visible stimuli group had a significant slowdown in RT lasting many sessions that returned to baseline by the end of the experiment (Figure 6d, black trace).

Rats with the transparent stimuli also approached the OPC by decreasing their RTs across sessions to better maximize iRR (Figure 6b). After a brief initial increase in RT in the first half of the first session (Figure 6d, inset), RTs rapidly decreased (Figure 6d, gray trace). Notably, RTs fell below the baseline RTs, indicating a strategy of responding quickly, which approaches iRR-optimal behavior for this zero SNR task. Additionally, we considered the rats’ entire RT distributions to investigate the effect of learnability beyond RT means. We found that while the RT distributions changed similarly from the beginning to end of learning for the learnable stimuli (stimulus pair 1 and 2), they differed for the unlearnable (transparent) stimuli, indicating an effect of learnability on the entire RT distributions (Figure 6—figure supplement 2). Hence, rodents are capable of modulating their strategy depending on their learning prospects.

Although there is no informative signal in this task with transparent stimuli, the rats could still be using stimulus-independent signals, such as choice history or feedback, to drive heuristic strategies. Indeed, DDM fits indicated a non-zero drift rate even in the absence of informative stimuli (Figure 6—figure supplement 3). To investigate whether the rats implemented stimulus-independent heuristic strategies in addition to random choice, we measured left/right bias and quantified the weights of bias, perseverance (choose the same port as the previous trial), and win-stay/lose-switch (choose the port that was correct on the previous trial) (Roy et al., 2021). In general, bias seemed to increase with transparent stimuli in the direction that each individual was already biased during visible stimuli. Perseverance and win-stay/lose-switch also seemed to increase and fluctuate more during transparent stimuli, suggesting a greater reliance on these heuristics now that the stimulus was uninformative (Figure 6—figure supplement 4). Engaging these heuristics may be a way that the rats expedited their choices in order to maximize iRR while still ‘monitoring’ the task for any potentially informative changes or patterns. Despite the fact that the animals’ still engaged these non-optimal heuristics, the lack of learnability in the transparent stimuli still led to a change in strategy that was distinct from that with learnable stimuli.

Importantly, this learnability experiment argues against other simple strategies accounting for the changes in RTs. If rats respond more slowly after error trials, a phenomenon known as post-error slowing (PES), they might exhibit slower RTs early in learning when errors are frequent (Notebaert et al., 2009). Indeed, we found a slight mean post-error slowing effect of about 50 ms that was on average constant throughout learning, though it was highly variable across individuals (Figure 6—figure supplement 5). However, rats viewing transparent stimuli had ERs constrained to 50%, yet their RTs systematically decreased (Figure 6b), such that post-error slowing alone cannot account for their strategy. Similarly, choosing RTs as a simple function of time since encountering a task would not explain the difference in RT trajectories between visible and transparent stimuli (Figure 6d).

A simulation of this experiment with the iRR-sensitive threshold LDDM qualitatively replicated the rats’ behavior (Figure 6c, f and g). Rodent behavior is thus consistent with a threshold policy that starts with a relatively long DT upon encountering a new task, and then decays toward the iRR-optimal DT. All other threshold strategies we considered fail to account for the totality of the results. The iRR-greedy strategy – as before – stays pinned to the OPC and speeds up upon encountering the novel stimuli rather than slowing down. The constant threshold strategy fails to predict the speed-up in DT for the transparent stimuli if we assume constant diffusion noise. This is because when the perceptual signal is small, mean DT can be shown to be the squared ratio of threshold to diffusion noise (see Methods). It is thus also possible to explain the speed-up with a constant threshold and increasing diffusion noise. With either interpretation, however, it is clear that a policy where the ratio of threshold to diffusion noise is constant is not compatible with the results. Finally, the global optimal strategy (which has oracle knowledge of the prospects for learning in each task) behaves like the iRR-greedy policy from the start on the transparent stimuli as there is nothing to learn.

Our RT restriction experiment showed that higher initial RTs led to faster learning, a higher iRR and more cumulative reward. Consistent with these findings, there was a correlation between initial mean ΔRT and initial ΔSNR across subjects viewing the visible stimuli, indicating the more an animal slowed down, the faster it learned (Figure 6e). We further tested these results in the voluntary setting by tracking iRR and cumulative reward for the rats in the learnable stimuli setting with the largest (blue, n=4) and smallest (black, n=4) ‘self-imposed’ change in RT (Figure 7a). The rats with the largest change started with a lower but ended with a higher mean iRR, and collected more cumulative reward (Figure 7b and c). Thus, in the voluntary setting there is a clear relationship between RT, learning speed, and its total reward benefits.

Figure 7. Rats that slowed down reaction times the most reached a higher instantaneous reward rate sooner and collected more reward.

Figure 7.

(a) Schematic showing segregation of top 25% of subjects (n=4) with the largest initial ΔRTs for the new visible stimuli and the bottom 25% of subjects (n=4) with the smallest initial ΔRTs. Initial ΔRTs were calculated as an average of the first two sessions for all subjects. (b) Mean iRR for subjects with largest and smallest mean changes in reaction time across task engagement time. (c) Mean cumulative reward over task engagement time for subjects as in b.

Discussion

Summary and limitations

Our theoretical and empirical results identify a trade-off between the need to learn rapidly and the need to accrue immediate reward in a perceptual decision making task. We find that rats adapt their decision strategy to improve learning speed and approximately maximize total reward, effectively navigating this trade-off over the total period of task engagement. In our experiments, rats responded slowly upon encountering novel stimuli, but only when there was a visual stimulus to learn from. This result indicates that they chose to respond more slowly in order to learn quickly, and only made the investment when learning was possible. This behavior requires foregoing both a cognitively easier strategy – fast random choice – and relinquishing a higher immediately available reward for several sessions spanning multiple days. By imposing different response times in groups of animals, we empirically verified our theoretical prediction that slow responses lead to faster learning and greater total reward in our task. These findings collectively show that rats exhibit cognitive control of the learning process, that is, the ability to engage in goal-directed behavior that would otherwise conflict with default or more immediately rewarding responses (Dixon et al., 2012, Shenhav et al., 2013; Shenhav et al., 2017, Cohen et al., 1990; Cohen and Egner, 2017).

Our high-throughput behavioral study with a controlled training protocol permits examination of the entire trajectory of learning, revealing hallmarks of non-greedy decision making. Nonetheless, it is accompanied by several experimental limitations. Our estimation of SNR improvements during learning relies on the DDM. Importantly, while this approach has been widely used in prior work (Brunton et al., 2013; Ratcliff et al., 2006; Balci et al., 2011b; Drugowitsch et al., 2019; Petrov et al., 2011 ), our conclusions are predicated on this model’s approximate validity for our task. Future work could address this issue by using a paradigm in which learners with different response deadlines are tested at the same fixed response deadline, equalizing the impact of stimulus exposure at test. This model-free paradigm is not trivial in rodents, because response deadlines cannot be rapidly instructed. Our study also focuses on one visual perceptual task. Further work should verify our findings with other perceptual tasks across difficulties, modalities, and organisms.

To understand possible learning trajectories, we introduced a theoretical framework based on an RNN, and from this derived an LDDM. The LDDM extends the canonical drift-diffusion framework to incorporate long-term perceptual learning, and formalizes a trade-off between learning speed and instantaneous reward. However, it remains approximate and limited in several ways. The LDDM builds off the simplest form of a DDM, while various extensions and related models have been proposed to better fit behavioral data, including urgency signals (Ditterich, 2006; Cisek et al., 2009; Deneve, 2012; Hanks et al., 2011; Drugowitsch et al., 2012), history-dependent effects (Busse et al., 2011; Scott et al., 2015; Akrami et al., 2018; Odoemene et al., 2018; Pinto et al., 2018; Lak et al., 2018; Mendonça et al., 2018), imperfect sensory integration (Brunton et al., 2013), confidence (Kepecs et al., 2008; Lak et al., 2014; Drugowitsch et al., 2019), and multi-alternative choices (Krajbich and Rangel, 2011, Tajima et al., 2019). Prior work in the DDM framework has investigated learning dynamics with a Bayesian update and constant thresholds across trials (Drugowitsch et al., 2019). Our framework uses simpler error-corrective learning rules, and focuses on how the decision threshold policy over many trials influences long-term learning dynamics and total reward. Future work could combine these approaches to understand how Bayesian updating on each trial would change long-term learning dynamics, and potentially, the optimality of different threshold strategies.

More broadly, it remains unclear whether the drift-diffusion framework in fact underlies perceptual decision making, with a variety of other proposals providing differing accounts (Gold and Shadlen, 2007, Zoltowski et al., 2019; Stine et al., 2020). We speculate that the qualitative learning speed/instantaneous reward rate trade-off that we formally derive in the LDDM would also arise in other models of within-trial decision making dynamics. In addition, on a long timescale over many trials, the LDDM improves performance through error-corrective learning. Future work could investigate learning dynamics under other proposed learning algorithms such as feedback alignment (Lillicrap et al., 2016), node perturbation (Williams, 1992), or reinforcement learning (Law and Gold, 2009). Additionally, the LDDM does not currently include a meta-learning component with which the agent can dynamically gauge the learnability of the task explicitly in order to set its decision threshold. Instead, the LDDM assumes a ‘learnability prior’ implemented as a high initial threshold condition for every new task. This limitation could be solved with a Bayesian observer that predicts learnability based on experience and controls the threshold accordingly. One potential avenue in this direction would be the implementation of the learned value of control theory, which provides a mechanism through which an agent can compare stimulus features to those it has encountered in the past in order to determine control allocation (Lieder et al., 2018). Moreover, the link between the LDDM and cognitive control is implicit: we interpret the choice of threshold in the DDM as a control process (a higher threshold than is optimal reflects control because it requires foregoing present reward in the service of future reward). Future modeling work should make the choice of control explicit, taking into account the inherent cost of control (Shenhav et al., 2013), and then using that choice to determine the decision threshold. Doing so would allow control to not only reflect the choice of threshold, as we have done, but also as a gain term on the drift rate (Leng et al., 2021), which may more completely capture control’s role in two-choice decisions.

Explore/exploit trade-off

Conceptually, the learning speed/instantaneous reward rate trade-off is related to the explore/exploit trade-off common in reinforcement learning, but differs in level of analysis. As traditionally framed in reinforcement learning, an agent has the option of maximizing reward based on its current information (exploitation), or of reaching a potentially larger future reward by expanding its current information (exploration). When framed this way, learning is an act of exploration. However, as framed in our study, learning is a systematic, directed strategy (or ‘action’), that is, exploitation, employed in order to maximize total future reward. The reconciliation between these seemingly contradictory accounts occurs at the meta-level: when an agent is aware that learning is the optimal strategy to maximize total future discounted reward, it is exploiting a strategy that trades learning speed for instantaneous reward rate. However, when that agent is not yet aware whether it can learn, then it must explore this question (i.e. meta-learn) before deciding whether it should exploit an explicit learning strategy (‘exploitation of exploration’) that will also come at the cost of instantaneous reward. Although explained sequentially, these two mechanisms can occur in parallel (i.e. an agent constantly probing its learning prospects). One intriguing finding is that state-of-the-art deep reinforcement learning agents, which succeed in navigating the traditional explore/exploit dilemma on complicated tasks like Atari games (Mnih et al., 2016), nevertheless fail to learn perceptual decisions like those considered here (Leibo et al., 2018). This may be because exploration and exploitation can mean different things depending on the level of analysis, and efficiently learning a perceptual task may require the ‘exploitation of exploration’. Our findings may thus offer routes for improving these artificial systems.

Cognitive control

In order to navigate the learning speed/instantaneous reward rate trade-off, our findings suggest that rats deploy cognitive control of the learning process. Two main features of cognitive control govern its use: it is limited (Shenhav et al., 2017), and it is costly (Krebs et al., 2010; Padmala and Pessoa, 2011, Kool et al., 2010, Dixon et al., 2012; Westbrook et al., 2013; Kool and Botvinick, 2018; Westbrook et al., 2019). If control is costly, then its application needs to be justified by the benefits of its application. The expected value of control (EVC) theory posits that control is allocated in proportion to the EVC (Shenhav et al., 2013). Previous work demonstrated that rats are capable of the economic reasoning required for optimal control allocation (Niyogi et al., 2014a; Niyogi et al., 2014b; Sweis et al., 2018). We demonstrated that rats incur a substantial initial instantaneous reward rate opportunity cost to learn the task more quickly, foregoing a cognitively less demanding fast random strategy that would yield higher initial rewards. Rather than optimizing instantaneous reward rate, which has been the focus of prior theories (Gold and Shadlen, 2002, Balci et al., 2011b; Bogacz et al., 2006), our analysis suggests that rats approximately optimize total reward over task engagement. Relinquishing initial reward to learn faster, a cognitively costly strategy, is justified by a larger total reward over task engagement. This pattern of behavior matches theoretical predictions of the value of learning based on a recent expansion of the EVC theory (Masís et al., 2021).

Assessing the expected value of learning in a new task requires knowing how much can be learned, how quickly one can learn, and for how long the task will be performed (Masís et al., 2021). None of these quantities is directly observable upon first encountering a new task, leading to the question of how rodents know to slow down in one task but not another. Importantly, rats only traded reward for information when learning was possible, a result in line with data demonstrating that humans are more likely to trade reward for information during long experimental time horizons, when learning is more likely (Wilson et al., 2014). Monkeys also reduce their reliance on expected value during decision making in order to explore strategically when it is deemed beneficial (Jahn et al., 2022). Moreover, previous work has highlighted the explicit opportunity cost of longer deliberation times (Drugowitsch et al., 2012), a trade-off that will differ during learning and at asymptotic performance, as we demonstrate here. One possibility is that rats estimate learnability and task duration through meta-learning processes that learn to estimate the value of learning through experience with many tasks (Finn et al., 2017; Wang et al., 2018; Metcalfe, 2009). The amount of control allocated to learning the current task could be proportional to its estimated value, determined based on similarity to previous learning situations and their reward outcomes and control costs (Lieder et al., 2018). Some of this bias for new information, termed curiosity, could be partly endogenous, serving as a useful heuristic for organisms outside of the lab, where rewards are sparse and action spaces are broad (Gottlieb and Oudeyer, 2018). Previous observations of suboptimal decision times in humans analogous to those we observed in rats might reflect incomplete learning, or subjects who think they still have more to learn (Balci et al., 2011b; Bogacz et al., 2010; Cohen et al., 1990). Future work could test further predictions emerging from a control-based theory of learning. An agent should assess both the predicted duration of task engagement and the predicted difficulty of learning in order to determine the optimal decision making strategy early in learning, and this can be tested by, for instance, manipulating the time horizon and difficulty of the task. From a control-based perspective, the expected reward from a task is also relevant to control allocation. Indeed, recent work in humans shows that externally motivating learners with the prospect of a test at the end of a task led to a much higher allocation of time on the harder-to-learn items compared to the case when learners were not warned of a test (Ten et al., 2020).

The trend of a decrease in response time and an increase in accuracy through practice – which we observed in our rats – has been widely observed for decades in the skill acquisition literature, and is known as the Law of Practice (Thorndike, 1913, Newell and Rosenbloom, 1981, Logan, 1992, Heathcote et al., 2000). Accounts of the Law of Practice have posited a cognitive control-mediated transition from shared/controlled to separate/automatic representations of skills with practice (Posner and Snyder, 1975, Shiffrin and Schneider, 1977, Cohen et al., 1990). On this view, control mechanisms are a limited, slow resource that impose unwanted processing delays. Our results suggest an alternative non-mutually exclusive reward-based account for why we may so ubiquitously observe the Law of Practice. Slow responses early in learning may be the goal of cognitive control, as they allow for faster learning, and faster learning leads to higher total reward. When faced with the ever-changing tasks furnished by naturalistic environments, it is the speed of learning which may exert the strongest impact on total reward.

Bounded optimality

More broadly, the optimization of behavior, not in a vacuum, but in the context of one’s constraints – intrinsic and environmentally determined – underlies several general theories of cognition, including theories that explain the allocation of cognitive control (Shenhav et al., 2013; Lieder et al., 2018), the selection of decision heuristics (Gigerenzer, 2008), and the rationale of seemingly irrational economic choices (Kahneman and Tversky, 1979, Juechems et al., 2021). These theories are instances of bounded optimality – a prominent theoretical framework of biological and artificial cognition stating that an agent is optimal when it maximizes reward per unit time within the limitations of its computational architecture (Russell and Subramanian, 1994, Lewis et al., 2014; Gershman et al., 2015; Griffiths et al., 2015; Bhui et al., 2021; Summerfield and Parpart, 2021).

Instances of this framework typically assume that cognitive constraints remain fixed and, more so, that agents do not take alterations of these constraints into account when choosing what to do. There exists, however, a novel theoretical avenue within this framework. An agent can optimize its behavior not only through maximization of reward within constraints, but also through the minimization of those constraints themselves. If an agent can change itself to minimize its constraints by, for example, improving its perceptual representations through learning, the future reward prospects of doing so should be considered in its current choices, even if it is at the expense of current reward. Intelligent agents, like humans, can and do change themselves through learning in order to improve future reward prospects. Our study formalizes this phenomenon in the context of two-choice perceptual decisions, but much work remains to be done in other contexts, modalities, and organisms.

Methods

Behavioral training

Subjects

All care and experimental manipulation of animals were reviewed and approved by the Harvard Institutional Animal Care and Use Committee (IACUC), protocol 27–22. We trained animals on a high-throughput visual object recognition task that has been previously described (Zoccolan et al., 2009). A total of 44 female Long-Evans rats were used for this study, with 38 included in analyses. Twenty-eight rats (AK1–12 and AL1–16) initiated training on stimulus pair 1, and 26 completed it (AK8 and AL12 failed to learn). Another 8 animals (AM1–8) were trained on stimulus pair 1 but were not included in the initial analysis focusing on asymptotic performance and learning (Figure 1d and e; Figure 2) because they were trained after the analyses had been completed. Subjects AM5–8, although trained, did not participate in other behavioral experiments so do not appear in this study. Sixteen animals (AL1–8, AL13–16, and AM1–8) participated in learning stimulus pair 2 (‘new visible stimuli’; canonical-only training regime) while 10 animals (AK1–3, 5–7, 9–12) initially participated in viewing transparent (alpha = 0; AK1, 3, 6, 7, 11) or near-transparent stimuli (alpha = 0.1; AK2, 5, 9, 10, 12), with the subjects sorted randomly into each group. The transparent and near-transparent groups were aggregated but two animals from the near-transparent group were excluded for performing above chance (AK5 and AK12) as this experiment focused on the effects of stimuli that could not be learned. The same 16 animals used for stimulus pair 2 were used for learning stimulus pair 3 under two different reaction time restrictions in which the subjects were sorted randomly. One rat (AL1) was excluded from the outset for not having learned stimulus pair 2. Two additional rats (AL4 and AL7) were excluded for not completing enough trials during practice sessions with the new reaction time restrictions. A final rat (AM1) was excluded because she failed to learn the task. The 12 remaining rats were grouped into seven subjects required to respond above (AL3, AL8, AL13, AL15, AL16, AM3, AM4) and five subjects required to respond below their individual average reaction times (AL2, AL5, AL6, AL14, AM2). Finally, eight rats (AN1–8) were trained on a simplified training regime (‘canonical only’) used as a control for the typical ‘size and rotation’ training object recognition regime (described below). Table 1 summarizes individual subject participation across behavioral experiments.

Table 1. Individual animal participation across behavioral experiments.
Animal Sex Stimulus pair 1 Stimulus pair 2 Transparent stimuli Stimulus pair 3
AK1 F Size and rotation Alpha = 0
AK2 F Size and rotation Alpha = 0.1
AK3 F Size and rotation Alpha = 0.0
AK4 F Size and rotation
AK5 F Size and rotation Alpha = 0.1 (excluded)
AK6 F Size and rotation Alpha = 0
AK7 F Size and rotation Alpha = 0
AK8 F Size and rotation (excluded)*
AK9 F Size and rotation Alpha = 0.1
AK10 F Size and rotation Alpha = 0.1
AK11 F Size and rotation Alpha = 0.0
AK12 F Size and rotation Alpha = 0.1 (excluded)
AL1 F Size and rotation Canonical only (Excluded)§
AL2 F Size and rotation Canonical only Below
AL3 F Size and rotation Canonical only Above
AL4 F Size and rotation Canonical only Below (excluded)
AL5 F Size and rotation Canonical only Below
AL6 F Size and rotation Canonical only Below
AL7 F Size and rotation Canonical only Below (excluded)
AL8 F Size and rotation Canonical only Above
AL9 F Size and rotation
AL10 F Size and rotation
AL11 F Size and rotation
AL12 F Size and rotation (excluded)*
AL13 F Size and rotation Canonical only Above
AL14 F Size and rotation Canonical only Below
AL15 F Size and rotation Canonical only Above
AL16 F Size and rotation Canonical only Above
AM1 F Size and rotation Canonical only Below (excluded)**
AM2 F Size and rotation Canonical only Below
AM3 F Size and rotation Canonical only Above
AM4 F Size and rotation Canonical only Above
AM5 F Size and rotation
AM6 F Size and rotation
AM7 F Size and rotation
AM8 F Size and rotation
AN1 F Canonical only
AN2 F Canonical ony
AN3 F Canonical only
AN4 F Canonical only
AN5 F Canonical only
AN6 F Canonical only
AN7 F Canonical only
AN8 F Canonical only
*

Failed to learn task.

Not included in initial learning experiment.

Above chance for near-transparent stimuli.

§

Failed to learn previous stimuli.

Not enough practice trials with reaction time restrictions.

**

Failed to learn stimuli with reaction time restrictions.

Behavioral training boxes

Rats were trained in high-throughput behavioral training rigs, each made up of four vertically stacked behavioral training boxes. In order to enter the behavioral training boxes, the animals were first individually transferred from their home cages to temporary plastic housing cages that would slip into the behavioral training boxes and snap into place. Each plastic cage had a porthole in front where the animals could stick out their head. In front of the animal in the behavior boxes were three easily accessible stainless steel lickports electrically coupled to capacitive sensors, and a computer monitor (Dell P190S, Round Rock, TX, USA; Samsung 943-BT, Seoul, South Korea) at approximately 40° visual angle from the rats’ location. The three sensors were arranged in a straight horizontal line approximately a centimeter apart and at mouth-height for the rats. The two side ports (L/R) were connected to syringe pumps (New Era Pump Systems, Inc NE-500, Farmingdale, NY, USA) that would automatically dispense water upon a correct trial. The center port was connected to a syringe that was used to manually dispense water during the initial phases of training (see below). Each behavior box was equipped with a computer (Apple Macmini 6,1 running OsX 10.9.5 [13F34] or Macmini 7.1 running OSX El Capitan 10.11.13, Cupertino, CA, USA) running MWorks, an open source software for running real-time behavioral experiments (MWorks 0.5.dev [d7c9069] or 0.6 [c186e7], The MWorks Project https://mworks.github.io/). The capacitive sensors (Phidget Touch Sensor P/N 1129_1, Calgary, Alberta, Canada) were controlled by a microcontroller (Phidget Interface Kit 8/8/8P/N 1018_2) that was connected via USB to the computer. The syringe pumps were connected to the computer via an RS232 adapter (Startech RS-232/422/485 Serial over IP Ethernet Device Server, Lockbourne, OH, USA). To allow the experimenter visual access to the rats’ behavior, each box was, in addition, illuminated with red LEDs, not visible to the rats.

Habituation

Long-Evans rats (Charles River Laboratories, Wilmington, MA, USA) of about 250 g were allowed to acclimate to the laboratory environment upon arrival for about a week. After acclimation, they were habituated to humans for 1 or 2 days. The habituation procedure involved petting and transfer of the rats from their cage to the experimenter’s lap until the animals were comfortable with the handling. Once habituated to handling, the rats were introduced to the training environment. To allow the animals to get used to the training plastic cages, the feedback sounds generated by the behavior rigs, and to become comfortable in the behavior training room, they were transferred to the temporary plastic cages used in our high-throughput behavioral training rigs and kept in the training room for the duration of a training session undergone by a set of trained animals. This procedure was repeated after water deprivation, and during the training session undergone by the trained animals, the new animals were taught to poke their head out of a porthole available in each plastic cage to receive a water reward from a handheld syringe connected to a lickport identical to the ones in the behavior training boxes in the training rigs. Once the animals reliably stuck their head out of the porthole (1 or 2 days) and accessed water from the syringe, they were moved into the behavior boxes.

Early shaping

On their first day in the behavior boxes, rats were individually tutored as follows: Water reward was manually dispensed from the center lickport which is normally used to initiate a trial. When the animal licked the center lickport, a trial began. After a 500 ms tone period, one of two visual objects (stimulus pair 1) appeared on the screen (large front view, degree of visual angle 40°) chosen pseudo-randomly (three randomly consecutive presentations of one stimulus resulted in a subsequent presentation of the other stimulus). This appearance was followed by a 350 ms minimum reaction time that was instituted to promote visual processing of the stimuli. If the animal licked one of the side (L/R) lickports during this time, then the trial was aborted, there would be a minimum intertrial time (1300 ms), and the process would begin again.

At the time of stimulus presentation, a free water reward was dispensed from the correct side (L/R) lickport. If the animals licked the correct side lickport within the allotted amount of time (3500 ms) then an additional reward was automatically dispensed from that port. This portion of training was meant to begin teaching the animals the task mechanics, that is to first lick the center port, and then one of the two side ports.

After the rats were sufficiently engaged with the lickports and began self-initiating trials by licking the center lickport (usually 1 to several days, determined by experimenter) no more water was dispensed manually through the center lickport, but the free water rewards from the side lickports were still given. Once the rats were self-initiating enough trials without manual rewards from the center lickport (>200 per session), the free reward condition was stopped, and only correct responses were rewarded.

Training

Data collection for this study began once the rats had demonstrated proficiency of the task mechanics (as described above). The training curriculum followed was similar to that by Zoccolan et al., 2009. Rats performed the task for about 2 hr daily. Initially, the rats were only presented with large front views (40° visual angle, 0° of rotation) of the two stimuli (stimulus pair 1). Once the rats reached a performance level of ≥70% with these views, the stimuli decreased in size to 15° visual angle in a staircased fashion with steps of 2.5° visual angle. Once the rats reached 15° visual angle, rotations of the stimuli to the left or right were staircased in steps of 5° at a constant size of 30° visual angle. Once the rats reached ±60° of rotation, they were considered to have completed training and were presented with random transformations of the stimuli at different sizes (15°–40° visual angle, step = 15°; 0° of rotation) or different rotations (-60° to +60° of rotation, step = 15°; 30° visual angle). After this point, 10 additional training sessions were collected to allow the animals’ performance to stabilize with this expanded stimulus set.

During training, there was a bias correction that tracked the animals’ tendency to be biased to one side. If biased, stimuli mapped to the unbiased side were presented for a maximum of three consecutive trials. For example, if the bias correction detected an animal was biased to the right, the left-mapped stimulus would appear three trials at a time in a non-random fashion and the animals’ performance would drop from 50% to 25%, reducing the advantageousness of a biased strategy dramatically. If the animals continued to exhibit bias after one or two sessions of bias correction, then the limit was pushed to five consecutive trials. Once the bias disappeared, stimulus presentation resumed in a pseudo-random fashion.

The left/right mapping of the stimuli to lickports was counterbalanced across animals, ruling out any effects related left/right stimulus-independent biases, or left/right-independent stimulus bias across animals.

Training regime comparison

Although object recognition is supposed to be a fairly automatic process (Cox, 2014), it is possible that the 14 possible presentations of each stimulus of stimulus pair 1 (6 sizes at constant rotation, and 8 rotations at constant size) varied in difficulty. To rule out any possible difficulty effects during training and at asymptotic performance, We trained n=8 different rats to asymptotic performance on the task but only on large, front views of the visual objects (Figure 2—figure supplement 1a). We compared the learning and asymptotic performance of the ‘size and rotation’ cohort and the ‘canonical only’ cohort across a wide range of behavioral measures. During learning, animals in both regimes followed similar learning trajectories in speed-accuracy space (Figure 2—figure supplement 1b), and clustered around the OPC at asymptotic performance (Figure 2—figure supplement 1c). Comparisons of accuracy, reaction time, and fraction maximum instantaneous reward rate trajectories during learning and at averages asymptotic performance revealed no detectable differences (Figure 2—figure supplement 1d—f). Total trials per session, and voluntary intertrial intervals after error trials did show slightly varied trajectories during learning, though there were no differences in their means after learning (Figure 2—figure supplement 1g, h). The difference in total trials per session could be unrelated to the difference in training regimes. The difference in voluntary intertrial intervals, however, could be related to the introduction of different sizes and rotations: a sudden spike in this metric is seen about halfway through normalized sessions and decays over time. If this is the case, it is a curious result that rats choose to display their purported ‘surprise’ in between trials, and not during trials, as we found no difference in the reaction time trajectories. Both training regimes had overlapping fraction trials ignored metrics during learning, with a sharp decrease after the start, and a small significant difference in their number at asymptotic performance (Figure 2—figure supplement 1i). We point out the fact that we do not consider voluntary intertrial intervals nor ignored trials in our analysis, so the differences between the regimes do not affect our conclusions. Overall, these results suggest that there is not a measurable or relevant difficulty effect based on our training regime with a variety of stimulus presentations.

Stimulus learnability experiment

Transparent stimuli. In order to assess how animals behaved in a scenario with non-existent learning potential, a subset of already well-trained animals were presented with transparent (n=5, alpha = 0) or near-transparent (n=5, alpha = 0.1) versions of the familiar stimulus pair 1 for a duration of 11 sessions. Before these sessions, 4 sessions with stimulus pair 1 at full opacity (alpha = 1) were conducted to ensure animals could perform the task adequately before the manipulation. We predicted that the near-transparent condition would segregate animals into two groups, those that could perform the task and those that could not, based on each individual’s perceptual ability. The animals in the near-transparent condition that remained around chance performance (n=3, rat AK2, AK9, and AK10) were grouped with the animals from the transparent condition, while those that performed well above chance (n=2, rat AK5 and AK12) were excluded.

Reaction times were predicted to decrease during the course of the experiment, so to measure the change most effectively, the minimum reaction time requirement of 350 ms was removed. However, removing the requirement could lead to reduced reaction times regardless of the presented stimuli. To be able to measure whether the transparent stimuli led to a significant difference in reaction times compared to visible stimuli, we ran sessions with visible stimuli with no reaction time requirement for the same animals and compared these reaction times with those from the transparent condition. We found that the aggregate reaction time distributions were significantly different (Figure 6—figure supplement 1a). A comparison of vincentized reaction times revealed that there was a significant difference in the fastest reaction time decile (Figure 6—figure supplement 1b), confirming that reaction times decreased significantly during presentation of transparent stimuli.

New visible stimuli. In order to assess how animals behaved in a scenario with high learning potential, a subset (n=16) of already well-trained animals on stimulus pair 1 were presented with a never before seen stimulus pair (stimulus pair 2) for a duration of 13 sessions. Before these sessions, 5 sessions with the familiar stimulus pair 1 were recorded immediately preceding the stimulus pair 2 sessions in order to compare performance and reaction time after the manipulation for every animal. Previous pilot experiments showed that the animals immediately assigned a left/right mapping to the new stimuli based on presumed similarity to previously trained stimulus pair, so in order to enforce learning, the left/right mapping contrary to that predicted by the animals in the pilot tests was chosen. Because of this, animals typically began with an accuracy below 50%, as they first had to undergo reversal learning for their initial mapping assumptions. Because the goal of this experiment was to measure effects during learning and not demonstrate invariant object recognition, the new stimuli were presented in large front views only (visual angle = 40°, rotation = 0°).

Behavioral data analysis

Software

Behavioral psychophysical data was recorded using the open-source MWorks 0.5.1 and 0.6 software (https://mworks.github.io/downloads/). The data were analyzed using Python 2.7 with the pymworks extension. We employed the hierarchical estimation of the DDM in python (HDDM) package for DDM fits (Wiecki et al., 2013). To measure stimulus-independent psychophysical strategies such as bias and perseverance, we employed PsyTrack, a generalized linear model package for fitting dynamic psychophysical models to behavioral data (Roy et al., 2021) in conjunction with Python 3.8.

DDM fit

In order to verify that our behavioral data could be modeled as a drift-diffusion process, the data were fit with an HDDM (Wiecki et al., 2013), permitting subsequent analysis (such as comparison to the OPC) based on the assumption of a drift-diffusion process. To verify that a DDM was appropriate for our data, we fit a simple DDM to 10 asymptotic sessions after learning stimulus pair 1 for n=26 subjects (Figure 1—figure supplement 3). In order to assess parameter changes across learning, we fit DDMs to the stimulus pair 1 experiment and the stimulus pair 2 experiment where the learning epochs were treated as conditions in each experiment. This allowed us to hold some parameters constant while conditioning others on learning. We fit both simple DDMs and DDMs with drift rate variability to the two experiments, allowing drift rate, threshold, and drift rate variability to vary with learning epoch. In particular, we fit three broad types of models: (1) simple DDMs (Figure 4—figure supplement 2), (2) DDMs + fixed drift rate variability (Figure 4—figure supplement 3), and (3) DDMs + drift rate variability that varied freely with learning epoch (Figure 4—figure supplement 4). For each of the types of models we held drift constant, threshold constant, or allowed both to vary with learning. The best fits, as determined by the deviance information criterion (DIC), came from models where we allowed both drift and threshold to vary with learning; the addition of drift rate variability did not appear to improve model fits (Figure 4—figure supplement 1). For both learning experiments, drift rates increased and thresholds decreased by the end of learning, in agreement with previous findings (Ditterich, 2006; Ratcliff et al., 2006; Dutilh et al., 2009; Balci et al., 2011b; Liu and Watanabe, 2012; Zhang and Rowe, 2014). In addition, for the transparent stimuli experiment we fit a DDM that allowed drift rate, threshold, drift rate variability, and T0 to vary with learning phase in order to observe the changes in drift rate and threshold (Figure 6—figure supplement 3).

PsyTrack fit

In order to estimate stimulus-independent psychophysical strategies in the transparent stimulus experiment (Figure 6—figure supplement 4), we used PsyTrack to fit a generalized linear model to our behavioral data (Roy et al., 2021). The model assigns weights to user-determined input variables to explain the output variable. The output variable consists of a vector for left/right choices on every trial for an individual subject, where left = 0 and right = 1 (or 1 and 2). PsyTrack automatically calculates weights on bias, with a positive weight indicating a rightward bias, and a negative weight indicating a leftward bias. We fit the model by providing it with three explanatory input variables: stimulus, perseverance, and win-stay/lose-switch. For the input variables, the left/right coding differed from that for the output variable as per the model documentation (left = -1, right = +1). The stimulus variable indicated the stimulus that appeared on that trial (stimulus A = -1, stimulus B = +1). However, the left/right mapping of these stimuli was counterbalanced across subjects, so depending on the mapping, subjects could have a strong positive or negative weight, both indicating the stimuli explained choices. The perseverance variable indicated the left/right location of the subject’s choice on the previous trial. The win-stay/lose-switch variable indicated the location of the correct choice on the previous trial. For both perseverance and win-stay/lose-switch, positive weights indicate the predicted presence of perseverance and win-stay/lose-switch rather than left/right information.

Behavioral metrics

Error rate (ER) was calculated by dividing the number of error trials by the number of total trials (error + correct) within a given window of trials in a full behavioral training session. Accuracy was calculated as 1-ER.

ER=errortrialstotaltrials (13)
accuracy=1ER (14)

Reaction time (RT) for one trial was measured by subtracting the time of the first lick on a response lickport from the stimulus onset time on the computer monitor. Mean RT was calculated by averaging reaction times across trials within a given window of trials or the trials in a full behavioral training session.

RT=1ntrial i=1trial nRTi (15)

Vincentized reaction time is one method to report aggregate reaction time data meant to preserve individual distribution shape and be less sensitive to outliers in the group distribution (Ratcliff, 1979, Blokland, 1998), although some scientists have argued parametric fitting (with an ex-Gaussian distribution, for example) and parameter averaging across subjects outperforms Vincentizing as sample size increases (Rouder and Speckman, 2004; Whelan, 2008). Each subject’s reaction time distribution is divided into quantiles (e.g. deciles; similar to percentile, but between 0 and 1), and then the quantiles across subjects are averaged.

Decision time (DT) for one trial was measured by subtracting the non-decision time T0 (see Estimating T0) from RT. Mean DT was calculated by subtracting T0 from the mean RT across trials within a given window of trials or the trials in a full behavioral training session.

.DT=RTT0 (16)

Post-error and correct non-decision task engagement times (Derr, Dcorr) are defined (for notational simplicity) as the sum of non-decision time T0 and the experimentally determined response-to-stimulus times after error D~err and correct trials D~corr. Please see Determining D~err and D~corr for how we determined these experimental variables.

Derr=D~err+T0 (17)
.Dcorr=D~corr+T0 (18)

Mean normalized decision time (DT/Derr) was measured by dividing mean DT by Derr, the sum of the non-decision time T0 and D~err, the mean non-decision task engagement time in an error trial (see Derr, Dcorr).

Mean difference in mean reaction time (ΔRT) was calculated by subtracting the mean reaction time of a number of baseline sessions from the mean reaction time of an experimental session. A positive difference indicates an increase over baseline mean reaction time. The mean of the two immediately preceding sessions with stimulus pair 1 were subtracted from the mean reaction time of every session with stimulus pair 2 or transparent stimuli for every animal individually (Figure 6d and e). These differences were then averaged to get a mean difference in mean reaction time ΔRT.

Mean instantaneous reward rate (iRR) (regularly referred to as just reward rate, RR) is defined as mean accuracy per mean time per trial (Gold and Shadlen, 2002):

iRR=meanaccuracymeantimepertrial. (19)

We define the average non-decision task engagement time per trial,

Dtot=(1ER)Dcorr+ERDerr. (20)

The mean instantaneous reward rate is then (see Equation A26 in Bogacz et al., 2006)

.iRR=1ERDT+Dtot (21)

N.B. Because our study hinges on the important difference between present, future and cumulative rewards and tracks changes in ER and DT over learning, we write what is traditionally referred to as reward rate RR as instantaneous reward rate iRR to emphasize these differences. Because ER and DT can change throughout learning, reward rate as traditionally defined only captures an ‘instant’ in a learning trajectory.

Mean total correct trials is a model-free measure of the reward attained by the animals within a given window of trials. Every correct response yields an identical water reward, hence, reward can be counted by counting correct responses across trials. For one subject a∈ [1, 2, 3,…, k], total correct trials at trial n are the sum of correct trials up to trial n:

,cna=trial i=1trial noia (22)

where oia is an element in a vector oa containing the outcomes of those trials oa=[o1a,o2a,o3a,,ona]. For correct and error responses ona = 1 and 0 respectively (e.g.ona = [0, 0, 1, 1, 0, …, 1]).

Mean total correct trials up to trial n is calculated by taking the average of total correct trials across all animals k up to trial n.

.cn=1ktrial i=1trial nci1+ci2+ci3+...+cik (23)

Mean cumulative reward is a measure of the reward attained by the animals within a given window of trials. To calculate this quantity, a moving average of RT and accuracy for a given window size are first calculated for every animal individually. To avoid averaging artifacts, only values a full window length from the beginning are considered. Given these moving averages, iRR is then calculated for every animal and subsequently averaged across animals to get a moving average of mean reward rate. To calculate the mean cumulative reward, a numerical integral over a particular task time, such as task engagement time (see Measuring task time), is then calculated using the composite trapezoidal rule.

SNR is a measure of an agent’s perceptual ability in a discrimination task. Given an animal’s particular ER and DT, we use an equation to infer its SNR A¯ deduced from standard DDM equations to infer its SNR (Equation 56 in Bogacz et al., 2006):

.A¯infer=12ER2DTlog1ERER (24)

The SNR equation defines a U-shaped curve that increases as ERs move away from 0.5. For cases early in learning where ERs were below 0.5 because of potential initial biases, we assumed the inferred SNR was negative (meaning the animals had to unlearn the biases in order to learn, and thus had a monotonically increasing SNR during learning).

SNR performance frontier is a measure of an agent’s possible error rate and reaction time combinations based on their current perceptual ability. Because of the SAT, not all combinations of ER and DT are possible. Instead, performance is bounded by an agent’s SNR A¯ at any point in time, and their particular (ER, DT) combination will depend on their choice of threshold.

Given a fixed Derr (as in the case of our experiment), this bound exists in the form of a performance frontier – the combination of all resultant ERs and mean normalized DTs possible given a fixed SNR A¯ and all possible thresholds z¯.

We can use A¯infer (Equation 24) to calculate its performance frontier for a range of thresholds z¯ [0, ) with standard equations from the DDM:

ERA¯infer=11+e2z¯A¯infer (25)
.DTA¯infer=z¯tanh(z¯A¯infer) (26)

For every performance frontier there will be one unique (ERA¯infer, DTA¯infer) combination for which reward rate will be greatest, and it will lie on the OPC.

Fraction maximum instantaneous reward rate is a measure of distance to the OPC, that is, optimal performance. Given an animal’s ER and DT, we inferred their SNR and calculated their performance frontier as described above. We then divided the animal’s reward rate by the maximum reward rate on their performance frontier, corresponding to the point on the OPC they could have attained given their inferred SNR A¯infer:

. fraction max iRR=iRRER, DTmaxiRRA¯infer (27)

Maximum instantaneous reward rate opportunity cost, like fraction maximum instantaneous reward rate, is also measure of distance to the OPC, that is, optimal performance, but it emphasizes the reward rate fraction given up by the subject given its current ER and DT combination along its SNR performance frontier. It is simply:

.max iRR opportunity cost=1fraction max iRR (28)

Mean post-error slowing is a metric to account for the potential policy of learning by slowing down after error trials. In order to quantify the amount of post-error slowing in a particular subject, the subject’s reaction times in a session are segregated into correct trials following an error, and correct trials following a correct choice, and separately averaged. The difference between these indicates the degree of post-error slowing present in that subject during that session.

.post-error slowing (PES)=RTpost-error correct trialsRTpost-correct correct trials (29)

The mean post-error slowing for one session is thus the mean of this quantity across all subjects k.

.PES=PES1+PES2+PES3+...+PESkk (30)

Left/right bias measures the extent to which a subject is biased to the left or right lickport regardless of the stimulus presented. For every individual, left or right choices for every trial are coded as a binary vector (left = 0, right = 1). The correct response side is also coded as a binary vector. Bias is calculated by taking the difference of these vectors. A Gaussian filter is then applied to smooth the bias vector over time, with negative numbers reflecting a left bias and positive numbers reflecting a right bias.

Computing error

Within-subject session errors (e.g. Figure 1d) for accuracy and reaction times were calculated by bootstrapping trial outcomes and reaction times for each session. We calculated a bootstrapped standard error of the mean by taking the standard deviation of the distribution of means from the bootstrapped samples. A 95% confidence interval can be calculated from the distribution of means as well.

Across-subject session errors (e.g. Figure 6d) were computed by calculating the standard error of the mean of individual animal session means.

Across-subject sliding window errors (e.g. Figure 6b; Figure 7b) were calculated by averaging trials over a sliding window (e.g. 200 trials) for each animal first, then taking the standard error of the mean of each step across animals. Alternatively, the average could be taken across a quantile (e.g. first decile, second decile, etc.), and then the standard error of the mean of each quantile across animals was computed.

Measuring task time

Trials are the smallest unit of behavioral measure in the task and are defined by one stimulus presentation accompanied by one outcome (correct, error) and one reaction time.

Sessions are composed of as many trials as an animal chooses to complete within a set window of wall clock time, typically around 2 hr once daily. An error rate (fraction of error trials over total trials for the session) and a mean reaction time can be calculated for a session.

Normalized sessions are a group of sessions (e.g. 1, 2, 3, …, 10) where a particular session’s normalized index corresponds to its index divided by the total number of sessions in the group (e.g. 0.1, 0.2, 0.3, …, 1.0). Because animals may take different numbers of sessions to learn to criterion, a normalized index for sessions allows better comparison of psychophysical measurements throughout learning.

Stimulus viewing time measures the time that the animals are viewing the stimulus, defined as the sum of all reaction times up to trial n as:

stimulus viewing time=trial i=1trial nRTi (31)
.task engagement time=trial i=1trial nRTi+ncorrD~corr+nerrD~err (32)

The sum of reaction times up to trial n plus the sum of D~err = 3136 ms and D~corr = 6370 ms, the mandatory post-error and post-correct response-to-stimulus intervals, proportional to the number of error and correct trials (n=ncorr + nnerr).

Statistical analyses

Figure 1d: We wished to test whether the mean fraction maximum reward rate of our subjects over the 10 sessions after having completed training were significantly different from optimal performance. A Shapiro-Wilk test failed to reject (p<0.05) a null hypothesis for normality for 18/26 subjects, with the following p-values (from left to right): (0.8162, 0.1580, 0.3746, 0.6985, 0.0025, 0.0467, 0.0040, 0.6522, 0.0109, 0.1625, 1.8178e-05, 0.0901, 0.7606, 0.0295, 0.0009, 0.2483, 0.5627, 0.0050, 0.4464, 0.6839, 0.5953, 0.0140, 0.1820, 0.1747, 0.6385, 0.2304). Thus, we conducted a one-sided Wilcoxon signed-rank test on our sample against 0.99, testing for the evidence that each subject’s mean fraction max reward rate was greater than 99% of the maximum (p<0.05), and obtained the following p-values (from left to right): (0.0025, 0.0025, 0.0025, 0.1013, 0.2223, 0.0063, 0.0047, 0.0025, 0.0025, 0.0025, 0.0025, 0.0571, 0.6768, 0.0047, 0.7125, 0.0372, 0.8794, 0.4797, 0.7125, 0.8987, 0.0372, 0.0109, 0.9975, 0.9766, 0.9917, 0.9975).

Figure 5b: We wished to test the difference in mean RT between two randomly chosen groups of animals before and after an RT restriction to assess the effectiveness of the restriction. A Shapiro-Wilk test did not support an assumption of normality for the ‘below’ group in either condition resulting in the following (W statistic, p-value) for the pre-RT restriction ‘above’ and ‘below’ groups and post-RT restriction ‘above’ and ‘below’ groups: (0.9073, 0.3777), (0.6806, 0.0059), (0.8976, 0.3168), (0.6583, 0.0033). Hence, we conducted a Wilcoxon rank-sum test for the pre- and post-RT restriction groups and found the pre-RT restriction group was not significant (p=0.570) while the post-RT restriction group was (p=0.007), indicating the two groups were not significantly different before the RT restriction, but became significantly different after the restriction.

Figure 5d: We wished to test the difference in accuracy between the ‘above’ and ‘below’ groups for every session of stimulus pair 3. A Shapiro-Wilk test failed to reject the assumption of normality (p<0.05) for any session from either condition (except session 4, ‘above’, which could be expected given there were 16 tests), with the following (W statistic, p-value) for [session: ‘above’, ‘below’] by session: [1: (0.9340, 0.6240),(0.8959, 0.3068)], [2: (0.9381, 0.6522), (0.8460, 0.1130)], [3: (0.9631, 0.8291), (0.9058, 0.3676)], [4: (0.7608, 0.0374), (0.9728, 0.9177)], [5: (0.8921, 0.3680), (0.9779, 0.9486)], [6: (0.7813, 0.0565), (0.9702, 0.9002)], [7: (0.8942, 0.3786), (0.9711, 0.9062)], [8: (0.7848, 0.0605), (0.9611, 0.8280)].

A Levene test failed to reject the assumption of equal variances for every pair of sessions except the first (statistic, p-value): (6.3263, 0.0306), (2.2780, 0.1621), (1.2221, 0.2948), (0.8570, 0.3764), (2.7979, 0.1253), (0.7364, 0.4109), (0.0871, 0.7739), (0.0088, 0.9269).

Hence, we performed a two-sample independent t-test for every session with the following p-values: (0.4014, 0.04064, 0.0057, 0.0038, 0.0011, 0.0038, 0.0006, 6.3658e-05).

We also wished to test the difference between the slopes of linear fits to the accuracy curves for both conditions. A Shapiro-Wilk test failed to reject the assumption of normality (p<0.05) for either condition, with the following (W statistic, p-value) for ‘above’ and ‘below’: (0.8964, 0.3095), (0.8794, 0.3065). A Levene test failed to reject the assumption of equal variances (p<0.05) for each condition (statistic, p-value): (0.2141, 0.6535). Hence, we performed a two-sample independent t-test and found a significant difference (p=0.0027).

Figure 6d: We wished to test whether the animals had significantly changed their session mean RTs with respect to their individual previous baseline RTs (paired samples). To do this, we conducted a permutation test for every session with the new visible stimuli (stimulus pair 2) or the transparent stimuli. For 1000 repetitions, we randomly assigned labels to the experimental or baseline RTs and then averaged the paired differences. The p-value for a particular session was the fraction of instances where the average permutation difference was more extreme than the actual experimental difference. For sessions with stimulus pair 2, the p-values from the permutation test were: (0.0034, 0.0069, 0.0165, 0.0071, 0.0291, 0.0347, 0.06, 0.0946, 0.3948, 0.244, 0.244, 0.4497, 0.3437). For sessions with transparent stimuli (plus rats AK2, AK9, and AK10 from the near-transparent stimuli), the p-values from the permutation were (0.0859375, 0.44921875, 0.15625, 0.03125, 0.02734375, 0.015625, 0.26953125, 0.02734375, 0.03125, 0.01953125, 0.0546875). To investigate whether the animals’ significantly slowed down their mean RTs compared to baseline during the first session of transparent stimuli, we divided RTs in the first session in half and ran a permutation test on each half with the following p-values: (0.0390625, 0.2890625).

Figure 6e: In order to test the correlation between the initial change in RT and the initial change in SNR for stimulus pair 2, we ran a standard linear regression on the average per subject for each of these variables for the first three sessions of stimulus pair 2 with R2 = 0.38 and p-value = 0.01.

Figure 2—figure supplement 1d—i: Statistical significance of differences in means between the two training regimes for a variety of psychophysical measures was determined by a Wilcoxon rank-sum test with p<0.05. The p-values were: (d) accuracy: 0.21, (e) reaction time: 0.81, (f) fraction max iRR: 0.22, (g) total trial number: 0.46, (h): voluntary intertrial interval after error: 0.75, (i) fraction trials ignored: 0.03.

Figure 4—figure supplement 2: Statistical significance for mean parameters was calculated by taking the difference between the posterior distributions and using the proportion of the difference distribution that overlapped with 0 as the p-value. For individuals’ parameters, p-values were determined via a Wilcoxon signed-rank test. The p-values were: (a) drift mean:<1e-4, drift individuals:<1e-4; (b) threshold mean: 0.0012, threshold individuals: 0.0008; (c) drift mean: <1e-4, drift individuals: <1e-4, threshold mean: 0.0298, threshold individuals: 0.0585; (d) drift mean: (baseline versus start learn: <1e-4, baseline versus after learn: 0.0378, start versus after learn: <1e-4), drift individuals: (baseline versus start learn: 0.0004, baseline versus after learn: 0.0703, start versus after learn: 0.0004); (e) threshold mean: (baseline versus start learn: 0.1100, baseline versus after learn: 0.3546, start versus after learn: 0.1904), threshold individuals: (baseline versus start learn: 0.0045, baseline versus after learn: 0.4380, start versus after learn: 0.1627); (f) drift mean: (baseline versus start learn: <1e-4, baseline versus after learn: 0.0616, start versus after learn: <1e-4), drift individuals: (baseline versus start learn: 0.0004, baseline versus after learn: 0.1089, start versus after learn: 0.0004), threshold mean: (baseline versus start learn: 0.2546, baseline versus after learn: 0.4614, start versus after learn: 0.2816), threshold individuals: (baseline versus start learn: 0.0200, baseline versus after learn: 0.7174, start versus after learn: 0.1089).

Figure 4—figure supplement 3: Statistical significance was determined as for Figure 4—figure supplement 2. The p-values were: (a) drift mean: <1e-4, drift individuals: <1e-4; (b) threshold mean: 0.0036, threshold individuals: 0.0013; (c) drift mean: <1e-4, drift individuals: <1e-4, threshold mean: 0.0314, threshold individuals: 0.0585; (d) drift mean: (baseline versus start learn: <1e-4, baseline versus after learn: 0.0428, start versus after learn: <1e-4), drift individuals: (baseline versus start learn: 0.0004, baseline versus after learn: 0.0703, start versus after learn: 0.0004); (e) threshold mean: (baseline versus start learn: 0.0866, baseline versus after learn: 0.4192, start versus after learn: 0.1252), threshold individuals: (baseline versus start learn: 0.0038, baseline versus after learn: 0.5349, start versus after learn: 0.1089); (f) drift mean: (baseline versus start learn: <1e-4, baseline versus after learn: 0.0618, start versus after learn: <1e-4), drift individuals: (baseline versus start learn: 0.0004, baseline versus after learn: 0.0980, start versus after learn: 0.0004), threshold mean: (baseline versus start learn: 0.2436, baseline versus after learn: 0.4596, start versus after learn: 0.2860), threshold individuals: (baseline versus start learn: 0.0200, baseline versus after learn: 0.7174, start versus after learn: 0.1089).

Figure 4—figure supplement 4: Statistical significance was determined as for Figure 4—figure supplement 2. The p-values were: (a) drift mean: <1e-4, drift individuals: <1e-4, drift variability: <1e-4; (b) threshold mean: 0.0036, threshold individuals:< 1e-4, drift variability: <1e-4; (c) drift mean: <1e-4, drift individuals: <1e-4, threshold mean: 0.0696, threshold individuals: 0.1587, drift variability: <1e-4; (d) drift mean: (baseline versus start learn: <1e-4, baseline versus after learn: 0.0422, start versus after learn: <1e-4), drift individuals: (baseline versus start learn: 0.0004, baseline versus after learn: 0.0557, start versus after learn: 0.0004), drift variability: (baseline versus start learn: 0.7940, baseline versus after learn: 0.8104, start versus after learn: 0.4564); (e) threshold mean: (baseline versus start learn: <1e-4, baseline versus after learn: 0.4188, start versus after learn: 0.0002), threshold individuals: (baseline versus start learn: 0.0004, baseline versus after learn: 0.5014, start versus after learn: 0.0004), drift variability: (baseline versus start learn: <1e-4, baseline versus after learn: 0.4442, start versus after learn: <1e-4); (f) drift mean: (baseline versus start learn:<1e-4, baseline versus after learn: 0.0596, start versus after learn: <1e-4), drift individuals: (baseline versus start learn: 0.0004, baseline versus after learn: 0.0980, start versus after learn: 0.0004), threshold mean: (baseline versus start learn: 0.2474, baseline versus after learn: 0.4702, start versus after learn: 0.2652), threshold individuals: (baseline versus start learn: 0.0200, baseline versus after learn: 0.7174, start versus after learn: 0.1089), drift variability: (baseline versus start learn: 0.2392, baseline versus after learn: 0.5294, start versus after learn: 0.2132).

Figure 6—figure supplement 1a, b: We tested for a difference in the aggregate reaction time distributions of a transparent stimuli condition in the first two and last two sessions (n=8 subjects), and a no minimum reaction time condition with known stimuli (n=8 subjects) via a two-sample Kolmogorov-Smirnov test and found a p-value of <1e-4 for both comparisons.

Figure 6—figure supplement 3: Statistical significance was determined as for Figure 4—figure supplement 2. The p-values were: (a) drift mean: (visible versus start transparent: 0.0002, visible versus end transparent: <1e-4, start versus end transparent: 0.3180), drift individuals: (visible versus start transparent: 0.0117, visible versus end transparent: 0.0117, start versus end transparent: 0.5754), threshold mean: (visible versus start transparent: 0.1362, visible versus end transparent: 0.0094, start versus end transparent: 0.0658), threshold individuals: (visible versus start transparent: 0.0499, visible versus end transparent: 0.0173, start versus end transparent: 0.0173), drift variability: (visible versus start transparent: 0.1494, visible versus end transparent: 0.1614, start versus end transparent: 0.5032), T0 mean: (visible versus start transparent: 0.0068, visible versus end transparent: 0.0194, start versus end transparent: 0.6106), T0 individuals: (visible versus start transparent: 0.0117, visible versus end transparent: 0.0117, start versus end transparent: 0.0929).

Figure 6—figure supplement 5b, d: We tested for a difference in mean post-error slowing between the first two sessions and last two sessions of training for each animal for stimulus pair 1 (b) or the last two sessions of stimulus pair 1 and the first two sessions of stimulus pair 2 (d) via a Wilcoxon-signed rank test. The p-values were (b) 0.585 and (d) 0.255.

Evaluation of optimality

Under the assumptions of a simple drift-diffusion process, the OPC defines a set of optimal threshold-to-drift ratios with corresponding decision times and error rates for which an agent maximizes instantaneous reward rate (Bogacz et al., 2006). Decision times are scaled by the particular task timing as mean normalized decision time: DT/Derr. The OPC is parameter free and can thus be used to compare performance across tasks, conditions, and individuals. An optimal agent will lie on different points on the OPC depending on differences in task timing (Derr) and stimulus difficulty (SNR). Assuming constant task timing, the SNR will determine different positions along the OPC for an optimal agent. For DT>0 and 0<ER<0.5, the OPC is defined as:

DTDerr=[1ERlog1ERER112ER]1 (33)

and exists in speed-accuracy space, defined by DT/Derr and ER. Given an estimate of Derr, the ER and DT for any given animal can be compared to the optimal values defined by the OPC in speed-accuracy space.

Moreover, because ER should decrease with learning, learning trajectories for different subjects and models can also be compared to the OPC and to each other in speed-accuracy space.

Mean normalized decision time depends only on Derr.

For completeness, we include a derivation showing that the appropriate normalized decision time for the OPC depends only on Derr, not Dcorr. According to Gold and Shadlen, 2002, average reward rate is defined as:

.RR=average accuracyaverage time per trial (34)

We can write the average reward rate as (see A26 from Bogacz et al., 2006),

RR=1ERDT+Dcorr+ER(DerrDcorr). (35)

Optimal behavior is defined as maximizing reward rate with respect to the thresholds in the DDM. We thus rewrite ER and DT in terms of average threshold and average SNR,

RR=111+e2z¯A¯z¯tanh(z¯A¯)+Dcorr+11+e2z¯A¯(DerrDcorr) (36)
=1z¯+Dcorr+(Derrz¯)e2z¯A¯. (37)

Next to find the extremum, we take the derivative of RR with respect to the threshold and set it to zero,

RRz¯=1+(12A¯(Derrz¯))e2z¯A¯(z¯+Dcorr+(Derrz¯)e2z¯A¯)2 (38)
0=1[1+2A¯(Derrz¯)]e2z¯A¯ (39)
,=12ERERDerrDT(12ER)log1ERERlog1ERER (40)

where in the final step we have rewritten z¯ and A¯ in terms of ER and DT.

Rearranging to place DT on the left-hand side reveals an OPC where decision time is normalized by the post-error non-decision time Derr:

.DTDerr=[1ERlog1ERER112ER]1 (41)

Notably, the post-correct non-decision time Dcorr is not part of the normalization. Intuitively, this is because post-correct delays are an unavoidable part of accruing reward and therefore do not influence the optimal policy.

Estimating T0

T0 is defined as the non-decision time component of a reaction time, comprising motor and perceptual processing time (Holmes and Cohen, 2014). It can be estimated by fitting a DDM to the psychophysical data. Because of the experimentally imposed minimum reaction time meant to ensure visual processing of the stimuli, however, our reaction time distributions were truncated at 350 ms, meaning a DDM fit estimate of T0 is likely to be an overestimate. To address this issue, we set out to determine possible boundaries for T0 and estimated it in a few ways, all of which did indeed fall between those boundaries (Figure 1—figure supplement 4e).

We found that after training, in the interval between 350 and 375 ms, nearly all of our animals had accuracy measurements above chance (Figure 1—figure supplement 4b), meaning that the minimum reaction time of 350 ms served as an upper bound to possible T0 values.

To determine a lower bound, we obtained measurements for the two components comprising T0: motor and initial perceptual processing times. To measure the minimum motor time required to complete a trial, we analyzed licking times across the different lickports. The latency from the last lick in the central port to the first lick in one of the two side ports peaked at around 80 ms (Figure 1—figure supplement 4c). In addition, the latency from one lick to the next lick at the same port at any of the lickports was also around 80 ms (data not shown). Because the latencies in lick times between lickports (requires movement of the head) and within the same lockport (does not require movement of the head) were about equal we concluded that the minimum motor time was determined by the limit on lick frequency, and not on a movement of the head redirecting the animal from the central port to one of the side ports. To measure the initial perceptual processing times, we looked to published latencies of visual stimuli traveling to higher visual areas in the rat. Published latencies reaching area TO (predicted to be after V1, LM, and LI in the putative ventral stream in the rat) were around 80 ms (Figure 1—figure supplement 4d; Vermaercke et al., 2014). Based on these measurements, we estimated a T0 lower bound of approximately 160 ms.

One worry is that our lower bound could potentially be too low, as it is only estimated indirectly. Recent work on the SAT in a low-level visual discrimination tasks in rats found that accuracy was highest at a reaction time of 218 ms (Kurylo et al., 2020). However, accuracy was still above chance for reaction times binned between 130 and 180 ms. In this task, reaction time was measured when an infrared beam was broken, which means we can assume there was no motor processing time. This leaves decision time, and initial perceptual processing time (part of T0) within the 130–180 ms duration. The complexity of solving a high-level visual task like ours and a low-level one will result in substantial differences in decision time, but should not in principle affect non-decision time. Considering a latency estimate of 80 ms based on physiological evidence (Vermaercke et al., 2014) can account for the initial perceptual processing component of T0 and gives an estimate T0=80 ms for this study.

Because a reaction time around T0 should not allow for any decision time, accuracy should be around 50%. To estimate T0 based on this observation, we extrapolated the time at which accuracy would drop to 50% after plotting accuracy as a function of reaction time (Figure 1—figure supplement 4a) and found values of 165 and 225 ms for linear and quadratic extrapolations respectively. Finally, we fit our behavioral data with an HDDM (Wiecki et al., 2013) and found a T0 estimate of 295 ±4 ms (despite there being no data below 350 ms). To address this issue, we fit a DDM to a small number of behavioral sessions we conducted with animals trained on the minimum reaction time of 350 ms but where that constraint was eliminated and found a T0 estimate of 265 ± 120 (SD) ms. We stress that because the animals were trained with a minimum reaction time, they likely would have required extensive training without that constraint to fully make use of the time below the minimum reaction time, thus this estimate is likely to also be an overestimate. We do note however that the estimate is lower than the estimate with an enforced minimum reaction time and has a much higher standard deviation (spanning our lower and upper bound estimates).

Despite the range of possible T0 values, we find that our qualitative findings (in terms of learning trajectory and near-optimality after learning) do not change (Figure 1—figure supplement 4f, g), and proceed with a T0=160 ms for the main text.

Determining D~err and D~corr

The experimental protocol imposes a mandatory post-error and post-correct response-to-stimulus time (D~err and D~corr, respectively). However, these times may not be accurate because of delays in the software communicating with different components such as the syringe pumps, and other delays such as screen refresh rates. We thus determined the actual mandatory post-error and post-correct response-to-stimulus times by measuring them based on timestamps on experimental file logs and found that D~err=3136 ms, and D~corr=6370 ms (Figure 1—figure supplement 6).

Voluntary intertrial interval

We assume that the animals optimize reward rate based on task engagement time, that is, the sum of reaction times and all mandatory task delays, but not including any extra voluntary intertrial intervals. Therefore, our measures of non-decision task engagement time Derr and Dcorr do not include voluntary intertrial intervals. In essence this amounts to the assumption that animals exit the task between trials, potentially pursuing other goals, and do not count this voluntary interval when measuring their within-task reward rate.

We conducted a detailed analysis of the voluntary intertrial intervals after both correct and error trials (Figure 1—figure supplement 5). To prevent a new trial from initiating while the animals were licking one of the side lickports, the task included a 300 ms interval at the end of a trial where an extra 500 ms were added if the animal licked one of the side lickports (Figure 1—figure supplement 6). There was no stimulus (visual or auditory) to indicate the presence of this task feature so the animals were not expected to learn it. It was clear that the animals did not learn this task feature as most voluntary intertrial intervals are clustered in 500 ms intervals and decay after each boundary (Figure 1—figure supplement 5a). Aligning the voluntary intertrial distributions every 500 ms reveals substantial overlap (Figure 1—figure supplement 5c, d), indicating similar urgency in every 500 ms interval, with an added amount of variance the farther the interval from zero. Moreover, measuring the median voluntary intertrial interval from 0 to 500, 0–1000, and 0–2000 ms showed very similar values (47, 67, 108 ms after error trials, Figure 1—figure supplement 5b). The median was higher after correct trials (55, 134, 512 ms, Figure 1—figure supplement 5b) because the animals were collecting reward from the side lickports and much more likely to trigger the extra 500 ms penalty times.

Reward rate sensitivity to T0 and voluntary intertrial interval

To ensure that our results did not depend on our chosen estimate for T0 and our choice to ignore voluntary intertrial intervals when computing metrics like Dtot and reward rate, we computed fraction maximum instantaneous reward rate as a function of T0 and voluntary intertrial interval. We conducted this analysis across n=26 rats at asymptotic performance (Figure 1—figure supplement 7a, b), and during the learning period (Figure 1—figure supplement 7c, d). During asymptotic performance, sweeping T0 from our estimated minimum to our maximum possible values generated negligible changes in reward rate across a much larger range of possible voluntary intertrial intervals than we observed (Figure 1—figure supplement 7a). Reward rate was more sensitive to voluntary intertrial intervals, but did not drop below 90% of the possible maximum when considering a median voluntary intertrial interval up to 2000 ms (the median when allowing up to a 2000 ms window after a trial, after which agents are considered to have ‘exited the task’) (Figure 1—figure supplement 7b). During learning, we found similar results, with possible voluntary intertrial interval values have a larger effect on reward rate than T0, however even with the most extreme combination of a maximum T0=350 ms, and the median voluntary intertrial interval up to 2000 ms (Figure 1—figure supplement 7d, light gray trace), fraction maximum reward rate was at most 10–15% away from the least extreme combination of T0=160 ms and voluntary intertrial interval = 0 (Figure 1—figure supplement 7c, horizontal line along the bottom of the heat map) for most of the learning period. These results confirm that our qualitative findings do not depend on our estimated values of T0 and choice to ignore voluntary intertrial intervals.

Ignore trials

Because of the free-response nature of the task, animals were permitted to ignore trials after having initiated them (Figure 1—figure supplement 2). Although the fraction of ignored trials did seem to be higher at the beginning of learning for the first set of stimuli the animals learned (stimulus pair 1; Figure 1—figure supplement 2a), this effect did not repeat for the second set (stimulus pair 2, Figure 1—figure supplement 2b). This suggests that the cause for ignoring the trials during learning was not stimulus-based but rather related to learning the task for the first time. Overall, the mean fraction of ignored trials remained consistently low across stimulus sets and ignore trials were excluded from our analyses.

Post-error slowing

In order to verify whether the increase in reaction time we saw at the beginning of learning relative to the end of learning was not solely attributable to a post-error slowing policy, we quantified the amount of post-error slowing during learning for both stimulus pair 1 and stimulus pair 2. For stimulus pair 1, we found that there was a consistent but slight amount of average post-error slowing (Figure 6—figure supplement 5a). This amount was not significantly different at the start and end of learning (Figure 6—figure supplement 5b).

We re-did this analysis for stimulus pair 2 and found similar results: animals had a consistent, modest amount of post-error slowing but it did not change across sessions during learning (Figure 6—figure supplement 5c). We tested for a significant difference in post-error slowing between the last two sessions of stimulus pair 1 and the first two sessions of the completely new stimulus pair 2 and found none (Figure 6—figure supplement 5d) even though there was a large immediate change in error rate. In fact, there was a trend toward a decrease in post-error slowing (and toward post-correct slowing) in the first few sessions of stimulus pair 2. This is consistent with the hypothesis that post-error slowing is an instance of a more general policy of orienting toward infrequent events (Notebaert et al., 2009). As correct trials became more infrequent than error trials when stimulus pair 2 was presented, we observed a trend toward post-correct slowing, as predicted by this interpretation.

Our subjects exhibit a modest, consistent amount of post-error slowing, which could at least partially explain the reaction time differences we see throughout learning. An experiment with transparent stimuli where error rate was constant but reaction times dropped, however, strongly contradicts the account that the rats implement a simple strategy like post-error slowing to modulate their reaction times during learning.

RNN model and LDDM reduction

We consider a recurrent network receiving noisy visual inputs over time. In particular, we imagine that an input layer projects through weighted connections to a single recurrently connected read-out node, and that the weights must be tuned to extract relevant signals in the input. The read-out node activity is compared to a modifiable threshold which governs when a decision terminates. This network model can then be trained via error-corrective gradient descent learning or some other procedure. In the following we derive the average dynamics of learning.

To reduce this network to a DDM with time-dependent SNR, we first note that due to the law of large numbers, activity increments of the read-out node will be Gaussian provided that the distribution of input stimuli has bounded moments. We can thus model the input-to-readout pathway at each time step as a Gaussian input x(t) flowing through a scalar weight u, with noise of variance co2 added before the signal is sent into an integrating network. Taking the continuum limit, this yields a drift-diffusion process with effective drift rate A~=Au and noise variance c~2=u2ci2+co2. Here, A parameterizes the perceptual signal, ci2 is the input noise variance (noise in input channels that cannot be rejected), and co2 is the output noise variance (internal noise in output circuitry). The resulting decision variable y^ at time T is Gaussian distributed as N(AuTy,u2ci2T+co2T), where y is the correct binary choice. A decision is made when y^ hits a threshold of ±z.

Within-trial drift-diffusion dynamics

On every trial, therefore, the subject’s behavior is described by a drift-diffusion process, for which the average reward rate as a function of signal to noise and threshold parameters is known (Bogacz et al., 2006). The accuracy and decision time of this scheme is determined by two quantities. First, the SNR

A¯=(A~c~)2=A2u2u2ci2+co2, (42)

and second, the threshold-to-drift ratio z¯=z/A~=zAu(t).

We can rewrite the SNR as

A¯(t)=A2u(t)2ci2u(t)2+co2=A2ci2+co2/u(t)2. (43)

From this it is clear that, when learning has managed to amplify the input signals such that u(t), the asymptotic SNR is simply A¯*=A2/ci2. Further, rearranging to

A¯(t)=A¯1+(co2/ci2)/u(t)2 (44)

shows that there are in fact just two parameters: the asymptotic achievable SNR A¯* and the output-to-input noise variance ratio cco2/ci2,

A¯(t)=A¯1+c/u(t)2. (45)

The mean error rate (ER), mean decision time (DT), and mean reward rate (RR) are therefore

ER=11+e2z¯A¯ (46)
DT=z¯tanh(z¯A¯) (47)
,RR=1ERDT+Dtot (48)

where we have suppressed the dependence of A¯ and z¯ on time for clarity. Here, Dtot=T0+ERDerr+(1-ER)Dcorr is the average non-decision task engagement time.

The term z¯A¯ is a measure of the total evidence accrued on average, and is equal to

z¯A¯=zAu(t)A¯1+c/u(t)2 (49)
=zA¯/Au(t)+c/u(t). (50)

Here, for a fixed threshold z, the denominator shows the trade-off for increasing perceptual sensitivity: small u(t) causes errors due to output noise, while large u(t) causes errors due to overly fast integration for the specified threshold level.

Across-trial error-corrective learning dynamics

To model learning, we consider that animals adjust perceptual sensitivities u over time in service of minimizing an objective function. In this section we derive the average learning dynamics when the objective is to minimize the error rate. The LDDM can be conceptualized as an ‘outer-loop’ that modifies the SNR of a standard DDM ‘inner-loop’ described in the preceding subsection. If perceptual learning is slow, there is a strong separation of timescales between these two loops. On the timescale of a single trial, the agent’s SNR is approximately constant and evidence accumulation follows a standard DDM, whereas on the timescale of many trials, the specific outcome on any one trial has only a small effect on the network weights w, such that the learning-induced changes are driven by the mean ER and DT.

To derive the mean effect of error-corrective learning updates, we suppose that on each trial the network uses gradient descent on the hinge loss to update its parameters, corresponding to standard practice for supervised neural networks. The hinge loss is

L(u,y)=max(0,1y^y), (51)

yielding the gradient descent update.

,u[r+1]u[r]λL(u[r],y)u (52)

where λ is the learning rate and r is the trial number.

When the learning rate is small (λ1), each trial changes the weights minimally and the overall update is approximately given by the average continuous time dynamics

dudr=λL(u,y)u (53)
=λL(u,y)u|error+L(u,y)u|correct (54)
=λERL(u,y)u|error (55)
,=λERyy^u|error (56)

where denotes an average over the correct answer y, the inputs and the output noise. The first step follows from iterated expectation. The second step follows from the fact that the probability of an error is simply the error rate ER, and for correct trials, the derivative of the hinge loss is zero. Next,

y^u=u(i=0Tuxi+ηi) (57)
,=i=0Txi (58)

where T is the time step at which y^ crosses the decision threshold ±z. Returning to Equation (56),

λERyy^u|error=λERyi=1Txi|error. (59)

Hence, the magnitude of the update depends on the typical total sensory evidence given that an error is made. To calculate this, let x¯t=i=0txi be the total sensory evidence up to time t, and η¯t=i=0t be the total decision noise up to t. These are independent and normally distributed as

x¯tN(yAtdt,ci2tdt) (60)
η¯tN(0,co2tdt). (61)

Therefore, we have

yi=1Txi|error=yx¯T|error (62)
=yx¯T|ux¯T+η¯T=yz (63)
=yx¯T|ux¯T/y+η¯T/y=z. (64)

These variables are jointly Gaussian. Letting v1=yx¯T and v2=ux¯T/y+η¯T/y, the means μ1,μ2, variances σ12,σ22, and covariance Cov(v1,v2) of v1,v2 given the hitting time T are

μ1=ATdt (65)
μ2=uATdt (66)
σ12=ci2Tdt (67)
σ22=u2ci2Tdt+co2Tdt (68)
Cov(yx¯T,ux¯T/y+η¯T/y)=yx¯T(ux¯T/y+η¯T/y)yx¯Tux¯T/y+η¯T/y (69)
=ux¯T2ux¯T2 (70)
=uci2Tdt. (71)

The conditional expectation is therefore

v1|v2=z,TT|error=μ1+Cov(v1,v2)σ22(zμ2)T|error (72)
=ATdt+uci2u2ci2+co2(zuATdt)T|error (73)
,=A(DT)uci2u2ci2+co2(z+uA(DT)) (74)

where we have used the fact that TdtT|error=DT, because in the DDM the mean decision time is the same for correct and error trials. Inserting Equation (74) into Equation (59) yields

ddru=λER(A(DT)+11+co2u2ci2(z/uA(DT))) (75)
=λER(A(DT)11+c/u2(z/u+A(DT))). (76)

Finally, we switch the units of the time variable from trials to seconds using the relation dt=(DT+Dtot)dr, yielding the dynamics

τddtu=ERDT+Dtot(A(DT)11+c/u2[zu+A(DT)]). (77)

The above equation describes the dynamics of u under gradient descent learning. We note that here, the dependence of the dynamics on threshold trajectory is contained implicitly in the DT, ER, and Dtot terms.

To obtain equivalent dynamics for the SNR A¯, we have

τddtA¯=2A2co2u(ci2+co2/u(t)2)2u(t)3ddtu (78)
=2cA¯A¯2u3ddtu. (79)

Rearranging the definition of A¯ yields

u2=cA¯A¯A¯. (80)

Inserting (Equation 80) into (Equation 79) and simplifying, we have

τddtA¯=2A¯(A¯)c(1A¯A¯)3/2ddtu (81)
=2A¯(A¯)c(1A¯A¯)3/2ERDT+Dtot(A(DT)11+c/u2[zu+A(DT)]) (82)
=2AA¯(A¯)c(1A¯A¯)5/2ERDT+Dtot[DTlog(1/ER1)A¯(1A¯A¯)2]. (83)

Here, in the second step we have used the fact that A¯=12ER2DTlog1ERER and Equation (80). Finally, absorbing the drift rate A into the time constant τ=1Aλ, we have the dynamics

τ~ddtA¯=2A¯(A¯)c(1A¯A¯)5/2ERDT+Dtot[DTlog(1/ER1)A¯(1A¯A¯)2]. (84)

This equation reveals that the LDDM has four scalar parameters: the asymptotic SNR A¯*, the output-to-input-noise variance ratio c, the initial SNR at time zero A¯(0), and the combined drift rate/learning rate time constant τ~. In addition, it requires the choice of threshold trajectory z(t).

To reveal the basic learning speed/instantaneous reward rate trade-off in this model, we investigate the limit where A¯ is small but finite (low signal-to-noise) and the threshold is small, such that the error rate is near ER=1/2. Then the final term in Equation (84) goes to zero, giving

τ~ddtA¯A¯(A¯)c(1A¯A¯)5/2DTDT+Dtot (85)
DTDT+Dtot, (86)

such that learning speed is increasing in DT. By contrast the instantaneous reward rate when ER=1/2 is

RR1/2DT+Dtot, (87)

which is a decreasing function of DT.

We note that when the perceptual signal is small, DT is determined by the ratio of threshold to diffusion noise. Starting with Equation 47, we rewrite it in terms of threshold, perceptual signal, and noise:

.DT=zA~tanh(zA~A~2c~2) (88)

If we explore the limit in which perceptual signal is small, and following L’Hôpital’s rule:

limA~0DT=limA~0ddA~ztanh(zc~2A~)ddA~A~=limA~0zsech2(zc~2A~)1 (89)

Leaving:

.limA~0DT=(zc~)2 (90)

Thus, a change in DT when perceptual signal is low could be caused by either a changing threshold with fixed diffusion noise, a constant threshold with varying diffusion noise, or a combination thereof, without the immediate ability to tell these apart. In these cases, however, we note that the ratio of threshold to diffusion noise cannot stay constant if DT changes.

Threshold policies

We evaluate several simple threshold policies.

γddtzs(t)=z(A¯(t))zs(t). (91)
  • The iRR-greedy policy sets z¯=z¯*, the instantaneous reward rate optimal policy at all times.

  • The constant threshold policy sets z¯ to a fixed constant throughout learning.

  • The iRR-sensitive policy implements a threshold zs(t) that decays with time constant γ from an initial value zs(0)=z0 toward the iRR-optimal threshold,

where γ controls the rate of convergence.

Finally, the global optimal policy optimizes the entire function z¯(t) to maximize total cumulative reward during exposure to the task. To compute the optimal threshold trajectory, we discretize the reduction dynamics in Equation 77 and perform gradient ascent on z¯(t) using automatic differentiation in the PyTorch python package. While this procedure is not guaranteed to find the global optimum (due to potential nonconvexity of the optimization problem), in practice we found highly reliable results from a range of initial conditions and believe that the identified threshold trajectory is near the global optimum.

Parameter fitting

The LDDM has several parameters governing its performance, including the asymptotic optimal SNR, the output/input noise variance ratio, the learning rate, and parameters controlling threshold policies where applicable. To fit these, we discretized the reduction dynamics and performed gradient ascent on the log likelihood of the observed data under the LDDM, again using automatic differentiation in the PyTorch python package. Because our model is highly simplified, our goal was only to place the parameters in a reasonable regime rather than obtain quantitative fits. We note that our fitting procedure could become stuck in local minima, and that a range of other parameter settings might also be consistent with the data. The best fitting parameters we obtained and used in all model results were A=0.9542,ci=0.3216,co=30,u0=0.0001. We used a discretization time step of dt=160. For the constant threshold and iRR-sensitive policies, the best fitting initial threshold was z(0)=30. For the iRR-sensitive policy, the best fitting decay rate was γ=0.00011891.

Acknowledgements

We thank Chris Baldassano, Christopher Summerfield, Rahul Bhui, Grigori Guitchounts, Laura Bustamante, Sebastian Musslick, and Jonathan D Cohen for useful discussions. We thank Joshua Breedon for summer assistance in developing faster animal training procedures. We thank Ed Soucy and the NeuroTechnology Core for help with improvements to the behavioral response rigs. This work was supported by the Richard A and Susan F Smith Family Foundation and IARPA contract # D16PC00002. JM was supported by the Harvard Brain Science Initiative (HBI) and the Department of Molecular and Cellular Biology at Harvard, and a Presidential Postdoctoral Research Fellowship at Princeton. AMS was supported by a Swartz Postdoctoral Fellowship in Theoretical Neuroscience and a Sir Henry Dale Fellowship from the Wellcome Trust and Royal Society (Grant Number 216386/Z/19/Z).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Contributor Information

Javier Masís, Email: jmasis@princeton.edu.

Andrew M Saxe, Email: a.saxe@ucl.ac.uk.

Valentin Wyart, École normale supérieure, PSL University, INSERM, France.

Michael J Frank, Brown University, United States.

Funding Information

This paper was supported by the following grants:

  • Intelligence Advanced Research Projects Activity D16PC00002 to David D Cox.

  • Richard and Susan Smith Family Foundation to David D Cox.

  • Harvard University Harvard Brain Science Initiative to Javier Masís.

  • Princeton University Presidential Postdoctoral Research Fellowship to Javier Masís.

  • Royal Society Sir Henry Dale Fellowship (216386/Z/19/Z) to Andrew M Saxe.

  • Wellcome Trust Sir Henry Dale Fellowship (216386/Z/19/Z) to Andrew M Saxe.

  • Swartz Foundation Swartz Postdoctoral Fellowship in Theoretical Neuroscience to Andrew M Saxe.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Validation, Writing – original draft, Writing – review and editing, Conceived the work.

Software, Methodology, Aided JM in establishing initial operant training procedures and behavioral analysis.

Resources, Methodology, Designed the behavioral response rigs.

Resources, Supervision, Funding acquisition, Methodology, Provided input to experimental design and acquired funding for the project.

Conceptualization, Formal analysis, Methodology, Project administration, Software, Supervision, Visualization, Writing – original draft, Writing – review and editing, Conceived the work.

Ethics

All care and experimental manipulation of animals were reviewed and approved by the Harvard Institutional Animal Care and Use Committee (IACUC), protocol 27-22.

Additional files

Transparent reporting form

Data availability

The data and code are freely available at https://github.com/jmasis/strategiclearning_and_lddm (copy archived at swh:1:rev:26aa21d1e830657896325b1a26e9b84f5e3be93d).

References

  1. Abraham NM, Spors H, Carleton A, Margrie TW, Kuner T, Schaefer AT. Maintaining accuracy at the expense of speed: stimulus similarity defines odor discrimination time in mice. Neuron. 2004;44:865–876. doi: 10.1016/j.neuron.2004.11.017. [DOI] [PubMed] [Google Scholar]
  2. Akrami A, Kopec CD, Diamond ME, Brody CD. Posterior parietal cortex represents sensory history and mediates its effects on behaviour. Nature. 2018;554:368–372. doi: 10.1038/nature25510. [DOI] [PubMed] [Google Scholar]
  3. Balci F, Freestone D, Simen P, Desouza L, Cohen JD, Holmes P. Optimal temporal risk assessment. Frontiers in Integrative Neuroscience. 2011a;5:56. doi: 10.3389/fnint.2011.00056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Balci F, Simen P, Niyogi R, Saxe A, Hughes JA, Holmes P, Cohen JD. Acquisition of decision making criteria: reward rate ultimately beats accuracy. Attention, Perception & Psychophysics. 2011b;73:640–657. doi: 10.3758/s13414-010-0049-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Beck JM, Ma WJ, Kiani R, Hanks T, Churchland AK, Roitman J, Shadlen MN, Latham PE, Pouget A. Probabilistic population codes for bayesian decision making. Neuron. 2008;60:1142–1152. doi: 10.1016/j.neuron.2008.09.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bejjanki VR, Beck JM, Lu ZL, Pouget A. Perceptual learning as improved probabilistic inference in early sensory areas. Nature Neuroscience. 2011;14:642–648. doi: 10.1038/nn.2796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bhui R, Lai L, Gershman SJ. Resource-rational decision making. Current Opinion in Behavioral Sciences. 2021;41:15–21. doi: 10.1016/j.cobeha.2021.02.015. [DOI] [Google Scholar]
  8. Blokland A. Reaction time responding in rats. Neuroscience and Biobehavioral Reviews. 1998;22:847–864. doi: 10.1016/s0149-7634(98)00013-x. [DOI] [PubMed] [Google Scholar]
  9. Bogacz R, Brown E, Moehlis J, Holmes P, Cohen JD. The physics of optimal decision making: a formal analysis of models of performance in two-alternative forced-choice tasks. Psychological Review. 2006;113:700–765. doi: 10.1037/0033-295X.113.4.700. [DOI] [PubMed] [Google Scholar]
  10. Bogacz R, Hu PT, Holmes PJ, Cohen JD. Do humans produce the speed-accuracy trade-off that maximizes reward rate? Quarterly Journal of Experimental Psychology. 2010;63:863–891. doi: 10.1080/17470210903091643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Brown SD, Heathcote A. The simplest complete model of choice response time: linear ballistic accumulation. Cognitive Psychology. 2008;57:153–178. doi: 10.1016/j.cogpsych.2007.12.002. [DOI] [PubMed] [Google Scholar]
  12. Brunton BW, Botvinick MM, Brody CD. Rats and humans can optimally accumulate evidence for decision-making. Science. 2013;340:95–98. doi: 10.1126/science.1233912. [DOI] [PubMed] [Google Scholar]
  13. Busse L, Ayaz A, Dhruv NT, Katzner S, Saleem AB, Schölvinck ML, Zaharia AD, Carandini M. The detection of visual contrast in the behaving mouse. The Journal of Neuroscience. 2011;31:11351–11361. doi: 10.1523/JNEUROSCI.6689-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Cisek P, Puskas GA, El-Murr S. Decisions in changing conditions: the urgency-gating model. The Journal of Neuroscience. 2009;29:11560–11571. doi: 10.1523/JNEUROSCI.1844-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Cohen JD, Dunbar K, McClelland JL. On the control of automatic processes: a parallel distributed processing account of the stroop effect. Psychological Review. 1990;97:332–361. doi: 10.1037/0033-295x.97.3.332. [DOI] [PubMed] [Google Scholar]
  16. Cohen JD, Egner T. The Wiley Handbook of Cognitive Control. Chapter Cognitive Control: Core Constructs and Current Considerations. John Wiley & Sons, Ltd; 2017. [DOI] [Google Scholar]
  17. Cox DD. Do we understand high-level vision? Current Opinion in Neurobiology. 2014;25:187–193. doi: 10.1016/j.conb.2014.01.016. [DOI] [PubMed] [Google Scholar]
  18. Deneve S. Making decisions with unknown sensory reliability. Frontiers in Neuroscience. 2012;6:75. doi: 10.3389/fnins.2012.00075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ditterich J. Evidence for time-variant decision making. The European Journal of Neuroscience. 2006;24:3628–3641. doi: 10.1111/j.1460-9568.2006.05221.x. [DOI] [PubMed] [Google Scholar]
  20. Dixon ML, Christoff K, de Lange FP. The decision to engage cognitive control is driven by expected reward-value: neural and behavioral evidence. PLOS ONE. 2012;7:e51637. doi: 10.1371/journal.pone.0051637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Drugowitsch J, Moreno-Bote R, Churchland AK, Shadlen MN, Pouget A. The cost of accumulating evidence in perceptual decision making. The Journal of Neuroscience. 2012;32:3612–3628. doi: 10.1523/JNEUROSCI.4010-11.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Drugowitsch J, DeAngelis GC, Klier EM, Angelaki DE, Pouget A. Optimal multisensory decision-making in a reaction-time task. eLife. 2014;3:e03005. doi: 10.7554/eLife.03005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Drugowitsch J, DeAngelis GC, Angelaki DE, Pouget A. Tuning the speed-accuracy trade-off to maximize reward rate in multisensory decision-making. eLife. 2015;4:e06678. doi: 10.7554/eLife.06678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Drugowitsch J, Mendonça AG, Mainen ZF, Pouget A. Learning optimal decisions with confidence. PNAS. 2019;116:24872–24880. doi: 10.1073/pnas.1906787116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Dutilh G, Vandekerckhove J, Tuerlinckx F, Wagenmakers EJ. A diffusion model decomposition of the practice effect. Psychonomic Bulletin & Review. 2009;16:1026–1036. doi: 10.3758/16.6.1026. [DOI] [PubMed] [Google Scholar]
  26. Fard PR, Park H, Warkentin A, Kiebel SJ, Bitzer S. A bayesian reformulation of the extended drift-diffusion model in perceptual decision making. Frontiers in Computational Neuroscience. 2017;11:29. doi: 10.3389/fncom.2017.00029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. International Conference on Machine Learning (ICML; 2017. pp. 1126–1135. [Google Scholar]
  28. Garrett HE. A Study of the Relation of Accuracy to Speed. Columbia University; 1922. [Google Scholar]
  29. Gershman SJ, Horvitz EJ, Tenenbaum JB. Computational rationality: a converging paradigm for intelligence in brains, minds, and machines. Science. 2015;349:273–278. doi: 10.1126/science.aac6076. [DOI] [PubMed] [Google Scholar]
  30. Gigerenzer G. Why heuristics work. Perspectives on Psychological Science. 2008;3:20–29. doi: 10.1111/j.1745-6916.2008.00058.x. [DOI] [PubMed] [Google Scholar]
  31. Gold JI, Shadlen MN. Banburismus and the brain: decoding the relationship between sensory stimuli, decisions, and reward. Neuron. 2002;36:299–308. doi: 10.1016/s0896-6273(02)00971-6. [DOI] [PubMed] [Google Scholar]
  32. Gold JI, Shadlen MN. The neural basis of decision making. Annual Review of Neuroscience. 2007;30:535–574. doi: 10.1146/annurev.neuro.29.051605.113038. [DOI] [PubMed] [Google Scholar]
  33. Gottlieb J, Oudeyer PY. Towards a neuroscience of active sampling and curiosity. Nature Reviews. Neuroscience. 2018;19:758–770. doi: 10.1038/s41583-018-0078-0. [DOI] [PubMed] [Google Scholar]
  34. Griffiths TL, Lieder F, Goodman ND. Rational use of cognitive resources: levels of analysis between the computational and the algorithmic. Topics in Cognitive Science. 2015;7:217–229. doi: 10.1111/tops.12142. [DOI] [PubMed] [Google Scholar]
  35. Hanks TD, Mazurek ME, Kiani R, Hopp E, Shadlen MN. Elapsed decision time affects the weighting of prior probability in a perceptual decision task. The Journal of Neuroscience. 2011;31:6339–6352. doi: 10.1523/JNEUROSCI.5613-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Heathcote A, Brown S, Mewhort DJ. The power law repealed: the case for an exponential law of practice. Psychonomic Bulletin & Review. 2000;7:185–207. doi: 10.3758/bf03212979. [DOI] [PubMed] [Google Scholar]
  37. Heekeren HR, Marrett S, Bandettini PA, Ungerleider LG. A general mechanism for perceptual decision-making in the human brain. Nature. 2004;431:859–862. doi: 10.1038/nature02966. [DOI] [PubMed] [Google Scholar]
  38. Heekeren HR, Marrett S, Ungerleider LG. The neural systems that mediate human perceptual decision making. Nature Reviews. Neuroscience. 2008;9:467–479. doi: 10.1038/nrn2374. [DOI] [PubMed] [Google Scholar]
  39. Heitz RP, Schall JD. Neural mechanisms of speed-accuracy tradeoff. Neuron. 2012;76:616–628. doi: 10.1016/j.neuron.2012.08.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Heitz RP. The speed-accuracy tradeoff: history, physiology, methodology, and behavior. Frontiers in Neuroscience. 2014;8:150. doi: 10.3389/fnins.2014.00150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Henmon VAC. The relation of the time of a judgment to its accuracy. Psychological Review. 1911;18:186–201. doi: 10.1037/h0074579. [DOI] [Google Scholar]
  42. Holmes P, Cohen JD. Optimality and some of its discontents: successes and shortcomings of existing models for binary decisions. Topics in Cognitive Science. 2014;6:258–278. doi: 10.1111/tops.12084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Jahn CI, Grohn J, Cuell S, Emberton A, Bouret S, Walton ME, Kolling N, Sallet J. Strategic Exploration in the Macaque’s Prefrontal Cortex. bioRxiv. 2022 doi: 10.1101/2022.05.11.491468. [DOI] [PMC free article] [PubMed]
  44. Juechems K, Balaguer J, Spitzer B, Summerfield C. Optimal utility and probability functions for agents with finite computational precision. PNAS. 2021;118:e2002232118. doi: 10.1073/pnas.2002232118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Kahneman D, Tversky A. Prospect theory: an analysis of decision under risk. Econometrica. 1979;47:263. doi: 10.2307/1914185. [DOI] [Google Scholar]
  46. Kepecs A, Uchida N, Zariwala HA, Mainen ZF. Neural correlates, computation and behavioural impact of decision confidence. Nature. 2008;455:227–231. doi: 10.1038/nature07200. [DOI] [PubMed] [Google Scholar]
  47. Kool W, McGuire JT, Rosen ZB, Botvinick MM. Decision making and the avoidance of cognitive demand. Journal of Experimental Psychology. General. 2010;139:665–682. doi: 10.1037/a0020198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Kool W, Botvinick M. Mental labour. Nature Human Behaviour. 2018;2:899–908. doi: 10.1038/s41562-018-0401-9. [DOI] [PubMed] [Google Scholar]
  49. Krajbich I, Rangel A. Multialternative drift-diffusion model predicts the relationship between visual fixations and choice in value-based decisions. PNAS. 2011;108:13852–13857. doi: 10.1073/pnas.1101328108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Krebs RM, Boehler CN, Woldorff MG. The influence of reward associations on conflict processing in the stroop task. Cognition. 2010;117:341–347. doi: 10.1016/j.cognition.2010.08.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Kurylo D, Lin C, Ergun T. Visual discrimination accuracy across reaction time in rats. Animal Behavior and Cognition. 2020;7:23–38. doi: 10.26451/abc.07.01.03.2020. [DOI] [Google Scholar]
  52. Lak A, Costa GM, Romberg E, Koulakov AA, Mainen ZF, Kepecs A. Orbitofrontal cortex is required for optimal waiting based on decision confidence. Neuron. 2014;84:190–201. doi: 10.1016/j.neuron.2014.08.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Lak A, Okun M, Moss M, Gurnani H, Wells MJ, Reddy CB, Harris KD, Carandini M. Dopaminergic and Frontal Signals for Decisions Guided by Sensory Evidence and Reward Value. bioRxiv. 2018 doi: 10.1101/411413. [DOI]
  54. Law CT, Gold JI. Reinforcement learning can account for associative and perceptual learning on a visual-decision task. Nature Neuroscience. 2009;12:655–663. doi: 10.1038/nn.2304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Leibo JZ, d’Autume C, Zoran D, Amos D, Beattie C, Anderson K, García Castañedo A, Sanchez M, Green S, Gruslys A, Legg S, Hassabis D, Botvinick MM. Psychlab: A Psychology Laboratory for Deep Reinforcement Learning Agents. arXiv. 2018 https://arxiv.org/abs/1801.08116
  56. Leng X, Yee D, Ritz H, Shenhav A. Dissociable influences of reward and punishment on adaptive cognitive control. PLOS Computational Biology. 2021;17:e1009737. doi: 10.1371/journal.pcbi.1009737. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Lewis RL, Howes A, Singh S. Computational rationality: linking mechanism and behavior through bounded utility maximization. Topics in Cognitive Science. 2014;6:279–311. doi: 10.1111/tops.12086. [DOI] [PubMed] [Google Scholar]
  58. Lieder F, Shenhav A, Musslick S, Griffiths TL. Rational metareasoning and the plasticity of cognitive control. PLOS Computational Biology. 2018;14:e1006043. doi: 10.1371/journal.pcbi.1006043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Lillicrap TP, Cownden D, Tweed DB, Akerman CJ. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications. 2016;7:13276. doi: 10.1038/ncomms13276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Liu CC, Watanabe T. Accounting for speed-accuracy tradeoff in perceptual learning. Vision Research. 2012;61:107–114. doi: 10.1016/j.visres.2011.09.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Logan GD. Shapes of reaction-time distributions and shapes of learning curves: a test of the instance theory of automaticity. Journal of Experimental Psychology. Learning, Memory, and Cognition. 1992;18:883–914. doi: 10.1037//0278-7393.18.5.883. [DOI] [PubMed] [Google Scholar]
  62. Ma WJ, Beck JM, Latham PE, Pouget A. Bayesian inference with probabilistic population codes. Nature Neuroscience. 2006;9:1432–1438. doi: 10.1038/nn1790. [DOI] [PubMed] [Google Scholar]
  63. Maddox WT, Bohil CJ. Base-rate and payoff effects in multidimensional perceptual categorization. Journal of Experimental Psychology. Learning, Memory, and Cognition. 1998;24:1459–1482. doi: 10.1037//0278-7393.24.6.1459. [DOI] [PubMed] [Google Scholar]
  64. Manohar SG, Chong TTJ, Apps MAJ, Batla A, Stamelou M, Jarman PR, Bhatia KP, Husain M. Reward pays the cost of noise reduction in motor and cognitive control. Current Biology. 2015;25:1707–1716. doi: 10.1016/j.cub.2015.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Masís JA, Musslick S, Cohen J. The value of learning and cognitive control allocation. Proceedings of the Annual Meeting of the Cognitive Science Society.2021. [Google Scholar]
  66. Mazurek ME, Roitman JD, Ditterich J, Shadlen MN. A role for neural integrators in perceptual decision making. Cerebral Cortex. 2003;13:1257–1269. doi: 10.1093/cercor/bhg097. [DOI] [PubMed] [Google Scholar]
  67. Mendonça AG, Drugowitsch J, Inês Vicente M, DeWitt E, Pouget A, Mainen ZF. The Impact of Learning on Perceptual Decisions and Its Implication for Speed-Accuracy Tradeoffs. bioRxiv. 2018 doi: 10.1101/501858. [DOI] [PMC free article] [PubMed]
  68. Metcalfe J. Metacognitive judgments and control of study. Current Directions in Psychological Science. 2009;18:159–163. doi: 10.1111/j.1467-8721.2009.01628.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. International conference on machine learning; 2016. pp. 1928–1937. [Google Scholar]
  70. Newell A, Rosenbloom PS. Cognitive Skills and Their Acquisition. Chapter Mechanisms of Skill Acquisition and the Law of Practice. Erlbaum; 1981. [Google Scholar]
  71. Niyogi RK, Breton YA, Solomon RB, Conover K, Shizgal P, Dayan P. Optimal indolence: a normative microscopic approach to work and leisure. Journal of the Royal Society, Interface. 2014a;11:20130969. doi: 10.1098/rsif.2013.0969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Niyogi RK, Shizgal P, Dayan P. Some work and some play: microscopic and macroscopic approaches to labor and leisure. PLOS Computational Biology. 2014b;10:e1003894. doi: 10.1371/journal.pcbi.1003894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Notebaert W, Houtman F, Opstal FV, Gevers W, Fias W, Verguts T. Post-error slowing: an orienting account. Cognition. 2009;111:275–279. doi: 10.1016/j.cognition.2009.02.002. [DOI] [PubMed] [Google Scholar]
  74. Odoemene O, Pisupati S, Nguyen H, Churchland AK. Visual evidence accumulation guides decision-making in unrestrained mice. The Journal of Neuroscience. 2018;38:10143–10155. doi: 10.1523/JNEUROSCI.3478-17.2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Pachella RG. Human Information Processing: Tutorials in Performance and Cognition. Chapter The Interpretation of Reaction Time in Information Processing Research. Erlbaum; 1974. [Google Scholar]
  76. Padmala S, Pessoa L. Reward reduces conflict by enhancing attentional control and biasing visual cortical processing. Journal of Cognitive Neuroscience. 2011;23:3419–3432. doi: 10.1162/jocn_a_00011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Petrov AA, Van Horn NM, Ratcliff R. Dissociable perceptual-learning mechanisms revealed by diffusion-model analysis. Psychonomic Bulletin & Review. 2011;18:490–497. doi: 10.3758/s13423-011-0079-8. [DOI] [PubMed] [Google Scholar]
  78. Pew RW. The speed-accuracy operating characteristic. Acta Psychologica. 1969;30:16–26. doi: 10.1016/0001-6918(69)90035-3. [DOI] [Google Scholar]
  79. Pinto L, Koay SA, Engelhard B, Yoon AM, Deverett B, Thiberge SY, Witten IB, Tank DW, Brody CD. An accumulation-of-evidence task using visual pulses for mice navigating in virtual reality. Frontiers in Behavioral Neuroscience. 2018;12:36. doi: 10.3389/fnbeh.2018.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Posner MI, Snyder CRR. Information Processing and Cognition: The Loyola Symposium. Chapter Attention and Cognitive Control. Erlbaum; 1975. [Google Scholar]
  81. Purcell BA, Heitz RP, Cohen JY, Schall JD, Logan GD, Palmeri TJ. Neurally constrained modeling of perceptual decision making. Psychological Review. 2010;117:1113–1143. doi: 10.1037/a0020311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Rahnev D, Denison RN. Suboptimality in perceptual decision making. The Behavioral and Brain Sciences. 2018;41:e223. doi: 10.1017/S0140525X18000936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Ratcliff R. A theory of memory retrieval. Psychological Review. 1978;85:59–108. doi: 10.1037/0033-295X.85.2.59. [DOI] [Google Scholar]
  84. Ratcliff R. Group reaction time distributions and an analysis of distribution statistics. Psychological Bulletin. 1979;86:446–461. doi: 10.1037/0033-2909.86.3.446. [DOI] [PubMed] [Google Scholar]
  85. Ratcliff R, Rouder JN. Modeling response times for two-choice decisions. Psychological Science. 1998;9:347–356. doi: 10.1111/1467-9280.00067. [DOI] [Google Scholar]
  86. Ratcliff R, Thapar A, McKoon G. Aging, practice, and perceptual tasks: a diffusion model analysis. Psychology and Aging. 2006;21:353–371. doi: 10.1037/0882-7974.21.2.353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Ratcliff R, McKoon G. The diffusion decision model: theory and data for two-choice decision tasks. Neural Computation. 2008;20:873–922. doi: 10.1162/neco.2008.12-06-420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Reinagel P. Speed and accuracy of visual image discrimination by rats. Frontiers in Neural Circuits. 2013a;7:200. doi: 10.3389/fncir.2013.00200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Reinagel P. Speed and accuracy of visual motion discrimination by rats. PLOS ONE. 2013b;8:e68505. doi: 10.1371/journal.pone.0068505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Richards BA, Lillicrap TP, Beaudoin P, Bengio Y, Bogacz R, Christensen A, Clopath C, Costa RP, de Berker A, Ganguli S, Gillon CJ, Hafner D, Kepecs A, Kriegeskorte N, Latham P, Lindsay GW, Miller KD, Naud R, Pack CC, Poirazi P, Roelfsema P, Sacramento J, Saxe A, Scellier B, Schapiro AC, Senn W, Wayne G, Yamins D, Zenke F, Zylberberg J, Therien D, Kording KP. A deep learning framework for neuroscience. Nature Neuroscience. 2019;22:1761–1770. doi: 10.1038/s41593-019-0520-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Rinberg D, Koulakov A, Gelperin A. Speed-Accuracy tradeoff in olfaction. Neuron. 2006;51:351–358. doi: 10.1016/j.neuron.2006.07.013. [DOI] [PubMed] [Google Scholar]
  92. Roitman JD, Shadlen MN. Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. The Journal of Neuroscience. 2002;22:9475–9489. doi: 10.1523/JNEUROSCI.22-21-09475.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Rouder JN, Speckman PL. An evaluation of the vincentizing method of forming group-level response time distributions. Psychonomic Bulletin & Review. 2004;11:419–427. doi: 10.3758/bf03196589. [DOI] [PubMed] [Google Scholar]
  94. Roy NA, Bak JH, Laboratory TIB, Akrami A, Brody CD, Pillow JW. Extracting the dynamics of behavior in sensory decision-making experiments. Neuron. 2021;109:597–610. doi: 10.1016/j.neuron.2020.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Russell SJ, Subramanian D. Provably bounded-optimal agents. Journal of Artificial Intelligence Research. 1994;2:575–609. doi: 10.1613/jair.133. [DOI] [Google Scholar]
  96. Ruthruff E. A test of the deadline model for speed-accuracy tradeoffs. Perception & Psychophysics. 1996;58:56–64. doi: 10.3758/BF03205475. [DOI] [PubMed] [Google Scholar]
  97. Saxe A, Nelli S, Summerfield C. If deep learning is the answer, what is the question? Nature Reviews. Neuroscience. 2021;22:55–67. doi: 10.1038/s41583-020-00395-8. [DOI] [PubMed] [Google Scholar]
  98. Scott BB, Constantinople CM, Erlich JC, Tank DW, Brody CD. Sources of noise during accumulation of evidence in unrestrained and voluntarily head-restrained rats. eLife. 2015;4:e11308. doi: 10.7554/eLife.11308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  99. Shenhav A, Botvinick MM, Cohen JD. The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron. 2013;79:217–240. doi: 10.1016/j.neuron.2013.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Shenhav A, Musslick S, Lieder F, Kool W, Griffiths TL, Cohen JD, Botvinick MM. Toward a rational and mechanistic account of mental effort. Annual Review of Neuroscience. 2017;40:99–124. doi: 10.1146/annurev-neuro-072116-031526. [DOI] [PubMed] [Google Scholar]
  101. Shiffrin RM, Schneider W. Controlled and automatic human information processing: II. perceptual learning, automatic attending and a general theory. Psychological Review. 1977;84:127–190. doi: 10.1037/0033-295X.84.2.127. [DOI] [Google Scholar]
  102. Simen P, Cohen JD, Holmes P. Rapid decision threshold modulation by reward rate in a neural network. Neural Networks. 2006;19:1013–1026. doi: 10.1016/j.neunet.2006.05.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  103. Simen P, Contreras D, Buck C, Hu P, Holmes P, Cohen JD. Reward rate optimization in two-alternative decision making: empirical tests of theoretical predictions. Journal of Experimental Psychology. Human Perception and Performance. 2009;35:1865–1897. doi: 10.1037/a0016926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Starns JJ, Ratcliff R. The effects of aging on the speed-accuracy compromise: boundary optimality in the diffusion model. Psychology and Aging. 2010;25:377–390. doi: 10.1037/a0018022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  105. Stine GM, Zylberberg A, Ditterich J, Shadlen MN. Differentiating between integration and non-integration strategies in perceptual decision making. eLife. 2020;9:e55365. doi: 10.7554/eLife.55365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  106. Summerfield C, Parpart P. Normative Principles for Decision-Making in Natural Environments. PsyArXiv. 2021 doi: 10.1146/annurev-psych-020821-104057. https://psyarxiv.com/s2wvz/ [DOI] [PubMed]
  107. Sweis BM, Abram SV, Schmidt BJ, Seeland KD, MacDonald AW, Thomas MJ, Redish AD. Sensitivity to “ sunk costs ” in mice, rats, and humans. Science. 2018;361:178–181. doi: 10.1126/science.aar8644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  108. Tajima S, Drugowitsch J, Patel N, Pouget A. Optimal policy for multi-alternative decisions. Nature Neuroscience. 2019;22:1503–1511. doi: 10.1038/s41593-019-0453-9. [DOI] [PubMed] [Google Scholar]
  109. Ten A, Kaushik P, Oudeyer PY, Gottlieb J. Humans monitor learning progress in curiosity-driven exploration. Natural Communication. 2020;12:5972. doi: 10.1038/s41467-021-26196-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  110. Thorndike EL. Educational Psychology, Vol 2: The Psychology of Learning. Teachers College; 1913. [DOI] [Google Scholar]
  111. Uchida N, Mainen ZF. Speed and accuracy of olfactory discrimination in the rat. Nature Neuroscience. 2003;6:1224–1229. doi: 10.1038/nn1142. [DOI] [PubMed] [Google Scholar]
  112. Usher M, McClelland JL. The time course of perceptual choice: the leaky, competing accumulator model. Psychological Review. 2001;108:550–592. doi: 10.1037/0033-295x.108.3.550. [DOI] [PubMed] [Google Scholar]
  113. Vermaercke B, Gerich FJ, Ytebrouck E, Arckens L, Op de Beeck HP, Van den Bergh G. Functional specialization in rat occipital and temporal visual cortex. Journal of Neurophysiology. 2014;112:1963–1983. doi: 10.1152/jn.00737.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  114. Wang JX, Kurth-Nelson Z, Kumaran D, Tirumala D, Soyer H, Leibo JZ, Hassabis D, Botvinick M. Prefrontal cortex as a meta-reinforcement learning system. Nature Neuroscience. 2018;21:860–868. doi: 10.1038/s41593-018-0147-8. [DOI] [PubMed] [Google Scholar]
  115. Westbrook A, Kester D, Braver TS. What is the subjective cost of cognitive effort? load, trait, and aging effects revealed by economic preference. PLOS ONE. 2013;8:e68210. doi: 10.1371/journal.pone.0068210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  116. Westbrook A, Lamichhane B, Braver T. The subjective value of cognitive effort is encoded by a domain-general valuation network. The Journal of Neuroscience. 2019;39:3934–3947. doi: 10.1523/JNEUROSCI.3071-18.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  117. Whelan R. Effective analysis of reaction time data. The Psychological Record. 2008;58:475–482. doi: 10.1007/BF03395630. [DOI] [Google Scholar]
  118. Wickelgren WA. Speed-accuracy tradeoff and information processing dynamics. Acta Psychologica. 1977;41:67–85. doi: 10.1016/0001-6918(77)90012-9. [DOI] [Google Scholar]
  119. Wiecki TV, Sofer I, Frank MJ. HDDM: hierarchical Bayesian estimation of the drift-diffusion model in python. Frontiers in Neuroinformatics. 2013;7:14. doi: 10.3389/fninf.2013.00014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  120. Williams RJ. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning. 1992;8:229–256. doi: 10.1007/BF00992696. [DOI] [Google Scholar]
  121. Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans use directed and random exploration to solve the explore-exploit dilemma. Journal of Experimental Psychology. General. 2014;143:2074–2081. doi: 10.1037/a0038199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  122. Woodworth RS. Accuracy of voluntary movement. The Psychological Review. 1899;3:i–114. doi: 10.1037/h0092992. [DOI] [Google Scholar]
  123. Zacksenhouse M, Bogacz R, Holmes P. Robust versus optimal strategies for two-alternative forced choice tasks. Journal of Mathematical Psychology. 2010;54:230–246. doi: 10.1016/j.jmp.2009.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  124. Zhang J, Rowe JB. Dissociable mechanisms of speed-accuracy tradeoff during visual perceptual learning are revealed by a hierarchical drift-diffusion model. Frontiers in Neuroscience. 2014;8:69. doi: 10.3389/fnins.2014.00069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  125. Zoccolan D, Oertelt N, DiCarlo JJ, Cox DD. A rodent model for the study of invariant visual object recognition. PNAS. 2009;106:8748–8753. doi: 10.1073/pnas.0811583106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  126. Zoltowski DM, Latimer KW, Yates JL, Huk AC, Pillow JW. Discrete stepping and nonlinear Ramping dynamics underlie spiking responses of lip neurons during decision-making. Neuron. 2019;102:1249–1258. doi: 10.1016/j.neuron.2019.04.031. [DOI] [PubMed] [Google Scholar]

Editor's evaluation

Valentin Wyart 1

This manuscript provides a fresh view on the fundamental trade-off between the speed and accuracy of perceptual decision-making. Using computational modeling, the authors establish the important finding that adopting a momentary suboptimal trade-off for maximizing reward rate at the beginning of learning can yield better decisions and larger rewards at later stages. This novel prediction is tested in rodent experiments. The experiments and their detailed analysis provide compelling evidence for the authors' theoretical predictions.

Decision letter

Editor: Valentin Wyart1
Reviewed by: Konstantinos Tsetsos2

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Strategically managing learning during perceptual decision making" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by Valentin Wyart as the Reviewing Editor and Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Konstantinos Tsetsos (Reviewer #1).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Classical descriptions of the speed-accuracy trade-off during perceptual decision-making assume that agents balance decision speed and accuracy given a fixed level of perceptual sensitivity. These descriptions ignore how agents learn to process the incoming sensory information for the purpose of decision-making. This manuscript develops a theory for how this perceptual learning ought to occur, and tests predictions from the theory using rodent experiments. This theory of perceptual learning leads to a new way of understanding suboptimal slow decisions at the early stages of learning. The manuscript is theoretically and technically sound. Additionally, the experiments are ingeniously designed and rigorously analysed, and their results provide empirical support for the theory proposed by the authors. There are however additional analyses that should be performed to validate the authors' specific claims regarding the strategic adaptation of perceptual sensitivity throughout task execution. Furthermore, the manuscript could be improved for clarity.

A current weakness of the manuscript concerns the preference for a strategic adaptation of perceptual sensitivity throughout learning (iRR-sensitive policy) over a simpler gradual increase in perceptual sensitivity (constant-threshold policy). Indeed, the strategies provide qualitatively similar predictions (see, e.g., Figure 3). Validating the iRR-sensitive policy over the constant-threshold policy is critical to the overall conclusions of the work (namely, that rats are strategically adapting their decision times to promote learning). However, it is currently unclear how the constant-threshold policy (in which rats do not control the speed of their decision but benefit from a gradual improvement in perceptual sensitivity) has been conclusively ruled out.

To rule out the constant-threshold policy, the authors argue first that drift-diffusion model (DDM) fits to the data show both drift rate and decision boundary changes throughout learning (Figure S5). However, it is not clear that a concurrent change in drift rate and decision boundary is the most parsimonious explanation of the data. The authors should establish that indeed both parameters change via comparing different versions of the DDM where a subset of the parameters are allowed to vary while others remain fixed throughout learning. Additionally, it would be interesting to know whether the conclusions remain the same when a more 'complete' version of the DDM is used (including drift-rate variability as a free parameter).

The second argument in support of the iRR-sensitive policy comes from experiment 2, in which the authors convincingly show that the improvement in perceptual sensitivity (or SNR) scales with decision times. It is indeed important to show that longer viewing leads to larger SNR improvements. However, it is currently unclear how this observation rules out the constant-threshold policy. Unless additional analyses are performed to show that the constant-threshold policy does not make this prediction, this observation appears necessary but not sufficient to validate the iRR-sensitive policy.

The third argument in support of the iRR-sensitive policy comes from experiment 3, in which a first group of rats performed a 'learnable' perceptual experiment while a second group performed an experiment with 'unlearnable' (transparent) stimuli. Indeed, this second group showed a reduction in reaction times as the experiment progressed, which (a) is reward-rate optimal, and (b) can be understood as a strategic change of decision boundary, since the SNR in this experiment is theoretically zero. However, it is unclear how rodents behave in this 'unlearnable' context. Presumably, during the first two sessions, the rats may be trying to figure out what the task is (e.g., waiting to see if there are visible stimuli in a subset of trials). In the third session, the rats speed up but it is not clear if they keep speeding up later in the experiment. Are reaction times significantly decreasing beyond the third session? Finally, it is not obvious that the effective SNR in this experiment is zero. In this 'unlearnable' experiment, rats may use some non-sensory information (e.g., choice history information such as their preceding response and whether they got rewarded) as input to their drift rate.

Another weakness comes from the use of recurrent neural networks (RNNs) to model that accumulation of decision-relevant evidence. Indeed, these networks are tuned such that they become equivalent to DDMs. Framed in this way, the connection between RNNs and DDMs appears somewhat trivial, such that the introduction of RNNs does not add anything to the manuscript, and might even be confusing to some readers. The authors should either reframe the specific role of RNNs for supporting their key findings, or possibly remove them if they do not provide unique insights beyond classical drift-diffusion modeling.

The detailed description of the models is currently hidden in the Methods section, even though it is essential for understanding their learning dynamics. In particular, the authors assume two sources of noise in the model: one on the input, and one on the accumulator. Learning is achieved by re-scaling the input by an 'input weight'. Increasing the input weight boosts the input signal compared to the accumulation noise, such that the latter can be effectively suppressed by making the input weight large enough. By contrast, the input noise cannot be suppressed by such re-scaling, such that it is this noise that ultimately limits asymptotic performance, and determines the asymptotic SNR. This important constraint is currently not clear after reading the manuscript. The authors should reframe the manuscript to highlight and discuss the model variables that affect the SNR, including the input weight. This would clarify what the authors mean by 'learning' in their theory. The authors initialize the model with a low input weight to reflect that the agent has not yet learned how to interpret the sensory information for the purpose of decision-making. Thus, the input weight is not something that would have a direct mechanistic implementation in the brain. Instead, it is an abstract quantity that describes how well a decision-maker can turn sensory information into a perceptual decision. Once this interpretation of input weight is stated clearly in the manuscript (which it is not in the current version), starting the task with a low input weight makes sense.

Finally, a few choices made by the authors are critical for their findings, but are not sufficiently described in the main text. First, the observed reaction time (RT) is composed of a non-decision time (T0) and a decision time (DT). Experiments allow to measure RT, but the theory makes predictions about DT, which requires inferring T0. The magnitude of T0 impacts the results, but how it is inferred is currently buried in the Methods section. Second, the rodents sometimes choose to not immediately initiate a new trial (the "voluntary inter-trial interval"). The authors assume that rodents ignore this interval when maximizing reward rate, and find near-optimal reward rates under this assumption. Importantly, including this voluntary inter-trial interval makes reward rates drop significantly. However, these details are again buried in the Methods section, and are not mentioned nor discussed in the main text.

– The main conclusions hinge upon concurrent changes in both drift rate and decision boundary throughout learning. This change is assessed via fitting HDDM models. It is important that the authors fit and compare the following 3 DDM variants: (a) drift rate changes throughout learning but decision boundary remains fixed, (b) decision boundary changes throughout learning but drift rate remains fixed, (c) both drift rate and decision boundary change throughout learning.

– The HDDM fitting procedure is not fully described. In particular, it is not clear what parameters, asides the drift rate, the decision boundary and the non-decision time, varied. For instance, was the drift rate variability a free parameter? It is important to report more precisely the details of the model fits and, if simple variants of the DDM were used, to fit the data using more complex DDMs that include drift rate variability as a free parameter. Additionally, please report in Figure S5 all parameter estimates during learning and not just the drift rate and the decision boundary. Details around Equation (40) should also be in the main text. Furthermore, except for the internal noise, the setup looks very similar to that of Drugowitsch et al. (2019), and the relationship should be discussed somewhere in the main text.

– Currently the evolution of the mean RT during learning is examined. Plotting the change in the RT distribution (averaged/vincentised across participants) can be more informative about changes in the strategy being used than plotting the mean RT alone.

– The results of experiment 3 are interesting, but at the same time the behaviour in the transparent group requires more scrutiny. How do rats behave in this condition? It appears that random choice in combination with heuristic strategies (e.g. win-stay lose-switch) are viable possibilities. The argument that the SNR is intrinsically zero in this task. However, the signal in this experiment may contain non-zero, irrelevant information (such as choice history or feedback information). Plotting the RT distributions as a function of learning could provide insight in this regard, because boundary changes and SNR changes manifest differently in the shapes of RT distributions.

– The use of RNNs is not sufficiently supported beyond classical drift-diffusion modeling. The authors should either reframe the specific role of RNNs for supporting their key findings, or possibly remove them if they do not provide unique insights beyond classical drift-diffusion modeling.

– The authors should clarify early on in the manuscript what "learning" means in their model and theory, and why it makes sense to start with a low input weight (after describing the meaning of the input weight). The authors should also explain what T0 and D_RSI are, how they are determined (and the choices made to determine them), and how these parameters impact the results (also related to Figure S13). In particular the relationship between RT and DT is already required to understand Figure 1c, and many plots thereafter.

– Box 1 provides some details of the model, but leaves out others – e.g., the different sources of noise in the model. From Box 1 alone, it is unclear how the asymptotic SNR or the iRR-sensitive threshold are computed. It is indeed nice that it is possible to derive Equation (3), but the equation itself is not particularly informative for the exposition of the main findings, and so it could be moved to the Methods section.

eLife. 2023 Feb 14;12:e64978. doi: 10.7554/eLife.64978.sa2

Author response


The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

1) Classical descriptions of the speed-accuracy trade-off during perceptual decision-making assume that agents balance decision speed and accuracy given a fixed level of perceptual sensitivity. These descriptions ignore how agents learn to process the incoming sensory information for the purpose of decision-making. This manuscript develops a theory for how this perceptual learning ought to occur, and tests predictions from the theory using rodent experiments. This theory of perceptual learning leads to a new way of understanding suboptimal slow decisions at the early stages of learning. The manuscript is theoretically and technically sound. Additionally, the experiments are ingeniously designed and rigorously analysed, and their results provide empirical support for the theory proposed by the authors. There are however additional analyses that should be performed to validate the authors' specific claims regarding the strategic adaptation of perceptual sensitivity throughout task execution. Furthermore, the manuscript could be improved for clarity.

A current weakness of the manuscript concerns the preference for a strategic adaptation of perceptual sensitivity throughout learning (iRR-sensitive policy) over a simpler gradual increase in perceptual sensitivity (constant-threshold policy). Indeed, the strategies provide qualitatively similar predictions (see, e.g., Figure 3). Validating the iRR-sensitive policy over the constant-threshold policy is critical to the overall conclusions of the work (namely, that rats are strategically adapting their decision times to promote learning). However, it is currently unclear how the constant-threshold policy (in which rats do not control the speed of their decision but benefit from a gradual improvement in perceptual sensitivity) has been conclusively ruled out.

To rule out the constant-threshold policy, the authors argue first that drift-diffusion model (DDM) fits to the data show both drift rate and decision boundary changes throughout learning (Figure S5). However, it is not clear that a concurrent change in drift rate and decision boundary is the most parsimonious explanation of the data. The authors should establish that indeed both parameters change via comparing different versions of the DDM where a subset of the parameters are allowed to vary while others remain fixed throughout learning. Additionally, it would be interesting to know whether the conclusions remain the same when a more 'complete' version of the DDM is used (including drift-rate variability as a free parameter).

We have fit the data with several different versions of the DDM where a subset of the parameters are allowed to vary with learning while others remain fixed, including versions with drift rate variability, as requested by the reviewers (Figure 4—figure supplements 9-12). We found that the best model fits were those where both drift rate and threshold were allowed to vary with learning, and the presence of drift rate variability did not seem to noticeably improve model fits (Figure 4—figure supplement 9).

The second argument in support of the iRR-sensitive policy comes from experiment 2, in which the authors convincingly show that the improvement in perceptual sensitivity (or SNR) scales with decision times. It is indeed important to show that longer viewing leads to larger SNR improvements. However, it is currently unclear how this observation rules out the constant-threshold policy. Unless additional analyses are performed to show that the constant-threshold policy does not make this prediction, this observation appears necessary but not sufficient to validate the iRR-sensitive policy.

The reviewers are correct that this experiment does not rule out the constant-threshold policy, and therefore in the text we did not make this claim. The purpose of this experiment was solely to verify that longer viewing times lead to larger SNR improvements (a key prediction of our learning framework regardless of threshold policy).

The third argument in support of the iRR-sensitive policy comes from experiment 3, in which a first group of rats performed a 'learnable' perceptual experiment while a second group performed an experiment with 'unlearnable' (transparent) stimuli. Indeed, this second group showed a reduction in reaction times as the experiment progressed, which (a) is reward-rate optimal, and (b) can be understood as a strategic change of decision boundary, since the SNR in this experiment is theoretically zero. However, it is unclear how rodents behave in this 'unlearnable' context. Presumably, during the first two sessions, the rats may be trying to figure out what the task is (e.g., waiting to see if there are visible stimuli in a subset of trials). In the third session, the rats speed up but it is not clear if they keep speeding up later in the experiment. Are reaction times significantly decreasing beyond the third session? Finally, it is not obvious that the effective SNR in this experiment is zero. In this 'unlearnable' experiment, rats may use some non-sensory information (e.g., choice history information such as their preceding response and whether they got rewarded) as input to their drift rate.

The reviewers make an astute observation that the effective SNR for the group experiencing the transparent stimuli may not be zero due to stimulus-independent information, such as choice and reward history. To test for these strategies, we fit a generalized linear model to trial-bytrial choices for every subject and found that overall there was an increase in the weights associated with these strategies (perseverance and win-stay/lose-switch) (Figure 6—figure supplement 4). HDDM fits for this experiment also indicated that there appeared to be a drift rate greater than 0, indicating a non-zero SNR, perhaps due to these stimulus-independent strategies (Figure 6—figure supplement 3). However, the HDDM fits still found a decrease in threshold, as predicted by our model, and did not find an increase in drift rate (Figure 6—figure supplement 3). These threshold and drift rate trajectories differ from the ones found in the learnable stimuli, indicating that although the rats could be implementing stimulus-independent “monitoring” strategies, their choice of threshold still indicated a reduced belief in stimulus learnability. Overall, we find that the presence of these strategies does not invalidate our conclusions. We also make a terminological distinction: the SNR may refer to the signal in the task that is predictive of the correct answer. On this definition, because the correct answer is random in this experiment, the SNR is zero regardless of any information employed by the animal, as no strategy will attain better than 50% accuracy. Said another way, in the DDM, if drift rate is drift toward the correct answer, then this can only be zero. However, in the way used by the reviewer, SNR could correspond to a more mechanistic account based on the DDM and reflect signals supplied to the integrator even if uncorrelated with the rewarded side on a trial. These new analyses focus on this second definition, but we believe our point about speeding up responses because of a lack of task-relevant information can still be appreciated using the first definition.

Another weakness comes from the use of recurrent neural networks (RNNs) to model that accumulation of decision-relevant evidence. Indeed, these networks are tuned such that they become equivalent to DDMs. Framed in this way, the connection between RNNs and DDMs appears somewhat trivial, such that the introduction of RNNs does not add anything to the manuscript, and might even be confusing to some readers. The authors should either reframe the specific role of RNNs for supporting their key findings, or possibly remove them if they do not provide unique insights beyond classical drift-diffusion modeling.

We thank the reviewer for pointing to this important lack of clarity in the text. The RNN is essential to the analysis, because it is only from the RNN that the learning dynamics for the DDM parameters can be derived. The derivation of the differential equation on the SNR (Equation 8) is a central result of our work, and relies on the RNN interpretation. Starting with the RNN, we derive the effect of doing standard gradient based learning of the underlying weight parameters. We then back out the SNR dynamics in a DDM that exactly correspond to these gradient based updates. With only the DDM, it is not clear what differential equation should govern changes in SNR due to learning. Therefore, the main result that learning speed scales with viewing time is derived from the RNN interpretation and cannot be seen just from a DDM framework. Said another way, while the RNN and DDM are trivially the same for the within-trial dynamics where parameters are constant, by itself the DDM makes no predictions for across-trial learning dynamics, where parameters change as a function of mean performance. For this, the RNN interpretation is required. We note that gradient descent is not invariant to parametrization and so performing gradient descent directly on the SNR variable of a DDM is not equivalent to the SNR dynamics we have derived based on gradient descent on the weights in the underlying RNN. We have expanded the model explanation in the main text considerably, and included the following sentences highlighting the necessity of an RNN for our reduction to a DDM:

“Remarkably, this reduction shows that the high-dimensional dynamics of the RNN receiving stochastic pixel input and performing gradient descent on the weights (Figure 3, grey trace) can be described by a drift diffusion model with a single deterministic scalar variable--the effective SNR--that changes over time (Figure 3, blue trace). Notably, without the mapping to the original recurrent neural network, it is not possible to understand what effect error-corrective gradient descent learning would have at the level of the DDM, or how the learning process is influenced by choice of decision times. In particular, the change in SNR that arises from gradient descent on the underlying RNN weights (Equation10) is not equivalent to that arising from gradient descent on the SNR parameter in the DDM directly because gradient descent is not parametrization invariant.”

The detailed description of the models is currently hidden in the Methods section, even though it is essential for understanding their learning dynamics. In particular, the authors assume two sources of noise in the model: one on the input, and one on the accumulator. Learning is achieved by re-scaling the input by an 'input weight'. Increasing the input weight boosts the input signal compared to the accumulation noise, such that the latter can be effectively suppressed by making the input weight large enough. By contrast, the input noise cannot be suppressed by such re-scaling, such that it is this noise that ultimately limits asymptotic performance, and determines the asymptotic SNR. This important constraint is currently not clear after reading the manuscript. The authors should reframe the manuscript to highlight and discuss the model variables that affect the SNR, including the input weight. This would clarify what the authors mean by 'learning' in their theory. The authors initialize the model with a low input weight to reflect that the agent has not yet learned how to interpret the sensory information for the purpose of decision-making. Thus, the input weight is not something that would have a direct mechanistic implementation in the brain. Instead, it is an abstract quantity that describes how well a decision-maker can turn sensory information into a perceptual decision. Once this interpretation of input weight is stated clearly in the manuscript (which it is not in the current version), starting the task with a low input weight makes sense.

We have modified the manuscript to explain these components of the model in the main text. In particular, we include sections explaining what “learning” means in the LDDM. First, we now explicitly describe and simulate a recurrent neural network from pixels, which provides a better account of the meaning of the input weights–taken literally, these could be synaptic weights from a population of neurons representing the visual input, to a population of neurons that integrate this signal, and gestures toward a mechanistic implementation in the brain. As the reviewer notes, though, these could in principle also be more abstract, and the model may describe functional connectivity implemented in more complex circuits.

“Within a trial, N dimensional inputs s(t) RN arrive at discrete times t = 1dt, 2dt,∙∙∙ where dt is a small time step parameter. In our experimental task, s(t) might represent the activity of LGN neurons in response to a given visual stimulus. Because of eye motion and noise in the transduction from light intensity to visual activity, the response of individual neurons will only probabilistically relate to the correct answer at any given instant. In our simulations, we take s(t) to be the pixel values of the exact images presented to the animals, but transformed at each time point by small rotations (±20deg) and translations (±25deg of the image width and height), as depicted in Figure 3a. This input variability over time makes temporal integration valuable even in this visual classification task. To perform this integration, each input s(t) is filtered through perceptual weights s(t) RN and added to a read-out node (decision variable) y^(t) along with i.i.d. η(t)N(0,co2dt). This integrator noise models internal neural noise. The evolution of the decision variable is given by the simple linear recurrence y^(t+dt)=y^(t)+ w(trial)s(t)+ η(t) until the decision variable hits a threshold ±z(trial) that is constant on each trial. Here the RNN already performs an integration through time (a choice motivated by prior experiments in rodents [37]), and improvements in performance come from adjusting the input-to-integrator weights w(trial) to better extract task relevant sensory information.”

In the second section, we describe the reduction of this RNN to the LDDM. The input weight in the reduction summarizes the functional effect of many individual weights in the RNN. In this sense, exactly as the reviewer says, the LDDM input weight is a more abstract quantity. However, it summarizes the behavior of the more mechanistically plausible weights of the RNN. We show that simulations of the reduction match those of a real RNN trained from pixels, verifying our methods.

“We start by noting that the input to the decision variable y^ at each time step is a weighted sum of many random variables, which by the law of large numbers will be approximately Gaussian. […] The dynamics of this “learning DDM” (LDDM) closely tracks simulated trajectories of the full network from pixels (Figure 3c-f blue trace, Figure 3—figure supplement 1; see Methods).”

Finally, a few choices made by the authors are critical for their findings, but are not sufficiently described in the main text. First, the observed reaction time (RT) is composed of a non-decision time (T0) and a decision time (DT). Experiments allow to measure RT, but the theory makes predictions about DT, which requires inferring T0. The magnitude of T0 impacts the results, but how it is inferred is currently buried in the Methods section. Second, the rodents sometimes choose to not immediately initiate a new trial (the "voluntary inter-trial interval"). The authors assume that rodents ignore this interval when maximizing reward rate, and find near-optimal reward rates under this assumption. Importantly, including this voluntary inter-trial interval makes reward rates drop significantly. However, these details are again buried in the Methods section, and are not mentioned nor discussed in the main text.

We added the following paragraph near the beginning of the Results section describing these parameters:

“Calculating mean normalized DT for comparison with the OPC requires knowing two quantities, DT and the average non-decision time per error trial Derr. The average non-decision time Derr = T0 +D¨err contains the motor and initial perceptual processing components of RT, denoted T0; and the post response timeout on error trials D¨err. Mean normalized DT is then the ratio DT/Derr. In order to determine each subject’s DT, we estimated T0 through a variety of methods, opting for a biological estimate (measured lickport latency response times and published visual processing latencies; Fig. 1–figure supplement 4). To ensure that our results did not depend on our choice of T0, we ran a sensitivity analysis on a wide range of possible values of T0 (Fig. 1–figure supplement 4f). We then had to determine D¨err, which can contain mandatory and voluntary intertrial intervals. We found that the rats generally kept voluntary intertrial intervals to a minimum, and we interpreted longer intervals as effectively “exiting” the DDM framework (Fig. 1–figure supplement 5). As such, we defined D¨err to only contain mandatory intertrial intervals (see Methods, Fig. 1–figure supplement 6). To ensure that our results did not depend on either choice, we ran a sensitivity analysis on the combined effects of T0 and a D¨err containing voluntary intertrial intervals on RR (Fig. 1–figure supplement 7). A full discussion of how these parameters were determined is included in the Methods.”

– The main conclusions hinge upon concurrent changes in both drift rate and decision boundary throughout learning. This change is assessed via fitting HDDM models. It is important that the authors fit and compare the following 3 DDM variants: (a) drift rate changes throughout learning but decision boundary remains fixed, (b) decision boundary changes throughout learning but drift rate remains fixed, (c) both drift rate and decision boundary change throughout learning.

We fit the 3 DDM variants suggested by the reviewers and found that models that allowed both drift and threshold to vary with learning provided the best fits. We discuss these model variants in the point below.

“Indeed, the best DDM model fits were those that allowed both threshold and drift rate to vary with learning, as was the case with the first stimuli the rats encountered, and in line with the LDDM model (Figure 4—figure supplement 1, Figure 4—figure supplement 2, Figure 4—figure supplement 3, Figure 4—figure supplement 4).”

– The HDDM fitting procedure is not fully described. In particular, it is not clear what parameters, asides the drift rate, the decision boundary and the non-decision time, varied. For instance, was the drift rate variability a free parameter? It is important to report more precisely the details of the model fits and, if simple variants of the DDM were used, to fit the data using more complex DDMs that include drift rate variability as a free parameter. Additionally, please report in Figure S5 all parameter estimates during learning and not just the drift rate and the decision boundary. Details around Equation (40) should also be in the main text. Furthermore, except for the internal noise, the setup looks very similar to that of Drugowitsch et al. (2019), and the relationship should be discussed somewhere in the main text.

In our original submission, we used the HDDM model to first fit a simple

DDM to 10 sessions of asymptotic behavior after learning stimulus pair 1 (n = 26) as a sanity check that a simple DDM was adequate (Figure 1—figure supplement 3). (We kept this figure unchanged).

To assess parameter changes during learning, in our original submission we fit separate, simple DDMs to each of the learning epochs of stimulus pair 1, and stimulus pair 2. As these were simple DDMs, drift rate variability was not a free parameter. For stimulus pair 1, the two learning epochs were the (1) first and (2) last 1000 trials for every subject. For stimulus pair 2, the three learning epochs were (1) 500 trials from a baseline session of stimulus pair 1 prior to introduction of stimulus pair 2, the (2) first 500 trials and the (3) last 500 trials of stimulus pair 2. Fewer trials were used because the stimulus pair 2 experiment had fewer sessions and trials overall. In order to succinctly address the modeling requests from the paper reviews, we modified the fitting procedure (described below) and replaced Figure S5 with new Figure 4—figure supplements 1-4, where we report the values of all parameters in every model fit.

The qualitative results were the same across all model fit variations: drift rate increased with learning and threshold decreased, even with the addition of drift rate variability. We report DICs (Figure 4—figure supplement 1) for each fit and arrive at two conclusions. First, models where both drift and threshold varied with learning fit the best, as opposed to models where drift and/or threshold were held constant. Second, more complex models involving drift rate variability did not provide a better fit over a simple DDM, and those that performed the best were those where drift and threshold were already allowed to vary with learning. This means that drift rate variability did not seem to provide a substantial benefit in explaining the data. Moreover, the direction in which drift rate variability changed with learning (increasing or decreasing with learning epoch) seemed to depend on which parameters also varied with learning (drift and/or threshold) leading to differing accounts. However, we cannot rule out that drift rate variability may indeed play a role in learning strategy. We leave the interpretability of such an account to our reviewers and our readers.

We have provided a more complete description of the revised HDDM fitting procedure in the Methods section:

“In order to verify that our behavioral data could be modeled as a drift-diffusion process, the data were fit with a hierarchical drift-diffusion model [121], permitting subsequent analysis (such as comparison to the optimal performance curve) based on the assumption of a drift-diffusion process. To verify that a DDM was appropriate for our data, we fit a simple DDM to 10 asymptotic sessions after learning stimulus pair 1 for n = 26 subjects (Fig. 1–figure supplement 3). In order to assess parameter changes across learning, we fit DDMs to the stimulus pair 1 experiment and the stimulus pair 2 experiment where the learning epochs were treated as conditions in each experiment. This allowed us to hold some parameters constant while conditioning others on learning. We fit both simple DDMs and DDMs with drift rate variability to the two experiments, allowing drift rate, threshold and drift rate variability to vary with learning epoch. In particular, we fit three broad types of models: (1) simple DDMs (Fig. 4–figure supplement 2), (2) DDMs + fixed drift rate variability (Fig. 4–figure supplement 3), and (3) DDMs + drift rate variability that varied freely with learning epoch (Fig. 4–figure supplement 4). For each of the types of models we held drift constant, threshold constant, or allowed both to vary with learning. The best fits, as determined by the deviance information criterion (DIC), came from models where we allowed both drift and threshold to vary with learning; the addition of drift rate variability did not appear to improve model fits (Fig. 4–figure supplement 1). For both learning experiments, drift rates increased and thresholds decreased by the end of learning, in agreement with previous findings [19, 53–57]. In addition, for the transparent stimuli experiment we fit a DDM model that allowed drift rate, threshold, drift rate variability and T0 to vary with learning phase in order to observe the changes in drift rate and threshold 6–figure supplement 3.”

We included details around Equation 40 in the main text (Within-trial DriftDiffusion Dynamics subsection).

We now discuss Drugowitsch et al., 2019 at greater length in the discussion. In short, they treat learning in a Bayesian formulation, using primarily simulation-based methods, under the assumption that the diffusion boundaries (thresholds) are constant. By accounting for uncertainty in the weights, their Bayesian learning algorithm is more powerful than ours. The principal difference with our work, however, is the assumption of constant thresholds throughout learning. Our core interest is in how modification of response times (through changes in the decision threshold) impacts long term learning. Our simpler gradient descent learning model (still widely used throughout deep learning) permits analytical computation of average trajectories and updates that reveal the basic tradeoff between viewing time and learning speed in this setting. We believe future work could profitably combine Bayesian learning with threshold trajectories.

“Prior work in the DDM framework has investigated learning dynamics with a Bayesian update and constant thresholds across trials [35]. Our framework uses simpler error corrective learning rules, and focuses on how the decision threshold policy over many trials influences long-term learning dynamics and total reward. Future work could combine these approaches to understand how Bayesian updating on each trial would change long-term learning dynamics, and potentially, the optimality of different threshold strategies.”

– Currently the evolution of the mean RT during learning is examined. Plotting the change in the RT distribution (averaged/vincentised across participants) can be more informative about changes in the strategy being used than plotting the mean RT alone.

We have provided the change in correct RT distributions across learning in Figure 6—figure supplement 2. The change in RT distributions for stimulus pair 1 and 2 is qualitatively similar (slower RTs get faster, and faster RTs remain about the same) suggesting a similar mechanism (increase in drift accompanied by a decrease in threshold, which is indeed what we see in our DDM fits).

– The results of experiment 3 are interesting, but at the same time the behaviour in the transparent group requires more scrutiny. How do rats behave in this condition? It appears that random choice in combination with heuristic strategies (e.g. win-stay lose-switch) are viable possibilities. The argument that the SNR is intrinsically zero in this task. However, the signal in this experiment may contain non-zero, irrelevant information (such as choice history or feedback information). Plotting the RT distributions as a function of learning could provide insight in this regard, because boundary changes and SNR changes manifest differently in the shapes of RT distributions.

We have provided the change in vincentized correct RT distributions as a function of learning for the transparent stimuli experiment in Figure 6—figure supplement 2. The change in the RT distribution for transparent stimuli differs from that of stimulus pair 1 and 2 (slow and fast RTs all get faster), suggesting a different mechanism (decrease in threshold with a constant drift rate, as supported by our DDM fits in Figure 6—figure supplement 3). We added the following sentence to the text noting this: “Additionally, we considered the rats' entire RT distributions to investigate the effect of learnability beyond RT means. We found that while the RT distributions changed similarly from the beginning to end of learning for the learnable stimuli (stimulus pair 1 and 2), they differed for the unlearnable (transparent) stimuli, indicating an effect of learnability on the entire RT distributions (Figure 6—figure supplement 2). Hence rodents are capable of modulating their strategy depending on their learning prospects.”

The reviewers raise the point that the rats may be using non-stimulus information in order to sustain an SNR > 0. In fact, HDDM fits of this experiment (Figure 6—figure supplement 3) reveal that although drift rate does decrease with the transparent stimuli, it is still above 0. That being said, we observe a systematic decrease in threshold throughout learning with the transparent stimuli, in agreement with our model. This monotonic decrease in threshold is in contrast to the experiment with new visible stimuli (stimulus pair 2), where we also first observe a decrease in drift, but that decrease in drift is accompanied with an increase in threshold (Figure 4—figure supplements 2-4). After learning, threshold decays back to its baseline state, and drift also increases to match its baseline state (Figure 4—figure supplement 2-4).

To investigate whether the rats implemented stimulus-independent heuristic strategies in addition to random choice, we measured left/right bias and quantified the weights of bias, perseverance (choose the same port as previous trial) and win-stay/lose-switch (choose the port that was correct on the previous trial) in addition to the stimulus presented using a dynamic generalized linear model (PsyTrack: Roy et al., 2021; Figure 6—figure supplement 4). In general, bias seemed to increase with transparent stimuli in the direction that each individual was already biased during visible stimuli. The weights for perseverance and winstay/lose-switch also seemed to increase and fluctuate more during transparent stimuli, suggesting a greater reliance on these heuristics now that the stimulus was uninformative (stimulus weights dropped to 0, whereas they were strongly positive or negative during visible stimuli depending on each individual’s left/right stimulus mapping).

These results support the reviewers’ suggestion that the rats may indeed be relying more on stimulus-independent heuristic strategies. Although the rats may be sustaining a non-zero drift rate through stimulus-independent information, this might amount to a “monitoring” of the task in case any changes or reliably informative patterns do arise, rather than a strategy to improve perceptual learning as we argue is the case with the new visible stimuli.

We note that we still observe a monotonic decrease in threshold in our DDM fits, and a non-zero drift rate due to these heuristics does not contravene our conclusions.

Regarding heuristic strategies, we have added the following paragraph to the text:

“Although there is no informative signal in this task with transparent stimuli, the rats could still be using stimulus-independent signals, such as choice history or feedback, to drive heuristic strategies. Indeed, DDM fits indicated a non-zero drift rate even in the absence of informative stimuli (Figure 6—figure supplement 3 ). To investigate whether the rats implemented stimulus-independent heuristic strategies in addition to random choice, we measured left/right bias and quantified the weights of bias, perseverance (choose the same port as the previous trial) and win-stay/lose-switch (choose the port that was correct on the previous trial) [58]. In general, bias seemed to increase with transparent stimuli in the direction that each individual was already biased during visible stimuli. Perseverance and win-stay/lose-switch also seemed to increase and fluctuate more during transparent stimuli, suggesting a greater reliance on these heuristics now that the stimulus was uninformative (Figure 6—figure supplement 4). Engaging these heuristics may be a way that the rats expedited their choices in order to maximize iRR while still ``monitoring" the task for any potentially informative changes or patterns. Despite the fact that the animals' still engaged these non-optimal heuristics, the lack of learnability in the transparent stimuli still led to a change in strategy that was distinct from that with learnable stimuli.”

– The use of RNNs is not sufficiently supported beyond classical drift-diffusion modeling. The authors should either reframe the specific role of RNNs for supporting their key findings, or possibly remove them if they do not provide unique insights beyond classical drift-diffusion modeling.

Please see the response above to point #1.

– The authors should clarify early on in the manuscript what "learning" means in their model and theory, and why it makes sense to start with a low input weight (after describing the meaning of the input weight). The authors should also explain what T0 and D_RSI are, how they are determined (and the choices made to determine them), and how these parameters impact the results (also related to Figure S13). In particular the relationship between RT and DT is already required to understand Figure 1c, and many plots thereafter.

We added a better explanation of what “learning” means in the LDDM. We also explained T0 in more detail early in the Results section. We also simplified our timing terms to remove confusion around D_RSI.

Please see the full response to this comment above point #1.

– Box 1 provides some details of the model, but leaves out others – e.g., the different sources of noise in the model. From Box 1 alone, it is unclear how the asymptotic SNR or the iRR-sensitive threshold are computed. It is indeed nice that it is possible to derive Equation (3), but the equation itself is not particularly informative for the exposition of the main findings, and so it could be moved to the Methods section.

We replaced Box 1 with a more complete explanation of the model in the main text (Linear Drift-Diffusion Model (LDDM) section), including information on the asymptotic SNR and the threshold policies. We believe that including Equation 3 in the main text is important to demonstrate that the dynamics of learning in the RNN can be expressed as these specific dynamics for SNR in DDM terms, making it readily applicable to other settings. It is in our view a central result, not a method.


Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

RESOURCES