Dopamine mediates the bidirectional update of interval timing

Anthony MV Jakob; John G Mikhael; Allison E Hamilos; John A Assad; Samuel J Gershman

doi:10.1037/bne0000529

. Author manuscript; available in PMC: 2023 Oct 1.

Published in final edited form as: Behav Neurosci. 2022 Oct;136(5):445–452. doi: 10.1037/bne0000529

Dopamine mediates the bidirectional update of interval timing

Anthony MV Jakob ^1,², John G Mikhael ^2,³, Allison E Hamilos ^2,³, John A Assad ^2,⁴, Samuel J Gershman ^5,⁶

PMCID: PMC9725808 NIHMSID: NIHMS1851630 PMID: 36222637

Abstract

The role of dopamine as a reward prediction error signal in reinforcement learning tasks has been well-established over the past decades. Recent work has shown that the reward prediction error interpretation can also account for the effects of dopamine on interval timing by controlling the speed of subjective time. According to this theory, the timing of the dopamine signal relative to reward delivery dictates whether subjective time speeds up or slows down: Early DA signals speed up subjective time and late signals slow it down. To test this bidirectional prediction, we reanalyzed measurements of dopaminergic neurons in the substantia nigra pars compacta of mice performing a self-timed movement task. Using the slope of ramping dopamine activity as a read-out of subjective time speed, we found that trial-by-trial changes in the slope could be predicted from the timing of dopamine activity on the previous trial. This result provides a key piece of evidence supporting a unified computational theory of reinforcement learning and interval timing.

Keywords: dopamine, interval timing, temporal difference learning, reward prediction error

Introduction

How does dopamine (DA) influence time perception? This question has been an active subject of debate. While some researchers have found that DA increases the rate at which subjective time progresses (Lake and Meck, 2013; Maricq et al., 1981; Maricq and Church, 1983), others have found the exact opposite effect (Soares et al., 2016). Recent work has developed a coherent framework to explain these phenomena (Mikhael and Gershman, 2019), which relates these timing effects to the role of DA in signaling reward prediction error (RPE; for reviews, see Gershman et al., 2014; Petter et al., 2018).

According to the RPE hypothesis, DA reports the difference between received and expected reward. In a seminal experiment, Schultz et al. (1997) presented monkeys with repeated rewards (after a fixed delay from a cue) and simultaneously recorded from putative DA neurons in the midbrain. The authors found that an unexpected reward elicited a burst of DA neuron activity, but that, when the reward was expected, it no longer elicited DA neuron activity. Furthermore, a reward omission at the time of expected reward elicited a dip in activity. These experimental observations are consistent with the RPE hypothesis, and have been buttressed by several decades of research (e.g., Eshel et al., 2015; Glimcher, 2011; Niv and Schoenbaum, 2008; Schultz et al., 1997; Steinberg et al., 2013; Bayer and Glimcher, 2005; Roesch et al., 2007). The computational importance of this hypothesis is due to the role of RPE in reinforcement learning (RL) algorithms, specifically the temporal difference learning algorithm (Sutton, 1988; Sutton and Barto, 2018). An agent can use RPEs to learn long-term reward predictions: Unexpected rewards indicate that the agent should increase its future expectation of reward, while omissions of expected rewards indicate that the animal should decrease its future expectation of reward.

The RPE hypothesis does not by itself explain the role of DA in interval timing, since it is compatible with many different assumptions about the representation of time (Starkweather et al., 2017; Daw et al., 2006; Ludvig et al., 2008). However, the choice of time representation can have a dramatic influence on the effectiveness of RL algorithms. If there is some limit on the precision with which time can be represented, then the limited representational capacity should be concentrated on time scales (or more generally time intervals) that are important for reward prediction. Since animals need to deal with multiple time scales for different tasks, this representation should be rescalable. For example, if time is represented by the firing rate of “time cells” tuned to particular time intervals (e.g., MacDonald et al., 2011; Salz et al., 2016; Tiganj et al., 2017; Bright et al., 2020), then the tuning functions should stretch or compress if the task-relevant interval is increased or decreased, respectively. Evidence for task-dependent rescaling has been reported in both striatum (Mello et al., 2015) and hippocampus (Shimbo et al., 2021).

Mikhael and Gershman (2019) formalized this rescaling idea in a temporal difference learning model of DA. The key idea was to treat the time scale of the temporal representation as a parameter that could be adjusted by the RPE signal. In this way, DA could modify the speed of subjective time in order to optimize reward prediction. In particular, the model predicted a bidirectional plasticity rule for the timing parameter: Positive RPEs that occur before expected reward delivery should tend to increase the speed of subjective time, while positive RPEs that occur after expected reward delivery should decrease the speed of subjective time (see a derivation of this result in the next section). Mikhael and Gershman (2019) showed that this model could account for a number of dopaminergic effects on interval timing behavior.

In this paper, we undertake a more direct test of the bidirectional plasticity hypothesis, using DA measurements collected from mice performing a self-timed movement task (Hamilos et al., 2021). In this task, mice received a reward for licks performed after a fixed interval. Even after extensive training, the authors observed ramping DA signals and variable trial-to-trial lick times. Furthermore, the authors found that steeply rising DA ramps preceded early lick times and slowly rising DA ramps preceded late lick times. Based on our earlier theoretical work (Gershman, 2014; Kim et al., 2020; Mikhael et al., 2022), we argue that the slope of DA ramps is a proxy for the speed of subjective time. We then ask whether the timing of DA activity relative to the time of reward delivery predicts the ramp slope on the subsequent trial in accordance with the bidirectional plasticity rule.

In contrast to other deterministic RL paradigms in which RPEs eventually flatten out to zero because the task is perfectly learned (the rewards are not surprising anymore), in the present setup the RPE will remain non-zero even after the task is well-learned. Indeed, previous studies have shown that dopamine signals are sensitive to the predicted timing of reward delivery (Hollerman and Schultz, 1998; Fiorillo et al., 2008; Starkweather et al., 2017). Thus, even after having learned to expect a reward at time T, the mice have to rely on their noisy estimate of the current time to determine whether they truly are at T, and a reward received at that moment will elicit some positive RPE. In other words, despite having learned the reward’s magnitude, the animals cannot make a perfect prediction about the reward’s timing.

Methods

The computational problem

We construe animals as facing the problem of learning to predict long-term reward, or value, defined as the expected discounted future return (cumulative reward):

V_{t} = E [\sum_{k = 0}^{T} γ^{k} r_{t + k}],

(1)

where t indexes intra-trial time (t = 0 corresponds to trial onset), r_t is the reward received at time t, T is the trial duration, and γ ∈ (0, 1) is a discount factor. In Hamilos et al. (2021), the animal receives a single reward r at time T in each trial, so Equation (1) can be simply written as:

V_{t} = γ^{T - t} r,

(2)

The value function and RPE are illustrated in Figure 1A. If, as commonly assumed, the rewards follow a Markov process, then Equation (1) can be written recursively:

V_{t} = r_{t} + γ V_{t + 1} .

(3)

Figure 1: — (A) Convex value function (black) and ramping RPE (gray). (B) Simulated DA signal (black) and estimated DA ramp (linear regression between trial start and reward delivery, gray). The DA signal corresponds to the RPE under temporal uncertainty (Methods). (C) Partial derivative of the estimated value function with respect to time, which gives the bidirectional update rule of the pacemaker rate η its qualitative shape.

This recursive expression is known as the Bellman equation (Bellman, 1957), and is the basis for efficient RL algorithms such as temporal difference learning (Sutton, 1988).

Note that, for simplicity, we do not directly model action selection in this paper. Of course, action selection is a critical aspect of the tasks facing animals in the experiment that we model. However, for the purposes of predicting dopamine responses, we will show that we do not need to invoke the additional complexity entailed by a model of action. We leave this more complete model as a task for future work.

Temporal difference learning model

To learn the value function V_t, we first define a parametric function class and then present a learning algorithm that adjusts the parameters to minimize the discrepancy between the estimator and the true value function. A standard parametrization is the linear function approximator, which approximates the value function as a linear projection of time-varying features (Schultz et al., 1997; Ludvig et al., 2008, 2012):

{\hat{V}}_{t} = \sum_{d} w_{d} x_{d, t},

where x_d,t is the d^th feature at time t, and w_d is the feature weight. For example, a feature may represent the presence (x_d,t = 1) or absence (x_d,t = 0) of a stimulus at time t. Alternatively, it may represent the physical proximity to a reward location.

The weights w_d are updated by gradient descent to reduce the mismatch between V_t and ${\hat{V}}_{t}$ :

Δ w_{d} = α δ_{t} \nabla_{w_{d}} {\hat{V}}_{t},

(5)

where α ∈ (0, 1) is the learning rate, $\nabla_{w_{d}} {\hat{V}}_{t} = x_{d, t}$ is the gradient of ${\hat{V}}_{t}$ with respect to the weight w_d, and δ_t is the reward prediction error (RPE):

δ_{t} = r_{t} + γ {\hat{V}}_{t + 1} - {\hat{V}}_{t} .

(6)

Notice that δ_t equals the mismatch between the agent’s estimates of the right-hand side and left-hand side of Equation (3). When δ_t = 0 on average, ${\hat{V}}_{t} = V_{t}$ , and hence the value is well-learned. Otherwise, the agent continues to update ${\hat{V}}_{t}$ to minimize δ_t.

The shape of δ_t after a task is well-learned will depend on the choice of features. For instance, Gershman (2014) showed that, for a single feature taking sufficiently convex shape across states, δ_t will exhibit the shape of a ramp (see also Morita and Kato, 2014; Lloyd and Dayan, 2015; Mikhael et al., 2022, for alternative approximation architectures that result in ramps, such as time cells). For simplicity, we will assume in what follows a single feature x taking sufficiently convex shape across subjective time (the animal’s estimate of elapsed time since the beginning of the trial). This will produce ramping (Figure 1A); a mathematical analysis of this point appears below. Convexity can arise in a variety of ways, but is broadly consistent with the idea that temporal sensitivity is higher around temporal landmarks such as motor responses and reward delivery. A more biologically realistic model could generate this differential sensitivity by narrowing tuning curves of time-encoding neurons selective for short time intervals relative to these temporal landmarks (see Ludvig et al., 2008; Mikhael and Gershman, 2019).

It is important to note that perfectly learning a value function depends on having a perfect internal clock (i.e., subjective and objective time coincide). Instead, animals are noisy timers, and are furthermore subject to Weber’s law, which asserts that the standard deviation of an animal’s temporal estimate increases linearly with the elapsed time (Church and Meck, 2003; Gibbon, 1977; Staddon, 1965). This has the effect of ‘blurring’ the value function in proportion to the animal’s temporal uncertainty. Because the RPE is a function of value, it too gets blurred, and this blurring determines the shape of the ramp (Figure 1B). Specifically, the predicted DA response is computed as the convolution of the RPE with a Gaussian temporal uncertainty kernel determined by Weber’s law:

{DA}_{t} = \sum_{τ} δ_{τ} 𝒩 (τ; t, {(β η t)}^{2}),

(7)

where β is the Weber fraction.¹ In our previous work (Mikhael et al., 2022), we showed that temporal uncertainty can explain diverse DA dynamics across different tasks, including positive ramps, negative ramps, flat functions, and even non-monotonic functions.

The key addition of the model presented in Mikhael and Gershman (2019) is to account for the role of DA in modulating the speed of subjective time. We formalize this speed variable as a parameter η that rescales the relationship between objective and subjective time: τ = ηt. Thus, when η increases, subjective time (τ) runs faster. Importantly, we can view η as another parameter in the function approximation architecture, and optimize it via gradient descent just as we did for the weights:

Δ η = α_{η} δ_{t} t \frac{\partial {\hat{V}}_{t}}{\partial τ},

(8)

where α_η is the learning rate. Note here that the derivative of ${\hat{V}}_{t}$ is greater than zero roughly before reward delivery but less than zero afterwards. It follows that the contribution of the RPE is bidirectional: DA signals occurring before reward time should increase η, and DA signals occurring after reward time should decrease it (Figure 1C).

Choice of feature shape

Our choice of feature x results in a ramping RPE. To see this, note that an RPE ramps if and only if $\ddot{x} + \dot{x} \ln γ > 0$ (Mikhael et al., 2022). Intuitively, by Equation (6), r_t = 0 during the trial but prior to receiving reward. With a single feature, it follows that $δ_{t} = γ {\hat{V}}_{t + 1} - {\hat{V}}_{t} = w (γ x_{t + 1} - x_{t})$ . Because γ is close to 1, the term in the parentheses is approximately the derivative of x. This term, and hence the RPE, ramps when its own derivative is positive, i.e., when the second derivative of x is positive $(\ddot{x} > 0)$ . The second term in our exact requirement accounts for the more general case when γ is not equal to 1 (see Mikhael et al., 2022, for a full derivation of this result). Using our choices of x and γ (specified below), the requirement is satisfied for t < 58, which is a superset of the temporal domain chosen for our simulations.

Simulation parameters

We have chosen γ = 0.95, T = 40, β = 0.2, r_T = 1 at time t = T and r_t = 0 otherwise, τ = t, α_η = 0.01, and a single feature x_t = kt⁴ if t ≤ T, and 0 otherwise, with k = r_T T⁻⁴.

Data analysis

We obtained F(t) by removing outliers (> 15 standard deviations from the mean) from the raw GCaMP6f measurements by interpolation, as done in Hamilos et al. (2021). To correct for bleaching, we then computed the DA dF/F signal as $\frac{d F}{F} (t) = \frac{F (t) - F_{0} (t)}{F_{0} (t)}$ , where F₀(t) is a 200s moving average of F(t), as reported in the original study. Subsequently, we divided each trial into n time-bins. We chose M = 20 time-bins of length 0.85s each to account for the trial length of 17s. We aligned the time-bins around the first-lick time in each trial n and computed the average DA level D_n,m within each time-bin m. We computed the baseline DA level for each trial, defined as the average DA level between lamp-off (a signal indicating the imminence of the cue, see Hamilos et al. (2021)) and cue, and subtracted it from each corresponding time-bin.

Then, we computed the DA ramp slope s_n during the trial by fitting a straight line to the DA signal from 0.7s post-cue to 0.6s pre-lick. These buffer lengths were taken from Hamilos et al. (2021) to eliminate the effect of perception- and motion-induced transients in the signal. Hence, in order to guarantee the presence of a start and end point for the computation of the ramp slope, we restricted our analysis to trials containing a lick.

We then defined a_n = s_n+1 − s_n, the difference in DA slope between the current trial n and the next trial n + 1, which is a neural proxy for the change in η from the current trial to the next. We then solved the linear system Db = a, where b is the contribution of each bin to the change in DA ramp slope. The solution to this optimization problem (equivalent to maximum likelihood estimation of a linear regression model) is $\hat{b} = {(D^{⊤} D)}^{- 1} D^{⊤} a$ . This analysis was done for each mouse individually as well as on pooled data.

Furthermore, for each rewarded trial, we averaged DA levels in a window 500ms around the cue, 500ms around the lick, and over the whole trial from cue to lick. We classified the trials as high-DA-around-cue or high-DA-around-lick if the average DA level in the corresponding time window was larger than the mean trial DA. We then plotted DA ramps of trials following immediately after high-DA-around-cue and high-DA-around-lick conditions.

Source code

All simulations and analyses were performed using Julia, version 1.6.2. Source code can be found at https://github.com/amvjakob/dopa-rpe-interval-timing.

Results

Hamilos et al. (2021) trained mice to perform an interval timing task by initiating a self-timed lick at least 3.3 seconds after a start-timing cue. First licks occurring during the reward window (3.3 to 7 seconds after the cue) were rewarded with juice, while no reward was delivered on early lick (< 3.3 seconds) and no-lick (> 7 seconds) trials. The total duration for one run of the task was set to 17 seconds. Despite highly variable first lick times from trial to trial, the authors found that DA signals ramped up during the self-timed interval following the start-timing cue. Crucially, they found that the DA ramp slope was highly predictive of lick time, with larger slopes being associated with earlier lick times. They also found that higher baseline DA levels correlated with greater ramp slopes and earlier lick times, consistent with the view that higher DA levels lead to faster clocks.

To examine our prediction of a bidirectional effect of DA on the speed of subjective time, we reanalyzed the data from Hamilos et al. (2021). Using the linear regression model detailed in the Methods, we studied the association between DA levels at particular points in time during a trial and the ramp slope (a measurable proxy for the speed of subjective time) on the subsequent trial. In this way, we could extract a detailed temporal plasticity function and compare it to the theoretical plasticity function (Figure 1C).

Figure 2A shows the estimated regression coefficients for each time-bin. Consistent with our model predictions, the estimated coefficients revealed that early DA signals in a trial had a positive effect on the change in ramp slope, and late signals had a negative effect. In other words, an increase in DA activity shortly after cue presentation resulted in an increase in ramp slope on the next trial, whereas an increase in DA activity shortly after licking resulted in a decrease in ramp slope on the next trial. Note that in comparison to Figure 1C, the bidirectional plasticity function appears shifted relative to lick time, which may stem from greater temporal uncertainty—leading to more value function blurring—or from measurement delays. Since the shape of the bidirectional plasticity function horizontally scales with trial duration, we also report the estimated regression coefficients for trials pooled by lick time (Figure 3). The function’s qualitative shape remained the same regardless of trial duration or reward delivery.

Figure 3: — Empirical bidirectional plasticity function for all trials containing a lick, pooled by lick time and smoothed with a 1.7s moving average filter. Vertical dashed lines are plotted at mean cue time (left) and lick time (right). The plasticity function’s qualitative shape is the same irrespective of the trial duration or outcome (early lick or rewarded). Shaded area represents standard error of the mean.

Due to the slow drift in the behavioral timing distribution occurring between the beginning and end of sessions, higher baseline amplitude at the beginning of the session may lead to steeper slopes on nearby trials generally, without any causal effect. Although baseline normalization of activity on each trial should diminish the effect of slow drift, it is possible that residual drift is driving our results. We reasoned that if the slow drift hypothesis is correct, then it should also produce the same results when run on trials in the reverse order. We therefore reran the regression analysis on the reversed sequence of trials, which eliminated the relationship between within-trial DA signaling and ramp slope change (Figure 2B). This analysis, coupled with baseline normalization, rules out the slow temporal confound.

Figure 2C illustrates how ramp slope changes as a function of DA activity at different points during the previous trial. When DA activity is high following cue presentation, the ramp on the next trial tends to be steeper compared to when DA activity is high immediately before licking. Our model asserts that this difference arises from the proposed bidirectional plasticity rule.

Discussion

By reanalyzing recordings of dopaminergic neurons in mice performing a self-timed movement task (Hamilos et al., 2021), we have shown that DA has a bidirectional effect on the speed of subjective time. We showed that the contribution of DA on the current trial to the change in DA ramp slope (a proxy for the speed of subjective time) on the next trial exhibits the predicted bidirectional shape: DA signals occurring before reward time tend to increase the DA ramp slope on the next trial, and those occurring after reward time tend to decrease it, consistent with our RL theory of temporal optimization (Mikhael and Gershman, 2019). This theory was previously invoked by Hamilos and Assad (2020) to suggest that the observed DA ramps may qualitatively correspond to an RPE (derivative-like) computation, but that study left open the question of why time rescaling itself should vary across trials and how previous DA signals affect the current clock speed. Here we address this question by showing how time rescaling can be endogenized by a model that optimizes the rescaling parameter using temporal difference learning.

For simplicity, we have chosen a feature in our temporal difference model that produces ramps. However, the cause of ramps—and how they relate mechanistically to the flow of time—remains an open question. Indeed, DA ramps have been observed in various operant conditioning tasks, both during the pre-action period (Totah et al., 2013) as well as during action execution (Howe et al., 2013). Ramps have furthermore been observed in classical conditioning tasks that provided cues indicating proximity to reward (Kim et al., 2020). Recent work has suggested that these ramps occur as a consequence of sensory feedback (Mikhael et al., 2022), although they may also be captured by a “forgetting” mechanism within an RL framework (i.e., a decay term in the value update; Morita and Kato, 2014), or by state-dependent biases such as an overestimation of time or distance to reward, if the biases decrease with proximity to the reward (Mikhael et al., 2022).

Our interpretation of the data from Hamilos et al. (2021) rests on a reverse inference about ramp slope: Steeper ramps indicate faster subjective time. Is this reverse inference valid? One cause for doubt is that some past work on ramping suggests that it occurs in the absence of any obvious demand on time-keeping. For example, Howe et al. (2013) found ramping in a T-maze task, where it was unnecessary for the animal to keep track of elapsed time. Moreover, ramp slope is modulated by other factors, such as learning stage and task engagement (Farrell et al., 2021; Guru et al., 2020). Our goal in this paper is not to provide a comprehensive theory of ramping (see Mikhael et al., 2022), but rather to leverage one factor determining ramp slope. Even if it is true that ramp slope is also determined by other factors, this does not logically invalidate the reverse inference as long as these other factors aren’t highly correlated with the timing factor. The fact that we are able to predict trial-by-trial variations in ramp slope based on a timing model suggests that this assumption is plausible.

While the model we put forward provides a joint explanation for the role of DA in time perception and reward prediction, the precise mechanisms through which DA signals are translated into movements remain unclear. Recent work has investigated the effect of DA activity on action initiation thresholding (Coddington and Dudman, 2018, 2019), thus providing another dimension to the role of DA in driving motor behavior.

Our model of timing optimization by RL can potentially be related to several existing models of interval timing. In the striatal beat frequency model, cortical neurons are assumed to fire in an oscillating pattern with different phases (Matell and Meck, 2004). It follows that the neurons active during both the reward-predicting cue and the reward represent a neural code for the interval to be timed. Assuming that DA affects the firing frequency of the cortical oscillators, our bidirectional update rule provides a compatible extension to this model to account for interval timing modulation effects.

Alternatively, in pacemaker-accumulator models, time is represented by counting the number of ticks emitted by a noisy clock (Gibbon et al., 1997; Zakay and Block, 1997). Given the similarity between the ticking of the clock and the successive transition from state to state—typical for a RL model—as a representation of the passing of time, our model provides a natural extension to the PA framework: By letting the rescaling parameter η influence the speed of the clock or the tick number threshold, DA-mediated interval timing modulation can be accounted for. Despite the differences between both classes of models of timing presented here, it is interesting to note that the parametrized rescaling of a quantity through a bidirectional plasticity rule will endow the model with the ability to accurately account for interval timing modulation effects.

In conclusion, we have shown here that RL and interval timing are critically linked by a common dopaminergic mechanism. To our knowledge, this is the first theory that captures the bidirectional effect of DA on interval timing. More broadly, the idea that prediction errors can drive representation learning may extend beyond interval timing to other domains (Alexander and Gershman, 2021). An important project for future work will be to examine empirically whether the same dopaminergic signal serves this function across domains.

Acknowledgments

This work was supported by a Bertarelli Fellowship (AMVJ), the Air Force Office of Scientific Research grant FA9550-20-1-0413 (SJG), and the National Institutes of Health grants T32GM007753 (JGM), T32MH020017 (JGM), and U19 NS113201-01 (SJG, JAA). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Source code for all simulations and analyses can be found at https://github.com/amvjakob/dopa-rpe-interval-timing.

Footnotes

The assumption of Weber noise is not necessary for the results we present in this paper, but we include it here for consistency with past work.

References

Alexander WH and Gershman SJ (2021). Representation learning with reward prediction errors. arXiv preprint arXiv:2108.12402. [Google Scholar]
Bayer HM and Glimcher PW (2005). Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47:129–141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bellman R (1957). Dynamic Programming. Princeton University Press. [Google Scholar]
Bright IM, Meister ML, Cruzado NA, Tiganj Z, Buffalo EA, and Howard MW (2020). A temporal record of the past with a spectrum of time constants in the monkey entorhinal cortex. Proceedings of the National Academy of Sciences, 117:20274–20283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Church RM and Meck W (2003). A concise introduction to scalar timing theory. In Meck W, editor, Functional and Neural Mechanisms of Interval Timing, pages 3–22. CRC Press/Routledge/Taylor & Francis Group. [Google Scholar]
Coddington LT and Dudman JT (2018). The timing of action determines reward prediction signals in identified midbrain dopamine neurons. Nature neuroscience, 21(11):1563–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coddington LT and Dudman JT (2019). Learning from action: reconsidering movement signaling in midbrain dopamine neuron activity. Neuron, 104(1):63–77. [DOI] [PubMed] [Google Scholar]
Daw ND, Courville AC, and Touretzky DS (2006). Representation and timing in theories of the dopamine system. Neural Computation, 18:1637–1677. [DOI] [PubMed] [Google Scholar]
Eshel N, Bukwich M, Rao V, Hemmelder V, Tian J, and Uchida N (2015). Arithmetic and local circuitry underlying dopamine prediction errors. Nature, 525:243–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
Farrell K, Lak A, and Saleem AB (2021). Midbrain dopamine neurons provide teaching signals for goal-directed navigation. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fiorillo CD, Newsome WT, and Schultz W (2008). The temporal precision of reward prediction in dopamine neurons. Nature neuroscience, 11(8):966–973. [DOI] [PubMed] [Google Scholar]
Gershman SJ (2014). Dopamine ramps are a consequence of reward prediction errors. Neural Computation, 26:467–471. [DOI] [PubMed] [Google Scholar]
Gershman SJ, Moustafa AA, and Ludvig EA (2014). Time representation in reinforcement learning models of the basal ganglia. Frontiers in Computational Neuroscience, 7:194. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gibbon J (1977). Scalar expectancy theory and Weber’s law in animal timing. Psychological Review, 84:279–325. [Google Scholar]
Gibbon J, Malapani C, Dale CL, and Gallistel CR (1997). Toward a neurobiology of temporal cognition: advances and challenges. Current Opinion in Neurobiology, 7:170–184. [DOI] [PubMed] [Google Scholar]
Glimcher PW (2011). Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 108(Supplement 3):15647–15654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guru A, Seo C, Post RJ, Kullakanda DS, Schaffer JA, and Warden MR (2020). Ramping activity in midbrain dopamine neurons signifies the use of a cognitive map. BioRxiv. [Google Scholar]
Hamilos AE and Assad JA (2020). Application of a unifying reward-prediction error (rpe)-based framework to explain underlying dynamic dopaminergic activity in timing tasks. bioRxiv. [Google Scholar]
Hamilos AE, Spedicato G, Hong Y, Sun F, Li Y, and Assad JA (2021). Slowly evolving dopaminergic activity modulates the moment-to-moment probability of reward-related self-timed movements. eLife, 10:e62583. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hollerman JR and Schultz W (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature neuroscience, 1(4):304–309. [DOI] [PubMed] [Google Scholar]
Howe MW, Tierney PL, Sandberg SG, Phillips PE, and Graybiel AM (2013). Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature, 500:575. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim HR, Malik AN, Mikhael JG, Bech P, Tsutsui-Kimura I, Sun F, Zhang Y, Li Y, Watabe-Uchida M, Gershman SJ, and Uchida N (2020). A unified framework for dopamine signals across timescales. Cell, 183:1600–1616. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lake JI and Meck WH (2013). Differential effects of amphetamine and haloperidol on temporal reproduction: dopaminergic regulation of attention and clock speed. Neuropsychologia, 51:284–292. [DOI] [PubMed] [Google Scholar]
Lloyd K and Dayan P (2015). Tamping ramping: Algorithmic, implementational, and computational explanations of phasic dopamine signals in the accumbens. PLoS Computational Biology, 11:e1004622. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ludvig E, Sutton RS, Kehoe EJ, et al. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation. [DOI] [PubMed] [Google Scholar]
Ludvig EA, Sutton RS, and Kehoe EJ (2012). Evaluating the TD model of classical conditioning. Learning & Behavior, 40:305–319. [DOI] [PubMed] [Google Scholar]
MacDonald CJ, Lepage KQ, Eden UT, and Eichenbaum H (2011). Hippocampal “time cells” bridge the gap in memory for discontiguous events. Neuron, 71:737–749. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maricq AV and Church RM (1983). The differential effects of haloperidol and methamphetamine on time estimation in the rat. Psychopharmacology, 79:10–15. [DOI] [PubMed] [Google Scholar]
Maricq AV, Roberts S, and Church RM (1981). Methamphetamine and time estimation. Journal of Experimental Psychology: Animal Behavior Processes, 7:18–30. [DOI] [PubMed] [Google Scholar]
Matell MS and Meck WH (2004). Cortico-striatal circuits and interval timing: coincidence detection of oscillatory processes. Cognitive Brain Research, 21:139–170. [DOI] [PubMed] [Google Scholar]
Mello GB, Soares S, and Paton JJ (2015). A scalable population code for time in the striatum. Current Biology, 25:1113–1122. [DOI] [PubMed] [Google Scholar]
Mikhael JG and Gershman SJ (2019). Adapting the flow of time with dopamine. Journal of Neurophysiology, 121:1748–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mikhael JG, Kim HR, Uchida N, and Gershman SJ (2022). The role of state uncertainty in the dynamics of dopamine. Current Biology, 32:1077–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morita K and Kato A (2014). Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits. Frontiers in Neural Circuits, 8:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
Niv Y and Schoenbaum G (2008). Dialogues on prediction errors. Trends in cognitive sciences, 12(7):265–272. [DOI] [PubMed] [Google Scholar]
Petter EA, Gershman SJ, and Meck WH (2018). Integrating models of interval timing and reinforcement learning. Trends in Cognitive Sciences, 22:911–922. [DOI] [PubMed] [Google Scholar]
Roesch MR, Calu DJ, and Schoenbaum G (2007). Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nature Neuroscience, 10:1615–1624. [DOI] [PMC free article] [PubMed] [Google Scholar]
Salz DM, Tiganj Z, Khasnabish S, Kohley A, Sheehan D, Howard MW, and Eichenbaum H (2016). Time cells in hippocampal area ca3. Journal of Neuroscience, 36:7476–7484. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schultz W, Dayan P, and Montague PR (1997). A neural substrate of prediction and reward. Science, 275:1593–1599. [DOI] [PubMed] [Google Scholar]
Shimbo A, Izawa E-I, and Fujisawa S (2021). Scalable representation of time in the hippocampus. Science Advances, 7:eabd7013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Soares S, Atallah BV, and Paton JJ (2016). Midbrain dopamine neurons control judgment of time. Science, 354:1273–1277. [DOI] [PubMed] [Google Scholar]
Staddon J (1965). Some properties of spaced responding in pigeons. Journal of the Experimental Analysis of Behavior, 8:19–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
Starkweather CK, Babayan BM, Uchida N, and Gershman SJ (2017). Dopamine reward prediction errors reflect hidden-state inference across time. Nature Neuroscience, 20:581–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steinberg EE, Keiflin R, Boivin JR, Witten IB, Deisseroth K, and Janak PH (2013). A causal link between prediction errors, dopamine neurons and learning. Nature Neuroscience, 16(7):966–973. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sutton RS (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44. [Google Scholar]
Sutton RS and Barto AG (2018). Reinforcement Learning: An Introduction. MIT press. [Google Scholar]
Tiganj Z, Jung MW, Kim J, and Howard MW (2017). Sequential firing codes for time in rodent medial prefrontal cortex. Cerebral Cortex, 27:5663–5671. [DOI] [PMC free article] [PubMed] [Google Scholar]
Totah NK, Kim Y, and Moghaddam B (2013). Distinct prestimulus and poststimulus activation of VTA neurons correlates with stimulus detection. Journal of Neurophysiology, 110:75–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zakay D and Block RA (1997). Temporal cognition. Current Directions in Psychological Science, 6:12–16. [Google Scholar]

[R1] Alexander WH and Gershman SJ (2021). Representation learning with reward prediction errors. arXiv preprint arXiv:2108.12402. [Google Scholar]

[R2] Bayer HM and Glimcher PW (2005). Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47:129–141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bellman R (1957). Dynamic Programming. Princeton University Press. [Google Scholar]

[R4] Bright IM, Meister ML, Cruzado NA, Tiganj Z, Buffalo EA, and Howard MW (2020). A temporal record of the past with a spectrum of time constants in the monkey entorhinal cortex. Proceedings of the National Academy of Sciences, 117:20274–20283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Church RM and Meck W (2003). A concise introduction to scalar timing theory. In Meck W, editor, Functional and Neural Mechanisms of Interval Timing, pages 3–22. CRC Press/Routledge/Taylor & Francis Group. [Google Scholar]

[R6] Coddington LT and Dudman JT (2018). The timing of action determines reward prediction signals in identified midbrain dopamine neurons. Nature neuroscience, 21(11):1563–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Coddington LT and Dudman JT (2019). Learning from action: reconsidering movement signaling in midbrain dopamine neuron activity. Neuron, 104(1):63–77. [DOI] [PubMed] [Google Scholar]

[R8] Daw ND, Courville AC, and Touretzky DS (2006). Representation and timing in theories of the dopamine system. Neural Computation, 18:1637–1677. [DOI] [PubMed] [Google Scholar]

[R9] Eshel N, Bukwich M, Rao V, Hemmelder V, Tian J, and Uchida N (2015). Arithmetic and local circuitry underlying dopamine prediction errors. Nature, 525:243–246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Farrell K, Lak A, and Saleem AB (2021). Midbrain dopamine neurons provide teaching signals for goal-directed navigation. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Fiorillo CD, Newsome WT, and Schultz W (2008). The temporal precision of reward prediction in dopamine neurons. Nature neuroscience, 11(8):966–973. [DOI] [PubMed] [Google Scholar]

[R12] Gershman SJ (2014). Dopamine ramps are a consequence of reward prediction errors. Neural Computation, 26:467–471. [DOI] [PubMed] [Google Scholar]

[R13] Gershman SJ, Moustafa AA, and Ludvig EA (2014). Time representation in reinforcement learning models of the basal ganglia. Frontiers in Computational Neuroscience, 7:194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Gibbon J (1977). Scalar expectancy theory and Weber’s law in animal timing. Psychological Review, 84:279–325. [Google Scholar]

[R15] Gibbon J, Malapani C, Dale CL, and Gallistel CR (1997). Toward a neurobiology of temporal cognition: advances and challenges. Current Opinion in Neurobiology, 7:170–184. [DOI] [PubMed] [Google Scholar]

[R16] Glimcher PW (2011). Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences, 108(Supplement 3):15647–15654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Guru A, Seo C, Post RJ, Kullakanda DS, Schaffer JA, and Warden MR (2020). Ramping activity in midbrain dopamine neurons signifies the use of a cognitive map. BioRxiv. [Google Scholar]

[R18] Hamilos AE and Assad JA (2020). Application of a unifying reward-prediction error (rpe)-based framework to explain underlying dynamic dopaminergic activity in timing tasks. bioRxiv. [Google Scholar]

[R19] Hamilos AE, Spedicato G, Hong Y, Sun F, Li Y, and Assad JA (2021). Slowly evolving dopaminergic activity modulates the moment-to-moment probability of reward-related self-timed movements. eLife, 10:e62583. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Hollerman JR and Schultz W (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature neuroscience, 1(4):304–309. [DOI] [PubMed] [Google Scholar]

[R21] Howe MW, Tierney PL, Sandberg SG, Phillips PE, and Graybiel AM (2013). Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature, 500:575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Kim HR, Malik AN, Mikhael JG, Bech P, Tsutsui-Kimura I, Sun F, Zhang Y, Li Y, Watabe-Uchida M, Gershman SJ, and Uchida N (2020). A unified framework for dopamine signals across timescales. Cell, 183:1600–1616. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Lake JI and Meck WH (2013). Differential effects of amphetamine and haloperidol on temporal reproduction: dopaminergic regulation of attention and clock speed. Neuropsychologia, 51:284–292. [DOI] [PubMed] [Google Scholar]

[R24] Lloyd K and Dayan P (2015). Tamping ramping: Algorithmic, implementational, and computational explanations of phasic dopamine signals in the accumbens. PLoS Computational Biology, 11:e1004622. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Ludvig E, Sutton RS, Kehoe EJ, et al. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation. [DOI] [PubMed] [Google Scholar]

[R26] Ludvig EA, Sutton RS, and Kehoe EJ (2012). Evaluating the TD model of classical conditioning. Learning & Behavior, 40:305–319. [DOI] [PubMed] [Google Scholar]

[R27] MacDonald CJ, Lepage KQ, Eden UT, and Eichenbaum H (2011). Hippocampal “time cells” bridge the gap in memory for discontiguous events. Neuron, 71:737–749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Maricq AV and Church RM (1983). The differential effects of haloperidol and methamphetamine on time estimation in the rat. Psychopharmacology, 79:10–15. [DOI] [PubMed] [Google Scholar]

[R29] Maricq AV, Roberts S, and Church RM (1981). Methamphetamine and time estimation. Journal of Experimental Psychology: Animal Behavior Processes, 7:18–30. [DOI] [PubMed] [Google Scholar]

[R30] Matell MS and Meck WH (2004). Cortico-striatal circuits and interval timing: coincidence detection of oscillatory processes. Cognitive Brain Research, 21:139–170. [DOI] [PubMed] [Google Scholar]

[R31] Mello GB, Soares S, and Paton JJ (2015). A scalable population code for time in the striatum. Current Biology, 25:1113–1122. [DOI] [PubMed] [Google Scholar]

[R32] Mikhael JG and Gershman SJ (2019). Adapting the flow of time with dopamine. Journal of Neurophysiology, 121:1748–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Mikhael JG, Kim HR, Uchida N, and Gershman SJ (2022). The role of state uncertainty in the dynamics of dopamine. Current Biology, 32:1077–1087. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Morita K and Kato A (2014). Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits. Frontiers in Neural Circuits, 8:36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Niv Y and Schoenbaum G (2008). Dialogues on prediction errors. Trends in cognitive sciences, 12(7):265–272. [DOI] [PubMed] [Google Scholar]

[R36] Petter EA, Gershman SJ, and Meck WH (2018). Integrating models of interval timing and reinforcement learning. Trends in Cognitive Sciences, 22:911–922. [DOI] [PubMed] [Google Scholar]

[R37] Roesch MR, Calu DJ, and Schoenbaum G (2007). Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards. Nature Neuroscience, 10:1615–1624. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Salz DM, Tiganj Z, Khasnabish S, Kohley A, Sheehan D, Howard MW, and Eichenbaum H (2016). Time cells in hippocampal area ca3. Journal of Neuroscience, 36:7476–7484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Schultz W, Dayan P, and Montague PR (1997). A neural substrate of prediction and reward. Science, 275:1593–1599. [DOI] [PubMed] [Google Scholar]

[R40] Shimbo A, Izawa E-I, and Fujisawa S (2021). Scalable representation of time in the hippocampus. Science Advances, 7:eabd7013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Soares S, Atallah BV, and Paton JJ (2016). Midbrain dopamine neurons control judgment of time. Science, 354:1273–1277. [DOI] [PubMed] [Google Scholar]

[R42] Staddon J (1965). Some properties of spaced responding in pigeons. Journal of the Experimental Analysis of Behavior, 8:19–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Starkweather CK, Babayan BM, Uchida N, and Gershman SJ (2017). Dopamine reward prediction errors reflect hidden-state inference across time. Nature Neuroscience, 20:581–589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Steinberg EE, Keiflin R, Boivin JR, Witten IB, Deisseroth K, and Janak PH (2013). A causal link between prediction errors, dopamine neurons and learning. Nature Neuroscience, 16(7):966–973. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Sutton RS (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44. [Google Scholar]

[R46] Sutton RS and Barto AG (2018). Reinforcement Learning: An Introduction. MIT press. [Google Scholar]

[R47] Tiganj Z, Jung MW, Kim J, and Howard MW (2017). Sequential firing codes for time in rodent medial prefrontal cortex. Cerebral Cortex, 27:5663–5671. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] Totah NK, Kim Y, and Moghaddam B (2013). Distinct prestimulus and poststimulus activation of VTA neurons correlates with stimulus detection. Journal of Neurophysiology, 110:75–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Zakay D and Block RA (1997). Temporal cognition. Current Directions in Psychological Science, 6:12–16. [Google Scholar]

PERMALINK

Dopamine mediates the bidirectional update of interval timing

Anthony MV Jakob

John G Mikhael

Allison E Hamilos

John A Assad

Samuel J Gershman

Abstract

Introduction