NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Apr 1.
Published in final edited form as: Curr Opin Neurobiol. 2017 Oct 31;49:1–7. doi: 10.1016/j.conb.2017.10.006

Model-based predictions for dopamine

Angela J Langdon 1, Melissa J Sharpe 1,2,3, Geoffrey Schoenbaum 2, Yael Niv 1
PMCID: PMC6034703  NIHMSID: NIHMS978215  PMID: 29096115


Phasic dopamine responses are thought to encode a prediction-error signal consistent with model-free reinforcement learning theories. However, a number of recent findings highlight the influence of model-based computations on dopamine responses, and suggest that dopamine prediction errors reflect more dimensions of an expected outcome than scalar reward value. Here, we review a selection of these recent results and discuss the implications and complications of model-based predictions for computational theories of dopamine and learning.


The striking correspondence between the phasic responses of midbrain dopamine neurons and the temporal-difference reward prediction error posited by reinforcement-learning theory is by now well established [15]. According to this theory, dopamine neurons broadcast a prediction error – the difference between the learned predictive value of the current state, signaled by cues or features of the environment, and the sum of the current reward and the value of the next state. Central to the normative grounding of temporal-difference reinforcement learning (TDRL) is the definition of ‘value’ as the expected sum of future (possibly discounted) rewards [6], from whence the learning rule can be derived directly. The algorithm also provides a simple way to learn such values using prediction errors, which is thought to be implemented in the brain through dopamine-modulated plasticity in corticostriatal synapses [7,8] (Figure 1, left). This theory provides a parsimonious account of a number of features of dopamine responses in a range of learning tasks [912].

Figure 1.

Figure 1

Multiple dimensions of prediction in dopamine prediction errors. Consider a simple task in which a brief presentation of a light cue is repeatedly followed by a drop of vanilla milk after some fixed delay (middle). What would happen on a trial in which the light is followed by a drop of equally-preferred chocolate milk after a shorter delay? Model-free TDRL with a complete serial compound stimulus representation proposes that the cue triggers a discrete sequence of activity that represents sequential time points after the presentation of the cue (left; a number of neurons are depicted horizontally; their activity at different timepoints is portrayed vertically). At each timepoint, summation of this weighted representation produces a scalar estimate of future value (V), which dopamine neurons (DA) compare to obtained reward to compute a prediction error signal. The prediction error is then broadcast widely (red) and used to modify the weights for neurons that were recently active (circles on arrows). When an unexpectedly early, chocolate-flavored reward is delivered, the prediction error signals the difference in time-discounted value, and modifies the weights for the part of the representation that is active when the prediction error is signaled. In contrast, we propose that dopamine neurons have access to (and maybe aid in learning) dimensions of prediction other than scalar value, and these are used for computation and signaling of prediction errors (right). For example, after the presentation of the cue, multiple features of the predicted next event (in this case, a liquid reward) may be represented by (perhaps overlapping) populations of neurons through time (color gradient), including the predicted amount (for example, one drop), the delay to reward delivery (it will arrive after several seconds) and the flavor of the reward (vanilla milk). At the time of reward delivery, violations of the prediction along any of these dimensions may elicit a phasic response from dopamine neurons, though different neurons may be specialized for prediction errors corresponding to different dimensions. In this case, at the early presentation of a drop of chocolate milk, prediction errors are elicited for the timing of reward delivery as well as for flavor (red) but no prediction error arises for amount (black).

Are model-free dopamine prediction errors a red herring?

A core tenet of TDRL is that it is ‘model-free’: learned state values are aggregate, scalar representations of total future expected reward, in some common currency [1,13]. That is, the value of a state is a quantitative summary of future reward amount, irrespective of either the specific form of the expected reward (e.g., water, food, a combination of the two), or the sequence of future states through which it will be obtained (e.g., will water be presented before or after food). Critically, model-free TDRL assigns these summed values to temporally-defined states; accordingly, the algorithm binds together predictions about the amount of reward and the expected time of delivery (Figure 1). In many studies, dopamine signals appear to reflect such temporally-precise, unitary value expectations, which also correlate with conditioned responding and choice preferences [14,15]. However, little work has tested this strong hypothesis directly, by, for instance, having a single cue predict several rewards of different types within a single trial, or by testing the effects of changes in type of reward on dopamine signaling, while keeping the reward value constant.

Another important feature of model-free learning (including TDRL) is that it posits that scalar state values are accrued solely through experiencing the relationship between the current state and the (possibly rewarded) state that follows [6,16]. That is, state values are learned through experience and ‘cached’ for future use. This is in contrast to model-based decision making [17], where values are computed anew each time a state is encountered by mentally simulating possibly distant futures using a learned internal ‘world model’, which captures the sequences of transitions between non-adjacent states and their associated rewards (but see below for some more nuanced distinctions).

Although phasic dopamine signals have predominantly been interpreted as model-free temporal difference prediction errors, a growing number of studies leveraging complex behavioral tasks, alongside novel optogenetic and imaging techniques, are revealing an increasingly detailed picture of dopamine reward prediction errors during learning, and the multiple dimensions of reward prediction on which they are based. Intriguingly, several of these studies have demonstrated a significant degree of heterogeneity in dopaminergic responses during learning, suggesting greater complexity in these signals than previously appreciated. Below we review evidence from these recent studies, asking what is the nature of dopamine signals? Do they reflect an aggregate (scalar) error, or a vector-based signal that includes not only the magnitude of deviation from predictions, but also the identity of the deviation (did I get more food than expected, or water instead of food)? And how might these signals be incorporated into learning algorithms implemented throughout the brain?

Temporal representation and dopamine

One notable property of dopamine prediction errors is that they are temporally precise: if an expected reward is omitted, the phasic decrease in dopamine neuron activity appears just after the time the reward would have occurred [2]. It is this phenomenon that inspired the TDRL algorithm, which models such temporally precise predictions by postulating sequences of time-point states that are triggered by a stimulus (known as the ‘complete serial compound,’ CSC stimulus representation, or ‘tapped delay line’; Figure 1), each of which separately accrues value through experience [6]. However, when a reward is delivered unexpectedly early, dopamine neurons do not display a phasic decrease in activity at the original expected time of reward, as would be implied by the CSC, in which a prediction error updates the value of the current, and not subsequent, timepoint states [18,19]. Reset mechanisms, in which reward delivery terminates the CSC representation, have been proposed to address this [19], but other challenges suggest that the CSC is perhaps not as viable an explanation for learned timing. Specifically, prediction errors are only slightly enhanced to temporally variable rewards, suggesting that under some conditions reward predictions may have low temporal precision [20], and multiple studies in humans (first inspired by [21]) have shown that a not-fully-predicted reward (or reward omission) affects choice of its related cue on the very next trial, suggesting that the CSC include only a single time-point, which then leaves unexplained how the timing of reward (relative to stimulus onset) is learned.

An alternative is to allow task states to persist for learned durations (formally, a ‘semi-Markov’ framework), with reward predictions tied to a temporally-evolving belief about the current latent state. Learning values for latent states, rather than cues, incorporates a rich world model, and suggests that prediction error signaling is ‘gated’ by inference about when one state has transitioned to another [19,22]. Recent work has directly demonstrated that dopamine reward prediction errors are consistent with this framework [23]. Here, when a cue predicted reward delivery with an unknown (but capped) delay, the passage of time since cue onset made reward delivery more likely, eliciting smaller dopamine prediction errors to later rewards. In contrast, when reward delivery was probabilistic, as time passed it became more likely that the trial would not be rewarded, and indeed dopamine responses increased with reward delay. Consistent with this theory, other studies have shown that dopamine activity reflects evolving temporal predictions, suggesting at the very least that inference about the timing of events (for e.g., the hazard rate) influences the computation of dopamine reward prediction errors [20,2426]. More broadly, optogenetic manipulation of midbrain dopamine activity is sufficient to bidirectionally change judgments on a temporal categorization task [25], directly implicating dopamine signaling in timing processes. It also appears that the generation of prediction errors due to mistimed reward delivery is neurally separable from computing prediction errors due to an unexpected amount of reward, as ventral striatum lesions abolish the former (so a mistimed reward does not elicit a prediction error signal) while leaving prediction errors due to reward magnitude intact. This finding argues against the time-bound representation of value in the CSC representation, suggesting instead a semi-Markov model in which the duration of states and the amount of reward associated with each state are separately learned, and the ventral striatum plays a key role in learning or representing the former, but not necessarily the latter [22].

In general, it is often implicitly assumed that states correspond directly to percepts of cues in the environment [27,28]. However, apart from the challenges that timing poses to such an account, even straightforward neural representations of the environment are an interpretation of the external reality through, at minimum, a relevance filter [29,30]. It is therefore natural to extend TDRL models by allowing expected value to be calculated with respect to inferred states that capture the learned structure of a task [17,3134]. The mapping between observations (such as cues and rewards) and underlying task states may be probabilistic (as in ‘partially observable environments’) or ambiguous (for example in the case of conflicting or mixed cues) [19,3537], making state inference itself a non-trivial process. However, it is important to keep in mind that both model-free and model-based values can be learned/computed for states that do not correspond directly to observable cues—prediction errors based on inferred states are not, in of themselves, a departure from model-free TDRL, since at the time the errors are generated, they may still be based on cached values attached to the hidden states through direct experience.

Not all dopaminergic predictions are learned through direct experience

Indeed, a central aspect of TDRL that makes it model free is that, in the algorithm, values for state are learned (and cached) through direct experience with the state. Recent work suggests, however, that phasic dopamine may reflect values that have been learned indirectly. Of particular relevance is a sensory preconditioning experiment showing that reward predictions that are ascribed to a cue solely through its relationship to another neutral cue are reflected in dopamine neuron firing. Here, two neutral cues (A and B) were first presented in sequence multiple times (A→B), and then one of the cues, B, was paired with food in a separate training session. Behaviorally, this later training is known to endow cue A with reward-predictive value. Importantly, the authors showed that after B→food training, the presentation of cue A elicited a phasic increase in dopamine, which was correlated with activity elicited by presentation of cue B. This suggests that the expectation that A would lead to reward, presumably computed through model-based forward simulation of A→B and B→food, was available to dopamine neurons [38].

Notably, TDRL has no mechanism by which value can transfer between predictive cues retrospectively. Attempts have been made to explain these results by enhancing TDRL to operate not only on the current state, but on states that are inferred to be related to the current state—a departure from pure model-free reinforcement learning—as in ‘mediated learning,’ [39,40] or the Kalman TD model [32,41]. These explanations suggest that during the pairing of B with food, a neural representation of A is activated by association to B, and therefore also associated with the food. However, if the orbitofrontal cortex—an area associated with model-based computing of values—is inactivated at test, responding to A is abolished, while responding to B is intact [42]. Given that OFC has been repeatedly shown to be unnecessary for conditioned responding to cues directly paired with reward (for example, cue B in this experiment), this result strongly suggests that the value of A is computed in OFC at the time of the test and not during the B→food training. That dopamine prediction errors may reflect this computed-on-the-fly value is also consistent with accumulating evidence from fMRI showing that prediction error signals include model-based information and that model-based decisions are sensitive to striatal dopamine [4345].

We note that even if model-based values are used to compute prediction errors, the error itself may still influence only model-free learning, for instance of a behavioral policy [46]. Indeed, it is possible that at test A invokes a model-based representation of the inferred B, the cached value of which is available to dopamine neurons. Under this scenario, the prediction error signaled to A arises from the cached value of B not A [47]. It is also important to note that adding inferred states and access to model-based values does not (yet) require that dopamine convey a prediction error signal that is used for learning the model itself. However, optogenetic silencing in a related task shows that dopamine transients are in fact required for the initial formation of associations between cues A and B, even though no rewards were present, and therefore learning in that phase could not have been driven by scalar prediction errors [48].

Multiple dimensions of prediction in dopamine responses

Another fundamental property of TDRL is that it learns aggregate, scalar predictions of the sum of future rewards predicated on occupying the current state—a ‘common currency’ value that sums over apples, oranges, sex and sleep. As alluded to above, and complicating the mapping between dopamine and TDRL even further, it appears that dopamine neurons respond to deviations from predictions in dimensions other than scalar value [49]. In particular, prediction errors have been recorded for an unexpected change in the flavor of reward pellets, even though there was no change in their subjective value [50]. Such “state prediction errors,” that is, prediction errors due to an unexpected state (“I got chocolate milk rather than vanilla”), suggest that the identity of the outcome is a component of reward prediction in dopamine circuits, at odds with the model-free framework that explicitly ignores specific identities and compares values in common currency. Information about outcome identity may reflect inputs from the orbitofrontal cortex [51] which track multiple specific features of outcomes beyond reward amount [52,53].

Model-based learning with dopamine prediction errors

All told, current findings suggest that dopamine neurons have access to model-based representations of expected rewards that reflect learned properties beyond a scalar representation of value (Figure 1, right). However, the convergence of TDRL to a useful value representation stems from the alignment between the computational goal of the agent (to maximize total reward through value-guided action) and the single dimension along which reward predictions are represented (i.e. scalar value). Unless used judiciously, a generalized prediction error signal [54] that responds to any mismatch along multiple dimensions of an outcome (e.g., the color of a reward, or the oddly shaped plate it was served in) might erroneously perturb value representations upon which choices are putatively based, biasing the animal away from the normative goal (for example, towards preferring low-quality food served in ever-changing plates, rather than high-quality food served in more mundane dinnerware). Such biases have indeed been identified in the influence of novelty and information on both dopamine reward prediction errors and value-guided choice [55,56], but it is unclear how widespread they are.

Indeed, to be truly useful for learning a world model, ‘model-based prediction errors’ must be computed for every aspect of the model in parallel—a multidimensional (i.e., vector) prediction error that signals not only that there is a mismatch between expectation and reality, but exactly what dimension of prediction was misaligned [34,57,58]. Do dopamine neurons signal such model-based prediction errors? If so, ideally, these would be broadcast in parallel so that the correct component of the model might be updated via its respective prediction error [19,22] (Figure 1, right). This would allow a segregation of learning across different dimensions of reward prediction such as value, state identity, or time, supported by separable neural populations. Such segregation might account for the distinct pattern of prediction-error signaling in dopamine terminals across striatal subregions [59,60], and might be a more prominent feature of dopamine activity than previously detected, in part due to a sampling bias whereby experiments investigating dopamine signaling have almost exclusively manipulated reward value, not other state dimensions.

Moreover, because much of what we know about dopamine activity is derived from the analysis of activity of individual neurons or localized dopamine release or from techniques that average these signals over large populations, we may be missing more complex spatiotemporal and network interactions that can only be uncovered by treating these neurons as ensembles with unique input and output relationships. For example, target regions that receive, and learn from, dopamine prediction error signals might locally separate the incoming signal into distinct components, allowing the relevant dimensions of prediction to be flexibly decoded, depending on the current task and internal goals. For example, cholinergic signaling in the striatum is known to powerfully modulate dopamine release [61,62], implying local circuit control over the influence of dopamine signals according to the current state of the task [63,64]. However, exactly how a truly multiplexed prediction error could be separated into its orthogonal components is not trivial, to say the least.

So what is the role of dopamine in learning?

One thing that these recent studies make clear is that a better understanding of the computational role of dopamine entails a broader consideration of what it means for a reinforcement learning algorithm to be ‘model-based’ [34]. Model-based prediction in RL has been most strongly identified with the use of models for forward planning, enabling values to be computed on the fly (as opposed to cached) in order to flexibly support goal-directed behavior [65]. But models may also be exploited to enable learning over hidden states, for example in algorithms that combine inference with TDRL [36,66]. Indeed, the necessity to represent states through time, either by a CSC or other, more complex state representation [67,68], can be thought of as a model of the past—and now unobservable—state of the environment. Overall, the dopaminergic signatures of model-based prediction we have highlighted draw attention to the question of what is being learned about—while a relatively straightforward stimulus representation may be evident to an experimenter, such a representation may not form the basis of learning for a behaving animal in more complex tasks [66].

The suggestion that dopamine signals a multidimensional model-based prediction-error signal departs considerably from the claim (and supporting evidence) that all dopamine neurons broadcast a single, scalar quantity across vast areas of the brain. But, it is hard to see how lumping together all model-based prediction errors into one aggregate signal would be useful for downstream learning, unless we modify what we think the prediction error does downstream. One possibility is that the dopamine prediction-error signal enhances learning in target areas indiscriminately, without signaling the direction of learning—similar to a salience signal, in the service of learning rather than action—and information about what exact prediction was violated is available from other sources. Indeed, sensory and associative areas that have a detailed representation of the current state (including all cue and reward properties deemed relevant to the task) may be in the best position to know exactly in what ways this state is unexpected. Unfortunately, this re-envisioning of the role of phasic dopamine signals would not explain why some prediction errors, namely those to reward omission, are signaled by pauses in firing. Multiplexing of model-free scalar prediction errors and model-based multidimensional prediction errors may be the answer – but only future experiments directly testing for the existence of several of these errors at once, will tell. In any case, what is becoming clear is that phasic dopamine signals, until recently a beacon of computationally-interpretable brain activity, may not be as simple as we once hoped they were.


  • Recent work shows that dopamine reward prediction error signals reflect model-based information.

  • These model-based predictions rely on complex internal representations of multiple dimensions of the expected outcome, including reward identity, delay, variability.

  • We review recent work establishing the role of dopamine in model-based learning, with a focus on computational implications for how dopamine signals influence learning in the brain.


This work was funded by grant R01DA042065 from the National Institute on Drug Abuse (AJL, YN), grant W911NF-14-1-0101 from the Army Research Office (YN, MJS), an NHMRC CJ Martin fellowship (MJS), and the Intramural Research Program at the National Institute on Drug Abuse (ZIA-DA000587) (MJS, GS). The opinions expressed in this article are the authors’ own and do not reflect the view of the NIH/DHHS. The authors have no conflicts of interest to report.


