The Basal Ganglia, in particular the striatum, are central to theories of behavioral control, and often identified as a seat of action selection. Reinforcement Learning (RL) models - which have driven much recent experimental work on this region - cast striatum as a dynamic controller, integrating sensory and motivational information to construct efficient and enriching behavioral policies. Befitting this informationally central role, the BG sit at the nexus of multiple anatomical “loops” of synaptic projections, connecting a wide range of cortical and sub-cortical structures. Numerous pioneering anatomical studies conducted over the past several decades have meticulously catalogued these loops, and labeled them according to the inferred functions of the connected regions. The specific cotermina of the projections are highly localized to several different subregions of the striatum, leading to the suggestion that these subregions perform complementary but distinct functions. However, until recently, the dominant computational framework outlined only a bipartite, dorsal/ventral, division of striatum. We review recent computational and experimental advances that argue for a more finely fractionated delineation. In particular, experimental data provides extensive insight on unique functions subserved by the dorsomedial striatum (DMS). These functions appear to correspond well to theories of a “model-based” RL subunit, and may also shed light on the suborganization of ventral striatum. Finally, we discuss the limitations of these ideas and how they point the way toward future refinements of neurocomputational theories of striatal function, bringing them into contact with other areas of computational theory and other regions of the brain.
Keywords: Reinforcement learning, Basal Ganglia, Action Selection, Model-based learning, Goal-directed, POMDP
Perhaps more than any other brain areas, recent advances in understanding of the basal ganglia (BG) have been driven by computational models. This is largely due to the fact that core functions commonly ascribed to the BG — action selection and value learning — have been the subject of intensive study in both economics and computer science, particularly the subfield of artificial intelligence known as reinforcement learning (RL) [1]. Theories from these areas propose mathematical definitions for quantities relevant to these functions and step-by-step procedures for computing them. Accordingly, these models have rapidly progressed from general frameworks for interpreting data toward playing a more integral quantitative role in experimental design and analysis, and now often serve as explicit hypotheses about trial-by-trial fluctuations in biological signals, such as action potentials or blood oxygenation level dependent (BOLD) signals. The poster children for this approach are the influential, albeit controversial, temporal difference (TD) learning models, which describe a reward prediction error (RPE) signal that has proved a strong match to phasic firing of midbrain dopamine neurons as well as BOLD in the ventral striatum (VS) [2–7]. The present review considers recent work that has expanded upon this initial achievement, shedding further light on the computational and functional suborganization of striatum, and then considers questions raised by this work in light of other empirical data and computational modeling that, together, point the way for future work in this area.
The suborganization of striatum
Although anatomical considerations such as the topographical gradient of afferents from cortex to striatum have long suggested considerable functional heterogeneity (for instance, five distinct corticostriatal loops in one seminal review [8]), there have until recently been surprisingly few clear correlates of this presumptive suborganization in unit recordings or functional neuroimaging within BG. In parallel, there have been few functional subdivisions suggested among the dominant computational models to motivate or guide the search for such prolific variegation.
In RL models, the primary early progress on this question was the functional breakdown between learning to predict rewards (“critic”) and, guided by these predictions, learning advantageous stimulus-action policies (“actor”) [2]. This was suggested [3] to correspond to a two-module breakdown in striatum between motoric actor functions dorsally and evaluative critic functions ventrally, an idea which resonates with data both from rodent lesions [9, 10] and human functional neuroimaging [11–13].
The actor/critic formalizes a classic psychological distinction between Pavlovian learning (about stimulus-outcome relationships), and instrumental learning (about which actions are advantageous). However, psychologists have long known that these functions are further decomposable. Notably, instrumental learning comprises subtypes that guide actions using different learned representations: “habitual” actions based on stimulus-response associations, vs. “goal-directed” actions supported by representation of the particular goal (such as food) expected for an action [14, 15]. This distinction is typically probed using manipulations that alter action-outcome contingency or outcome value, such as reward devaluation, and then evaluating the subsequent effect on responding. After extensive training, behavior can become insensitive to these manipulations, suggesting an isolated reliance on stimulus-response representations, i.e. habits.
A raft of recent theoretical work exploits a parallel distinction in RL models [16–22]. TD theories, such as the actor/critic, correspond well to classical ideas about stimulus-response habits and how they are reinforced [23, 24], but these “model-free” algorithms cannot explain behavioral phenomena associated with goal-directed behavior such as devaluation sensitivity or latent learning [25]. However, an additional family of “model-based” RL algorithms describe how learning about the structure of an environment can be used to evaluate candidate actions online. This evaluation, typically implemented by some sort of simulation, inference or preplay using a forward model of the task (analogous to an action-outcome representation or cognitive map) [26–29], can plan new actions or re-evaluate old ones drawing on information other than simple reinforcement history.
Much recent experimental research has been guided by the discovery, via lesions of rodent striatum, that these behaviors are supported by distinct subregions of dorsal striatum, respectively lateral and medial (DLS and DMS) [30–33]. The proposal that a model-free TD actor in DLS is accompanied by an additional model-based RL system in DMS has considerable promise. First, it preserves the substantial successes of the TD/dopamine models while correcting some of their shortcomings: for instance its redundancy of control may help to explain why lesions to dopaminergic nuclei do not prevent all instrumental learning [34]. Conversely, just as TD models have helped to shed light on dopaminergic habit mechanisms, model-based RL may provide a framework for understanding neural mechanisms for goal-directed evaluation.
Such work is at a very early stage; although putative correlates of model-based computations have been reported throughout a very wide network [35–43], probably the most developed data thus far concern spiking correlates for both prospective locations (in hippocampus) and associated rewards (in ventral striatum), suggesting a circuit for model-based evaluation of candidate trajectories [26, 44, 45]. Complementary to such “preplay” phenomena, “replay” of neural sequences may play a similar, but offline, role in updating stored (e.g. model-free) value predictions [46–49], perhaps by sampling model-based trajectories estimated from a cognitive map [50].
Theories of model-based and model-free RL in the basal ganglia envision parallel circuits. This is consistent with findings from lesion studies that the two learning processes appear to evolve side-by-side [30, 31], even though they tend to dominate behavior serially: progressing, with training, from goal-directed to habitual responding. Several features of DMS and DLS unit recordings appear to reflect these differential temporal dynamics, with task-related responsivity in DMS peaking early in training (and in retraining, following task changes) [51, 52]; as well as with changes in ensemble responses [53] and several measures of synaptic potentiation [54] peaking earlier in DMS than DLS.
The last results [54] resonate with at least two other predictions from RL models. The authors measure a greater concentration of dopamine D2 receptors on neurons in DLS (relative to DMS). That these receptors are uniquely sensitive to extrasynaptic tonic dopamine concentrations (which are considerably lower than those resulting from phasic bursts) supports a previously posited computational role for tonic dopamine in the modulation and motivational control of habitual expression [18]. Further, the D2-containing neurons (which are known to primarily project into the “indirect”, striatopallidal, pathway) were also a primary site of synaptic potentiation during behavioral training. This observation is consistent with a body of computational and experimental work suggesting that these receptors are involved in learning as well as expression, perhaps specifically in learning which actions to avoid [55, 56]
Questions and anomalies
At the same time, many of these studies point to three serious questions for the RL models: on their overall architecture, their mechanisms for learning, and how they are deployed during choice.
Architecture and the model-based critic
First, the basic project of rescuing model-free actor/critic theories of the dopamine system by augmenting them with a separate and parallel model-based RL system is challenged by a number of recent results suggesting that even areas associated with the putatively model-free critic (including ventral stratum, [42, 43, 45], downstream ventral pallidum [39], and RPE units in the dopaminergic midbrain [59]) all show properties such as sensitivity to devaluation that are indicative of model-based RL and not easily explained by the standard model-free TD theories (see also [60]). These results suggest that the two hypothetical systems are either more interacting or hybrid than separate, consistent with the overlapping “loop” architecture suggested by anatomical studies [8, 61]. An intriguing alternate suggestion is that the ventral striatal critic also consists of dissociable subcomponents for model-based and model-free Pavlovian evaluation (Figure 1). Indeed, psychologists distinguish preparatory and consummatory forms of Pavlovian conditioning, which may involve distinct circuits in the core and shell of ventral striatum [33, 62–64]. This distinction again closely tracks that between model-based and model-free RL, in that consummatory Pavlovian responses reflect knowledge of the particular outcome expected (suggesting they are derived from predictions using a world model), whereas preparatory responses, like a model-free critic’s predictions, are not outcome-specific [65].
Figure 1. The dual-actor/critic framework.
The dorsal/ventral divide of the actor/critic model is extended to include recent theoretical and experimental advances supporting further functional subdivisions of each region (after Yin et al. [33]). These parallel circuits implement different approaches to reinforcement learning, either “model-free” (dark grey) or “model-based” (light grey).
The dorsal region is now divided medial/lateral (though a gradient may be more accurate [57]), each supporting a different “actor” submodule: a dorsolateral area, supporting a model-free actor; and the dorsomedial region, a substrate for representations that enable model-based planning. Further, current evidence suggests that the ventral region may itself be functionally subdivided, along the boundary between nucleus accumbens “core” and “shell”. These regions are each crucial for different forms of Pavlovian responding: preparatory and consummatory, respectively. Computationally, they correspond to model-based and model-free critics, computing the net present expected value of the current state using, respectively, either state-value mappings (purely based on reinforcement history) or state-outcome and outcome-value predictions derived from a world model.
Thus, model-based RL may offer a process-level description of striatal subregion function that encompasses both goal-directed instrumental and consummatory Pavlovian behaviors, paralleling the unification of habitual action and preparatory Pavolovian responses embodied in the original, model-free, actor/critic formulation.
Schematic coronal slice of rat striatum modified with permission from Paxinos and Watson [58].
Learning and hierarchical RL
A related issue is that the considerable algorithmic differences between model-based and model-free RL approaches seem poorly matched to the basic isomorphism of circuitry between different parts of the BG [8]. While the model-free actor and critic both learn from the same error signal operating on different inputs — thought to be consistent with a similar dopaminergic input driving plasticity in both ventral and dorsolateral striatum — the representations used for model-based RL seem to require quite different teaching signals and learning rules [41], offering no obvious role, in these theories, for a dopaminergic RPE in DMS.
One possible direction for resolving this question arises from a somewhat different empirical and theoretical take on the function of DLS, in the “chunking” of behavioral sequences. DLS neurons (and, importantly, not those in DMS) tend with training to cluster their responsivity at the beginning and end of a trial [53], and lesions of DLS [66] (and of a prefrontal area that may be afferent to it [32]) also suggest a causal role in behavioral chunking. In RL, such chunking of actions into multiaction “options” is formalized by a family of “hierarchical” RL models [67]. Taken as an organizing principle for striatum (e.g., with policies operating on elemental actions represented in DMS and, moving laterally, policies on progressively more chunked options), this model has the appealing feature that all levels of the hierarchy learn from a similar (in this case, model-free) RPE signal and a common learning rule.
However, although there appears to be some informal resonance between a stimulus-response habit and an automatized behavioral sequence, in RL, the inclusion of options generally crosscuts the distinction between model-based and model-free evaluative strategies. Since the latter distinction has been used to explain the signature phenomena (such as devaluation sensitivity) tying DMS and DLS to goal-directed and habitual instrumental behaviors, additional theoretical work will be needed to understand if these these two approaches can be blended, e.g. using model-based hierarchical RL, to formalize both chunking and devaluation phenomena together.
Arbitration and choice under uncertainty
A third major question raised by theories involving multiple, parallel reinforcement learners is how the brain arbitrates between the two systems’ choices. One theoretical proposal is that the predictions of model-free and model-based reinforcement learners may be competitively combined based on the uncertainty about their predictions [17]. In population code representations, uncertainty may be carried by the entropy of neuronal firing across the population [68, 69]. Indeed, over training Thorn et al. [53] observed differential modulations in population entropy in DLS and DMS, with the DMS representation becoming structured more quickly but ultimately overtaken in this measure by DLS.
Nevertheless, this theory has little to say about the more dynamic processes or mechanisms by which the brain combines these uncertain estimates. The accumulation of multiple noisy evidence sources has, however, been studied extensively in another heretofore largely distinct area of theoretical and experimental work on decision making about noisy sensory displays. Here, reaction times, errors, and ramping activity of neurons in posterior parietal cortex are famously captured by Bayesian models of the accumulation of evidence about stimulus identity [68, 70]. By these models, sensory decision regions interpret stimuli by drawing sucessive samples of noisy sensory input as represented in upstream sensory cortices (which may themselves incorporate a sensory prior that develops to match the distribution of natural stimuli [71]). This work is rapidly coming directly into contact with research on RL and the BG for a number of reasons.
For one, the success of these Bayesian sequential sampling models is not limited to purely sensory tasks involving the analysis of noisy percepts. Notably, they also capture human behavior in tasks involving more affective, value-driven choices, such as pricing or choosing between snack foods [20, 72, 73]. Thus, goal-directed valuation, too, may involve accumulating stochastic samples, here presumably drawn from memory rather than a noisy percept [74–77]. This implies a rather different mechanism for model-based evaluation than the more systematic tree search or Bayesian graph inference so far hypothesized [17, 28], though perhaps one not incompatible with the relatively noisy preplay phenomena observed neurally [45, 50, 78]. Such a procedure for computing model-based values could incorporate uncertainty-weighted habit information (e.g. as a prior), suggesting a dynamic solution to the arbitration problem. In all these respects, it is interesting that in a sensory decision task, primate caudate neurons display activity related to evidence accumulation not unlike that seen in parietal cortex [79].
These models, finally, speak to the BG’s connections with a broader anatomical and computational universe. Typical sensory and RL decision tasks exercise almost entirely complementary functions: in one case, analyzing a noisy sensory stimulus with the response rule well defined, and, in the other, figuring out which candidate response is most valuable with no perceptual uncertainty. There has been considerable interest in how the neurocomputational mechanisms that have been characterized for each function, separately, might interact in more complicated tasks involving both value learning and perceptual (or, in RL terms, “state”) uncertainty [80–83]. Models based on RL for so-called partially observable Markov decision processes (POMDP) [80, 82, 84] suggest that these two mechanisms can operate serially: a cortical perceptual inference module infers a distribution over possible stimuli and this serves as input for a BG RL module operating much as before. Such a model explains dopaminergic responses in a perceptual inference task as related to prediction error as the cortical model “figures out” whether the animal is facing an easy (likely rewarded) or hard trial [82, 85]. This approach may also provide a route toward explaining dopaminergic responses related to seeking information about future reward prospects [86].
Last, since the perceptual system, on this view, must in general base its percepts on learning about the statistical structure of the task and percepts, this idea situates the RL circuit alongside substantial recent work on Bayesian models of learning latent structure [87–90]. More generally, such structure learning is a valuable component of an efficient model-based system. Indeed, in realistic RL tasks involving perceptual uncertainty, such latent structure learning is also an important component of learning the world model for model-based RL. Thus, exploring how inference guides the construction and employment of associative representations may ultimately provide a synthesis between cortical belief computations and model-based striatal RL.
The authors are supported by a Scholar Award from the McKnight Foundation, a NARSAD Young Investigator Award, Human Frontiers Science Program Grant RGP0036/2009-C, and NIMH grant 1R01MH087882-01, part of the CRCNS program. We thank Dylan Simon for helpful ideas on heirarchical RL, and Amitai Shenhav and Fenna Krienen for comments on an earlier version of this manuscript.
References and recommended reading
