Skip to main content
. 2008 Jul 9;2(1):86–99. doi: 10.3389/neuro.01.014.2008

Figure 1.

Figure 1

The basic Actor/Critic architecture and its suggested neural implementation. (A) The (external or internal) environment provides two signals to the system: S, indicating the current state or stimuli, and r indicating the current reward. The Actor comprises of a mapping between states S and action propensities π(a|S) (through modifiable weights or associative strengths). Its ultimate output is an action which then feeds back into the environment and serves to (possibly) earn rewards and change the state of the environment. The Critic comprises of a mapping between states S and values V (also through modifiable weights). The value of the current state provides input to a temporal difference (TD) module that integrates the value of the current state, the value of the previous state (indicated by the feedback arrow) and the current reward, to compute a prediction error signal δt = r(St) + V(St) − V(St−1). This signal is used to modify the mappings in both the Actor and the Critic. (B) A suggested mapping of the Actor/Critic architecture onto neural substrates in the cortex and basal ganglia. The mapping between states and actions in the Actor is realized through plastic synapses between the cortex and the dorsolateral striatum. The mapping between states and their values is realized though similarly modifiable synaptic strengths in cortical projections to the ventral striatum. The prediction error is computed in the ventral tegmental area (VTA) and the substantia nigra pars compacta (SNc) – the two midbrain dopaminergic nuclei – based on state values from ventral striatal afferents, and outcome information from sources such as the pedunculopontine nucleus (PPTN), the habenula etc. (Christoph et al., 1986; Ji and Shepard, 2007; Kobayashi and Okada, 2007; Matsumoto and Hikosaka, 2007). Nigrostriatal and mesolimbic dopaminergic projections to the dorsolateral and ventral striatum, respectively, are used to modulate synaptic plasticity according to temporal difference learning.