Skip to main content
. 2006 Jun 15;28(4):294–302. doi: 10.1002/hbm.20274

Figure 1.

Figure 1

The standard tapped delay line assumption effectively yields a set of states for representing a trial. Each state is given a unique index (written above the circles), and each state maintains a value estimate (inside each circle). The value estimate is central to all temporal difference methods and represents, for each state, the future expected reward from this state. Each state has an intrinsic reward value, r, which is supplied by the environment on activation of that state. The reward value of the US state was set to 1 or −1, while the reward value of all other states was set to 0. Each time a state is entered, a prediction error signal is generated. This signal, which is positive for “better than expected” and negative for “worse than expected,” is then used to update the previous value estimate. The value estimate of each state is updated once per trial, and over successive trials, the prediction error signal effectively moves from the US to the CS via the intermediate states. As learning progresses, the value estimates represent with increasing accuracy the future reward associated with each state. An implicit assumption is made that the appropriate state is recognized based on the current position within the current trial type. The above representation was maintained separately for the three different CS types (CSAppetitive, CSAversive, and CSNeutral).