Predicting the Future With a Scale-Invariant Temporal Memory for the Past

Wei Zhong Goh; Varun Ursekar; Marc W Howard

doi:10.1162/neco_a_01475

. 2022 Feb 17;34(3):642–685. doi: 10.1162/neco_a_01475

Predicting the Future With a Scale-Invariant Temporal Memory for the Past

Wei Zhong Goh ¹, Varun Ursekar ², Marc W Howard ³

PMCID: PMC8944185 PMID: 35026027

Abstract

In recent years, it has become clear that the brain maintains a temporal memory of recent events stretching far into the past. This letter presents a neurally inspired algorithm to use a scale-invariant temporal representation of the past to predict a scale-invariant future. The result is a scale-invariant estimate of future events as a function of the time at which they are expected to occur. The algorithm is time-local, with credit assigned to the present event by observing how it affects the prediction of the future. To illustrate the potential utility of this approach, we test the model on simultaneous renewal processes with different timescales. The algorithm scales well on these problems despite the fact that the number of states needed to describe them as a Markov process grows exponentially.

1. Using Memory to Predict the Future

Reinforcement learning (RL) models that are designed for Markov processes (Watkins & Dayan, 1992; Sutton, 1988) have been extraordinarily successful in accounting for reward systems in the brain (Schultz, Dayan, & Montague, 1997; Waelti, Dickinson, & Schultz, 2001) and have led to remarkable achievements in artificial intelligence (Mnih et al., 2015; Silver et al., 2018). Despite the success of RL, its affinity for Markov statistics may be a serious limitation. The real world contains many distinct causes that predict their effects at a range of timescales, presenting a challenge for learners optimized for Markov statistics. Of course, random processes with memory can be turned into Markov processes at the cost of defining additional states. However, the cost in terms of memory, and time to learn transition probabilities among an exponentially growing number of states, may be excessive in some settings.

It has been proposed that a primary function of the mammalian brain is to predict future events to enable adaptive behavior (Clark, 2013; Friston, 2010). Evidence from neuroscience has made clear that the brain contains robust memory for the identity and time of recent events extending well into the past. For instance, sequentially activated time cells in the hippocampus, prefrontal cortex, and striatum (MacDonald, Lepage, Eden, & Eichenbaum, 2011; Tiganj, Cromer, Roy, Miller, & Howard, 2018; Mello, Soares, & Paton, 2015) maintain information about the time at which recent events were experienced over at least tens of seconds, and perhaps much longer. Experimental presentation of distinct stimuli triggers different sequences of time cells (Tiganj et al., 2018; Taxidis et al., 2020; Cruzado, Tiganj, Brincat, Miller, & Howard, 2020) so that these populations maintain information about what happened when. In addition to sequentially activated time cells, neurons in the entorhinal cortex (Tsao et al., 2018; Bright et al., 2020) and other cortical regions (Bernacchia, Seo, Lee, & Wang, 2011; Murray et al., 2017) carry temporal information via populations of neurons that respond with a spectrum of characteristic timescales, in some cases up to at least minutes (Tsao et al., 2018). This letter, inspired by work arguing that conditioning results from an attempt to learn temporal contingencies between stimuli (Balsam & Gallistel, 2009; Gallistel, Craig, & Shahan, 2019), presents a formal model that learns to predict the future given a temporal record of the past. This proposed mechanism is computable given a temporal history that can be translated in time and proposes a solution for how to estimate the future from a past that includes information about many past events.

This letter proceeds as follows. In the rest of this section, we review a model for retaining a record of past events and associations between event pairs. In section 2, we present the model for predicting the future given a temporal record of the past. In section 3, we discuss its computational complexity, timescale invariance, and several other properties. In section 4, we present a numerical demonstration of the efficacy of this algorithm. Finally, in section 5, we compare this algorithm to traditional RL algorithms and point out its connections to neuroscience.

1.1. A Formal Model for Temporal Record of the Past

We start with an agent capable of observing and remembering several types of events, such as the onset of a 440-Hz tone or the appearance of an image of an apple. In this section, we describe a model for its capabilities. We will see that the agent maintains a fuzzy time line of past events, which it uses to make pairwise associations between events. Neurobiological justification for this model is outlined in section A.1 of the appendix.

1.1.1. Events in Continuous Time

We assume that the world provides a series of discrete events that occur in continuous time. For simplicity, without loss of generality, suppose there are three types of events, which we call x, y, and z, respectively. Whenever we need to avoid confusion, we use event type to refer to type of event and event episode to refer to an individual occurrence of an event. We encode the occurrence of the event type x as a signal $f^{X} (t)$ , which is the sum of Dirac delta functions centered at the occurrence times of episodes of x (see Figure 1a). (We discuss quantities in relation to x; such statements hold analogously for y and z.) We call $t$ , the argument for the signal $f^{X} (t)$ , real time or external time, to emphasize that this time axis is a feature of the world instead of being constructed by the observer. We denote the collection of all three signals as $f (t)$ and analogously for the quantities to follow. At every instant in (external) time $t_{0}$ , the agent has direct access to $f (t_{0})$ (which is zero precisely unless the event of interest occurs at $t_{0}$ ), but not $f$ at any other time value. Signals are shown in Figure 1a in the case where x, y, and z occur at times 0, 1, and 2, respectively.

Figure 1: — Memory is a fuzzy representation of the signal up to the present. (a) Signal as a function of external time, for three event types, x, y, and z. This is the scenario considered in Figures 2 through 6. (b) Memory for a recent event as a function of internal past time, at varying (external) times since the event occurred. As a function of internal past time, peaks in the memory are present at approximately the time interval since the event.

1.1.2. Temporal Memory

At every instant in time $t_{0}$ , the agent's memory for x, denoted ${\tilde{f}}^{X} (\overset{*}{τ}; t_{0})$ , is a fuzzy representation of the signal up to the present, $f^{X} (t_{0} - \overset{*}{τ})$ . From the agent's perspective, the internal past time, $\overset{*}{τ} > 0$ , indexes how long ago events in memory might have occurred. The degree of fuzziness of the memory varies inversely with a sharpness parameter $k$ , which is typically a small even integer; throughout this letter, it is fixed at 8.

Consider an event that exactly happens at a particular time, $τ_{0}$ . At time $τ_{0} + t$ , the memory element for that event is given by $\tilde{f} (\overset{*}{τ}; τ_{0} + t) = Φ_{k} (t / \overset{*}{τ}) / \overset{*}{τ}$ , where the fuzziness, $Φ_{k} (\cdot)$ , is given by the dimensionless equation,

Φ_{k} (x) = u (x) κ_{0} x^{k} e^{- k x},

(1.1)

$κ_{0} = k^{k + 1} / k!$ is a normalizing constant and $u$ is the unit step function. Memories for a recent event are shown in Figure 1b for various values of $t$ . For an arbitrary signal $f$ , the associated memory up to time $t$ is

\tilde{f} (\overset{*}{τ}; t) = \frac{1}{\overset{*}{τ}} \int_{- \infty}^{t} f (τ) Φ_{k} (\frac{t - τ}{\overset{*}{τ}}) d τ .

(1.2)

(The origin for equations 1.1 and 1.2 is outlined in section A.1) In other words, the memory for an event type is the sum of the memory elements associated with each episode of that event type. On its face, equation 1.2 appears to assume that the agent has access to the infinite past of $f (t)$ . However, previous work has shown that $\tilde{f} (\overset{*}{τ}; t)$ can be efficiently and time-locally constructed from a set of leaky integrators with a spectrum of time constants (see section A.1; Shankar & Howard, 2013). Using this approach, the number of leaky integrators necessary to remember the past to some bound $T$ goes up like $log T$ . Previous papers (e.g., Shankar & Howard, 2013) using this formalism have made explicit use of this property in choosing to sample values of $\overset{*}{τ}$ on a logarithmic scale and logarithmically compress the $\overset{*}{τ}$ axis in integrals. In this letter, we do not logarithmically compress the $\overset{*}{τ}$ axis in integrals. However, one may adopt an alternative interpretation, consistent with this letter, as follows. Within this alternative interpretation, the expressions for integrals over $\overset{*}{τ}$ used below are logarithmically compressed (i.e., a factor of $1 / \overset{*}{τ}$ is added to the integrands). At the same time, the prefactor of $1 / \overset{*}{τ}$ is removed from equation 1.2. Neurally, this would amount to a statement that the peak firing rate of time cells triggered by a delta function is constant as a function of $\overset{*}{τ}$ .

The signal $f$ up to any given external time $t_{0}$ fixes the event occurrence history. However, due to the agent's fuzzy memory, the agent is only able to form a fuzzy subjective belief distribution about the event occurrence history leading up to the present. We may interpret the memory for x as the agent's subjective estimate of the instantaneous rate of occurrence of x at time $t - \overset{*}{τ}$ . In other words, we have, for an infinitesimal time element $d \overset{*}{τ}$ ,

{\tilde{f}}^{X} (\overset{*}{τ}; t) d \overset{*}{τ} \approx P (x @ t - \overset{*}{τ} (d \overset{*}{τ})),

(1.3)

where $P (\cdot)$ , the probability of an event, is used in the subjective Bayesian sense to describe the agent's belief and “ $x @ t - \overset{*}{τ} (d \overset{*}{τ})$ ” stands for “an episode of event x occurred within the infinitesimal time interval between $t - \overset{*}{τ}$ and $t - \overset{*}{τ} + d \overset{*}{τ}$ .” Since $\tilde{f}$ allows the agent access to the identity of and approximate time at which past events might have happened, we describe $\tilde{f} (\overset{*}{τ})$ to be a time line of the past.

At each instant in time $t$ , the agent is also able to compute the state of the memory a time interval $δ$ into the future, assuming that no events of interest occur during that interval. For an arbitrary signal $f$ , this quantity is given by

{\tilde{f}}_{δ} (\overset{*}{τ}; t) = \frac{1}{\overset{*}{τ}} \int_{- \infty}^{t} f (τ) Φ_{k} (\frac{t + δ - τ}{\overset{*}{τ}}) d τ .

(1.4)

Translation can be efficiently implemented based on the set of leaky integrators. Prior work has shown that this can be done in a neurobiologically reasonable way (see section A.2; Shankar, Singh, & Howard, 2016).

1.1.3. Estimating Pairwise Time-Lagged Statistics

Many models of memory make use of associations between the temporal context describing the recent past and the currently available stimulus. The agent described here builds pairwise associations from x (the cue) to y (the outcome) as the average state of memory for x whenever y occurs, and analogously for other event pairs:

Δ M_{X}^{Y} (\overset{*}{τ}) \propto {\tilde{f}}^{X} (\overset{*}{τ}; t) f^{Y} (t) .

(1.5)

We denote the collection of pairwise associations between event pairs as $M (\overset{*}{τ})$ , which may be thought of as an $n \times n$ matrix at every $\overset{*}{τ}$ , where $n$ is the number of possible events. We denote the collection of pairwise associations with x as the cue as $M_{X} (\overset{*}{τ})$ , which may be thought of as a vector with $n$ elements, one for each possible outcome, at every $\overset{*}{τ}$ .

Note that as a neural network, equation 1.5 simply requires Hebbian learning. At the end of learning, we normalize $M_{X}$ by the number of episodes of the cue x, $\int f_{X} (t) d t$ .

For example, suppose that x always precedes y by a time interval $τ_{X Y}$ . Then, by the end of learning, we would have the pairwise association

M_{X}^{Y} (\overset{*}{τ}) = Φ_{k} (τ_{X Y} / \overset{*}{τ}) / \overset{*}{τ} .

(1.6)

Figure 2 shows the pairwise associations between two pairs of events, occurring 1 and 2 time units apart, respectively.

We may view $M_{X}^{Y}$ from two complementary perspectives. First, given the occurrence of y in the present, the agent could use $M_{X}^{Y} (\overset{*}{τ})$ as a subjective estimate, based on an average over occurrences of y, of the instantaneous rate of occurrence of x at time $\overset{*}{τ}$ in the past, that is,

M_{X}^{Y} (\overset{*}{τ}) d \overset{*}{τ} \approx P (x @ t_{Y} - \overset{*}{τ} (d \overset{*}{τ})) .

(1.7)

Second, given the occurrence of x in the present, the agent may use $M_{X}^{Y} (δ)$ as a subjective estimate of the instantaneous rate of occurrence of y at time $δ$ in the future, that is,

M_{X}^{Y} (δ) d δ \approx P (y @ t_{X} + δ (d δ)) .

(1.8)

We use $\overset{*}{τ}$ or $δ$ as the time argument for $M$ according to the interpretation that applies.

A limitation of directly selecting an element of the pairwise association $M$ to predict the future is that the prediction can only be based on a single cue in the present (i.e., the cue corresponding to that element). To overcome this, the agent constructs the pairwise prediction $m$ , which integrates cue—outcome pairing information (encoded by $M$ ) from multiple simultaneous cues in the present to predict the future. We estimate the rate of future occurrences of y based on the events in the present (time $t$ ) as

m^{Y} (δ; t) d δ \approx P (y @ t + δ (d δ)),

(1.9)

where

m (δ; t) = κ_{1} e^{[M f_{δ}] (t)}

(1.10)

is the collection of rates of outcomes (a vector with $n$ elements, one for each possible outcome, at every $δ$ ) and $κ_{1}$ is a normalization constant whose form is given in section A.3. The exponential applies element-wise. The operator $M$ is defined by

[M f_{δ}] (t) = \frac{1}{|E_{t}|} \sum_{α ∊ E_{t}} \int f_{δ}^{α} (\overset{*}{τ}; t) log M_{α} (\overset{*}{τ}) d \overset{*}{τ},

(1.11)

where $E_{t}$ is the set of events occurring at time $t$ , $|E_{t}|$ is the number of events occurring at time $t$ , $M_{α}$ is the collection of pairwise associations with $α$ as the cue (a vector with $n$ elements, one for each possible outcome in correspondence with $m$ , at every $δ$ ), and

f_{δ}^{α} (\overset{*}{τ}; t) = Φ_{k} (δ / \overset{*}{τ}) / \overset{*}{τ}

(1.12)

denotes the future state of the memory element associated with the currently occurring episode of $α$ . (Contrast this with ${\tilde{f}}_{δ}^{α}$ (see equation 1.4), which denotes the future memory state induced by all past occurrences of $α$ .) The logarithm in equation 1.11 applies element-wise. The operator $M$ may be thought of as operating on the precomputed future memory state of the current events (see equation 1.4) to generate a prediction for the future. In general, equations 1.8 and 1.9 provide similar estimates for the future. The normalization constant $κ_{1}$ is such that precisely when x is the only cue for y and the time delay between them is fixed, equations 1.8 and 1.9 provide exactly the same estimate for y. Mathematical details are in section A.3.

Intuitively, the integral that is the second term on the right-hand side of equation 1.11 takes the product of $log M_{X}$ with the future state of the memory element $f_{δ}^{X}$ . Let us consider y to be the outcome of interest and consider the case where the time interval between x and y, $τ_{X Y}$ , is constant. In this case, the integral (and thus, $m^{Y}$ ) attains a maximum as a function of $δ$ when $δ$ coincides with the peak of $log M_{X}^{Y} (\overset{*}{τ})$ , which is approximately $τ_{X Y}$ . Fittingly, this is behavior we expect of $m^{Y}$ .

Both $M_{X}^{Y}$ and the corresponding equation 1.11 integral are smooth functions that peak around the delay interval between the two events. This may prompt the question of why the integral is used instead of $M_{X}^{Y}$ in its place (Tiganj et al., 2019). The strength of the present formulation is that equations 1.10 to 1.12 closely parallel equations 2.2 to 2.3 in the next section, for which the integral is necessary. Using equations of similar form is more neurobiologically realistic because it suggests that analogous neural architecture supports the computation for both pairwise prediction $m$ and prediction $p$ , to be introduced later.

2. Predicting the Future with a Scale-Invariant Past

It would be straightforward to build a prediction for the future based on a single event (e.g., the most recent event) using the pairwise associations $M$ . The challenge is to build a prediction that is based on multiple events in the recent past. One difficulty arises when associations overlap. For example, we associate the sound of rain (x) with a chance of hearing thunder (z). We also associate the sight of wet ground (y) with a chance of hearing thunder (z). Having heard the sound of rain, the prediction for thunder should not be increased by the sight of wet ground when we step outdoors. This example illustrates one of the pitfalls of simply adding the predictions suggested by the pairwise associations.

To address double-counting, in addition to pairwise associations, we construct credit associations between event pairs, which is the key for this algorithm to generate a time line for the future. In section 2.1, we explain how the agent constructs a time line of the future by integrating over a time line of the past, weighted by credit associations between cues and outcomes. In section 2.2, we show how the agent learns credit associations between cues and outcomes based on comparing predictions prior to the cue with predictions due to the cue.

2.1. Generating Predictions from Credit Associations

In addition to the pairwise associations $M$ , we build the credit associations $C$ between each pair of events (a cue and an outcome) as a function of internal time $δ$ since the cue. The credit associations $C (δ)$ may be thought of as an $n \times n$ matrix at every $δ$ . We denote the collection of credit associations with x as the cue as $C_{X} (δ)$ , which may be thought of as a vector with $n$ elements, one for each possible outcome, at every $δ$ . We interpret $C_{X}^{Y} (δ)$ as the logarithm of the factor by which an agent adjusts its subjective estimate of the instantaneous rate of occurrence of y at time $δ$ in the future, having just observed x. Denoting $p^{-} (δ)$ as the agent's prior estimate (just before observing x), we have

{[p^{-} (δ)]}^{Y} exp C_{X}^{Y} (δ) d δ \approx P (y @ t_{X} + δ (d δ)) .

(2.1)

equation 2.1 relates to the observation of one cue at one time (the present). For cues in the past, the further in the past they occur, the more imminent outcomes should seem. For example, if x has credit for y peaking at $δ = 5$ and x occurred three time units ago, y should be expected in two time units. Accounting for multiple cues over the past, we find that at time $t$ , the agent's internal time line for a time $δ$ into the future, is

p (δ; t) = Λ ⊙ e^{[C {\tilde{f}}_{δ}] (t)},

(2.2)

where $p$ stands for prediction and is a vector over event types, $Λ$ consists of the long-term average of each event type, $⊙$ denotes the element-wise product, and the exponential applies element-wise. The operator $C$ is defined by

[C {\tilde{f}}_{δ}] (t) = \sum_{E} \int C_{E} (\overset{*}{τ}) {\tilde{f}}_{δ}^{E} (\overset{*}{τ}; t) d \overset{*}{τ},

(2.3)

where the index of summation $E$ indexes the possible cue types, and $C_{E} (\overset{*}{τ})$ represents the collection of credit associations with $E$ as the cue (a vector with $n$ elements, one for every possible outcome, at every $\overset{*}{τ}$ ). Intuitively, the integral sums products of $C_{E}$ with the projected memory ${\tilde{f}}_{δ}^{E}$ . Let us consider y to be the outcome of interest and x the only cue, and consider the case where the time interval between x and y, $τ_{X Y}$ , is constant. In this case, the integral (and thus, $p^{Y}$ ) attains a maximum as a function of $δ$ at the value of $δ$ at which the peaks of $C_{X}^{Y}$ and ${\tilde{f}}_{δ}^{X}$ coincide. The credit $C_{X}^{Y}$ peaks around $τ_{X Y}$ , the time delay between x and y. The projected memory ${\tilde{f}}_{δ}^{E}$ peaks around the time $τ_{X} + δ$ , where $τ_{X}$ is the time that has elapsed since x. Therefore, the integral (and thus, $p^{Y}$ ) peaks around $δ = τ_{X Y} - τ_{X}$ , which is the time remaining to y, the outcome of interest. In other words, the agent's expectation for y would be the highest at a time when y is, in fact, due. Mathematical details are in section A.4.

We interpret $p^{Y} (δ; t)$ as the agent's subjective estimate, made at time $t$ , of the instantaneous rate of occurrence of y at time $δ$ in the future, that is,

p^{Y} (δ; t) d δ \approx P (y @ t + δ (d δ)) .

(2.4)

Unlike equations 1.8 and 2.1, this estimate takes into account all of the events that have occurred in the recent past. A schematic distinguishing the utility of the pairwise associations $M$ and the credit associations $C$ in making predictions is shown in Figure 3. Just as we consider $\tilde{f} (\overset{*}{τ})$ a time line of the past, we consider $p (δ)$ a time line of the future. Note that $p^{Y} (δ = 0; t)$ would correspond to the agent's internal model for, in the language of point process theory, the conditional intensity function of y (see Rasmussen, 2018).

Figure 3: — Predictions can be made using credit associations $C$ based on memories of the multiple events in the recent past. The horizontal axis shows events occurring in real time. The event signal for this scenario is shown in Figure 1a. Associated with each point in real time is an agent's internal time axis, shown here diagonally at $t = 1.0$ , which indexes memories of the past (bottom half) and predictions for the future (top half). The agent may make a prediction for the future with $M$ (see equation 1.8) based on the currently observed event (here, y). As a better alternative, the agent may make a prediction for the future with $C$ (see equation 2.2) based on multiple events in the present and the recent past.

As an illustration, consider again the scenario where events x, y, and z always occur consecutively, once on each trial, at relative times 0, 1, and 2, respectively, with a very long gap between trials. Once x occurs, the proposed algorithm (explained in the following sections) generates predictions for y and z that become more and more imminent as time elapses (see Figure 4). As a function of $δ$ , the predictions peak at approximately the time when the events are, in fact, due.

Figure 4: — Predictions for events peak at about the right time and become more imminent with time. The events x, y, and z occur on each trial at times 0, 1, and 2, respectively, as with previous figures. (a) A schematic of the state of memory and prediction as a function of time. The axes have the same interpretation as in Figure 3. At real time 0.00, x is observed, leading to a prediction for y and z, depicted along the diagonal internal time axis. As real time passes, the memory of x recedes into the past, and the predictions for y and z become more imminent, depicted by the events' downward movement along the internal time axis. (b) Prediction for y and z generated by simulation using the proposed algorithm as the memory for x recedes into the past, depicted at four time points. The peak times for the prediction for y and z correspond roughly to when the events are in fact due and move toward zero as time passes. For example, in the top-most plot, right after x occurs, y and z are to occur in 1 and 2 time units, respectively. Indeed, the generated predictions for y and z peak at approximately $δ =$ 1 and 2, respectively.

2.2. Computing Credit Associations

Loosely speaking, we assign credit for an outcome to an event according to how much the event's occurrence would revise the prediction for that future outcome. In our example, wet ground would be assigned little to no predictive value, since the chance of thunder has already been predicted by the sound of rain. During training, we update the credit assigned to an event when that event occurs. In this section, we describe the update that happens when x occurs with no loss of generality.

Formally, as we have stated, $exp C_{X}^{Y} (δ)$ is the factor by which we should adjust the prediction for y at time $δ$ in the future, having just observed x. Therefore, to compute $exp C_{X}^{Y} (δ)$ , whenever x is observed, we will first compute the prediction for y before and due to the observation of x, and analogously for other possible outcomes.

2.2.1. Prediction prior to Event Observation

Prior to event observation at time $t$ , the prediction associated with internal future time $δ$ is given simply by

p^{-} (δ; t) = lim_{t^{'} \to t^{-}} p (δ; t^{'}) .

(2.5)

This prediction arises from the memory of cues in the past and specifically excludes the effects of what occurs at time $t$ .

Consider the scenario in Figure 1, where x, y, and z occur consistently at trial times 0, 1, and 2, respectively. When x occurs, ${(p^{-})}^{Y} = Λ_{Y}$ , the long-term average rate of y, for all $δ$ . This is because $p^{-}$ is computed based on memory of events occurring before x, of which there are none (see Figure 5c). In contrast, when y occurs, ${(p^{-})}^{Z}$ shows a peak at $δ = 1$ , based on memory of events occurring before y (i.e., x), and the credit association between x and z (see Figure 5f).

Figure 5: — Observed events receive less credit for future events, which have already been predicted based on past events. As with all previous figures, the events x, y, and z occur on each trial at times 0, 1, and 2, respectively. (a) A schematic of the state of memory and prediction as a function of time, as in Figure 4(a). At real time 0.00, x is observed and the memory is empty. (b) An illustration of an agent's inferences at the time x occurs. No memory of past events exists to suggest a prediction, whereas the currently observed event x suggests that y occurs soon. (c) Plots of $p^{+}$ , $p^{-}$ (top) and $e^{C_{X}^{Y}}$ (bottom) as a function of internal future time, $δ$ , at the time x occurs, for the prediction of y. The quantity $p^{+}$ (red) is the pairwise association between x and y, while $p^{-}$ (purple) is flat as a function of $δ$ as there is no memory of events. The quantity $e^{C_{X}^{Y}} = p^{+} / p^{-}$ (orange). (d) Same as panel a, but at real time 1.00. y is observed, and x is in memory. (e) An illustration of an agent's inferences at the time y occurs. The agent remembers x, prompting a prior prediction of z. The currently observed event y suggests the same, but the agent does not gain much information from y, and hence assigns y less credit. (f) Same as panel c, but at the time y occurs, for the prediction of z. The quantity $p^{+}$ is the pairwise association between y and z, which is the same as that between x and y. However, $p^{-}$ reflects the prior prediction for z based on the memory for x. (This is $p^{C}$ from the bottom-most plot in Figure 4b.) Thus, $e^{C_{X}^{Y}}$ is diminished.

2.2.2. Prediction due to Event Observation

For the prediction due to the observed event x itself, we use the pairwise prediction in accordance with equation 1.9,

p^{+} (δ; t) = m (δ; t) .

(2.6)

For the scenario in Figure 1, when x occurs, ${(p^{+})}^{Y} = m^{Y} = M_{X}^{Y}$ (see Figure 5c), and when y occurs, ${(p^{+})}^{Z} = m^{Z} = M_{Y}^{Z}$ (see Figure 5f). Both of these have the same form, peaking sharply at $δ = 1$ , since the time intervals between x and y and between y and z are fixed and equal.

2.2.3. Updating $C$

When x is observed at time $t$ , we update $C_{X}$ in the following manner:

Δ exp C_{X} (δ) \propto \frac{p^{+} (δ; t)}{p^{-} (δ; t)} - exp C_{X} (δ) .

(2.7)

The division and exponentiation are performed element-wise for each possible outcome. This update depends on the previous state of $C$ (through equations 2.5 and 2.2). During training, as events occur, we update respective components of $C$ , which in turn enhances the agent's predictions of the future as training proceeds. This update rule squares with the intuition that events be assigned credit in accordance with their association with outcomes that are not previously predicted. As training proceeds, $exp C_{X} (δ)$ approaches $p^{+} (δ; t) / p^{-} (δ; t)$ in expectation, up to the variability of event occurrence history in recent episodes of x during training. Since we assume stationary statistics, a small learning rate (i.e., constant of proportionality in equation 2.7) should be used to minimize the effects of such variability.

For the scenario in Figure 1, as noted, the observation of x generates a prior prediction for z that is present when y occurs. Thus, via equation 2.7, y receives less credit for z than x for y at each $δ$ (see Figure 5), even though the x–y and y–z pairwise associations are the same (see Figure 6). As a practical matter, since the learning of $C$ depends on the accurate learning of $M$ , for best results, $C$ should be learned only after $M$ stabilizes during training.

Figure 6: — For event pairs, credit density can differ despite having the same pairwise associations. A summary of the event pair associations (top) and credit densities (bottom) for all nontrivial event pairs for the scenario in all previous figures, where the events x, y, and z occur on each trial at times 0, 1, and 2, respectively. The pairwise associations $M_{X}^{Y}$ and $M_{Y}^{Z}$ , overlapping perfectly, are slightly displaced for clarity. However, $C_{X}^{Y}$ is greater than $C_{Y}^{Z}$ due to the memory of x allowing a prior prediction for z when y occurs, as shown in Figure 5(f).

2.3. Summary

The agent's memory $\tilde{f}$ encodes a time line of past events (see equation 1.2). Using Hebbian association, the agent makes pairwise associations $M$ between each pair of event types as a function of internal time (see equation 1.5). This lets the agent form a pairwise prediction $m$ for the future whenever events occur, but only based on the pairwise correlations associated with those events as cues. To predict future events based on past events, the agent learns credit associations $C$ between each pair of event types as a function of internal time. The agent uses $C$ and $\tilde{f}$ to generate a time line of future events (see equation 2.2). While the agent learns, each time an event occurs, we step $exp C_{α}^{β}$ (where $α$ is the event that occurred) toward the ratio of the prediction for $β$ due to $α$ (based on $M$ ), to the prediction for $β$ prior to $α$ (based on $C$ ) (see equation 2.7). This design curbs double-counting of correlations for an outcome associated with multiple cues at different points in the past. Through learning, we expect the agent to produce better and better predictions for events in its future.

3. Properties of the Prediction Algorithm

The algorithm described above has interesting computational properties. We will discuss how it scales with the number of event types that can be distinguished and the timescales over which prospection is implemented. It can be shown that the model is optimal for pairwise predictions modulo the uncertainty that comes from finite temporal resolution of memory. Moreover, the model is invariant to rescaling of time, which may be useful in applications where the relevant time scale is not known a priori.

3.1. Scaling Properties

As with traditional associative models, the computational time and space required for this algorithm vary quadratically with the number of event types considered. In typical RL models, each state $s$ must be defined to include all of the information that could affect the transition to the next state in order to fit into the Markov structure. If transitions depend on the indefinite past, the number of possible states would become unwieldy. In contrast, the event types used here are economically defined to be those events that occupy a single point in time (e.g., x, y), which are much smaller in number.

In addition, this algorithm runs in time and space polynomial in the number of $\overset{*}{τ}$ time points considered in $\tilde{f} (\overset{*}{τ})$ and $δ$ time points considered in $p (δ)$ . For example, in equation 2.3, for each $δ$ , the numerical integral is computed in time linear in the number of $\overset{*}{τ}$ , the variable of integration, corresponding to how far in the past memories are considered. The full prediction, over all $δ$ that the agent considers, is computed in time linear in the number of $δ$ . Translation to different values of $δ$ can be implemented serially, consistent with neural considerations (Shankar et al., 2016), or be parallelized in silico. The quick performance comes at the cost of the ability to directly handle some forms of joint statistics among cues. We discuss this shortcoming in section 5.2.

The longest timescale over which predictions are based and are made increases exponentially with computational demands. Although the integral form in equation 1.2 would seem to require memory for the entire history up to the present, $\tilde{f}$ can be generated from leaky integrators with a number of time constants (Shankar & Howard, 2013). The scale invariance of $Φ_{k}$ allows us to choose the distribution of time constants as a geometric series, resulting in a logarithmic relationship between the number of integrators and the longest timescale that can be represented.

3.2. Equivalence of Fuzzy Memory and Input Temporal Uncertainty

Even when the time interval between events is fixed, fuzzy memory (finite $k$ ) leads to temporal fuzziness in both the pairwise association $M$ and prediction $p (δ)$ . At every instant in time, this induced fuzziness is equivalent, in its effect on the prediction, to fuzziness due to intrinsic temporal uncertainty in the signal $f$ faced by an agent with perfect memory (infinite $k$ ).

As an example, consider an agent with fuzzy memory encountering x, followed by y after a fixed time interval $τ$ . Precisely at the time x occurs, the agent's prediction for y is given by

p^{Y} (δ; t_{X}) = \frac{κ_{1}^{- 1}}{δ} Φ_{k} (\frac{τ}{δ}),

(3.1)

where $κ_{1}$ is as given in section A.3. Another agent with perfect memory encountering x, followed by y after a random time interval $τ$ , whose probability density function is given by $q_{τ} (t) = Φ_{k} (t / δ) / δ$ , makes an optimal prediction following x equivalent to equation 3.1. The derivation of equation 3.1 is given in section A.5.1.

Although the fuzzy memory agent's prediction for y some time after encountering x is different from equation 3.1, this equivalence property still holds: at every instant in time, there exists a perfect memory agent, with observations subject to some density function of $τ$ , with an equivalent optimal prediction.

3.3. Timescale Invariance

The prediction algorithm inherits the timescale invariance of the temporal record of the past. If the input signals are time-dilated, the resulting predictions would be time-dilated, rescaled in magnitude and otherwise unchanged (see Figure 7). Therefore, the prediction algorithm, with an appropriate range of $\overset{*}{τ}$ and $δ$ , supports chains of events that happen over any timescale.

Figure 7: — Predictions are timescale–invariant. Top: Credit density and, as an example, the prediction after x occurs for y and z, as a function of internal future time, $δ$ , for the scenario in all previous figures. Middle, bottom: When the scenario is time-dilated, shown here by 10 and 100 times, the model output is unchanged as a function of dilated internal time. In the case of predictions, the magnitude rescales to preserve the area under the curve. This suggests that the proposed algorithm supports chains of events that occur over any timescale.

Formally, for any constant $λ$ , the estimated probability of event occurrence within a small duration $d δ$ , $p (δ; t) d δ$ , is invariant under the transformation

\begin{matrix} t & \to & λ t, \\ \overset{*}{τ} & \to & λ \overset{*}{τ}, \\ δ & \to & λ δ . \end{matrix}

This means that within the limits of a computational implementation, that is, far from the smallest and largest values of $\overset{*}{τ}$ and $δ$ (which grow exponentially with the resources committed to representing time), the model provides the same relative temporal resolution.

One may wonder whether, as an alternative to computing $C$ , one can generate a future time line $p (δ)$ and directly update it using $p^{+}$ and $p^{-}$ whenever an event occurs. A difficulty with this approach is that a timescale would have to be chosen for the evolution of $p (δ)$ between events, violating the timescale invariance property that we desire.

3.4. With Fuzzy Memory, Credit Is Assigned Based on Temporal Proximity

Consider the scenario where x occurs, then y, then z, always with the same time delays. In the limit of perfect memory, y would receive no credit for z. This is because the occurrence of x would allow the time of occurrence of z to be predicted perfectly at all times. The occurrence of y would not improve the (already perfect) prediction. When memory is fuzzy, the x–z pairwise association would have a larger temporal uncertainty than the y–z pairwise association, since y and z are closer in time than x and z (see equation 1.5). Therefore, the occurrence of y would improve the prediction for z. The closer y occurs to z, the more y sharpens the prediction for z and the more credit is assigned to y for z. Figure 8 illustrates this effect, and supporting equations are worked out in section A.5.2.

Figure 8: — Temporal proximity promotes credit assignment. Events x, y, and z occur at times 0, $2 - t_{Y Z}$ , and 2, respectively. (a) A schematic of the state of memory and prediction at the time y occurs, for six values of $t_{Y Z}$ . The axes have the same interpretation as in Figure 3. (b) Credit assigned to y for z is shown here for the six values of $t_{Y Z}$ , as a function of internal future time, $δ$ . In other words, each line represents different amounts of temporal proximity between y and z, while the interval between x and z remains fixed. For $t_{Y Z} = 1.9$ , y is much closer in time to x than to z. In this case, the credit is almost flat, as the prediction for z due to x is still fresh. The case $t_{Y Z} = 1$ is the scenario in Figure 1 through 6. For lower and lower values of $t_{Y Z}$ , credit density is more and more sharply peaked. The prediction for z due to x has flattened out, allowing the effect of the pairwise association between x and z to dominate. The analytic form of the lines plotted is worked out in section A.5.2.

4. Demonstration: Event Streams with Memory and Multiple Characteristic Timescales

We have seen that the algorithm described here is able to predict the future based on a temporally extended record of the past containing multiple possible cues. In addition, this prediction does not require selection of a preferred timescale, allowing for generalization across an exponentially large range of times. As a consequence of these two properties, this approach is well suited to applications where the relevant timescale is not known a priori or to situations where there are multiple processes at different characteristic timescales that must be simultaneously learned. To illustrate these properties, we demonstrate learning of the algorithm on a time series of discrete events generated from multiple Markov renewal processes (MRP).

In principle, the algorithm we describe is capable of handling multiple cues with additive effects (but see section 5.2) stretching into the indefinite past. However, for simplicity, we generate a scenario such that each event has exactly one cue. This cue is mostly found at most 15 time steps before the event. For comparison, most consecutive events have an intervening time of between 0.1 and 15 time units. Crucially, the cue is not usually the immediately preceding event, but one of the several preceding events. Thus, one cannot merely predict the future based on the most recent event. To add realism, we introduce a small amount of variability in the event type of the outcome, as well as a small amount of gaussian variability in the time of the outcome.

The way we generated a scenario with such properties is to superpose several MRPs, each with three base event types, u, v, w. MRPs have the property that the type of each event is the sole determiner of the probability distribution of the type and time of the next event. In other words, each event has a single cue. Superposing MRPs destroys the guarantee that the cue immediately precedes its outcome. We generated the scenario using two approaches, mainly differing in the way event types are determined in the superposed process. For the first approach, event types in the superposed process are determined according to the base type of the event and the MRP of origin. For example, for a superposition of 7 MRPs, there would be $3 \times 7 = 21$ event types ( $1 u, 1 v, 1 w, \dots, 7 w$ ). An example of such a scenario with two MRPs superposed is shown in Figure 9a. Figure 9b shows the corresponding mean transition times for each type of transition. The drawback of this approach is that as the number of MRPs increases, the number of event types increases, making the prediction task inherently harder. For the second approach, event types in the superposed process are determined only according to the base types of the events, even if they originate from different MRPs. This way, for the prediction of the type of an event, there are always two wrong answers and one correct answer, for a fair comparison regardless of the number of MRPs superposed.

Figure 9: — The algorithm provides good predictions for cues at multiple timescales. (a) The top panel shows the first few events in a superposed process. The bottom two panels show the corresponding events from the two component MRPs, which are composed of events ${1 u, 1 v, 1 w}$ and ${2 u, 2 v, 2 w}$ , respectively. Note that successive events in the superposed process (e.g., the first two events in the top-most panel) may be from different MRPs, and thus the earlier event not predictive of the later event. (b) Graph depicting mean transition times between event types within each component MRP. Weights are associated with the arrowhead closest to them. The variances of the normally distributed transition times are not shown here. Note, for example, how the $1 v \to 1 u$ transition takes place at the scale of about 1.5 time units, while the $1 u \to 1 w$ transition takes place at a different scale of about 10 time units. The two MRPs depicted here are two of the component MRPs in the simulation used to generate panel c. (c) We superpose MRPs such that event types from different MRPs are deemed different event types in the superposed process. (d) We superpose MRPs such that event types from different MRPs are identified by base types (u, v, w) irrespective of their MRP of origin, resulting in exactly three event types in the superposed process. For panels c and d, each point represents an average accuracy computed by repeating the training and testing procedures six times for each choice of number of MRPs superposed. Regardless of method of superposing MRPs, the algorithm (labeled $C$ ) performs well above chance, showing that it provides good predictions for cues at multiple timescales. See text for a comparison of the $C$ - and $M$ -based predictions.

The algorithm we describe can be used to predict both the time and type of likely events in the future. However, for simplicity, we evaluate the algorithm on its average accuracy of predicting the type of the next event, given the time to the next event, whenever an event occurs. We generate this prediction via ${argmax}_{i} p^{i} (δ = t_{n + 1} - t_{n}; t = t_{n})$ , where $t_{n}$ is the time of the $n$ th event. We call this the $C$ -based prediction. As a comparison, we generate an $M$ -based prediction via ${argmax}_{i} m^{i} (δ = t_{n + 1} - t_{n}; t = t_{n})$ , where $j$ is the type of the event at $t_{n}$ , and evaluate its average accuracy. Notice that the $M$ -based prediction only invokes pairwise associations with event $j$ as the cue, whereas the $C$ -based prediction invokes credit associations with current and past events as cues. Finally, we compare these to a baseline of always predicting the most frequent event type. Our method is described in detail in appendix A.6.

The average accuracies of the prediction methods are shown in Figures 9c and 9d, as a function of the number of MRPs superposed, for the first and second approach of scenario generation, respectively. The $C$ - and $M$ -based predictions generally outperform the baseline model. Across both figures, the results are qualitatively similar. The accuracies of $C$ - and $M$ -based predictions are comparable for a single MRP. This is expected since for an MRP, the cue and its outcome are neighbors. Whenever an event occurs, the $M$ -based predictor uses the pairwise associations between that event and its possible outcomes to predict the type of the next event. However, as more and more MRPs are superposed, the $C$ -based algorithm outperforms.

What drives the difference in performance between the $C$ - and $M$ -based algorithms? Although the $C$ -based algorithm uses the credit associations $C$ while the $M$ -based algorithm uses the pairwise associations $M$ , this difference is immaterial in this case. Since each event only has one cue, $exp C^{α}$ is proportional to $M^{α}$ . Rather, the $M$ -based algorithm suffers when successive events originate from separate MRPs and the pairwise association between the respective event types would not be predictive. The $M$ -based algorithm makes predictions only based on events in the present. In contrast, the $C$ -based algorithm makes predictions based on events in the present and in the past, where the correct cue would be included in such situations.

This demonstration provides a proof of concept that the algorithm provides reasonable predictions for cues at timescales spanning one order of magnitude. We accomplished this without selecting any single operating scale. The demonstration gives a flavor for the advantages of the algorithm we describe over Markov models. A classic approach based on $n$ th-order Markov models would entail discretizing time at some lowest-level scale (but see Kurth-Nelson & Redish, 2009; Ludvig, Sutton, & Kehoe, 2008) and sizing the memory buffer to encompass most of the longest transitions. For simplicity, we have constructed a relatively tame scenario for this demonstration, in which most event relationships only span about 1 to 15 time units, and events are sparse. In reality, the wider the range of timescales, the harder it is for standard algorithms operating at the lowest-level timescale, which fumble at timescales significantly different from their operating scale (Mozer, 1992). In scenarios where events have long-range temporal dependencies, Markov models would be significantly limited by the exponential growth in the number of states (and, thus, computational demands) with the size of the memory buffer. The algorithm we describe does not face these limitations (see section 3.1).

5. Discussion

We have proposed an algorithm that generates a scale-invariant time line of the future. This algorithm is time-local in the sense that predictions at time $t$ are derived from $\tilde{f} (\overset{*}{τ}, t)$ , which represents events that are, in fact, nonlocal in real time. Moreover, the translation mechanism enables event rates at future time points to be estimated. In addition to associative memory, as developed by model-free RL algorithms, this capability would let an agent construct the estimate over possible futures (McGuire & Kable, 2013).

5.1. Theoretical Properties of the Current Model

This model has properties that are quite different from traditional RL paradigms. First, this algorithm naturally runs in continuous time, which suits applications dealing with natural processes unfolding in time. This feature contrasts with basic RL algorithms, which only allow agents to move among discrete states in discrete time. In principle, this proposed algorithm can be extended such that position in higher-dimensional spaces replaces or augments time, allowing agents to navigate real and abstract spaces. Translation can be along an angle or perhaps even along a trajectory instead of being confined to a given axis (see equation 1.4).

Second, the scale invariance of the model is useful in applications where the timescale of event relationships is not known in advance. In principle, the model is indifferent toward the absolute time intervals between events. Instead, within a given scenario, it is only concerned about time intervals relative to other time intervals. In comparison, in traditional RL systems, a timescale for history dependency, if any, is set by the size of the history that the designer defines as part of the state $s$ . Moreover, in many aspects of the world that we might be interested in, such as in natural language (Altmann, Cristadoro, & Esposti, 2012), network traffic (Cohen, Erez, Ben-Avraham, & Havlin, 2000), and financial markets (Cont, 2005), event dependencies exist simultaneously across a wide range of scales. This model is potentially suited for such applications, since it incorporates past events across a range of timescales, and an increase in computing resources provides an exponential increase in the length of history considered.

Third, in the context of RL, this model may be incorporated into algorithms to allow agents to naturally form a prediction of its own trajectory as a function of future time. This can be done by considering the agent's arrival at some or all states as events. In addition, by combining the predictions for future states $s$ as a function of future time, $p^{s} (δ)$ , with a reward function over future states, the agent can generate the predicted future reward as a function of future time, $r (δ)$ . By learning and comparing weighted integrals of $r (δ)$ for several alternative policies, the agent can choose flexibly among these policies according to task demands. For instance, if the agent knows it only has 10 time units to complete the task, it can choose the policy with the highest $\int_{0}^{10} r (δ) d δ$ . The model's ability to form a prediction as a function of future time stands in contrast to RL paradigms, which tend to flatten the dimension of future time. For example, a naive RL agent assigns values to states according to the expected sum of future reward starting from that state; a successor representation agent (Dayan, 1993) learns the expected future state occupancy, summed over future time, starting from each state (but see Tano, Dayan, & Pouget, 2020; Momennejad & Howard, 2018).

Finally, we note that this model provides information usually associated with model-based RL but with very different computational properties. Like model-based RL, this model provides an explicit prediction as a function of future time $δ$ . However, a constraint of model-based RL is that the time to compute an event $δ$ in the future goes up linearly with $δ$ . In the present model, because the calculation of the prediction at a particular value of $δ$ does not depend on the prediction at previous values of $δ$ , one could in principle compute all values of $δ$ in parallel. Moreover, this means that it is possible to sample the $δ$ axis in whatever way is convenient. Integrals over $δ$ give hyperbolic discounting if the $δ$ axis is sampled evenly as a function of $log δ$ . See also Shankar et al. (2016) for considerations related to physically instantiating translation across a population of neurons.

5.2. Theoretical Limitations of the Current Model

We highlight two limitations relating to applying this algorithm toward machine learning. First, the algorithm, as currently described, is not directly sensitive to joint statistics of two or more cues. For example, the model would be unable to capture the conditional structure “z occurs exactly if either x or y occurs, but not both.” As a consequence, the algorithm is also unable to deal appropriately with number of events. For instance, the algorithm has no basis to differentiate “x precedes y by 10 s” from “half of the time, x precedes two closely spaced occurrences of y by 10 s and the other half of the time, y does not occur.” We can mitigate this issue by perceiving events depending on context. For example, the agent can perceive the y after an x as the event xy, enabling sensitivity to joint statistics of at most two cues. In terms of computational complexity, naively implementing this would introduce a quadratic factor in the number of base events. However, we can reduce the resource complexity by finding a compressed representation of the event history while preserving information about future events—that is, dealing with the information bottleneck problem (Tishby, Pereira, & Bialek, 2000). Since existing deep neural network algorithms efficiently extract joint statistics, it would be natural to pursue research that seeks to merge this approach with deep network algorithms.

Second, this algorithm is limited in prescribing how to achieve optimal policies in the context of RL. Our focus has been on how to predict future events, not how to learn the best policy. In many contexts, it is natural to define events such that events occur depending on actions of the agent (e.g., in a spatial navigation task where events occur based on the agent's trajectory). In these cases, in effect, we presume that the agent follows an existing policy $π$ , and the model deduces event associations and makes predictions with respect to $π$ . The agent can certainly flexibly choose among several alternative policies, say, between $π$ and $π^{'}$ , by comparing predictions from the start state and selecting the more rewarding alternative. However, unlike basic RL algorithms, we do not prescribe a method for learning a policy that scales in complexity with the number of states, such as a policy to navigate a grid. In the context of grid navigation, we have, in effect, avoided assigning values to coordinates on the grid, since this contradicts our design principle of allowing history to influence events (rewards). More research would be needed if one wished to pursue policy learning within the framework we describe.

5.3. Neuroscience Considerations

We now discuss two potential points of contact between the formal model presented in this letter and computational and systems neuroscience.

5.3.1. Reward Prediction Error and Dopamine

The success of RL algorithms in accounting for the firing of dopaminergic neurons in the basal forebrain (Schultz et al., 1997) is arguably the greatest achievement in computational cognitive neuroscience. The basic empirical story is well known. Dopaminergic neurons respond to unpredicted rewarding outcomes. However, with learning, as the reward becomes predicted by a neutral stimulus, the cells no longer fire to the predicted reward but instead fire to the neutral stimulus that predicts the future rewarding outcome (see Schultz, 2006, for a review of the early literature). While there are undoubtedly many details that would need to be worked out, at least the rough outline of the classical empirical story about dopamine can be mapped onto this framework.

Let us suppose that expected future value is computed at each moment by integrating over future time $δ$ , taking the projection from the vector of predicted events $p (δ)$ onto some vector $A$ that describes the reward value of each possible event type in $p$ :

V (t) = \int_{0}^{\infty} A \cdot p (δ; t) g (δ) d δ,

(5.1)

where $g (δ)$ denotes the factor arising from compression of the $δ$ axis. As discussed above, it is reasonable to sample $δ$ on a logarithmic scale to implement hyperbolic discounting, in which case, $g (δ) = 1 / δ$ . Let us suppose further that reward prediction error $E$ is computed as the difference between $r (t)$ , the actual reward observed at time step $t$ , and the change in $V$ at time step $t$ :

E (t) = r (t) + [V (t) - V (t - Δ t)],

(5.2)

where we have chosen a discrete time interval $Δ t$ to acknowledge that the computation of value may take a substantial amount of time in the brain. For instance, Shankar et al. (2016) proposed that integrals over $δ$ could be completed within a theta oscillation, suggesting $Δ t$ could be as long as a few hundred milliseconds. Now, consider slowly learning an association between an inherently neutral event x and a rewarding event y, separated by a fixed delay $τ$ . Initially, y is unpredicted. When x is presented, there is no change in $V$ . Similarly, $V$ is zero both before and after y is presented. Because y is rewarding, the reward prediction error is positive around the time of presentation of y. After learning, immediately after x is presented, $p (δ)$ includes the prediction for y, a time $δ ≃ τ$ in the future. This means that $V (t)$ changes abruptly around the time that x is presented, resulting in a positive reward prediction error. Now, after learning, consider a time $τ$ after presentation of x. If the rewarding stimulus is omitted, negative reward prediction error is observed as the peak in $p (δ)$ corresponding to y becomes increasingly truncated. However, if y is presented at the time it is expected, then the positive reward from y is balanced by a rapid decreasing $V (t)$ . Note that because y does not predict itself at a short lag, observation of y abruptly decreases the prediction of itself.

This approach aligns well with the classic understanding of reward prediction error with one very important exception. Rather than estimating expected future reward via temporal difference learning, predictions for an extended future are available at each moment. Unlike temporal difference learning algorithms, there is no sense in which value moves gradually along intermediate time points between x and y. This model thus has no difficulty accounting for the finding that value seems to rapidly “jump” between events (Pan, Schmidt, Paton, & Hyland, 2005).

5.3.2. Translation and Theta Oscillations

The algorithm described here relies on the ability to translate $\tilde{f}$ toward the past. Shankar et al. (2016) suggested that hippocampal theta (4–12 Hz) oscillations could provide a mechanism for translation of temporal representations. The basic conjecture of that model for translation is that different values of $δ$ map onto different phases of theta oscillations. If the time line $δ$ maps onto different phases of the theta oscillation, this places a lower limit on the order of 100 ms on the time lines indexed by $\overset{*}{τ}$ and $δ$ . Theoretical and neurobiological considerations led Shankar et al. (2016) to the conclusion that $δ$ ought to accelerate exponentially with the theta cycle, resulting in a logarithmic sampling of the $δ$ axis.

This conjecture made sense of several neurophysiological findings, including the gradual ramping of firing in striatal neurons accompanied by phase precession with respect to theta recorded in the hippocampus, a brain structure that is relatively distant from the striatum (van der Meer & Redish, 2011). The fact that spikes in the striatum are organized by hippocampal theta suggests that theta oscillations reflect a computation that is extended over a significant part of the brain. The learning rule presented here, equation 2.7, describes changes in the strength of connections in $C$ by noting the difference between $p^{+}$ and $p^{-}$ at each value of $δ$ . This suggests convergent connections between axons communicating $M$ and $C$ arriving at target neurons representing predicted future outcomes. Perhaps the coordination implied by the involvement of theta oscillations in prediction could lead to a difference in the timing of spikes communicated via $M$ and $C$ . Coupled with spike-timing-dependent plasticity, perhaps this could lead to the learning rule in equation 2.7.

Acknowledgments

This work was supported by NIBIB R01ER022864 and NSF IIS 1631460. We gratefully acknowledge inspiring conversations with Randy Gallistel and work in early stages of this project by Kostya Tiurev.

Appendix

A.1. A Formal Model for Temporal Record of the Past

Let multiple types of discrete events occur in continuous time. For each event type, we denote the signal by $f (t)$ , where each event is represented by a Dirac delta function at the instant it occurs. For each event type, an array of leaky integrators, $F$ , with a range of decay rates $s$ , receives the signal as input:

\frac{\partial}{\partial t} F (s; t) = - s F (s; t) + f (t) .

(A.1)

The array of leaky integrators $F (s; t)$ encodes the real Laplace transform of the signal up to time $t$ , where $s$ is the Laplace domain variable. For each event type, an array of time cells $\tilde{f} (\overset{*}{τ})$ approximately inverts the Laplace transform (see Post, 1930). This yields an estimate of the signal up to time $t$ , at time offsets $\overset{*}{τ}$ prior:

\tilde{f} (\overset{*}{τ}; t) = \tilde{f} (k / s; t) = \frac{{(- 1)}^{k}}{k!} s^{k + 1} \frac{\partial^{k}}{\partial s^{k}} F (s; t) = L_{k}^{- 1} F (s; t) .

(A.2)

The constant $k$ is a sharpness parameter. As $k \to \infty$ , the estimate $\tilde{f} (\overset{*}{τ})$ becomes precise, at the cost of infinite resources to implement the model. As stated in equation 1.2, for an arbitrary signal $f$ ,

\tilde{f} (\overset{*}{τ}; t) = \frac{1}{\overset{*}{τ}} \int_{- \infty}^{t} f (τ) Φ_{k} (\frac{t - τ}{\overset{*}{τ}}) d τ .

(A.3)

In other words, for a given $\overset{*}{τ}$ , $\tilde{f} (\overset{*}{τ}; t)$ is proportional to a causal convolution of the signal $f$ with a kernel $Φ_{k}$ that describes the smearing.

A.2. Time Translation to Estimate the Future State of the Past

The future state of the memory (see equation 1.4) can be readily computed through translation in the Laplace domain:

{\tilde{f}}_{δ} (\overset{*}{τ}; t) \equiv L_{k}^{- 1} R^{δ} F (s; t) \equiv L_{k}^{- 1} \{e^{- s δ} F (s; t)\} .

(A.4)

Building a translation operator out of realistic neurons and synapses is a nontrivial but tractable problem. It has been proposed that the brain implements translation to various amounts $δ$ by mapping $δ$ on to different phases of theta oscillations (Shankar et al., 2016). Previous work has long argued that theta oscillations, a prominent 4–12 Hz oscillation in the local field potential, have long been believed to be crucial in the neurobiology of memory (Buzsáki, 2002; Hasselmo, Bodelón, & Wyble, 2002; Kahana, Seelig, & Madsen, 2001). Requiring scale invariance and also consideration of the problem from the perspective of the individual neurons requires the sweep through $δ$ to accelerate exponentially through the theta cycle.

A.3. Pairwise Association and Pairwise Prediction

The agent makes pairwise associations $M$ between each pair of event types using Hebbian learning. As the agent experiences the world, the pairwise prediction $m$ allows the agent to generate predictions for the future based on pairwise associations with the currently occurring events as cues. The pairwise prediction is a building block for the learning of the credit associations $exp C$ , from which the prediction $p$ is derived. This section consists of two subsections. The first subsection motivates the form of the pairwise prediction $m$ (see equations 1.10 to 1.12; in particular, the form of the integral in equation 1.11). The second subsection highlights and proves the numerical coincidence between $M$ and $m$ in a simple case, from which the normalization for $m$ derives.

A.3.1. Equation for Pairwise Prediction

In this subsection, we motivate the equations for the pairwise prediction. We do this by showing that when memory is perfect, $m$ reduces to the geometric mean of the elements of $M$ associated with the events at time $t$ , as desired.

Hebbian learning can be used to make pairwise associations $M$ between events (see equation 1.5). The parameters of $M$ are the possible cue, the possible outcome, and the internal time. The pairwise prediction $m$ uses the pairwise associations $M$ to make a prediction about future events based on possibly multiple currently occurring events. In other words, $m$ serves to integrate cue–outcome pairwise information from multiple simultaneous cues in the present. $M$ , in the definition for $m$ , plays the role that $exp C$ does in the definition for $p$ . In the algorithm, $m$ serves the function of $p^{+}$ (see equation 2.6).

The pairwise prediction $m$ is computed when a set of events $E_{t}$ occur at time $t$ , as follows:

m^{β} (δ; t) = κ_{1} exp (\frac{1}{|E_{t}|} \sum_{α ∊ E_{t}} \int f_{δ}^{α} (\overset{*}{τ}; t) log M_{α}^{β} (\overset{*}{τ}) d \overset{*}{τ}), δ > 0,

(A.5)

where the constant $κ_{1} = {[k e^{- ψ (k)}]}^{k + 1}$ , $ψ (k)$ is the digamma function, and

f_{δ}^{α} (\overset{*}{τ}; t) = Φ_{k} (δ / \overset{*}{τ}) / \overset{*}{τ}

denotes the future state of the memory element associated with the currently occurring episode of $α$ . (For $k = 2$ , $κ_{1} \sim 2.3$ ; for $k = 8$ , $κ_{1} \sim 1.8$ ; as $k \to \infty,$ $κ_{1} \to \sqrt{e} \sim 1.65$ .) The notation $|E_{t}|$ denotes the number of elements in $E_{t}$ , that is, the number of events co-occurring at time $t$ . The function $f_{δ}^{α} (\overset{*}{τ}; t)$ is a gaussian-like function that peaks around $\overset{*}{τ} = δ$ and reflects the fact that at a time $δ$ in the future, $α$ would have occurred $\overset{*}{τ}$ in the past and the memory ${\tilde{f}}^{α}$ would reflect this.

We can motivate the form of the pairwise prediction $m$ as follows. Let us imagine that memory were perfect such that events were localized in time exactly, so $ϕ_{δ}^{α} (\overset{*}{τ}) = δ (\overset{*}{τ} - δ) I (α ∊ E_{t}),$ where $ϕ_{δ}^{α}$ is the analog of $f_{δ}^{α}$ when memory is perfect (i.e., $k \to \infty$ ), $δ (\cdot)$ is the Dirac delta function, and $I (\cdot)$ is the indicator function. We would then have

\begin{matrix} \int ϕ_{δ}^{α} (\overset{*}{τ}; t) log M_{α}^{β} (\overset{*}{τ}) d \overset{*}{τ} & = & I (α ∊ E_{t}) \int δ (\overset{*}{τ} - δ) log M_{α}^{β} (\overset{*}{τ}) d \overset{*}{τ} \\ = & I (α ∊ E_{t}) log M_{α}^{β} (δ) . \end{matrix}

The integral on the right-hand side is a convolution of $log M_{α}^{β}$ with the delta function, which returns the former unchanged. In the case where exactly $x$ occurs at time $t$ ,

\begin{matrix} m^{β} (δ; t) & \propto & exp \int f_{δ}^{α} (\overset{*}{τ}; t) log M_{α}^{β} (\overset{*}{τ}) d \overset{*}{τ} \\ = & exp log M_{X}^{β} (δ) \\ = & M_{X}^{β} (δ), \end{matrix}

(A.6)

so $m$ would be proportional to the appropriate elements of $M$ . In the case where exactly $x$ and $y$ occur at time $t$ ,

\begin{matrix} m^{β} (δ; t) & \propto & exp [\frac{1}{|E_{t}|} \sum_{α ∊ E_{t}} \int f_{δ}^{α} (\overset{*}{τ}; t) log M_{α}^{β} (\overset{*}{τ}) d \overset{*}{τ}] \\ = & exp \{\frac{1}{2} [log M_{X}^{β} (δ) + log M_{Y}^{β} (δ)]\} \\ = & \sqrt{M_{X}^{β} (δ) M_{Y}^{β} (δ)}, \end{matrix}

so $m$ would be proportional to the geometric mean of the elements of $M$ associated with the events at time $t$ .

In the model with fuzzy memory, $f_{δ}$ approximates $ϕ_{δ}$ , so the above relationships hold only approximately. In other words, in general, according to equations 1.10 to 1.12, $m^{β} (δ; t)$ is not always exactly $M_{X}^{β} (δ)$ (when $x$ occurs at time $t$ ) or $\sqrt{M_{X}^{β} (δ) M_{Y}^{β} (δ)}$ (when $x$ and $y$ occur at time $t$ ), and so on. Instead of using equations 1.10 to 1.12, which involves an integral, one could have defined $m^{β} (δ; t)$ directly as the geometric mean of the relevant elements of $M^{β} (δ)$ . However, we do not do so. We use equations 1.10 to 1.12 because they closely parallel equations 2.2 to 2.3 for the prediction, for which the integral is necessary (see section A.4). Using equations of similar form is more neurobiologically realistic, because it suggests that analogous neural architecture supports the computation for both pairwise prediction $m$ and prediction $p$ . Integrals in time are straightforward to implement with neural networks.

In summary, we have shown that when memory is perfect, $m$ reduces to the geometric mean of the elements of $M$ associated with the events at time $t$ , as desired. The strength of the equations underlying $m$ is that they closely parallel those for $p$ .

A.3.2. Normalization for Pairwise Prediction

In the bulk of section A.3.1, we imagined that memory were perfect such that events were localized in time exactly. In the actual formulation, memory is fuzzy, and this is reflected in the form of $f_{δ}$ . Therefore, the above proportionality relationships do not hold exactly in general. However, equation A.6 holds at the time of occurrence of $x$ in the case where $x$ and $y$ occur at a constant time interval $τ$ , and no event co-occurs with $x$ , even in the case of fuzzy memory. We will prove this result (in lemma 3), which suggests the value that the normalization constant $κ_{1}$ should take. Lemmas 1 and 2 are integrals that are used to prove lemma 3.

Lemma 1.

If the constant $a$ is positive and $k$ is a nonnegative integer,

$\int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 2}} d x = \frac{k!}{a^{k + 1}} .$

Proof.

The integral is

$I = \int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 2}} d x = \frac{1}{a} \int_{0}^{\infty} \frac{a e^{- a / x}}{x^{2}} x^{- k} d x .$

Noting that $\frac{d}{d x} (e^{- a / x}) = \frac{a}{x^{2}} e^{- a / x}$ , we integrate by parts:

$\begin{matrix} u & = & x^{- k} d v = \frac{a e^{- a / x}}{x^{2}} \\ d u & = & (- k) x^{- k - 1} v = e^{- a / x}, \end{matrix}$

so our integral is now

$I = {[\frac{1}{a} x^{- k} e^{- a / x}]}_{0}^{\infty} - \frac{- k}{a} \int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 1}} d x = \frac{k}{a} \frac{1}{a} \int_{0}^{\infty} \frac{a e^{- a / x}}{x^{2}} x^{- \overset{´}{(k - 1)}} d x .$

If we were to integrate by parts again, we would have

$\begin{matrix} u & = & x^{- (k - 1)} d v = \frac{a e^{- a / x}}{x^{2}} \\ d u & = & - (k - 1) x^{- k - 1} v = e^{- a / x} . \end{matrix}$

Each $i$ th iteration reduces the exponent on $x$ in the denominator of the integrand by 1 and introduces a factor of $(k - i + 1) / a$ , and $k$ iterations are needed to go from having $x^{k + 2}$ to $x^{2}$ in the denominator of the integrand. Noting that $(k) (k - 1) \dots (2) (1) = k!,$ we thus have

$I = \frac{k!}{a^{k}} \frac{1}{a} \int_{0}^{\infty} \frac{a e^{- a / x}}{x^{2}} d x = \frac{k!}{a^{k}} \frac{1}{a} {[e^{- a / x}]}_{0}^{\infty} = \frac{k!}{a^{k + 1}} .$

Lemma 2.

If the constants $A$ , $a$ , and $b$ are positive and $k$ and $m$ are positive integers, then

$\int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 1}} log \frac{A e^{- b / x}}{x^{m}} d x = \frac{(k - 1)!}{a^{k}} [m ψ (k) - \frac{k b}{a} + log \frac{A}{a^{m}}],$

where $ψ (k)$ is the digamma function.

Proof.

The integrand is

$\frac{e^{- a / x}}{x^{k + 1}} log \frac{A e^{- b / x}}{x^{m}} = \frac{e^{- a / x}}{x^{k + 1}} log A - \frac{b e^{- a / x}}{x^{k + 2}} - m \frac{e^{- a / x}}{x^{k + 1}} log x .$

We integrate term by term. Applying lemma 1, the first term is

$log A \int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 1}} d x = (log A) \frac{(k - 1)!}{a^{k}},$

noting that $k$ above is at least 1, and the second term is

$- b \int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 2}} d x = - b \frac{k!}{a^{k + 1}} .$

The third term is

$- m \int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 1}} log x d x, k \geq 1 .$

Substituting $t = a / x,$ $d t = - a / x^{2} d x$ , we have

$\begin{matrix} - \frac{m}{a^{k - 1}} \int_{\infty}^{0} z^{k - 1} e^{- t} log \frac{a}{t} (- \frac{1}{a}) d t \\ = & \frac{m}{a^{k}} \int_{0}^{\infty} t^{k - 1} e^{- t} log \frac{t}{a} d t \\ = & \frac{m}{a^{k}} [(- log a) \int_{0}^{\infty} t^{k - 1} e^{- t} d t + \int_{0}^{\infty} t^{k - 1} e^{- t} log t d t] \\ = & \frac{m}{a^{k}} [(- log a) Γ (k) + Γ^{'} (k)] \\ = & \frac{m}{a^{k}} Γ (k) [ψ (k) - log a] \\ = & \frac{m}{a^{k}} (k - 1)! [ψ (k) - log a], \end{matrix}$

where we have applied equations 5.2.1, 5.9.19, and 5.2.2 from DLMF (2021), and used the fact that $k$ is a positive integer. Putting everything together, we have

$\begin{matrix} \int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 1}} log \frac{A e^{- b / x}}{x^{m}} d x & = & (log A) \frac{(k - 1)!}{a^{k}} - b \frac{k!}{a^{k + 1}} + \frac{m}{a^{k}} (k - 1)! [ψ (k) - log a] \\ = & \frac{(k - 1)!}{a^{k}} \{log A - \frac{k b}{a} + m [ψ (k) - log a]\} \\ = & \frac{(k - 1)!}{a^{k}} [m ψ (k) - \frac{k b}{a} + log \frac{A}{a^{m}}] . \end{matrix}$

Lemma 3.

Let event $i$ cue event $j$ with a fixed time interval $t_{i j}$ , and let no event co-occur with event $i$ . After training, at the time event $i$ occurs,

$m^{j} (δ) = M_{i}^{j} (δ),$

for all $δ > 0 .$

Proof.

On the right-hand side, we have

$M_{i}^{j} (δ) = K \frac{t_{i j}^{k}}{δ^{k + 1}} e^{- k t_{i j} / δ}, δ > 0,$

where $K = \frac{k^{k + 1}}{k!}$ .

To compute the left-hand side, we note that

$f_{δ}^{i} (\overset{*}{τ}) = K \frac{δ^{k}}{{\overset{*}{τ}}^{k + 1}} e^{- k δ / \overset{*}{τ}}, \overset{*}{τ}, δ > 0 .$

Then

$\int f_{δ}^{i} (\overset{*}{τ}) log M_{i}^{j} (\overset{*}{τ}) d \overset{*}{τ} = \int_{0}^{\infty} K \frac{t_{i j}^{k}}{{\overset{*}{τ}}^{k + 1}} e^{- k t_{i j} / \overset{*}{τ}} log (K \frac{t_{i j}^{k}}{{\overset{*}{τ}}^{k + 1}} e^{- k t_{i j} / \overset{*}{τ}}) d \overset{*}{τ}$ (A.7)

$= K δ^{k} \int_{0}^{\infty} \frac{e^{- k δ / \overset{*}{τ}}}{{\overset{*}{τ}}^{k + 1}} log (\frac{K t_{i j}^{k}}{{\overset{*}{τ}}^{k + 1}} e^{- k t_{i j} / \overset{*}{τ}}) d \overset{*}{τ} .$ (A.8)

Substituting $a = k δ$ , $b = k t_{i j}$ , $A = K t_{i j}^{k}$ , $m = k + 1$ into lemma 2, the above evaluates to

$\begin{matrix} \int f_{δ}^{i} (\overset{*}{τ}) log M_{i}^{j} (\overset{*}{τ}) d \overset{*}{τ} & = & K δ^{k} \frac{(k - 1)!}{{(k δ)}^{k}} [(k + 1) ψ (k) - \frac{k^{2} t_{i j}}{k δ} + log \frac{K t_{i j}^{k}}{{(k δ)}^{k + 1}}] \\ = & (k + 1) ψ (k) - \frac{k t_{i j}}{δ} + log \frac{K t_{i j}^{k}}{δ^{k + 1}} - (k + 1) log k \\ = & log \frac{K t_{i j}^{k}}{δ^{k + 1}} - \frac{k t_{i j}}{δ} - log κ_{1} . \end{matrix}$

Thus, the left-hand side of the lemma is

$\begin{matrix} m^{j} (δ) & = & κ_{1} exp (\frac{1}{|E_{t}|} \sum_{α ∊ E_{t}} \int f_{δ}^{α} (\overset{*}{τ}; t) log M_{α}^{j} (\overset{*}{τ}) d \overset{*}{τ}) \\ = & κ_{1} exp (\int f_{δ}^{i} (\overset{*}{τ}; t) log M_{i}^{j} (\overset{*}{τ}) d \overset{*}{τ}) \\ = & κ_{1} exp (log \frac{K t_{i j}^{k}}{δ^{k + 1}} - \frac{k t_{i j}}{δ} - log κ_{1}) \\ = & K \frac{t_{i j}^{k}}{δ^{k + 1}} e^{- k t_{i j} / δ} \\ = & M_{i}^{j} (δ) . \end{matrix}$

Our choice of $κ_{1} = {[k e^{- ψ (k)}]}^{k + 1}$ allowed the equality $m^{j} (δ) = M_{i}^{j} (δ)$ to hold when event $i$ cues event $j$ with a fixed time interval and no event co-occurs with event $i$ , without any additional constant of proportionality in the equation.

Remark.

The conclusion of lemma 3 does not hold in general if the delay interval is not fixed. Let $i$ and then $j$ occur at times 1 or 2 apart (with equal probability). Let $k = 2$ (for ease of calculation) so $K = \frac{k^{k + 1}}{k!} = 4$ . We have

$\begin{matrix} M_{i}^{j} (\overset{*}{τ}) & = & \frac{\int {\tilde{f}}_{δ = 0}^{i} (\overset{*}{τ}; t) f_{j} (t) d t}{\int f_{i} (t) d t} \\ = & \frac{1}{2} (K \frac{1}{{\overset{*}{τ}}^{k + 1}} (e^{- k / \overset{*}{τ}} + 2^{k} e^{- 2 k / \overset{*}{τ}})) \\ = & \frac{2}{{\overset{*}{τ}}^{3}} (e^{- 2 / \overset{*}{τ}} + 4 e^{- 4 / \overset{*}{τ}}) \\ M_{i}^{j} (\overset{*}{τ} = 1) & = & 2 (e^{- 2} + 4 e^{- 4}) = 0.417195, \end{matrix}$

and as before,

$f_{δ}^{i} (\overset{*}{τ}) = 4 \frac{δ^{2}}{{\overset{*}{τ}}^{3}} e^{- 2 δ / \overset{*}{τ}} .$

But when $δ = 1$ ,

$\begin{matrix} \int f_{δ}^{i} (\overset{*}{τ}) log M_{i}^{j} (\overset{*}{τ}) d \overset{*}{τ} & = & 4 δ^{2} \int_{0}^{\infty} \frac{e^{- 2 δ / \overset{*}{τ}}}{{\overset{*}{τ}}^{3}} log [\frac{2}{{\overset{*}{τ}}^{3}} (e^{- 2 / \overset{*}{τ}} + 4 e^{- 4 / \overset{*}{τ}})] d \overset{*}{τ} \\ = & - 1.51366 \\ m^{j} (δ = 1) & = & κ_{1} exp (- 1.51366) \\ = & 8 exp [- 3 (1 - γ)] exp (- 1.51366) \\ = & 0.495310 . \end{matrix}$

Since the two boxed values are different, the equality $m^{j} (δ) = M_{i}^{j} (δ)$ does not hold in general.

A.4. Credit Association and Prediction

The agent maintains credit associations $exp C$ between each pair of event types, which estimates the multiplier for the agent's belief about the rate of each outcome whenever it sees a potential cue. The agent maintains predictions $p$ , a time line of future events based on the credit associations. As the agent experiences the world, $exp C$ and $p$ are iteratively updated.

This section motivates the form of the prediction $p$ (see equations 2.2 and 2.3), by showing that the integral approximately subtracts the time elapsed since each cue, from the time delay between that cue and the outcome of interest. The integral thus produces a function of $δ$ that peaks approximately at the time remaining to the outcome. This section consists of three subsections. First, we calculate the projected memory ${\tilde{f}}_{δ}$ , which is an element in the prediction $p$ , in the case of perfect memory. Then we calculate the prediction $p$ in the case of perfect memory. Finally, we discuss the case of fuzzy memory.

A.4.1. Perfect Memory: Projected Memory

We proposed the following form for the prediction:

p^{β} (δ; t) = Λ^{β} exp \sum_{E ∊ E} \int C_{E}^{β} (\overset{*}{τ}) {\tilde{f}}_{δ}^{E} (\overset{*}{τ}; t) d \overset{*}{τ},

where $Λ^{β}$ denotes the long-term average of event type $β$ , and $E$ denotes the set of possible cue types. We motivate the above functional form by deriving the prediction in the case of perfect memory (i.e., $k \to \infty$ ) and discrete events. To do so, we would need to find the projected memory ${\tilde{f}}_{δ}^{E}$ for event type $E$ . Note (see equation 1.1) that

lim_{k \to \infty} Φ_{k} (x) = δ (x - 1),

where $δ (\cdot)$ represents the Dirac delta function. Let the input function $f$ be a series of discrete events of type $e_{i}$ occurring at times $t_{i} < t$ ,

f^{E} (τ) = \sum_{i} δ (t_{i} - τ) I (e_{i} = E) .

If we imagine that memory were perfect, then from equation 1.4, the projected memory would be represented as

\begin{matrix} {\tilde{ϕ}}_{δ}^{E} (\overset{*}{τ}; t) & = & lim_{k \to \infty} \frac{1}{\overset{*}{τ}} \int_{- \infty}^{t} f^{E} (τ) Φ_{k} (\frac{t + δ - τ}{\overset{*}{τ}}) d τ \\ = & \frac{1}{\overset{*}{τ}} \int_{- \infty}^{t} \sum_{i} δ (t_{i} - τ) I (e_{i} = E) δ (\frac{t + δ - τ}{\overset{*}{τ}} - 1) d τ \\ = & \sum_{i} I (e_{i} = E) \int_{- \infty}^{t} δ (t_{i} - τ) \frac{1}{\overset{*}{τ}} δ (\frac{t + δ - τ - \overset{*}{τ}}{\overset{*}{τ}}) d τ \\ = & \sum_{i} I (e_{i} = E) \int_{- \infty}^{t} δ (t_{i} - τ) δ (t + δ - τ - \overset{*}{τ}) d τ \\ = & \sum_{i} δ (t - t_{i} + δ - \overset{*}{τ}) I (e_{i} = E), \end{matrix}

where ${\tilde{ϕ}}_{δ}$ is the analog of ${\tilde{f}}_{δ}$ when memory is perfect, and we have used the property $δ (α x) = δ (x) / |α|$ of the Dirac delta function.

A.4.2. Perfect Memory: Prediction

We are now ready to derive the prediction:

\begin{matrix} p^{β} (δ; t) & = & Λ^{β} exp \sum_{E ∊ E} \int C_{E}^{β} (\overset{*}{τ}) {\tilde{ϕ}}_{δ}^{E} (\overset{*}{τ}; t) d \overset{*}{τ} \\ = & Λ^{β} exp \sum_{i} \sum_{E ∊ E} \int C_{E}^{β} (\overset{*}{τ}) δ (t - t_{i} + δ - \overset{*}{τ}) I (e_{i} = E) d \overset{*}{τ} . \end{matrix}

Consider the summand associated with $i = 1$ :

\sum_{E ∊ E} \int C_{E}^{β} (\overset{*}{τ}) δ (t - t_{1} + δ - \overset{*}{τ}) I (e_{1} = E) d \overset{*}{τ} = \int C_{e_{1}}^{β} (\overset{*}{τ}) δ (t - t_{1} + δ - \overset{*}{τ}) d \overset{*}{τ} .

The above integral is a convolution of $C_{e_{1}}^{β} (δ)$ with $δ_{- (t - t_{1})} (δ) = δ (t - t_{1} + δ)$ . The result is a translation of $C_{e_{1}}^{β} (δ)$ by $- (t - t_{1})$ , the time interval since event $i = 1$ :

\int C_{e_{1}}^{β} (\overset{*}{τ}) δ (t - t_{1} + δ - \overset{*}{τ}) d \overset{*}{τ} = C_{e_{1}}^{β} (t - t_{1} + δ) .

This makes sense because $C_{e_{1}}^{β}$ peaks at the time that $β$ is expected after having observed $e_{1}$ (say, $τ$ ). Since $e_{1}$ occurred $t - t_{1}$ ago, the view of $C_{e_{1}}^{β}$ that matters for the prediction should be translated by $- (t - t_{1})$ . This view peaks at $τ - (t - t_{1})$ , which is the time remaining until $β$ is expected on the basis of $e_{1} .$ (For example, if $β$ is expected five time units after $e_{1}$ ( $C_{e_{1}}^{β} (δ)$ peaks at $δ = 5$ ) and $e_{1}$ occurred three time units ago, then $β$ is expected in two time units ( $C_{e_{1}}^{β} (t - t_{1} + δ) = C_{e_{1}}^{β} (3 + δ)$ peaks at $δ = 2$ ).)

If there is only one event episode, indexed by $i = 1$ ,

p^{β} (δ; t) = Λ^{β} exp C_{e_{1}}^{β} (t - t_{1} + δ) .

Thus, the prediction $p^{β}$ will also peak at $δ = τ - (t - t_{1})$ . (Continuing with the example, the prediction will also peak at $δ = 2$ .) The form of the prediction makes sense. The quantity $Λ^{β}$ is the rate of occurrence of $β$ without any evidence or the agent's prior belief. The quantity $exp C_{e_{1}}^{β}$ is the adjustment factor for the rate of $β$ due to $e_{1}$ . The product is the new, adjusted estimate for the rate of $β$ .

If there are exactly two event types with one episode each,

p^{β} (δ; t) = Λ^{β} [exp C_{e_{1}}^{β} (t - t_{1} + δ)] [exp C_{e_{2}}^{β} (t - t_{2} + δ)],

and so on. In other words, the adjustment factors for different event types multiply, as would be desired.

A.4.3. Fuzzy Memory

Previously, we have assumed that memory is perfect to illustrate the principles behind the form of the prediction $p$ . However, memory is fuzzy in real, resource-bounded systems, so the foregoing comments apply only approximately. The form of the integral in the prediction,

\int C_{E}^{β} (\overset{*}{τ}) {\tilde{f}}_{δ}^{E} (\overset{*}{τ}; t) d \overset{*}{τ},

is not a convolution since the width of ${\tilde{f}}_{δ}^{E}$ increases with $δ$ . However, the purpose of the integral is to approximate a convolution, so that the relevant view of $C_{E}^{β}$ can be used to generate a prediction. For example, if $E$ occurred three time units ago, ${\tilde{f}}_{δ}^{E} (\overset{*}{τ})$ would peak at $\overset{*}{τ} = δ + 3$ units (as $δ$ increases, ${\tilde{f}}_{δ}^{E} (\overset{*}{τ})$ becomes increasingly wider). If $E$ cues $β$ with a delay of five time units, $C_{E}^{β} (\overset{*}{τ})$ would peak at around five time units. The integral attains its largest value approximately when the peaks of ${\tilde{f}}_{δ}^{E} (\overset{*}{τ})$ and $C_{E}^{β} (\overset{*}{τ})$ are aligned, which is around $δ = 2 .$ This makes sense, because $β$ is expected in two time units from the present.

In summary, we have shown that the integral in equation 2.3 approximately subtracts the time elapsed since each cue, from the time delay between that cue and the outcome of interest. The integral thus produces a function of $δ$ that peaks approximately at the time remaining to the outcome.

A.5. Worked Examples

In section 3.2, we noted that fuzzy memory induces fuzziness in the prediction $p$ and that the latter could have equivalently been induced by intrinsic temporal uncertainty in the input in an agent with perfect memory, for every given snapshot in time. In section 3.4, we noted that other things equal, cues closer in time to outcomes tend to receive more credit for those outcomes in the case of fuzzy memory. We provided one example each for illustration. We provide the mathematical details for the example in section 3.2 and then for section 3.4.

A.5.1. Worked Example 1: Forward Conditioning

For this example, event $x$ cues event $y$ with a fixed delay of time interval $τ$ . We work out the agent's prediction for $y$ at the time $x$ occurs.

The pairwise association is given in equation 1.6,

M_{X}^{Y} (\overset{*}{τ}) = Φ_{k} (τ / \overset{*}{τ}) / \overset{*}{τ} .

By lemma 3, when $x$ occurs,

m^{Y} (\overset{*}{τ}) = M_{X}^{Y} (\overset{*}{τ}) = Φ_{k} (τ / \overset{*}{τ}) / \overset{*}{τ} .

When $x$ occurs, the memory is empty, so

{(p^{-})}^{Y} = Λ^{Y} .

Thus, after training, the credit due to $x$ for $y$ is

exp C_{X}^{Y} (\overset{*}{τ}) = \frac{p^{+} (\overset{*}{τ})}{p^{-} (\overset{*}{τ})} = \frac{Φ_{k} (τ / \overset{*}{τ})}{Λ^{Y} \overset{*}{τ}} .

The projected memory for $x$ when $x$ occurs is

{\tilde{f}}_{δ}^{X} (\overset{*}{τ}) = Φ_{k} (δ / \overset{*}{τ}) / \overset{*}{τ} .

The prediction for $y$ at the time $x$ occurs is

\begin{matrix} p^{Y} (δ) & = & Λ^{Y} exp \sum_{E ∊ E} \int C_{E}^{Y} (\overset{*}{τ}) {\tilde{f}}_{δ}^{E} (\overset{*}{τ}) d \overset{*}{τ} \\ = & Λ^{Y} exp \int C_{X}^{Y} (\overset{*}{τ}) {\tilde{f}}_{δ}^{X} (\overset{*}{τ}) d \overset{*}{τ} . \end{matrix}

(A.9)

The integral is

\begin{matrix} \int C_{X}^{Y} (\overset{*}{τ}) {\tilde{f}}_{δ}^{X} (\overset{*}{τ}) d \overset{*}{τ} & = & \int \frac{Φ_{k} (δ / \overset{*}{τ})}{\overset{*}{τ}} log \frac{Φ_{k} (τ / \overset{*}{τ})}{Λ^{Y} \overset{*}{τ}} d \overset{*}{τ} \\ = & κ_{0} δ^{k} \int \frac{e^{- k δ / \overset{*}{τ}}}{{\overset{*}{τ}}^{k + 1}} log \frac{κ_{0} τ^{k} e^{- k τ / \overset{*}{τ}}}{Λ^{Y} {\overset{*}{τ}}^{k + 1}} d \overset{*}{τ} . \end{matrix}

(A.10)

By lemma 2, we have

\int_{0}^{\infty} \frac{e^{- a / x}}{x^{k + 1}} log \frac{A e^{- b / x}}{x^{m}} d x = \frac{(k - 1)!}{a^{k}} [m ψ (k) - \frac{k b}{a} + log \frac{A}{a^{m}}],

so we substitute $a = k δ$ , $A = κ_{0} τ^{k} / Λ^{Y}$ , $b = k τ$ , and $m = k + 1$ to find

\int_{0}^{\infty} \frac{e^{- k δ / \overset{*}{τ}}}{{\overset{*}{τ}}^{k + 1}} log \frac{κ_{0} τ^{k} e^{- k τ / \overset{*}{τ}}}{Λ^{Y} {\overset{*}{τ}}^{k + 1}} d \overset{*}{τ} = \frac{(k - 1)!}{{(k δ)}^{k}} [(k + 1) ψ (k) - \frac{k τ}{δ} + log \frac{κ_{0} τ^{k}}{Λ^{Y} {(k δ)}^{k + 1}}] .

Therefore,

\begin{matrix} \int C_{X}^{Y} (\overset{*}{τ}) {\tilde{f}}_{δ}^{X} (\overset{*}{τ}) d \overset{*}{τ} & = & κ_{0} δ^{k} \frac{(k - 1)!}{{(k δ)}^{k}} [(k + 1) ψ (k) - \frac{k τ}{δ} + log \frac{τ^{k}}{Λ^{Y} k! δ^{k + 1}}] \\ = & (k + 1) ψ (k) - \frac{k τ}{δ} + log \frac{τ^{k}}{Λ^{Y} k! δ^{k + 1}}, \end{matrix}

and the prediction for $y$ at the time $x$ occurs is

\begin{matrix} p^{Y} (δ) & = & Λ^{Y} exp \int C_{X}^{Y} (\overset{*}{τ}) {\tilde{f}}_{δ}^{X} (\overset{*}{τ}) d \overset{*}{τ} \\ = & Λ^{Y} exp [(k + 1) ψ (k) - \frac{k τ}{δ} + log \frac{τ^{k}}{Λ^{Y} k! δ^{k + 1}}] \\ = & Λ^{Y} exp [(k + 1) ψ (k)] \frac{τ^{k}}{Λ^{Y} k! δ^{k + 1}} e^{- k τ / δ} \\ = & \frac{exp [(k + 1) ψ (k)]}{k!} \frac{τ^{k}}{δ^{k + 1}} e^{- k τ / δ} \\ = & \frac{exp [(k + 1) ψ (k)]}{k^{k + 1}} \frac{1}{δ} κ_{0} {(\frac{τ}{δ})}^{k} e^{- k τ / δ} \\ = & \frac{κ_{1}^{- 1}}{δ} Φ_{k} (\frac{τ}{δ}) . \end{matrix}

This equation is equation 3.1. section 3.2 makes use of this result.

A.5.2. Worked Example 2: Credit and Temporal Proximity

For this example, event $x$ cues event $y$ and event $z$ with a fixed delay of time interval $2 - t_{YZ}$ and 2, respectively (relative to the time of occurrence of $x$ ). We work out $exp C_{Y}^{Z}$ as a function of $t_{YZ}$ . The significance of this result is discussed in section 3.4.

We need ${(p^{+})}_{Y}^{Z}$ , which we write as shorthand for “the degree to which $z$ appears in $p^{+}$ due to $y$ ” and ${(p^{-})}^{Z}$ at the time $y$ occurs,

{(p^{-})}^{Z} = Λ^{Z} exp \sum_{E ∊ E} \int C_{E}^{Z} (\overset{*}{τ}) {\tilde{f}}_{δ}^{E} (\overset{*}{τ}; t_{Y}^{-}) d \overset{*}{τ},

where $t_{Y}^{-}$ refers to the moment just before $y$ occurs. At that time, the memory contains only $x$ , which occurred $(2 - t_{YZ}$ ) ago, so

{(p^{-})}^{Z} = Λ^{Z} exp \int C_{X}^{Z} (\overset{*}{τ}) {\tilde{f}}_{δ}^{X} (\overset{*}{τ}; t_{Y}^{-}) d \overset{*}{τ} .

(A.11)

The credit due to $x$ for $z$ is

exp C_{X}^{Z} (δ) = \frac{Φ_{k} (2 / \overset{*}{τ})}{Λ^{Z} \overset{*}{τ}},

and the projected memory for $x$ when $y$ occurs is

{\tilde{f}}_{δ}^{X} (\overset{*}{τ}) = Φ_{k} ((δ + 2 - t_{YZ}) / \overset{*}{τ}) / \overset{*}{τ} .

Thus, the integral is

\int C_{X}^{Z} (\overset{*}{τ}) {\tilde{f}}_{δ}^{X} (\overset{*}{τ}; t_{Y}^{-}) d \overset{*}{τ} = \int \frac{1}{\overset{*}{τ}} Φ_{k} (\frac{δ + 2 - t_{YZ}}{\overset{*}{τ}}) log \frac{Φ_{k} (2 / \overset{*}{τ})}{Λ^{Z} \overset{*}{τ}} d \overset{*}{τ} .

This integral is the same as equation A.10 in the previous example with $δ$ , $τ$ and $Λ^{Y}$ replaced with $δ + 2 - t_{YZ}$ , 2 and $Λ^{Z}$ , respectively. The same relationship holds between ${(p^{-})}^{Z}$ here (see equation A.11) and $p^{Y}$ in the previous example (see equation A.9). Referring to the previous result, we thus have

{(p^{-})}^{Z} (δ) = \frac{κ_{1}^{- 1}}{δ + 2 - t_{YZ}} Φ_{k} (\frac{2}{δ + 2 - t_{YZ}}) .

On the other hand,

{(p^{+})}_{Y}^{Z} (δ) = m^{Z} (δ) = M_{Y}^{Z} (δ) = Φ_{k} (t_{YZ} / δ) / δ .

Thus, after training, the credit due to $y$ for $z$ is

\begin{matrix} exp C_{Y}^{Z} (δ) & = & \frac{{(p^{+})}_{Y}^{Z} (δ)}{{(p^{-})}^{Z} (δ)} \\ = & \frac{Φ_{k} (t_{YZ} / δ) / δ}{\frac{κ_{1}^{- 1}}{δ + 2 - t_{YZ}} Φ_{k} (\frac{2}{δ + 2 - t_{YZ}})} \\ = & \frac{{(t_{YZ} / δ)}^{k} exp (- k t_{YZ} / δ) / δ}{\frac{κ_{1}^{- 1}}{δ + 2 - t_{YZ}} {(\frac{2}{δ + 2 - t_{YZ}})}^{k} exp (- \frac{2 k}{δ + 2 - t_{YZ}})} \\ = & κ_{1} {(\frac{t_{YZ}}{2})}^{k} {(\frac{δ + 2 - t_{YZ}}{δ})}^{k + 1} exp (- k [\frac{t_{YZ}}{δ} - \frac{2}{δ + 2 - t_{YZ}}]) . \end{matrix}

Figure 8 plots this expression for various values of $t_{YZ}$ . See section 3.4 for a discussion.

A.6. Demonstration: Methods

Given a time-ordered set of events $[e_{1}, e_{2}, \dots, e_{n}]$ , where each $e_{i} = (x_{i}, t_{i})$ comprises a discrete-valued type and real-valued timestamp, we are interested in predicting the type $x_{n + 1}$ of the next event given its time of occurrence $t_{n + 1}$ . In the demonstration, we apply the prediction algorithm (“ $C$ -based”) to a superposition of independent MRPs and compare its predictions to those of a pairwise event association model (“ $M$ -based”). In the simulation, both the $C$ - and $M$ -based predictors have memories spanning $10^{- 5}$ to 80 time units into the past, each covered by 200 log-spaced memory nodes.

Within each MRP, the probability of the type and time of an event depends solely on the type of the most recent past event, so for MRP $k$ ,

P ((x_{n + 1}^{k}, t_{n + 1}^{k}) | {(x_{m}^{k}, t_{m}^{k})}_{m \leq n}) = P ((x_{n + 1}^{k}, t_{n + 1}^{k}) | x_{n}^{k}),

where $t_{n + 1}^{k} > t_{n}^{k}$ . The set of event types within each MRP is discrete and finite, while transition times $Δ t_{n + 1} = t_{n + 1} - t_{n} > 0$ are real and strictly positive; this allows only one event to occur at a given time. Within each MRP, the probability of the type of the next event is given by the transition matrix,

P_{i j} = P (x_{n + 1}^{k} = j | x_{n}^{k} = i) = (\begin{matrix} 0.05 & 0.75 & 0.2 \\ 0.2 & 0.05 & 0.75 \\ 0.75 & 0.2 & 0.05 \end{matrix}) .

The transition times from $i$ to $j$ in MRP $k$ follow a truncated normal distribution $N (μ_{i j}^{k}, σ_{i j}^{2 k})$ , with a lower-bound cutoff of $10^{- 5}$ (to ensure positivity).

We use two approaches that generate superposed processes differently. We discuss the first approach, used for Figure 9c. The means $μ_{i j}^{k}$ and variances $σ_{i j}^{2 k}$ of the transition time distributions are drawn uniformly from the intervals (0, 10) and (0, 2), respectively. The same values are used across all six runs of the simulation. For each run, we generate exactly seven MRPs, labeled $k = 1, \dots, 7$ , each with 500 event episodes. We then construct seven superposed processes from the MRPs as follows. The first superposed process consists of one MRP, namely, the MRP $k = 1$ ; the second superposed process consists of two MRPs, namely, the MRPs with $k = 1$ and $k = 2$ ; and so on. Each component MRP has three types of events, so the total number of event types in the superposed process is $3 N$ , where $N$ is the number of MRPs superposed.

We now discuss the second approach, used for Figure 9d. We draw exactly one set of transition time distribution parameters $μ_{i j}$ and $σ_{i j}^{2}$ as before. This same set of parameters is used across all six runs of the simulation, and for all MRPs, $k = 1, \dots, 7$ . We generate exactly seven MRPs of 20,000 event episodes each and generate seven superposed processes therefrom by incrementally superposing the MRPs as in the first approach. Every MRP has three types of events (u, v, w). In the superposed processes, the event types are not distinguished according to the MRP of origin (e.g., a u from one MRP and a u from another MRP are both of type u in the superposed process). Thus, in contrast to the previous approach, the algorithms observe only three types of events in the superposed MRPs.

In both Figures 9c and 9d, 80% of each superposed process is used for training and the rest for testing. For the $C$ -based prediction, accuracy on the test set is computed by checking if, at every time $t_{n}$ that an event occurs, the prediction evaluated at $t_{n + 1}$ , the time of the next event, ${argmax}_{i} p^{i} (δ = Δ t_{n + 1}; t = t_{n})$ matches the event that actually occurs at that time. For the $M$ -based prediction, the computation is analogous, except the prediction is found via ${argmax}_{i} m^{i} (δ = Δ t_{n + 1}; t = t_{n})$ , where $j = x_{n}$ , the type of the event at $t_{n}$ . The simulation is run six times and the average accuracy is reported.

Contributor Information

Wei Zhong Goh, Email: weizhong@bu.edu.

Varun Ursekar, Email: varunu@bu.edu.

Marc W. Howard, Email: marc777@bu.edu.

Code Availability

The code that supports the demonstration in section 4 can be found at https://predicting.gitlab.io.

References

Altmann, E. G., Cristadoro, G., & Esposti, M. D. (2012). On the origin of long-range correlations in texts. In Proceedings of the National Academy of Sciences, 109(29), 11582–11587. 10.1073/pnas.1117723109 [DOI] [PMC free article] [PubMed] [Google Scholar]
Balsam, P. D., & Gallistel, C. R. (2009). Temporal maps and informativeness in associative learning. Trends in Neuroscience, 32(2), 73–78. 10.1016/j.tins.2008.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bernacchia, A., Seo, H., Lee, D., & Wang, X. J. (2011). A reservoir of time constants for memory traces in cortical neurons. Nature Neuroscience, 14(3), 366–372. 10.1038/nn.2752 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bright, I. M., Meister, M. L. R., Cruzado, N. A., Tiganj, Z., Buffalo, E. A., & Howard, M. W. (2020). A temporal record of the past with a spectrum of time constants in the monkey entorhinal cortex. In Proceedings of the National Academy of Sciences, 117, 20274–20283. 10.1073/pnas.1917197117 [DOI] [PMC free article] [PubMed] [Google Scholar]
Buzsáki, G. (2002). Theta oscillations in the hippocampus. Neuron, 33(3), 325–340. [DOI] [PubMed] [Google Scholar]
Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181–204. 10.1017/S0140525X12000477 [DOI] [PubMed] [Google Scholar]
Cohen, R., Erez, K., Ben-Avraham, D., & Havlin, S. (2000). Resilience of the Internet to random breakdowns. Physical Review Letters, 85(21), 4626. [DOI] [PubMed] [Google Scholar]
Cont, R. (2005). Long range dependence in financial markets. In Lévy-Véhel J.& Lutton E. (Eds.), Fractals in engineering (pp. 159–179). London: Springer. [Google Scholar]
Cruzado, N. A., Tiganj, Z., Brincat, S. L., Miller, E. K., & Howard, M. W. (2020). Conjunctive representation of what and when in monkey hippocampus and lateral prefrontal cortex during an associative memory task. Hippocampus, 30, 1332–1346. 10.1002/hipo.23282 [DOI] [PubMed] [Google Scholar]
Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4), 613–624. 10.1162/neco.1993.5.4.613 [DOI] [Google Scholar]
DLMF. NIST Digital Library of Mathematical Functions. (2021). Edited by Olver F. W. J., Daalhuis A. B. Olde, Lozier D. W., Schneider B. I., Boisvert R. F., Clark C. W., … McClain M. A.. Release 1.1.2 of 2021-06-15. http://dlmf.nist.gov/ [Google Scholar]
Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11, 127–138. 10.1038/nrn2787 [DOI] [PubMed] [Google Scholar]
Gallistel, C., Craig, A. R., & Shahan, T. A. (2019). Contingency, contiguity, and causality in conditioning: Applying information theory and Weber's law to the assignment of credit problem. Psychological Review, 126(5), 761. 10.1037/rev0000163 [DOI] [PubMed] [Google Scholar]
Hasselmo, M. E., Bodelón, C., & Wyble, B. P. (2002). A proposed function for hippocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Computation, 14, 793–817. 10.1162/089976602317318965 [DOI] [PubMed] [Google Scholar]
Kahana, M. J., Seelig, D., & Madsen, J. R. (2001). Theta returns. Current Opinion in Biology, 11(6), 739–744. [DOI] [PubMed] [Google Scholar]
Kurth-Nelson, Z., & Redish, A. D. (2009). Temporal-difference reinforcement learning with distributed representations. PLOS One, 4(10), e7362. 10.1371/journal.pone.0007362 [DOI] [PMC free article] [PubMed] [Google Scholar]
Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of dopamine system. Neural Computation, 20, 3034–3054. 10.1162/neco.2008.11-07-654 [DOI] [PubMed] [Google Scholar]
MacDonald, C. J., Lepage, K. Q., Eden, U. T., & Eichenbaum, H. (2011). Hippocampal “time cells” bridge the gap in memory for discontiguous events. Neuron, 71(4), 737–749. 10.1016/j.neuron.2011.07.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
McGuire, J. T., & Kable, J. W. (2013). Rational temporal predictions can underlie apparent failures to delay gratification. Psychological Review, 120(2), 395–410. 10.1037/a0031910 [DOI] [PMC free article] [PubMed] [Google Scholar]
Mello, G. B., Soares, S., & Paton, J. J. (2015). A scalable population code for time in the striatum. Current Biology, 25(9), 1113–1122. 10.1016/j.cub.2015.02.036 [DOI] [PubMed] [Google Scholar]
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. 10.1038/nature14236 [DOI] [PubMed] [Google Scholar]
Momennejad, I., & Howard, M. W. (2018). Predicting the future with multi-scale successor representations. bioRxiv:449470. [Google Scholar]
Mozer, M. C. (1992). Induction of multiscale temporal structure. In Moody J., Hanson S. J., & Lippmann R. (Eds.), Advances in neural information processing systems, 4 (pp. 51–58). San Mateo, CA: Morgan Kaufmann. [Google Scholar]
Murray, J. D., Bernacchia, A., Roy, N. A., Constantinidis, C., Romo, R., & Wang, X.-J. (2017). Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. In Proceedings of the National Academy of Sciences, 114(2), 394–399. 10.1073/pnas.1619449114 [DOI] [PMC free article] [PubMed] [Google Scholar]
Pan, W. X., Schmidt, R., Paton, J. R., & Hyland, B. I. (2005). Dopamine cells respond to predicted events during classical conditioning: Evidence for eligibility traces in the reward-learning network. Journal of Neuroscience, 25(26), 6235–6242. 10.1523/JNEUROSCI.1478-05.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
Post, E. (1930). Generalized differentiation. Transactions of the American Mathematical Society, 32, 723–781. 10.1090/S0002-9947-1930-1501560-X [DOI] [Google Scholar]
Rasmussen, J. G. (2018). Lecture notes: Temporal point processes and the conditional intensity function. arXiv:1806.00221. [Google Scholar]
Schultz, W. (2006). Behavioral theories and the neurophysiology of reward. Annual Review of Psychology, 57, 87–115. 10.1146/annurev.psych.56.091103.070229 [DOI] [PubMed] [Google Scholar]
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. 10.1126/science.275.5306.1593 [DOI] [PubMed] [Google Scholar]
Shankar, K. H., & Howard, M. W. (2013). Optimally fuzzy temporal memory. Journal of Machine Learning Research, 14, 3753–3780. [Google Scholar]
Shankar, K. H., Singh, I., & Howard, M. W. (2016). Neural mechanism to simulate a scale-invariant future. Neural Computation, 28, 2594–2627. 10.1162/NECO_a_00891 [DOI] [PubMed] [Google Scholar]
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144. 10.1126/science.aar6404 [DOI] [PubMed] [Google Scholar]
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. [Google Scholar]
Tano, P., Dayan, P., & Pouget, A. (2020). A local temporal difference code for distributional reinforcement learning. In Larochelle H., Ranzato M., Hadsell R., Balcan M. F., & Lin H. (Eds.), Advances in neural information processing systems, 33 (pp. 13662–13673). Red Hook, NY: Curran. [Google Scholar]
Taxidis, J., Pnevmatikakis, E. A., Dorian, C. C., Mylavarapu, A. L., Arora, J. S., Samadian, K. D., … Golshani, P. (2020). Differential emergence and stability of sensory and temporal representations in context-specific hippocampal sequences. Neuron, 108(5), 984–998.e9. 10.1016/j.neuron.2020.08.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tiganj, Z., Cromer, J. A., Roy, J. E., Miller, E. K., & Howard, M. W. (2018). Compressed timeline of recent experience in monkey lPFC. Journal of Cognitive Neuroscience, 30, 935–950. 10.1162/jocn_a_01273 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tiganj, Z., Gershman, S. J., Sederberg, P. B., & Howard, M. W. (2019). Estimating scale-invariant future in continuous time. Neural Computation, 31(4), 681–709. 10.1162/neco_a_01171 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tishby, N., Pereira, F. C., & Bialek, W. (2000). The information bottleneck method. arXiv:0004057. [Google Scholar]
Tsao, A., Sugar, J., Lu, L., Wang, C., Knierim, J. J., Moser, M.-B., & Moser, E. I. (2018). Integrating time from experience in the lateral entorhinal cortex. Nature, 561, 57–62. 10.1038/s41586-018-0459-6 [DOI] [PubMed] [Google Scholar]
van der Meer, M. A. A., & Redish, A. D. (2011). Theta phase precession in rat ventral striatum links place and reward information. Journal of Neuroscience, 31(8), 2843–2854. 10.1523/JNEUROSCI.4869-10.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
Waelti, P., Dickinson, A., & Schultz, W. (2001). Dopamine responses comply with basic assumptions of formal learning theory. Nature, 412(6842), 43–48. 10.1038/35083500 [DOI] [PubMed] [Google Scholar]
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. [Google Scholar]

[R1] Altmann, E. G., Cristadoro, G., & Esposti, M. D. (2012). On the origin of long-range correlations in texts. In Proceedings of the National Academy of Sciences, 109(29), 11582–11587. 10.1073/pnas.1117723109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Balsam, P. D., & Gallistel, C. R. (2009). Temporal maps and informativeness in associative learning. Trends in Neuroscience, 32(2), 73–78. 10.1016/j.tins.2008.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bernacchia, A., Seo, H., Lee, D., & Wang, X. J. (2011). A reservoir of time constants for memory traces in cortical neurons. Nature Neuroscience, 14(3), 366–372. 10.1038/nn.2752 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Bright, I. M., Meister, M. L. R., Cruzado, N. A., Tiganj, Z., Buffalo, E. A., & Howard, M. W. (2020). A temporal record of the past with a spectrum of time constants in the monkey entorhinal cortex. In Proceedings of the National Academy of Sciences, 117, 20274–20283. 10.1073/pnas.1917197117 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Buzsáki, G. (2002). Theta oscillations in the hippocampus. Neuron, 33(3), 325–340. [DOI] [PubMed] [Google Scholar]

[R6] Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181–204. 10.1017/S0140525X12000477 [DOI] [PubMed] [Google Scholar]

[R7] Cohen, R., Erez, K., Ben-Avraham, D., & Havlin, S. (2000). Resilience of the Internet to random breakdowns. Physical Review Letters, 85(21), 4626. [DOI] [PubMed] [Google Scholar]

[R8] Cont, R. (2005). Long range dependence in financial markets. In Lévy-Véhel J.& Lutton E. (Eds.), Fractals in engineering (pp. 159–179). London: Springer. [Google Scholar]

[R9] Cruzado, N. A., Tiganj, Z., Brincat, S. L., Miller, E. K., & Howard, M. W. (2020). Conjunctive representation of what and when in monkey hippocampus and lateral prefrontal cortex during an associative memory task. Hippocampus, 30, 1332–1346. 10.1002/hipo.23282 [DOI] [PubMed] [Google Scholar]

[R10] Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4), 613–624. 10.1162/neco.1993.5.4.613 [DOI] [Google Scholar]

[R11] DLMF. NIST Digital Library of Mathematical Functions. (2021). Edited by Olver F. W. J., Daalhuis A. B. Olde, Lozier D. W., Schneider B. I., Boisvert R. F., Clark C. W., … McClain M. A.. Release 1.1.2 of 2021-06-15. http://dlmf.nist.gov/ [Google Scholar]

[R12] Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11, 127–138. 10.1038/nrn2787 [DOI] [PubMed] [Google Scholar]

[R13] Gallistel, C., Craig, A. R., & Shahan, T. A. (2019). Contingency, contiguity, and causality in conditioning: Applying information theory and Weber's law to the assignment of credit problem. Psychological Review, 126(5), 761. 10.1037/rev0000163 [DOI] [PubMed] [Google Scholar]

[R14] Hasselmo, M. E., Bodelón, C., & Wyble, B. P. (2002). A proposed function for hippocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Computation, 14, 793–817. 10.1162/089976602317318965 [DOI] [PubMed] [Google Scholar]

[R15] Kahana, M. J., Seelig, D., & Madsen, J. R. (2001). Theta returns. Current Opinion in Biology, 11(6), 739–744. [DOI] [PubMed] [Google Scholar]

[R16] Kurth-Nelson, Z., & Redish, A. D. (2009). Temporal-difference reinforcement learning with distributed representations. PLOS One, 4(10), e7362. 10.1371/journal.pone.0007362 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of dopamine system. Neural Computation, 20, 3034–3054. 10.1162/neco.2008.11-07-654 [DOI] [PubMed] [Google Scholar]

[R18] MacDonald, C. J., Lepage, K. Q., Eden, U. T., & Eichenbaum, H. (2011). Hippocampal “time cells” bridge the gap in memory for discontiguous events. Neuron, 71(4), 737–749. 10.1016/j.neuron.2011.07.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] McGuire, J. T., & Kable, J. W. (2013). Rational temporal predictions can underlie apparent failures to delay gratification. Psychological Review, 120(2), 395–410. 10.1037/a0031910 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Mello, G. B., Soares, S., & Paton, J. J. (2015). A scalable population code for time in the striatum. Current Biology, 25(9), 1113–1122. 10.1016/j.cub.2015.02.036 [DOI] [PubMed] [Google Scholar]

[R21] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. 10.1038/nature14236 [DOI] [PubMed] [Google Scholar]

[R22] Momennejad, I., & Howard, M. W. (2018). Predicting the future with multi-scale successor representations. bioRxiv:449470. [Google Scholar]

[R23] Mozer, M. C. (1992). Induction of multiscale temporal structure. In Moody J., Hanson S. J., & Lippmann R. (Eds.), Advances in neural information processing systems, 4 (pp. 51–58). San Mateo, CA: Morgan Kaufmann. [Google Scholar]

[R24] Murray, J. D., Bernacchia, A., Roy, N. A., Constantinidis, C., Romo, R., & Wang, X.-J. (2017). Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. In Proceedings of the National Academy of Sciences, 114(2), 394–399. 10.1073/pnas.1619449114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Pan, W. X., Schmidt, R., Paton, J. R., & Hyland, B. I. (2005). Dopamine cells respond to predicted events during classical conditioning: Evidence for eligibility traces in the reward-learning network. Journal of Neuroscience, 25(26), 6235–6242. 10.1523/JNEUROSCI.1478-05.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Post, E. (1930). Generalized differentiation. Transactions of the American Mathematical Society, 32, 723–781. 10.1090/S0002-9947-1930-1501560-X [DOI] [Google Scholar]

[R27] Rasmussen, J. G. (2018). Lecture notes: Temporal point processes and the conditional intensity function. arXiv:1806.00221. [Google Scholar]

[R28] Schultz, W. (2006). Behavioral theories and the neurophysiology of reward. Annual Review of Psychology, 57, 87–115. 10.1146/annurev.psych.56.091103.070229 [DOI] [PubMed] [Google Scholar]

[R29] Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. 10.1126/science.275.5306.1593 [DOI] [PubMed] [Google Scholar]

[R30] Shankar, K. H., & Howard, M. W. (2013). Optimally fuzzy temporal memory. Journal of Machine Learning Research, 14, 3753–3780. [Google Scholar]

[R31] Shankar, K. H., Singh, I., & Howard, M. W. (2016). Neural mechanism to simulate a scale-invariant future. Neural Computation, 28, 2594–2627. 10.1162/NECO_a_00891 [DOI] [PubMed] [Google Scholar]

[R32] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144. 10.1126/science.aar6404 [DOI] [PubMed] [Google Scholar]

[R33] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9–44. [Google Scholar]

[R34] Tano, P., Dayan, P., & Pouget, A. (2020). A local temporal difference code for distributional reinforcement learning. In Larochelle H., Ranzato M., Hadsell R., Balcan M. F., & Lin H. (Eds.), Advances in neural information processing systems, 33 (pp. 13662–13673). Red Hook, NY: Curran. [Google Scholar]

[R35] Taxidis, J., Pnevmatikakis, E. A., Dorian, C. C., Mylavarapu, A. L., Arora, J. S., Samadian, K. D., … Golshani, P. (2020). Differential emergence and stability of sensory and temporal representations in context-specific hippocampal sequences. Neuron, 108(5), 984–998.e9. 10.1016/j.neuron.2020.08.028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Tiganj, Z., Cromer, J. A., Roy, J. E., Miller, E. K., & Howard, M. W. (2018). Compressed timeline of recent experience in monkey lPFC. Journal of Cognitive Neuroscience, 30, 935–950. 10.1162/jocn_a_01273 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Tiganj, Z., Gershman, S. J., Sederberg, P. B., & Howard, M. W. (2019). Estimating scale-invariant future in continuous time. Neural Computation, 31(4), 681–709. 10.1162/neco_a_01171 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Tishby, N., Pereira, F. C., & Bialek, W. (2000). The information bottleneck method. arXiv:0004057. [Google Scholar]

[R39] Tsao, A., Sugar, J., Lu, L., Wang, C., Knierim, J. J., Moser, M.-B., & Moser, E. I. (2018). Integrating time from experience in the lateral entorhinal cortex. Nature, 561, 57–62. 10.1038/s41586-018-0459-6 [DOI] [PubMed] [Google Scholar]

[R40] van der Meer, M. A. A., & Redish, A. D. (2011). Theta phase precession in rat ventral striatum links place and reward information. Journal of Neuroscience, 31(8), 2843–2854. 10.1523/JNEUROSCI.4869-10.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Waelti, P., Dickinson, A., & Schultz, W. (2001). Dopamine responses comply with basic assumptions of formal learning theory. Nature, 412(6842), 43–48. 10.1038/35083500 [DOI] [PubMed] [Google Scholar]

[R42] Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. [Google Scholar]

PERMALINK

Predicting the Future With a Scale-Invariant Temporal Memory for the Past

Wei Zhong Goh

Varun Ursekar

Marc W Howard

Abstract

1. Using Memory to Predict the Future

1.1. A Formal Model for Temporal Record of the Past

1.1.1. Events in Continuous Time

Figure 1:

1.1.2. Temporal Memory

1.1.3. Estimating Pairwise Time-Lagged Statistics

Figure 2:

2. Predicting the Future with a Scale-Invariant Past

2.1. Generating Predictions from Credit Associations

Figure 3:

Figure 4:

2.2. Computing Credit Associations

2.2.1. Prediction prior to Event Observation

Figure 5:

2.2.2. Prediction due to Event Observation

2.2.3. Updating C

Figure 6:

2.3. Summary

3. Properties of the Prediction Algorithm

3.1. Scaling Properties

3.2. Equivalence of Fuzzy Memory and Input Temporal Uncertainty

3.3. Timescale Invariance

Figure 7:

3.4. With Fuzzy Memory, Credit Is Assigned Based on Temporal Proximity

Figure 8:

4. Demonstration: Event Streams with Memory and Multiple Characteristic Timescales

Figure 9:

5. Discussion

5.1. Theoretical Properties of the Current Model

5.2. Theoretical Limitations of the Current Model

5.3. Neuroscience Considerations

5.3.1. Reward Prediction Error and Dopamine

5.3.2. Translation and Theta Oscillations

Acknowledgments

Appendix

A.1. A Formal Model for Temporal Record of the Past

A.2. Time Translation to Estimate the Future State of the Past

A.3. Pairwise Association and Pairwise Prediction

A.3.1. Equation for Pairwise Prediction

A.3.2. Normalization for Pairwise Prediction

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Remark.

A.4. Credit Association and Prediction

A.4.1. Perfect Memory: Projected Memory

A.4.2. Perfect Memory: Prediction

A.4.3. Fuzzy Memory

A.5. Worked Examples

A.5.1. Worked Example 1: Forward Conditioning

A.5.2. Worked Example 2: Credit and Temporal Proximity

A.6. Demonstration: Methods

Contributor Information

Code Availability

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2.3. Updating $C$