Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 11.
Published in final edited form as: Adv Neural Inf Process Syst. 2023;36:44341–44355.

Sequential Memory with Temporal Predictive Coding

Mufeng Tang 1, Helen Barron 1, Rafal Bogacz 1
PMCID: PMC7615819  EMSID: EMS194757  PMID: 38606302

Abstract

Forming accurate memory of sequential stimuli is a fundamental function of biological agents. However, the computational mechanism underlying sequential memory in the brain remains unclear. Inspired by neuroscience theories and recent successes in applying predictive coding (PC) to static memory tasks, in this work we propose a novel PC-based model for sequential memory, called temporal predictive coding (tPC). We show that our tPC models can memorize and retrieve sequential inputs accurately with a biologically plausible neural implementation. Importantly, our analytical study reveals that tPC can be viewed as a classical Asymmetric Hopfield Network (AHN) with an implicit statistical whitening process, which leads to more stable performance in sequential memory tasks of structured inputs. Moreover, we find that tPC exhibits properties consistent with behavioral observations and theories in neuroscience, thereby strengthening its biological relevance. Our work establishes a possible computational mechanism underlying sequential memory in the brain that can also be theoretically interpreted using existing memory model frameworks.

1. Introduction

The ability to memorize and recall sequences of events with temporal dependencies is crucial for biological memory systems that also underpins many other neural processes [13]. For example, forming correct memories of words requires humans to memorize not only individual letters but also the sequential order following which the letters appear (e.g., “cat” and “act”). However, despite extensive research into models of static, temporally unrelated memories from both neuroscience and machine learning [410], computational modeling of sequential memory is not as developed. Existing models for sequential memory are either analytically intractable or have not yet been systematically evaluated in challenging sequential memory tasks [1115], which hinders a comprehensive understanding of the computational mechanism underlying sequential memory, arguably a more general form of memory in the natural world than static memories.

In this work, we propose a novel approach to modeling sequential memory based on predictive coding (PC) [1618], a biologically plausible neural network model able to reproduce many phenomena observed in the brain [19], which has also shown to have a close relationship to backpropagation in artificial neural networks [20, 21]. Using PC to model sequential memory is motivated by two key factors: Firstly, neuroscience experiments and theories have suggested that (temporal) predictive processing and memory are two highly related computations in the hippocampus, the brain region crucial for memory [22, 23]. Secondly, the modeling of static memories (e.g., a single image) using PC has recently demonstrated significant success [8, 10, 24, 25], raising the question of whether PC can also be employed to model sequential memory. To take into account the temporal dimension in sequential memory, in this work we adopt a temporal extension of the original PC models, which has been employed in filtering problems and in modeling the visual system [2629]. Here we investigate its performance in sequential memory tasks. Our contributions can be summarized as follows:

  • We propose temporal predictive coding (tPC), a family of PC models capable of sequential memory tasks that inherit the biologically plausible neural implementation of classical PC models [18];

  • We present an analytical result showing that the single-layer tPC can be viewed as the classical Asymmetric Hopfield Network (AHN) performing an implicit statistical whitening step during memory recall, providing a possible mechanism of statistical whitening in the brain [3032];

  • Experimentally, we show that the whitening step in single-layer tPC models results in more stable performance than the AHN and its modern variants [15] in sequential memory, due to the highly variable and correlated structure of natural sequential inputs;

  • We show that tPC can successfully reproduce several behavioral observations in humans, including the impact of sequence length in word memories and the primacy/recency effect;

  • Beyond memory, we show that our tPC model can also develop context-dependent representations [2, 11] and generalize learned dynamics to unseen sequences, suggesting a potential connection to cognitive maps in the brain [33, 34].

2. Background and related work

Predictive Coding Models for Static Memory

The original PC model for memory follows a hierarchical and generative structure, where higher layers generate top-down predictions of lower layers’ activities and the network’s learning and inference are driven by the minimization of all the prediction errors [8]. Since the memorized patterns minimize the prediction errors in PC, they can be considered attractors of the energy function defined as the (summed) prediction errors, which is similar to the energy perspective of Hopfield Networks (HNs) [4]. Subsequent research has also shown that PC models for memory can be formulated as recurrent networks to account for the recurrently connected hippocampal network [10, 25], and that continual memorization and recall of a stream of patterns can be achieved by combining the hierarchical PC model with conjugate Bayesian updates [24]. Despite these abundant investigations into the memory capability of PC models, they all focused on static memories. Although the BayesPCN model [24] is capable of recalling patterns memorized in an online manner, the memories stored in this model are still static, order-invariant patterns, rather than sequences of patterns with underlying temporal dependencies.

Predictive Coding with Time

PC was extended to include temporal predictions by earlier works (see [35] for a review) noticing its relationship to Kalman filters [36]. However, the Kalman filtering approach to temporal predictions sacrifices the plausible neural implementation of PC models due to non-local computations, and a recent work proposed an alternative way of formulating PC with temporal predictions that inherits the plausible implementation of static PC, while approximating Kalman filters [37]. However, none of these models were examined in memory tasks. Broadly, models based on temporal predictions were proposed in both neuroscience and machine learning [29, 3840]. However, these models are either trained by the implausible backpropagation or rely on complex architectures to achieve temporal predictions. In this work, our tPC model for sequential memory is based on the model in [37], which inherits the simple and biologically plausible neural implementation of classical PC models that requires only local computations and Hebbian plasticity.

Hopfield Networks for Sequential Memory

Although there exist other models of sequential memory [1113], these models are mostly on the conceptual level, used to provide theoretical accounts for physiological observations from the brain, and are thus hard to analyze mathematically. Therefore, here we focus our discussion on the AHN [14] and its modern variants [15], which extended the HN [4] to account for sequential memory, and have a more explicit mathematical formulation. Denoting a sequence of P + 1 patterns xμ (μ = 1, …, P + 1), xμ ∈ {–1, 1}N, the N × N weight matrix of an AHN is set as follows:

WAHN=μ=1Pxμ+1(xμ) (1)

Notice that it differs from the static HN only by encoding the asymmetric autocovariance rather than the symmetric covariance μ=1Pxμ(xμ), thus the name. The single-shot retrieval, which we define as R, is then triggered by a query, q ∈ {–1, 1}N:

RAHN(q)=sgn(WAHNq)=sgn(μ=1Pxμ+1(xμ)q) (2)

The retrieval process in Eq. 2 can be viewed as follows: the query q is first compared with each xμ using a dot product function (xμ) q that outputs a similarity score, then the retrieval is a weighted sum of all xμ+1 (μ = 1, …, P) using these scores as weights. Thus, if q is identical to a certain xμ, the next pattern xμ+1 will be given the largest weight in the output retrieval. Following the Universal Hopfield Network (UHN) framework [7], we can generalize this process to define a retrieval function of a general sequential memory model with real-valued patterns xμ ∈ ℝN:

RUHN(q)=μ=1Pxμ+1sep(sim(xμ,q)) (3)

where sim is a similarity function such as dot product or cosine similarity, and sep is a separation function that separates the similarity scores i.e., emphasize large scores and de-emphasize smaller ones [7]. When sim is dot product and sep is an identity function, we get the retrieval function of the original AHN (for binary patterns an ad-hoc sgn function can be applied to the retrievals). Chaudhry et al. [15] have shown that it is possible to extend AHN to a model with polynomial sep function with a degree d, and a model with softmax sep function, which we call “Modern Continuous AHN” (MCAHN):

RAHN(q,d)=μ=1Pxμ+1((xμ)q)d (4)
RMCAHN(q,β)=μ=1Pxμ+1softmax(β(xμ)q) (5)

where β is the temperature parameter that controls the separation strength of the MCAHN. Note that these two models can be respectively viewed as sequential versions of the Modern Hopfield Network [5] and the Modern Continuous Hopfield Network [6], which is closely related to the self-attention mechanism in Transformers [41]. However, the family of AHNs has not yet been investigated in sequential memory tasks with structured and complex inputs such as natural movies.

Other Models for Sequential Memory

Beyond Hopfield Networks, many other computational models have been proposed to study the mechanism underlying sequential memory. Theoretical properties of self-organizing networks in sequential memory were discussed as early as in [42]. In theoretical neuroscience, models by Jensen et al. [43] and Mehta et al. [44] suggested that the hippocampus performs sequential memory via neuron firing chains. Other models have suggested the role of contextual representation in sequential memory [11, 45], with contextual representations successfully reproducing the recency and contiguity effects in free recall [46]. Furthermore, Howard et al. [47] proposed that sequential memory is represented in the brain via approximating the inverse Laplacian transform of the current sensory input. However, these models were still at the conceptual level, lacking neural implementations of the computations. Recurrent networks with backpropagation and large spiking neural networks also demonstrate sequential memory [48, 49]. We compare our model with [48] to validate tPC’s alignment with behavior.

Our model is also closely related to the concept of cognitive map in the hippocampal formation [5052], which is often discussed within the context of sequence learning to explain knowledge abstraction and generalization. In this work, we present two preliminary results related to cognitive maps, showing that our tPC model can 1) disambiguate aliased observation via latent representations and 2) generalize with simple sequential dynamics as a result of performing sequential memory [52]. However, as this work centers on memory, we leave cognitive maps for future explorations of tPC.

3. Models

In this section, we introduce the tPC models by describing their computations during memorization and recall respectively, as well as the neural implementations of these computations. We describe the single-layer tPC first, and then move to the 2-layer tPC.

3.1. Single-layer tPC

Memorization

The most straightforward intuition behind “good” sequential memory models is that they should learn the transition between every pair of consecutive patterns in the sequence, so that accurate recall of the full sequence can be achieved recursively by recalling the next pattern based on the current one. Given a sequence of real-valued patterns xμ, μ = 1, …, P + 1, this intuition can be formalized as the model minimizing the following loss at each time-step:

Fμ(W)=xμWf(xμ1)22 (6)

which is simply the squared temporal prediction error. W is the weight parameter of the model, and f (·) is a nonlinear function. Similar to static PC models [8, 10], we assume that the model has two populations of neurons: value neurons that are loaded with the inputs xμ at step μ, and error neurons representing the temporal prediction error εμxμWf(xμ−1). To memorize the sequence, the weight parameter W is updated at each step following gradient descent:

ΔWFμ(W)/W=εμf(xμ1) (7)

and the model can be presented with the sequence for multiple epochs until W converges. Note that the model only has one weight parameter for the whole sequence, rather than P weight parameters for a sequence of length P + 1.

Recall

During recall, the weight matrix W is fixed to the learned values, and the value neurons no longer receive the correct patterns xμ. Instead, while trying to recall the pattern xμ based on the query q, the value neurons are updated to minimize the squared temporal prediction error based on the query q and the learned W:

Fμ(x^μ)=x^μWf(q)22 (8)

where we denote the value neurons’ activities during recall as x^μ to differentiate it from the memorized patterns xμ. The value neurons then perform the following inferential dynamics to minimize the loss Fμ(x^μ):

x^˙μFμ(x^μ)/x^μ=εμ (9)

and the converged x^μ is the final retrieval. Note that the error neurons’ activities during recall are defined as εμ:=x^μWf(q), which is also different from their activities during memorization.

In the case of sequential memory, there are two types of recall. We define the first type as “online” recall, where the query q at each step μ is the ground-truth pattern at the previous step xμ−1. It is called online as these ground-truth queries can be viewed as real-time online feedback during the sequence recall. In this case, the original xμ will define a memory attractor as it defines an optimum of the loss, or energy in Eqs. 6 and 8. The second type is referred to as “offline” recall, where q is the recall from the previous step i.e., x^μ1, except at the first step, where a ground-truth x1 is supplied to elicit recall of the whole sequence. This is called offline as there is no real-time feedback. In this case, errors from earlier steps may accumulate through time and xμ is no longer an ascertained attractor unless x^μ1=xμ1, which makes it more challenging and analogous to the replay of memories during sleep [53].

Neural implementation

A possible neural network implementation of these computations is shown in Fig. 1A, which is similar to that of static PC models [18] characterized by separate populations of value and error neurons. The difference from static models is that the predictions are now from the previous time-step μ – 1. To achieve this, we assume that the value neurons are connected to the error neurons via two pathways: the direct pathway (the straight arrows between value and error neurons) and the indirect pathway through an additional population of inhibitory interneurons, which provides inhibitory inputs to the error neurons via W. These interneurons naturally introduce a synaptic delay of one time-step, such that when the inputs from step μ – 1 reach the error neurons through the indirect pathway, the error neurons are already receiving inputs from step μ via the direct pathway, resulting in the temporal error. Moreover, we assume that memory recall is a much faster process than the time-steps μ so that the interneurons can hold a short working memory of q during the (iterative) inferential dynamics in Eq. 9 at step μ, which can be achieved by the mechanisms described in [54]. Notice that in this implementation, the learning rule (Eq. 7) is Hebbian and the inference rule (Eq. 9) is also local, inheriting the plausibility of static PC implementations [18].

Figure 1.

Figure 1

Neural implementations of the tPC models. A: single-layer tPC. B: 2-layer tPC.

3.2. 2-layertPC

Similar to multi-layer static PC models for memory, we can have multiple layers in tPC to model the hierarchical processing of raw sensory inputs by the neocortex, before they enter the memory system [8, 10]. We focus on a 2-layer tPC model in this work. In this model, we assume a set of hidden value neurons zμ to model the brain’s internal neural responses to the sequential sensory inputs xμ. The hidden neurons make not only hierarchical, top-down predictions of the current activities in the sensory layer like in static PC models [8], but also temporal predictions like in the single-layer tPC. We also assume that the sensory layer xμ does not make any temporal predictions in this case. Thus, this 2-layer tPC can be viewed as an instantiation of the hidden Markov model [55].

Memorization

During memorization, the 2-layer tPC tries to minimize the sum of squared errors at step μ, with respect to the model parameters and the hidden activities:

Fμ(zμ,WH,WF)=zμWHf(z^μ1)22+xμWFf(zμ)22 (10)

where WH governs the temporal prediction in the hidden state, WF is the forward weight for the top-down predictions, and z^μ1 is the hidden state inferred at the previous time-step. During memorization, the 2-layer tPC follows a similar optimization processing to that of static hierarchical PC [8]. It first infers the hidden representation of the current sensory input xμ by:

z˙μFμ(zμ,WH,WF)/zμ=εz,μ+f(zμ)WFεx,μ (11)

where ⊙ denotes the element-wise product between two vectors, and εz,μ and εx,μ are defined as the hidden temporal prediction error zμWHf(z^μ1) and the top-down error xμWF f(zμ) respectively. After zμ converges, WH and WF are updated following gradient descent on Fμ:

ΔWHFμ(zμ,WH,WF)/WH=εz,μf(z^μ1);ΔWFFμ(zμ,WH,WF)/WF=εx,μf(zμ) (12)

which are performed once for every presentation of the full sequence. The converged zμ is then used as z^μ for the memorization at time-step μ + 1.

Recall

After learning/memorization, WH and WF are fixed. We also assume that the hidden activities zμ are unable to store memories, by resetting their values to randomly initialized ones. Thus, the sequential memories can only be recalled through the weights WH and WF. Again, the sensory layer has no access to the correct patterns during recall and thus needs to dynamically change its value to retrieve the memories. The loss thus becomes:

Fμ(zμ,x^μ)=zμWHf(z^μ1)22+x^μWFf(zμ)22 (13)

where x^μ denotes the activities of value neurons in the sensory layer during recall. Both the hidden and sensory value neurons are updated to minimize the loss. The hidden neurons will follow similar dynamics specified in Eq. 11, with the top-down error εx,μ defined as x^μWFf(zμ), whereas the sensory neurons are updated according to:

x^˙μFμ(zμ,x^μ)/x^μ=εx,μ (14)

and the converged x^μ is the final retrieval. Similar to the single-layer case, if the converged z^μ is used for the recall at the next step directly, the recall is offline; on the other hand, if we query the model with q = xμ i.e., the ground-truth and use the query to infer z^μ, and then use z^μ for the recall at the next step, the recall is online.

Neural implementation

A neural implementation of the computations above is shown in Fig. 1B. The hidden layer follows the same mechanism in the single-layer tPC, with interneurons introducing a synaptic delay for temporal errors and short-term memory to perform the inferential dynamics (Eq. 11). The connection between the hidden layer and the sensory layer WF is modeled in the same way as in static PC models, which requires only Hebbian learning and local computations [18]. The memorization and recall pseudocode for the tPC models is provided in SM.

4. Results

4.1. Theoretical relationship to AHNs

We first develop a theoretical understanding of single-layer tPC by relating it to AHNs:

Property 1

Assume, without loss of generality, a sequence of memories xμ ∈ ℝN (μ = 1, …, P + 1) with zero mean. With an identity nonlinear function f(x) = x in Eq. 6, the retrieval of the single-layer tPC with query q, defined as RtPC (q), can be written as:

RtPC(q)=μ=1Pxμ+1(Mxμ)Mq (15)

where M is an empirical whitening matrix such that:

Mxμ(Mxμ)μ=IN (16)

where ⟨·⟩ is the expectation operation over xμ ’s.

Proof of this property is provided in the SM. Essentially, this property implies that our single-layer tPC, in its linear form, can be regarded as a special case of the UHN for sequential memories (Eq. 3), with a “whitened dot product” similarity function where the two vectors xμ and q are first normalized and decorrelated (to have identity covariance IN) before the dot product. AHNs, on the other hand, calculate the dot product directly. Biologically, this property provides a possible mechanism of statistical whitening in the brain that, unlike earlier models of biological whitening with explicit objectives [3032], performs this computation implicitly via the circuit shown in Fig. 1 that minimizes the temporal prediction errors.

4.2. Experimental comparison to AHNs

To understand how the whitening step affects the performance in sequential memory tasks, we compare our single-layer tPC with the family of AHNs exeperimentally. To ensure consistency to Property 1 above, we use an identity nonlinearity f(x) = x for all these experiments. Empirically, we found that using a tanh nonlinearity makes subtle differences irrelevant to our main discussion in this work, and we discuss it in SM.

Polynomial AHNs

We first compare tPC with polynomial AHNs (Eq. 4) in sequences of uncorrelated binary patterns, where AHNs are known to work well [14, 15]. We plot their sequence capacity Pmax against the number of value neurons of the models i.e., the pattern dimension N. Here, Pmax is defined as the maximum length of a memorized sequence, for which the probability of incorrectly recalled bits is less than or equal to 0.01. Fig. 2A shows that the capacity of our single-layer tPC is greater than that of the original AHN (d = 1) but smaller than that of a quadratic AHN (d = 2). Notice that the single-layer tPC has an identity sep function like the original AHN. Therefore the whitening operation has indeed improved the performance of the dot product sim function. Inspired by the decorrelation effect of statistical whitening, we then generated binary patterns with N = 100 correlated features, with a parameter b controlling the level of correlation (bcorrelation|). The approach that we followed to generate the correlated patterns is provided in SM. As shown in Fig. 2B, as the correlation increases, all AHNs up to d = 3 suffer from a quick decrease of capacity Pmax, whereas the capacity of single-layer tPC almost remains constant. This observation is consistent with the theoretical property that the whitening transformation essentially decorrelates features such that patterns with any level of correlation are regarded as uncorrelated in tPC recall (Eq. 15). This result also explains the comparison in Fig. 2A: although the patterns generated in this panel are theoretically uncorrelated, the small correlation introduced due to experimental randomness will result in the performance gap between AHN and single-layer tPC.

Figure 2.

Figure 2

Comparison between single-layer tPC and AHNs. A: Capacity of models with uncorrelated binary patterns. B: Capacity of models with binary patterns with increasing feature correlations. C: Recall performance with sequences of binary MNIST digits.

We then investigate the performance of these models with sequences of binarized MNIST images [56] in Fig. 2C. It can be seen that the AHNs with d = 1 and d = 2 quickly fail as the sequence length P reaches 3, whereas the single-layer tPC performs well. These results suggest that our tPC with a whitening step is superior to simple AHNs due to the inherently correlated structure of natural inputs such as handwritten digits. For all the experiments with binary patterns, we used the online recall mentioned above, and polynomial AHNs already fail in this simpler recall scenario.

MCAHN

Due to the quick failure of AHNs with a polynomial sep function, we now compare our single-layer tPC with the MCAHN (Eq. 5). In static memory tasks, it is known that the softmax separation leads to exponentially high capacity, especially when β is high [7, 57]. In this work we use β = 5 for all MCAHNs. We first compare the performances of our single-layer tPC model and an MCAHN on random sequences of MNIST digits with varying lengths. Here we trigger recalls with online queries. Fig. 3A shows that the performance of our single-layer tPC, measured as the mean squared error (MSE) between the recalled sequence and the ground truth, is better than that of the MCAHN, further demonstrating the usefulness of the implicit whitening in our model.

Figure 5.

Figure 5

Replicating behavioral data with tPC. A: Experimental data from [61] that studies the impact of sequence length on serial recall of English words and the replications by Botvinick and Plaut [48] and tPC. B: tPC replicates the primacy/recency effects in serial recall experiments [62].

Figure 3.

Figure 3

A: Recall MSE of MNIST sequences with increasing length; B: Recall MSE of MovingM-NIST sequences of a fixed length 10 but with an increasing number of sequences. Error bars obtained with 5 seeds

Despite the superior performance of our model in this task, we note that random sequences of MNIST images are not naturally sequential inputs i.e., there are no sequential dynamics underlying them. We thus examine the models on the MovingMNIST dataset [58]. Each video in this dataset consists of 20 frames of 2 MNIST digits moving inside a 64 × 64 patch. Due to the fixed sequence length, for experiments with MovingMNIST we vary the total number of sequences to memorize and fix the sequence length to the first 10 frames of the videos. The performance of the models is shown in Fig. 3B. On average, the recall MSE of MCAHN has a slower increase than that of the single-layer tPC as the total number of sequences increases. However, the performance of MCAHN has very large variations across all sequence numbers. To probe into this observation, we visually examined 3 examples of the MovingMNIST movies recalled by MCAHN and our single-layer tPC in Fig. 4A, when the total number of sequences to memorize is 40. MCAHN produces very sharp recalls for the first 2 example sequences, but totally fails to recall the third one by converging to a different memory sequence after the red triangle in Fig. 4A. On the other hand, the recall by single-layer tPC is less sharp but stably produces the correct sequence. This phenomenon can be understood using the UHN framework [7]: in Eq. 5, when β is large, the softmax separation function used by MCAHN will assign a weight close to 1 to the memory whose preceding frame is most similar to q (measure by dot product), and weights close to 0 to all other memories, which results in the sharp recall. In contrast, our single-layer tPC model uses an identity separation function that fails to suppress the incorrect memories, resulting in blurry recalls. Importantly, however, the failure of MCAHN in sequence 3 in Fig. 4A suggests that there are “strong attractors” in the memories with an undesirable advantage in dot product similarity which resulted in the large variance in the numerical results, and is addressed by the whitening step in single-layer tPC as it effectively normalized the patterns before dot product.

Figure 4.

Figure 4

Visual results of offline memory recall with 3 datasets. A: MovingMNIST. B: CIFAR10. C: UCF101.

The importance of whitening in tPC is further demonstrated in our experiments with random sequences of CIFAR10 [59] images and movies from the UCF101 [60] dataset, shown in Figs. 4B and C. When recalling these colored sequences, MCAHN can easily converge to strong attractors preceded by frames with many large pixel values that lead to large dot products e.g., the third image in the CIFAR10 example with bright backgrounds, and the penultimate frame in the UCF101 example with a large proportion of sand, which give their subsequent frames large similarity scores. This problem is consistent with earlier findings with static memories [8], and is circumvented in our single-layer tPC model with the whitening matrix M normalizing the pixel values across the sequence, yielding the correct and more stable memory recalls. However, they are less sharp due to the identity separation function. For all the experiments in Fig. 3B and Fig. 4, we used offline recall to make the tasks more challenging and more consistent with reality. Results with online recalls are shown in SM.

4.3. Comparison to behavioral data in sequential memory experiments

We further demonstrated the biological relevance of tPC by comparing it to data from behavioral experiments in sequential memory. In Fig 5A, our 2-layer tPC is compared with Crannell and Parrish’s [61] study on sequence length’s impact on serial recall of English words. Using one-hot vectors to represent letters (i.e., each “word” in our experiment is an ordered combination of one-hot vectors of “letters”, a minimal example of a word with 3 letters being: [0, 1, 0], [1, 0, 0], [0, 0, 1]), we demonstrate accuracy as the proportion of perfectly recalled sequences across varying lengths. Our model aligns consistently with experimental data in [61] as well as the recurrent network model by Botvinick and Plaut [48], displaying a sigmoidal accuracy drop with increasing sequence length.

Fig 5B introduces a qualitative comparison to Henson’s [62] experimental data, examining primacy/recency effects in serial recall of letters. These effects involve higher accuracy in recalling early (primacy) and late (recency) entries in a sequence, with the recency effect slightly weaker than the primacy effect. Using one-hot vectors and a fixed sequence length 7 (6 positions are shown as the first position is given as the cue to initiate recall in our experiment), we visualize recall frequency at different positions across simulated sequences (100 repetitions, multiple seeds for error bars).

Each bar in Fig 5B indicates the frequency of an entry at a particular position being recalled at each position. Our 2-layer tPC reproduces primacy/recency effects, albeit weaker than the model in [62] and previous models [48]. Additionally, the model tends to recall neighboring entries upon errors, echoing Henson’s data. We attribute the weaker effects to tPC’s memory storage in weights, leading to overall improved performance across positions. Details for these experiments are shown in SM.

4.4. tPC develops context-dependent representations

In Fig. 3, we plotted the recall MSE of the 2-layer tPC model in MovingMNIST, which is similar to that of the single-layer tPC. The close performance of these models raises the question of whether and when the hidden layer and hierarchical processing of sequential inputs are necessary. Inspired by earlier neuroscience theories that the hippocampus develops neuron populations signaling the sensory inputs and the context of the inputs separately [3, 11], we hypothesize that the hidden neurons in our model represent when an element occurs in a memory sequence i.e., its context. We thus designed a sequential memory task with aliased, or repeating inputs at different time-steps [28, 63]. An example of such an aliased sequence can be seen in Fig. 6A, where the second and the fourth frames of a short MNIST sequence are exactly the same (“2”). Recalling such a sequence is inherently more challenging as the models have to determine, when queried with the “2” during recall (either online or offline), whether they are at the second or the fourth step to give the correct recall at the next step (“1” or “3”). As can be seen in Fig. 6A, both MCAHN (which can be regarded as a single-layer network [7]) and single-layer tPC fail in this task, recalling an average frame of “1” and “3” after the aliased steps, whereas the 2-layer tPC can recall the correct sequence. We then conducted a numerical investigation into this problem. We first plotted the (online) recall MSEs of the single- and 2-layer tPC models in random MNIST sequences (sampled from the training set) of varying lengths, which are shown as the solid lines in Fig. 6B. We then randomly replaced 20% (rounded up to the closest integer) of the elements in these sequences with a single digit from the test set of MNIST so that each sequence now has 20% repeating i.e., aliased elements, and plotted the recall MSEs as the dotted lines in Fig. 6B. The result suggests that sequences with aliased inputs affect the single-layer model much more than it affects the 2-layer one, producing significantly larger MSEs than recalls without aliased inputs. It is worth noting that aliased inputs are ubiquitous in natural sequential memories. In SM, we provide a natural example from the UCF101 dataset where the 2-layer tPC successfully de-aliased repeating inputs whereas the MCAHN and single-layer tPC failed.

Figure 6.

Figure 6

Context representation and generalization of tPC. A: A simple aliased example with MNIST. B: Numerical investigation into the impact of aliased or repeating elements on model performance. C: Different latent representations of aliased inputs by the 2-layer tPC. D/E: Recall/generalization of sequences with rotational dynamics. “GT” stands for ground truth and 16 and 1024 are the numbers of training sequences. F: Recall and generalization MSE of seen and unseen rotating MNIST images. Error bars obtained with 5 seeds.

To further understand the mechanism underlying the 2-layer tPC with aliased memories, we used a simpler synthetic sequential memory shown in Fig. 6C bottom, where a white bar moves first down and then up in a 5 × 5 frame so that the steps 2 and 4 are aliased [28]. We then trained a 2-layer tPC with a hidden size 5 to memorize this sequence and queried it offline. The smaller hidden size allows us to plot, when the recall dynamics (Eq. 11) have converged, the exact hidden activities in Fig. 6C top, where each vertical line represents the activity of a hidden value neuron z^μ. As can be seen in the circled time-steps, the 2-layer model represents the aliased inputs differentially in its hidden states, which helps it recall the next frame correctly. This property is consistent with early observations in neuroscience that when memorizing sequences, the hippocampus develops a conjunction of neurons representing individual inputs, as well as neurons signaling the temporal context within which an individual appears [3, 11]. In our simple 2-layer model, the “sensory” layer represents individual inputs, whereas the hidden layer plays the role of indexing time and context.

4.5. tPC generalizes learned dynamics to unseen sequences

After memorizing a large number of training sequences sharing underlying dynamics, can tPC generalize the dynamics to unseen sequences? In this experiment, we train the 2-layer tPC with sequences of rotating MNIST digits of identical rotating dynamics (i.e., the digits are rotating towards a fixed direction with a fixed angle at each time step) and vary the number of training sequences (“training size”). An example of these rotating MNIST digits can be seen in Fig 6D, row “ground truth”. The model’s performance is assessed by its ability to rotate both seen MNIST digits and unseen EMNIST letters. For small training sizes (16), tPC can recall seen rotating digits but struggles with generalizing to unseen letters (Fig 6D and E, second row). Increasing the training size to 1024 improves generalization, evident in clearer rotating sequences (Fig 6D and E, bottom row). Panel F quantitatively confirms this trend: the generalization MSE on unseen EMNIST drops as MNIST training size increases, indicating the model learns the underlying dynamics. Interestingly, the recall MSEs for seen MNIST sequences also decrease due to the model extracting rotational dynamics from the larger training set, differing from the behavior observed in random MNIST sequences (Fig 3A). Generalization and the capability of developing contextual representations that disambiguate aliased inputs are two critical functions underlying the flexible behavior of animals, thus connecting our tPC to models of cognitive maps in the hippocampus and related brain regions [51].

5. Conclusion

Inspired by experimental and theoretical discoveries in neuroscience, in this work we have proposed a temporal predictive coding model for sequential memory tasks. We have shown that our tPC model can memorize and recall sequential inputs such as natural movies, and performs more stably than earlier models based on Asymmetric Hopfield Networks. We have also provided a theoretical understanding of the stable performance of the tPC model, showing that it is achieved by an additional statistical whitening operation that is missing in AHNs. Importantly, this whitening step is achieved implicitly by a plausible neural circuit performing local error minimization. Moreover, our 2-layer tPC has exhibited representational and behavioral properties consistent with biological observations, including contextual representations and generalization. Overall, our model has not only provided a possible neural mechanism underlying sequential memory in the brain but also suggested a close relationship between PC and HN, two influential computational models of biological memories. Future directions include systematic investigations into tPC with more than 2 layers and modelling cognitive maps with tPC.

Supplementary Material

SI

6. Acknowledgement

This work has been supported by Medical Research Council UK grant MC_UU_00003/1 to RB, an E.P. Abraham Scholarship in the Chemical, Biological/Life and Medical Sciences to MT, and UKRI Future Leaders Fellowship to HB (MR/W008939/1). The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. http://dx.doi.org/10.5281/zenodo.22558

Contributor Information

Mufeng Tang, Email: mufeng.tang@bndu.ox.ac.uk.

Helen Barron, Email: helen.barron@bndu.ox.ac.uk.

Rafal Bogacz, Email: rafal.bogacz@bndu.ox.ac.uk.

References

  • [1].Tulving Endel. Episodic and semantic memory. 1972.
  • [2].Eichenbaum Howard. Memory on time. Trends in cognitive sciences. 2013;17(2):81–88. doi: 10.1016/j.tics.2012.12.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Eichenbaum Howard. Time cells in the hippocampus: a new dimension for mapping memories. Nature Reviews Neuroscience. 2014;15(11):732–744. doi: 10.1038/nrn3827. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Hopfield John J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences. 1982;79(8):2554–2558. doi: 10.1073/pnas.79.8.2554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Krotov Dmitry, Hopfield John J. Dense associative memory for pattern recognition. Advances in neural information processing systems. 2016;29 [Google Scholar]
  • [6].Ramsauer Hubert, Schäfl Bernhard, Lehner Johannes, Seidl Philipp, Widrich Michael, Adler Thomas, Gruber Lukas, Holzleitner Markus, Pavlović Milena, Sandve Geir Kjetil, et al. Hopfield networks is all you need. arXiv preprint. 2020:arXiv:2008.02217 [Google Scholar]
  • [7].Millidge Beren, Salvatori Tommaso, Song Yuhang, Lukasiewicz Thomas, Bogacz Rafal. Universal hopfield networks: A general framework for single-shot associative memory models; International Conference on Machine Learning; 2022. pp. 15561–15583. [PMC free article] [PubMed] [Google Scholar]
  • [8].Salvatori Tommaso, Song Yuhang, Hong Yujian, Sha Lei, Frieder Simon, Xu Zhenghua, Bogacz Rafal, Lukasiewicz Thomas. Associative memories via predictive coding. Advances in Neural Information Processing Systems. 2021;34 [PMC free article] [PubMed] [Google Scholar]
  • [9].Iatropoulos Georgios, Brea Johanni, Gerstner Wulfram. Kernel memory networks: A unifying framework for memory modeling. arXiv preprint. 2022:arXiv:2208.09416 [Google Scholar]
  • [10].Tang Mufeng, Salvatori Tommaso, Millidge Beren, Song Yuhang, Lukasiewicz Thomas, Bogacz Rafal. Recurrent predictive coding models for associative memory employing covariance learning. PLOS Computational Biology. 2023;19(4):e1010719. doi: 10.1371/journal.pcbi.1010719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Wallenstein Gene V, Hasselmo Michael E, Eichenbaum Howard. The hippocampus as an associator of discontiguous events. Trends in neurosciences. 1998;21(8):317–323. doi: 10.1016/s0166-2236(97)01220-4. [DOI] [PubMed] [Google Scholar]
  • [12].Rolls Edmund T. A computational theory of episodic memory formation in the hippocampus. Behavioural brain research. 2010;215(2):180–196. doi: 10.1016/j.bbr.2010.03.027. [DOI] [PubMed] [Google Scholar]
  • [13].Hawkins Jeff, George Dileep, Niemasik Jamie. Sequence memory for prediction, inference and behaviour. Philosophical Transactions of the Royal Society B: Biological Sciences. 2009;364(1521):1203–1209. doi: 10.1098/rstb.2008.0322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Sompolinsky Haim, Kanter Ido. Temporal association in asymmetric neural networks. Physical review letters. 1986;57(22):2861. doi: 10.1103/PhysRevLett.57.2861. [DOI] [PubMed] [Google Scholar]
  • [15].Chaudhry Hamza Tahir, Zavatone-Veth Jacob A, Krotov Dmitry, Pehlevan Cengiz. Long sequence hopfield memory. arXiv preprint. 2023:arXiv:2306.04532 [Google Scholar]
  • [16].Rao Rajesh PN, Ballard Dana H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience. 1999;2(1):79–87. doi: 10.1038/4580. [DOI] [PubMed] [Google Scholar]
  • [17].Friston Karl. Learning and inference in the brain. Neural Networks. 2003;16(9):1325–1352. doi: 10.1016/j.neunet.2003.06.005. [DOI] [PubMed] [Google Scholar]
  • [18].Bogacz Rafal. A tutorial on the free-energy framework for modelling perception and learning. Journal of mathematical psychology. 2017;76:198–211. doi: 10.1016/j.jmp.2015.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Walsh Kevin S, McGovern David P, Clark Andy, O’Connell Redmond G. Evaluating the neurophysiological evidence for predictive processing as a model of perception. Annals of the new York Academy of Sciences. 2020;1464(1):242–268. doi: 10.1111/nyas.14321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Whittington James CR, Bogacz Rafal. An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity. Neural computation. 2017;29(5):1229–1262. doi: 10.1162/NECO_a_00949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Song Yuhang, Lukasiewicz Thomas, Xu Zhenghua, Bogacz Rafal. Can the brain do backpropagation?—exact implementation of backpropagation in predictive coding networks. Advances in neural information processing systems. 2020;33:22566–22579. [PMC free article] [PubMed] [Google Scholar]
  • [22].Barron Helen C, Auksztulewicz Ryszard, Friston Karl. Prediction and memory: A predictive coding account. Progress in neurobiology. 2020;192:101821. doi: 10.1016/j.pneurobio.2020.101821. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Lisman John, Redish A David. Prediction, sequences and the hippocampus. Philosophical Transactions of the Royal Society B: Biological Sciences. 2009;364(1521):1193–1201. doi: 10.1098/rstb.2008.0316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Yoo Jinsoo, Wood Frank. Bayespcn: A continually learnable predictive coding associative memory. Advances in Neural Information Processing Systems. 2022;35:29903–29914. [Google Scholar]
  • [25].Salvatori Tommaso, Pinchetti Luca, Millidge Beren, Song Yuhang, Bao Tianyi, Bogacz Rafal, Lukasiewicz Thomas. Learning on arbitrary graph topologies via predictive coding. Advances in neural information processing systems. 2022;35:38232–38244. [PMC free article] [PubMed] [Google Scholar]
  • [26].Rao Rajesh PN. Correlates of attention in a model of dynamic visual recognition. Advances in neural information processing systems. 1997;10 [Google Scholar]
  • [27].Rao Rajesh PN, Ballard Dana H. Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural computation. 1997;9(4):721–763. doi: 10.1162/neco.1997.9.4.721. [DOI] [PubMed] [Google Scholar]
  • [28].Rao Rajesh PN. An optimal estimation approach to visual perception and learning. Vision research. 1999;39(11):1963–1989. doi: 10.1016/s0042-6989(98)00279-x. [DOI] [PubMed] [Google Scholar]
  • [29].Ororbia Alexander, Mali Ankur, Lee Giles C, Kifer Daniel. Continual learning of recurrent neural networks by locally aligning distributed representations. IEEE Transactions on Neural Networks and Learning Systems. 2020;31(10):4267–4278. doi: 10.1109/TNNLS.2019.2953622. [DOI] [PubMed] [Google Scholar]
  • [30].Duong Lyndon R, Lipshutz David, Heeger David J, Chklovskii Dmitri B, Simoncelli Eero P. Statistical whitening of neural populations with gain-modulating interneurons. arXiv preprint. 2023:arXiv:2301.11955 [Google Scholar]
  • [31].Pehlevan Cengiz, Chklovskii Dmitri B. Neuroscience-inspired online unsupervised learning algorithms: Artificial neural networks. IEEE Signal Processing Magazine. 2019;36(6):88–96. [Google Scholar]
  • [32].Golkar Siavash, Tesileanu Tiberiu, Bahroun Yanis, Sengupta Anirvan, Chklovskii Dmitri. Constrained predictive coding as a biologically plausible model of the cortical hierarchy. Advances in Neural Information Processing Systems. 2022;35:14155–14169. [Google Scholar]
  • [33].Tolman Edward C. Cognitive maps in rats and men. Psychological review. 1948;55(4):189. doi: 10.1037/h0061626. [DOI] [PubMed] [Google Scholar]
  • [34].Behrens Timothy EJ, Muller Timothy H, Whittington James CR, Mark Shirley, Baram Alon B, Stachenfeld Kimberly L, Kurth-Nelson Zeb. What is a cognitive map? organizing knowledge for flexible behavior. Neuron. 2018;100(2):490–509. doi: 10.1016/j.neuron.2018.10.002. [DOI] [PubMed] [Google Scholar]
  • [35].Rao Rajesh PN, Jiang Linxing Preston. Predictive coding theories of cortical function. 2022. URL https://oxfordre.com/neuroscience/view/10.1093/acrefore/9780190264086.001.0001/acrefore-9780190264086-e-328.
  • [36].Kalman Rudolph Emil. A new approach to linear filtering and prediction problems. 1960 [Google Scholar]
  • [37].Millidge Beren, Tang Mufeng, Osanlouy Mahyar, Bogacz Rafal. Predictive coding networks for temporal prediction. bioRxiv. 2023 doi: 10.1101/2023.05.15.540906. URL https://www.biorxiv.org/content/early/2023/05/16/2023.05.15.540906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Chen Yusi, Zhang Huanqiu, Sejnowski Terrence J. Hippocampus as a generative circuit for predictive coding of future sequences. bioRxiv. 2022:2022–05 [Google Scholar]
  • [39].Jiang Linxing Preston, Rao Rajesh PN. Dynamic predictive coding: A new model of hierarchical sequence learning and prediction in the cortex. bioRxiv. 2022:2022–06. doi: 10.1371/journal.pcbi.1011801. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].van den Oord Aaron, Li Yazhe, Vinyals Oriol. Representation learning with contrastive predictive coding. arXiv preprint. 2018:arXiv:1807.03748 [Google Scholar]
  • [41].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, Polosukhin Illia. Attention is all you need. Advances in neural information processing systems. 2017;30 [Google Scholar]
  • [42].Amari S-I. Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on computers. 1972;100(11):1197–1206. [Google Scholar]
  • [43].Jensen Ole, Idiart MA, Lisman John E. Physiologically realistic formation of autoassociative memory in networks with theta/gamma oscillations: role of fast nmda channels. Learning & Memory. 1996;3(2-3):243–256. doi: 10.1101/lm.3.2-3.243. [DOI] [PubMed] [Google Scholar]
  • [44].Mehta Mayank R, Quirk Michael C, Wilson Matthew A. Experience-dependent asymmetric shape of hippocampal receptive fields. Neuron. 2000;25(3):707–715. doi: 10.1016/s0896-6273(00)81072-7. [DOI] [PubMed] [Google Scholar]
  • [45].Levy William B. Psychology of learning and motivation. Vol. 23. Elsevier; 1989. A computational approach to hippocampal function; pp. 243–305. [Google Scholar]
  • [46].Howard Marc W, Kahana Michael J. A distributed representation of temporal context. Journal of mathematical psychology. 2002;46(3):269–299. [Google Scholar]
  • [47].Howard Marc W, MacDonald Christopher J, Tiganj Zoran, Shankar Karthik H, Du Qian, Hasselmo Michael E, Eichenbaum Howard. A unified mathematical framework for coding time, space, and sequences in the hippocampal region. Journal of Neuroscience. 2014;34(13):4692–4707. doi: 10.1523/JNEUROSCI.5808-12.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Botvinick Matthew M, Plaut David C. Short-term memory for serial order: a recurrent neural network model. Psychological review. 2006;113(2):201. doi: 10.1037/0033-295X.113.2.201. [DOI] [PubMed] [Google Scholar]
  • [49].Eliasmith Chris, Stewart Terrence C, Choo Xuan, Bekolay Trevor, DeWolf Travis, Tang Yichuan, Rasmussen Daniel. A large-scale model of the functioning brain. science. 2012;338(6111):1202–1205. doi: 10.1126/science.1225266. [DOI] [PubMed] [Google Scholar]
  • [50].Whittington James, Muller Timothy, Mark Shirely, Barry Caswell, Behrens Tim. Generalisation of structural knowledge in the hippocampal-entorhinal system. Advances in neural information processing systems. 2018;31 [Google Scholar]
  • [51].Whittington James CR, Muller Timothy H, Mark Shirley, Chen Guifen, Barry Caswell, Burgess Neil, Behrens Timothy EJ. The tolman-eichenbaum machine: unifying space and relational memory through generalization in the hippocampal formation. Cell. 2020;183(5):1249–1263. doi: 10.1016/j.cell.2020.10.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Whittington James CR, McCaffary David, Bakermans Jacob JW, Behrens Timothy EJ. How to build a cognitive map. Nature neuroscience. 2022;25(10):1257–1272. doi: 10.1038/s41593-022-01153-y. [DOI] [PubMed] [Google Scholar]
  • [53].Freyja Ólafsdóttir H, Bush Daniel, Barry Caswell. The role of hippocampal replay in memory and planning. Current Biology. 2018;28(1):R37–R50. doi: 10.1016/j.cub.2017.10.073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].Barak Omri, Tsodyks Misha. Working models of working memory. Current opinion in neurobiology. 2014;25:20–24. doi: 10.1016/j.conb.2013.10.008. [DOI] [PubMed] [Google Scholar]
  • [55].Roweis Sam, Ghahramani Zoubin. A unifying review of linear gaussian models. Neural computation. 1999;11(2):305–345. doi: 10.1162/089976699300016674. [DOI] [PubMed] [Google Scholar]
  • [56].LeCun Yann, Cortes Corinna, Burges Christopher J. Mnist handwritten digit database. 2010;7(23):6. 2010 URL http://yann.lecun.com/exdb/mnist. [Google Scholar]
  • [57].Demircigil Mete, Heusel Judith, Löwe Matthias, Upgang Sven, Vermet Franck. On a model of associative memory with huge storage capacity. Journal of Statistical Physics. 2017;168:288–299. [Google Scholar]
  • [58].Srivastava Nitish, Mansimov Elman, Salakhudinov Ruslan. Unsupervised learning of video representations using lstms; International conference on machine learning; 2015. pp. 843–852. [Google Scholar]
  • [59].Krizhevsky Alex, Nair Vinod, Hinton Geoffrey. The CIFAR-10 dataset. 2014;55(5) online: http://www.cs.toronto.edu/kriz/cifar.html. [Google Scholar]
  • [60].Soomro Khurram, Zamir Amir Roshan, Shah Mubarak. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint. 2012:arXiv:1212.0402 [Google Scholar]
  • [61].Crannell CW, Parrish JM. A comparison of immediate memory span for digits, letters, and words. The Journal of Psychology. 1957;44(2):319–327. [Google Scholar]
  • [62].Henson Richard NA. Short-term memory for serial order: The start-end model. Cognitive psychology. 1998;36(2):73–137. doi: 10.1006/cogp.1998.0685. [DOI] [PubMed] [Google Scholar]
  • [63].Whitehead Steven D, Ballard Dana H. Learning to perceive and act by trial and error. Machine Learning. 1991;7:45–83. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SI

RESOURCES