Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Feb 6.
Published in final edited form as: Proc Mach Learn Res. 2022 Jul;162:15561–15583.

Universal Hopfield Networks: A General Framework for Single-Shot Associative Memory Models

Beren Millidge 1, Tommaso Salvatori 2, Yuhang Song 1,2,, Thomas Lukasiewicz 3,2, Rafal Bogacz 1
PMCID: PMC7614148  EMSID: EMS163745  PMID: 36751405

Abstract

A large number of neural network models of associative memory have been proposed in the literature. These include the classical Hopfield networks (HNs), sparse distributed memories (SDMs), and more recently the modern continuous Hopfield networks (MCHNs), which possess close links with self-attention in machine learning. In this paper, we propose a general framework for understanding the operation of such memory networks as a sequence of three operations: similarity, separation, and projection. We derive all these memory models as instances of our general framework with differing similarity and separation functions. We extend the mathematical framework of Krotov & Hopfield (2020) to express general associative memory models using neural network dynamics with local computation, and derive a general energy function that is a Lyapunov function of the dynamics. Finally, using our framework, we empirically investigate the capacity of using different similarity functions for these associative memory models, beyond the dot product similarity measure, and demonstrate empirically that Euclidean or Manhattan distance similarity metrics perform substantially better in practice on many tasks, enabling a more robust retrieval and higher memory capacity than existing models.

1. Introduction

Associative, or ‘semantic‘, memories are memory systems where data points are retrieved not by an explicit address, but by making a query to the system of approximately the same type as the data points that it stores. The system then returns the closest data point to the query according to some metric. For instance, an associative memory system, when given an image, can be used to return other ‘similar’ images. It is often argued that the brain similarly stores and retrieves its own memories (Hinton & Anderson, 2014; Rolls, 2013; Tsodyks & Sejnowski, 1995), as it is a common experience to be able to recall a memory given a partial cue, e.g., recalling a song given just a few notes (Bonetti et al., 2021). A large literature of neuroscience and computational theories has developed models of how such associative memory systems could be implemented in relatively biologically plausible neural network architectures (Kanerva, 1988; 1992; Hopfield, 1982; Hinton & Anderson, 2014).

Two classical and influential models are the Hopfield network (HN) (Hopfield, 1982; 1984) and the sparse distributed memory (SDM) (Kanerva, 1988; 1992; Jaeckel, 1989). More recently, they have been generalized to the modern-continuous Hopfield network (MCHN) (Ramsauer et al., 2020) and the modern-continuous sparse distributed memory (MCSDM) (Bricken & Pehlevan, 2021), which have substantially improved performance, close relationships with transformer attention, and can handle continuous inputs.

Here, we propose a unified framework that encompasses all these models as simple instantiations of a more general framework, which we call universal Hopfield networks (UHNs). Mathematically, the UHN can be described as a function UHN: ℝI → ℝO mapping a vector in an input space of dimension I to a vector in an output space of dimension O, with two additional inputs of a memory matrix M of size N × I, consisting of a set of N stored patterns, and a projection matrix P of size O × N, consisting of a potentially different set of stored patterns with dimension O for heteroassociation. The dimensionality of the input and output patterns are allowed to differ to enable heteroassociative memories to be described in the same framework. For autoassociative memories I = O. The UHN function can be factorized into a sequence of three operations: similarity, separation, and projection, illustrated in Figure 1. First, similarity matching between the query and a stored set of memory vectors to produce a vector of similarity scores. Second, separation to numerically magnify small differences in original similarity scores into large differences in the output scores so as to increase the relative separation of the scores, and finally projection, which multiplies the resulting set of output similarity scores with a projection matrix, and constructs an output based essentially on a list of stored data points in the memory1 weighted by the output similarity scores, so that the network’s output is most influenced by memories similar to the query vector. The main contributions of this paper are briefly as follows.

  • We define a general framework of universal Hopfield networks, which clarifies the core computation underlying single-shot associative memory models.

  • We demonstrate how existing models in the literature are special cases of this general framework, which can be expressed as an extension of the energy-based model proposed by Krotov & Hopfield (2020).

  • We demonstrate that our framework allows straightforward generalizations to define novel associative memory networks with superior capacity and robustness to MCHNs by using different similarity functions.

Figure 1.

Figure 1

Left: Schematic of the key equations that make up the general theory of the abstract Hopfield network, which shows the factorization of a UHN into similarity, separation, and projection. Right: Visual representation of the factorization diagram when performing an associative memory task on three stored memories. The corrupted data point is scored against the three memories (similarity). The difference in scores are then exaggerated (separation), and used to retrieve a stored memory (projection).

It is also important to draw a distinction between feedforward and iterative associative memory models. In feedforward models, memory retrieval is performed through a fixed computation mapping a query to its retrieved output. Examples of feedforward associative memory models include the DAM (Krotov & Hopfield, 2016), the MCHN (Ramsauer et al., 2020), and the MCSDM (Bricken & Pehlevan, 2021), which effectively instantiate 2-layer MLPs and perform retrieval as a feedforward pass through the network. Conversely, iterative associative memory models retrieve memories by either iterating over neurons (Hopfield, 1982; Demircigil et al., 2017) or else iterative over multiple forward passes of the network by feeding back the output into the network as a new input. It has been shown empirically that standard autoencoder networks (Radhakrishnan et al., 2018; 2020; Jiang & Pehlevan, 2020) and predictive coding networks (Salvatori et al., 2021) can store memories as fixed points of these dynamics. In Section 3 and our experiments, we primarily investigate feedforward associative memories, while in Section 4 we derive a general framework and energy function that can support both feedforward and iterative associative memory models.

The rest of this paper is organized as follows. In Section 2, we define the mathematical framework of universal Hopfield networks. In Section 3, we show how existing models can be derived as special cases of our framework. In Section 4, we extend the neural model of Krotov & Hopfield (2020) to define an energy function and associated neural dynamics for the UHN. In Section 5, we show that our framework enables generalization to novel similarity and separation functions, which result in higher capacity and more robust networks, while experiments on the separation functions empirically confirm theoretical results regarding the capacities of associative memory models.

2. Universal Hopfield Networks (UHNs)

A single-shot associative memory can be interpreted as a function that takes an input vector q (ideally, a corrupted version of a data point already in memory) and outputs a vector corresponding to the closest stored data point. Mathematically, our framework argues that every feedforward associative memory in the literature admits the following factorization, which defines an abstract and general universal Hopfield network (UHN):

z=PProjectionsepSeparation(sim(M,q))Similarity, (1)

where z is the O × 1 output vector of the memory system, P is a projection matrix of dimension O × N, sep is the separation function, sim is the similarity function, M is an N × I matrix of stored memories or data points, and q is the query vector of dimension I × 1.

The intuition behind this computation is that, given an input query, we first want to rank how similar this query is to all the other stored memories. This is achieved by the similarity function, which outputs a vector of similarity scores between each data point held in the memory and the query. Given these similarity scores, since we will ultimately be retrieving based on a linear combination of the patterns stored in the projection matrix, weighted by their similarity scores, and we ultimately only want to produce one clear output pattern without interference from the other patterns, then we need a way to emphasize the top score and de-emphasize the rest. This is achieved by the separation function. It is well known that separation functions of higher polynomial degrees lead to capacity increases of the order of CNn−1, where N is the number of visible (input) neurons, and n is the order of the polynomial (Chen et al., 1986; Horn & Usher, 1988; Baldi & Venkatesh, 1987; Abbott & Arian, 1987; Caputo & Niemann, 2002; Krotov & Hopfield, 2016), while exponential separation functions (such as the softmax) lead to exponential memory capacity (Demircigil et al., 2017; Ramsauer et al., 2020). Taking this further, it is clear to see that simply using a max separation function leads to a theoretically unbounded capacity in terms of the dimension of the query vector, since then presenting an already stored pattern as a query will always return itself as a memory. However, the ‘attractors’ in such a network grow increasingly small so that, in practice, the real bound on performance is not the capacity but rather the ability of the similarity function to distinguish between the query and various possible stored patterns — a pattern that is clear already with the ‘exponential’ capacity MCHN network, which despite its theoretical exponential capacity often performs relatively poorly at retrieval in practice with corrupted or noisy queries. Finally, the projection matrix takes the vector of separated similarity scores and maps it to the correct output expected of the network.

Importantly, Equation 1 can be interpreted as a feedforward pass through an artificial neural network with a single hidden layer, where the activation function of the first layer is the separation function, and the activation function of the output is linear or else is some post-processing function such as binarization (as in the classical HN). Interpretations of memory networks in this way have been independently proposed by Kanerva (1988) for SDM and recently by Krotov (2021) for the MCHN (Ramsauer et al., 2020). Furthermore, coming from the other direction, recent work has also begun to suggest that standard 2-layer multi-layer perceptrons (MLPs) may naturally tend to function as associative memory models in practice. For instance, Geva et al. (2020) show that the feedforward layers of the transformer appear to serve as key-value memories (Geva et al., 2020), and it has been suggested that these feedforward layers can be replaced with simply persistent memory vectors (Sukhbaatar et al., 2019).

3. Instances of Universal Hopfield Networks

Now that we have defined our universal Hopfield network (UHN), we shall show how the currently existing main associative memory models can be derived as specific instances of the UHN. The equivalences are summarized in Table 1.

Table 1. Associative memory models.

Memory Network Similarity Function Separation Function
(Classical) Hopfield Network (HN) Dot Product Identity
Sparse Distributed Memory (SDM) Hamming Distance Threshold
Dense Associative Memory (DAM) Dot Product Polynomial
Modern Continuous Hopfield Network (MCHN) Dot Product Softmax

3.1. Hopfield Networks

Hopfield networks (HNs) (Hopfield, 1982; 1984) consist of a single neural network layer that stores an array of binary memories M = [m1, m2, . . . , mN], where M is an N × I matrix, and I is the dimension of each memory vector, and N is the number of memories stored. The memory arrays are then stored in a synaptic weight matrix W = MMT. Memories are retrieved by fixing the input neurons to a query pattern q, which is a binary vector of length I. While the original HN of Hopfield (1984) iteratively minimized the energy function over individual neurons, we here describe the ‘feedforward’ Hopfield networks described in (Krotov & Hopfield, 2016; Little, 1974), which retrieve memories by performing a forward pass through the network to compute an output z = sign(W · q), where sign is the sign function, and z is the retrieved pattern and is also a binary vector of length I (since the HN is autoassociative). This process can be repeated if necessary to further minimize the energy by feeding in the reconstructed output again to the network as its input. This network can be interpreted as minimizing a ‘Hopfield energy function’, which is equivalent to the energy function of an Ising spin-glass model (Kirkpatrick & Sherrington, 1978; Keeler, 1988). To show that the HN is an example of a UHN, first recall that the synaptic weight matrix in the HN is defined not as the stored pattern matrix but as the outer product W = MMT. By substituting this into the HN update rule, we obtain z=sign(MMT)sign(MI(MTq)), where we use I to denote the identity function. Thus, we can understand the HN within our framework as using a dot-product similarity function and an identity separation function (which is the cause of the HN’s relatively poor storage capacity). The sign function plays no part in memory retrieval and simply binarizes the network’s output.

3.2. Sparse Distributed Memories

Sparse distributed memories (Kanerva, 1988; 1992) (SDM) are designed to heteroassociate long binary vectors. The network consists of two matrices — an ‘Address’ matrix and a ‘Pattern’ matrix. Memories are thought of as being stored in a data-type with both an ‘Address’ and a ‘Pattern’ pointer.

To retrieve a memory, a query vector is compared against all stored addresses in the Address matrix, and the binary Hamming distance between the query and all addresses is computed. Then, a certain number of addresses are activated that are below a threshold Hamming distance from the query. The memory is retrieved by summing the pattern pointers for all the addresses activated by the query. The ‘read’ phase of the SDM (Kanerva, 1988) can be written mathematically as P · thresh(d(M, q)), where d is the Hamming distance function, and thresh is a threshold function that returns 1 if the Hamming distance is greater than some threshold, and 0 otherwise. Here, it is clear that the SDM can be naturally understood using our framework with similarity function d (Hamming distance) and separation function thresh, which implements a top-k operation to cut out poor matches.

3.3. Dense Associative Memories and Modern Continuous Hopfield Networks

In recent years, the capabilities of both of these classical memory models have been substantially improved, and a number of new Hopfield architectures have been developed based on the dense associative memory (DAM) initially proposed by Krotov & Hopfield (2016) and extended by Demircigil et al. (2017). Specifically, Krotov & Hopfield (2016) argued for generalizing the standard Hopfield energy function (E = qT Wq + qTb) (where b is an I × 1 bias vector to convert between binary and bipolar representations) to an arbitrary function of q and W : E = F (W · q) and showed that as F becomes a polynomial of increasing order, the memory storage capacity of the network increases as CIn−1, where I is the number of visible neurons, and n is the order of the polynomial. Demircigil et al. (2017) extended this argument to exponential energy functions of the form E = σ(W · q), where σ(x) is the softmax function, and showed that the resulting networks have exponential storage capacity. Then, Ramsauer et al. (2020) generalized these networks to continuous (instead of binary) inputs to derive the modern continuous Hopfield network (MCHN). The MCHN uses the energy function E = qT q + logsumexp(Wq), which can be minimized with the convex-concave procedure (Yuille & Rangarajan, 2003), giving the update rule z = WTσ(Wq), which enables exponential capacity, memory retrieval in a single step, and is extremely similar to the feedforward pass of a self-attention unit z = (KQ) with ‘Query Matrix’ Q, ‘Key Matrix’ K, and ‘Value Matrix’ V, where we can associate Q = q, K = W, and V = W (Bahdanau et al., 2014; Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Radford et al., 2019). Lastly, Krotov & Hopfield (2020) presented a unified set of neural dynamics that can reproduce the original HN, the polynomial interaction functions of (Krotov & Hopfield, 2016), and the exponential Hopfield network of (Demircigil et al., 2017; Ramsauer et al., 2020), using only local computations, and which Tang & Kopp (2021) have shown also to be related to the spherical normalization dynamics in the recent MLP-mixer (Tolstikhin et al., 2021).

3.4. Continuous Sparse Distributed Memories

Recent work has also uncovered a close link between SDMs and transformer attention (Bricken & Pehlevan, 2021). Recall that the SDM read rule can be expressed as P · thresh(d(A, q)), where thresh is a threshold function, A is an M × N matrix of addresses, P is a K × O matrix mapping each stored data point to its associated pattern, and d is the Hamming distance between each of the stored addresses in A and the query pattern q. Bricken & Pehlevan (2021) first generalized SDM from binary vectors to the ‘continuous SDM‘, where P, A, and q contain real values instead of bits. Then, they replaced the Hamming distance (which only applies to binary vectors) with the dot product, using the argument that the Hamming distance is the dot product (mod 2) of binary vectors, and thus that the dot product is the natural generalization of the Hamming distance to continuous variables. Finally, they noted that the decay of the number of addresses that are not cutoff by the threshold function decreases approximately exponentially as the Hamming distance threshold decreases. The mathematical reason for this is that the distribution of addresses within a given Hamming distance of a query is a binomial distribution, which can be well approximated with a Gaussian at large N, and the tails of a Gaussian distribution decay exponentially. This approximately exponential decay in the number of addresses passing the threshold allows us to heuristically replace the threshold function with an exponential function, resulting in the following approximate update rule for the ‘continuous SDM‘ model z = (Aq), which is closely related to the self-attention update rule and is identical to the rule for the MCHN.

3.5. Auto- and Heteroassociative Memories

Our framework also provides a simple explanation of the difference between autoassociative memories (which map a corrupted version of a memory to itself) and heteroassociative memories (which map some input memory to some other memory type, potentially allowing for memory chains and sequence retrieval): namely, that autoassociative memories set the projection matrix P equal to the memory matrix M, i.e., one recalls the memories used for similarity matching, while heteroassociative memory networks set the projection matrix equal to the associated heteroassociated memory. It is thus clear to see why the HN and MCHN networks are autoassociative, and how to convert them to heteroassociative memory networks. Namely, for the MCHN, set the update rule to z = (M.q), and for the HN set the weight matrix W = PMT. Demonstrations of these novel heteroassociative HNs and MCHNs are given in Appendix B. Interestingly, the heteroassociative MCHN update rule is equivalent to the self-attention update rule found in transformer networks (Vaswani et al., 2017), and thus suggests that the fundamental operation performed by transformer networks is heteroassociation of inputs (the queries) and memories (the keys) with other memories (the values).

4. Neural Dynamics

In this section, extending the work of Krotov & Hopfield (2020), we present an abstract energy function for the UHN and set of neural dynamics that minimize it, which can be specialized to reproduce any of the associative memory models in the literature. By framing associative memory models in terms of an energy function, we can describe the operation of both iterative and feedforward associative memory models, as well as mathematically investigate the properties of the fixed points that they use as memories. We define a general neural implementation and energy function for our abstract associative memory model that uses only local interactions. In this model, there are two types of ‘neurons’: ‘value neurons’ v and ‘memory neurons’ h. We can think of the ‘value neurons’ v being initialized to the query pattern q such that vt=0 = q, and then updated to produce the output pattern z. This is because the UHN effectively implements a two-layer artificial neural network where the value neurons are the input layer and the memory neurons are the hidden layer. The memory and value neurons are interconnected by the memory matrix M. The neural activities v and h are also passed through two activation functions g and f such that f = f(h) and g = g(v). The network has the following recurrent neural dynamics:

τvdvidt=Evi=jsim(Mi,j,vi)vifivig(vi)τhdhidt=Ehi=f(hi)[jsim(Mi,j,vi)hi], (2)

where τv and τh are time-constants of the dynamics. These dynamics can be derived from the energy function:

E(M,v,h)=[i(viIi)giLv]+[ifihiLh]ijfisim(Mi,j,vi), (3)

where we define the ‘Lagrangian’ functions Lv and Lh such that their derivatives are equal to the activation functions g=Lvv and f=Lhh. The energy function is defined such that it only includes second-order interactions between the value and memory neurons in the third term, while the first two terms in square brackets only involve single sets of neurons. In Appendix A, we show that the energy function is a Lyapunov function of the dynamics, i.e., it always decreases over time, as long as the Hessian of the activation functions f and g are positive definite. To obtain the dynamics equations (2) and (3), we first define function f to be the separation function f(h) = sep(h) such that Lh=dhsep(h) and denote f(h)=f(h)h as the derivative of f. We also set Lv=12ivi2, which implies gi=Lvvi=vi. A further simplication of the energy and dynamics occurs if we assume that τh is small, and thus the dynamics of the hidden neurons are fast compared to the value neurons such that we can safely assume that these dynamics have converged. This allows us to write h=jsim(Mi,j,vi), since when setting dhidt=0, we can cancel the f′(hi) terms as long as f′(h) ≠ 0, which is true for all separation functions we consider in this paper except the max function, which is therefore heuristic. This gives us the simpler and intuitive energy function:

E=ivi212ivi2+ifijsim(Mi,jvi)Lhijfisim(Mi,j,vi)=i12vi2sep(jsim(Mi,j,vi)), (4)

where the integral is over the input to the separation function. It is now straightforward to derive the HN and MCHN. To do so, we set sim(M, v) = Mv and sep(x) = x for the HN, and sep(x)=exex for the MCHN. Following Krotov & Hopfield (2020), for the MCHN, we can derive a single step of the dynamics by taking the gradient of the energy:

E=i12vi2logje(sim(Mi,j,vi))τvdvidt=Evi=vi+ejsim(Mi,j,vi)iejsim(Mi,j,vi)sim(Mi,jvi)vi. (5)

If we then perform an Euler discretization of these dynamics and set Δt = τv = 1, then we obtain the following update step:

vit+1=Mσ(Mvit), (6)

where σ(x)=exex is the softmax function by using the fact that the MCHN uses the dot product similarity function sim(M, v) = Mv. It was proven in (Ramsauer et al., 2020) that this update converges to a local minimum in a single step. This thus derives the MCHN update

v=Mσ(Mq), (7)

since vt=0 = q as the visible neurons are initialized to the input query and equilibrium occurs after a single step. Similarly, to derive the HN, we set the separation function to the identity (sep(x) = x) and similarity function to the dot product and using the fact that d(Mv)sep(sim(Mv))=d(Mv)Mv=12(Mv)2, thus resulting in the energy function and equilibrium update rule:

E=i12vi212ijviMi,jMi,jTviτvdvidt=Evi=vi+ijMi,jMi,jTvi, (8)

where, again, if we perform the Euler discretization, we obtain the following update step:

vt+1=MMTvt, (9)

which, with a final normalizing sign function to binarize the output reconstruction, is identical to the HN update rule. We thus see that using this abstract energy function, we can derive a Lyapunov energy function and associated local neural dynamics for any associative memory model that fits within our framework. Moreover, our framework also describes iterative associative memory models if these inference dynamics are integrated over multiple steps instead of converging in a single step.

5. Experiments

Our general framework allows us to define an abstract associative memory model with arbitrary similarity and separation functions, as well as a set of neural dynamics and associated energy function for that model. A natural question is whether we can use this abstract framework to derive more performant associative memory models by using different similarity and separation functions. In this section, we empirically test a wide range of potential separation and similarity functions on associative memory retrieval tasks. We find similarity functions such as the Manhattan (or absolute or l1 norm) distance metric perform substantially better than the dot product distance used in the MCHN across a datasets and is more robust to input distortion. We define novel associative memory models with state-of-the-art performance, which can scale beyond that considered previously in the literature, especially on the Tiny ImageNet dataset. In Appendix E, we discuss the detailed numerical normalizations and other implementation details that are necessary to achieve a good performance in practice.

5.1. Capacity under Different Similarity Functions

We investigate the capacity of the associative memory models to increasing numbers of stored memories on a suite of potential similarity functions. The similarity and separation functions tested are defined in Appendix D. We tested the retrieval capacity on three image datasets: MNIST, CIFAR10, and Tiny ImageNet. All images were normalized such that all pixel values lay between 0 and 1. Before presenting the images to the network as queries, they were flattened into a single vector. When masking the images, the masked out pixels were set to 0. When adding Gaussian noise to the images, we clipped the pixel values after noise was added to maintain all values between 0 and 1.

From Figure 2, we see that the similarity function has a large effect on the memory capacity of the associative memory models. Empirically, we see very robustly that the highest performing and highest capacity similarity function is the Manhattan distance sim(M, q) = abs(Mq), where the subtraction is taken over rows of the memory matrix. Moreover, the superiority of the Manhattan distance as a similarity function appears to grow with the complexity of the dataset. It is roughly equivalent to the Euclidean and dot product on MNIST, slightly better on CIFAR10 and substantially better on Tiny ImageNet. The Euclidean distance also performs very well across image datasets. Other potential measures such as the KL divergence, Jensen-Shannon distance, and reverse KL perform substantially worse than simple Euclidean, dot-product, and Manhattan distance measures. The dot product metric used in the MHCN also performs very well, although it must be carefully normalized (see Appendix E). Interestingly, we see stable levels of performance for increasing capacity for a given similarity function across a wide range of memory capacities.

Figure 2. Capacity of the associative memory networks with different similarity functions, as measured by increasing the number of stored images.

Figure 2

The capacity is measured as the fraction of correct retrievals. To test retrieval, the top-half of the image was masked with all zeros (this is equivalent to a fraction masked of 0.5 in Figure 4) and was then presented as the query vector for the network. A retrieval was determined to be correct if the summed squared difference between all pixels in the retrieved image and the true reconstruction was less than a threshold T, which was set at 50. The queries were presented as the stored images corrupted with independent Gaussian noise with a variance of 0.5. Mean retrievals over 10 runs with different sets of memories images. Error bars are computed as the standard deviations of the correct retrievals of the 10 runs. A softmax separation function was used with a β parameter of 100.

The similarity functions are so important, because they are the fundamental method by which the abstract associative memory model can perform ranking and matching of the query to memory vectors. An ideal similarity function would preserve a high similarity across semantically non-meaningful transformations of the query vectors (i.e., insensitive to random noise, perturbations, and masking of parts of the image), while returning a low similarity for transformed queries originating from other memory vectors. An interesting idea is that, while thus far we have used simple similarity functions such as the dot product and the Euclidean distance, it is possible to define smarter distance metrics native to certain data types, which should be expected to give an improved performance. Moreover, it may be possible to directly learn useful similarity functions by defining the similarity function itself as a neural network trained on a contrastive loss function to minimize differences between variants of the same memory and maximize differences between variants of different ones.

5.2. Capacity under Different Separation Functions

In Figure 3, we considered the effect of the separation function on retrieval capacity by measuring the retrieval performance using a fixed similarity function (dot product) for different separation functions (defined in Appendix D). The empirical effect of the separation function on capacity appear to align closely with known theoretical results (Demircigil et al., 2017; Keeler, 1988; Abu-Mostafa & Jacques, 1985; Ma, 1999; Wu et al., 2012). Namely, that the exponential and max functions have substantially higher capacity than that of other separation functions and that low-order polynomial or lesser separation functions suffer a very rapid decline in retrieval performance as capacity increases. High-order polynomials perform very well, as predicted by the mathematical capacity results in (Krotov & Hopfield, 2016; Demircigil et al., 2017). Here, the softmax performs relatively poorly compared to the 10th order polynomial due to the β parameter in the softmax being set to 1, which was done for a fair comparison to other methods. However, as β → ∞, the softmax function tends to the max, so the relative performance of the softmax can be increased by simply increasing β. The importance of the separation functions, and especially using ‘high-powered’ separation functions such as softmax, max, and a 10th order polynomial increases with the complexity of the data. This is due to the greater level of interference caused by more complex and larger images, which requires a more powerful separation function to numerically push apart the similarity scores.

Figure 3. The retrieval capacity of the network on retrieving half-masked images using the dot-product similarity function.

Figure 3

Plotted are the means and standard deviations of 10 runs. A query was classed as correctly retrieved if the sum of squared pixel differences was less than a threshold of 50.

5.3. Retrieval under Different Similarity Functions

We also tested (Figure 4) the effect of the similarity function on the retrieval capacity of the network for different levels of noise or masking of the query vector, a proxy for the robustness of the memory network. We tested the retrieval capacity on two types of query perturbation: Gaussian noise and masking. In the first case, independent zero-mean Gaussian noise with a specific noise variance σ was added elementwise to the query image. As the image pixel values were restricted to lie in the range [0, 1], a σ of 1 results in a huge distortion of the original image. With masking, the top k fraction of pixels were set to 0. A fraction of 0.9 results in only the bottom 10% of the image being visible in the query vector. Example visualizations different noise levels and masking fractions are given in Appendix C.

Figure 4.

Figure 4

Top Row: Retrieval capability against increasing levels of i.i.d added to the query images for different similarity functions. Bottom Row: Retrieval capability against increasing fractions of zero-masking of the query image. The networks used a memory of 100 images with the softmax separation function. Error bars are across 10 separate runs with different sets of memories stored. Datasets used left to right: MNIST, CIFAR, and Tiny ImageNet.

We observe in Figure 4 that the used similarity functions have strong effects on the robustness of the retrieval under different kinds of perturbations. For independent Gaussian noise, it largely appears that the dot product similarity measures allow for relatively robust reconstructions even up to very high levels of noise, which would make the queries uninterpretable to humans (see Appendix C). The Manhattan distance similarity metric, however, performs better under masking of the image, definitely for relatively small fractions masked, although for Tiny ImageNet, the dot-product similarity function appears to be more robust to extremely high masking fractions. Overall, it appears that the similarity function plays a large role in the degree of robustness of the memory to corrupted queries, but that the same few similarity functions, such as dot product and Manhattan distance, consistently perform well across a range of circumstances.

6. Discussion

In this paper, we have proposed a simple and intuitive general framework that unifies existing single-shot associative memory models in the literature. Moreover, we have shown that this scheme comes equipped with a set of local neural dynamics and that leads immediately to useful generalizations in terms of varying the similarity function, which led to the discovery of the superior performance of Manhattan distance, which outperforms the state-of-the-art MCHN at various retrieval tasks with complex images. Finally, our general framework lets us see the natural and clear relationship between auto- and heteroassociative memory models, which amounts entirely to the selection of the projection matrix P, a fact that has often been unclear in the literature.

Our framework gives a clear insight into the two key steps and bottlenecks of current associative memory models. The major bottleneck is the similarity function, which is fundamental to the retrieval capacity and performance of the model, and it is the similarity metric that, assuming a sufficiently powerful separation function, is the main determinant of retrieval performance, especially of noisy or corrupted queries. Here, we only considered single-layer networks, which apply the similarity function directly to raw image data. However, performance may be increased by first feeding the raw queries through a set of preprocessing steps or, alternatively, an encoder network trained to produce a useful latent representation of the input, and then performing associative memory on the latent representations.

This naturally leads to a hierarchical scheme for associative memories models, which will be explored in future work. This scheme also has close associations with the field of metric learning (Kulis et al., 2013; Yang, 2007), where we consider the similarity function as defining a metric on the underlying data geometry and then the associative memory simply performs nearest-neighbour matching with this metric. Using predefined similarity functions corresponds to directly defining a metric on the space, however, using a deep neural network to map into a latent space and then performing a similarity scoring on that latent space is equivalent to a learnable metric which is implicitly parametrized by the deep neural network encoder (Kaya & Bilge, 2019).

A conceptual benefit of our framework is that it makes clear that single-shot associative memory models are simply two-layer MLPs with an unusual activation function (i.e., the separation function), which works best as a softmax or max function, and where the weight matrices directly encode explicit memory vectors instead of being learnt with backpropagation. This leads immediately to the question of whether standard MLPs in machine learning can be interpreted as as-sociative memories instead of hierarchical feature extractors. A crucial requirement for the MLP to function as an associative memory appears to be a high degree of sparsity of intermediate representations (ideally one-hot output) so that an exact memory can be reconstructed instead of a linear combination of multiple memories. With a dense representation at the intermediate layers, no exact memory can be reconstructed, and the network will instead function as a feature detector. This continuum between associative memories and standard MLPs, which depends on the sparsity of the intermediate representations, has resonances in neuroscience, where neural representations are typically highly sparse, as well as to helps contextualize results showing that associative memory capabilities naturally exist in standard machine learning architectures (Radhakrishnan et al., 2018).

In terms of the separation function, it is clear that for exact retrieval, the max function is simply the best option, as it removes any interference between different stored memories. The improvement of the separation function is the fundamental cause behind the vast gulf of theoretical capacity and practical performance between the classical HN and the MCHN. It is straightforward to show that with the max separation function, as long as queries are simply uncorrupted copies of the memory images, and that the similarity function of a memory and query has its minimum at the memory (i.e., sim(x, x) < sim(x, y) for any y), then the max separation function will achieve a theoretically infinite capacity for any fixed size of input query (although, of course, requiring an infinite dimensional memory matrix M). However, this theoretical capacity is irrelevant in practice where, for corrupted queries, it is the propensity of the similarity function to detect the right match between query and memory that is the main determinant of retrieval quality.

Our framework also makes a straightforward prediction that the retrieval capacity of hetero- and autoassociative memories are identical for powerful separation functions. This is because the key ‘work’ performed by the memory model is in the first two stages of computing the similarity scores and then separating them, while whether the result is a hetero- or autoassociative memory depends entirely on the projection matrix used to project the resulting separated similarity scores. As such, if the separation scores are nearly a one-hot vector at the correct memory index, the correct image will be ‘retrieved’ by the projection matrix regardless of whether it is a hetero- or autoassociated memory. We verify this prediction by studying the retrieval capacities of hetero- vs. autoassociative MCHNs and HNs in Appendix B.

Finally, while the capabilities and performance of these associative memory models may seem remote to state-of-the-art machine learning, recent work has begun to link the MHCN with self-attention in transformers (Ramsauer et al., 2020), which has also more recently been linked to SDM (Bricken & Pehlevan, 2021). These close links between associative memory models and transformer attention may therefore indicate that improvements in understanding and increasing the effective capacity of such models may also lead to improvements in transformer performance for large-scale machine learning tasks. Perhaps the most interesting avenue lies in testing different similarity functions in transformer models, which, up to now, have almost entirely utilized the dot-product similarity function. This paper, however, has suggested that other similarity functions such as Euclidean and Manhattan distance are also highly competitive with the dot-product similarity and may lead to comparable or superior results when used in transformer self-attention. Preliminary results (Appendix F) suggest that the Manhattan and Euclidean distance similarity functions are competitive with dot product attention in small-scale transformer networks, despite transformer architectures being optimized for the dot product, and suggest that investigating transformer performance more thoroughly with different similarity functions may be an important avenue for future work.

Supplementary Material

Appendix

7. Acknowledgements

We would like to thank Trenton Bricken for many interesting discussions on related topics and Mycah Banks for her help in preparing the figures for this manuscript. Beren Millidge and Rafal Bogacz were supported by the BBSRC grant BB/S006338/1, and Rafal Bogacz was also supported by the MRC grant MC_UU_00003/1. Yuhang Song was supported by the China Scholarship Council under the State Scholarship Fund and by a J.P. Morgan AI Research Fellowship. Thomas Lukasiewicz was supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1, the AXA Research Fund, and the EPSRC grant EP/R013667/1.

Footnotes

1

For heteroassociative memory models, a separate projection memory is used containing the outputs associated with each input.

References

  1. Abbott LF, Arian Y. Storage capacity of generalized networks. Physical Review A. 1987;36(10):5091. doi: 10.1103/physreva.36.5091. [DOI] [PubMed] [Google Scholar]
  2. Abu-Mostafa Y, Jacques JS. Information capacity of the Hopfield model. IEEE Transactions on Information Theory. 1985;31(4):461–464. [Google Scholar]
  3. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint. 2014:arXiv:1409.0473 [Google Scholar]
  4. Baldi P, Venkatesh SS. Number of stable points for spin-glasses and neural networks of higher orders. Physical Review Letters. 1987;58(9):913. doi: 10.1103/PhysRevLett.58.913. [DOI] [PubMed] [Google Scholar]
  5. Bonetti L, Brattico E, Carlomagno F, Donati G, Cabral J, Haumann N, Deco G, Vuust P, Kringelbach M. Rapid encoding of musical tones discovered in wholebrain connectivity. NeuroImage. 2021;245:118735. doi: 10.1016/j.neuroimage.2021.118735. [DOI] [PubMed] [Google Scholar]
  6. Bricken T, Pehlevan C. Attention approximates sparse distributed memory. arXiv preprint. 2021:arXiv:2111.05498 [Google Scholar]
  7. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. arXiv preprint. 2020:arXiv:2005.14165 [Google Scholar]
  8. Caputo B, Niemann H. Storage capacity of kernel associative memories; International Conference on Artificial Neural Networks; Springer; 2002. pp. 51–56. [Google Scholar]
  9. Chen H, Lee Y, Sun G, Lee H, Maxwell T, Giles CL. High order correlation model for associative memory; AIP Conference Proceedings; 1986. pp. 86–99. [Google Scholar]
  10. Demircigil M, Heusel J, Löwe M, Upgang S, Vermet F. On a model of associative memory with huge storage capacity. Journal of Statistical Physics. 2017;168(2):288–299. [Google Scholar]
  11. Devlin J, Chang M-W, Lee K, Toutanova K. arXiv preprint. arXiv:1810.04805; 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. [Google Scholar]
  12. Geva M, Schuster R, Berant J, Levy O. Transformer feed-forward layers are key-value memories. arXiv preprint. 2020:arXiv:2012.14913 [Google Scholar]
  13. Gupta A, Dar G, Goodman S, Ciprut D, Berant J. Memory-efficient transformers via top-k attention. arXiv preprint. 2021:arXiv:2106.06899 [Google Scholar]
  14. Hinton GE, Anderson JA. Parallel Models of Associative Memory: Updated Edition. Psychology Press; 2014. [Google Scholar]
  15. Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences. 1982;79(8):2554–2558. doi: 10.1073/pnas.79.8.2554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hopfield JJ. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences. 1984;81(10):3088–3092. doi: 10.1073/pnas.81.10.3088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Horn D, Usher M. Capacities of multiconnected memory models. Journal de Physique. 1988;49(3):389–395. [Google Scholar]
  18. Jaeckel LA. An alternative design for a sparse distributed memory. Research Institute for Advanced Computer Science, NASA Ames Research Center; 1989. [Google Scholar]
  19. Jayakumar SM, Czarnecki WM, Menick J, Schwarz J, Rae J, Osindero S, Teh YW, Harley T, Pascanu R. Multiplicative interactions and where to find them; 8th International Conference on Learning Representations; 2020. [Google Scholar]
  20. Jiang Y, Pehlevan C. Associative memory in iterated overparameterized sigmoid autoencoders; International Conference on Machine Learning; 2020. pp. 4828–4838. [Google Scholar]
  21. Kanerva P. Sparse Distributed Memory. MIT Press; 1988. [Google Scholar]
  22. Kanerva P. Sparse distributed memory and related models. Vol. 92 NASA Ames Research Center, Research Institute for Advanced Computer Science; 1992. [Google Scholar]
  23. Kaya M, Bilge HŞ. Deep metric learning: A survey. Symmetry. 2019;11(9):1066 [Google Scholar]
  24. Keeler JD. Comparison between Kanerva’s SDM and Hopfield-type neural networks. Cognitive Science. 1988;12(3):299–329. [Google Scholar]
  25. Kirkpatrick S, Sherrington D. Infinite-ranged models of spin-glasses. Physical Review B. 1978;17(11):4384 [Google Scholar]
  26. Kitaev N, Kaiser Ł, Levskaya A. Reformer: The efficient transformer. arXiv preprint. 2020:arXiv:2001.04451 [Google Scholar]
  27. Krotov D. Hierarchical associative memory. arXiv preprint. 2021:arXiv:2107.06446 [Google Scholar]
  28. Krotov D, Hopfield J. Large associative memory problem in neurobiology and machine learning. arXiv preprint. 2020:arXiv:2008.06996 [Google Scholar]
  29. Krotov D, Hopfield JJ. Dense associative memory for pattern recognition. Advances in Neural Information Processing Systems. 2016;29:1172–1180. [Google Scholar]
  30. Kulis B, et al. Metric learning: A survey. Foundations and Trends® in Machine Learning. 2013;5(4):287–364. [Google Scholar]
  31. Little WA. The existence of persistent states in the brain. Mathematical Biosciences. 1974;19(1-2):101–120. [Google Scholar]
  32. Ma J. The asymptotic memory capacity of the generalized Hopfield network. Neural Networks. 1999;12(9):1207–1212. doi: 10.1016/s0893-6080(99)00042-8. [DOI] [PubMed] [Google Scholar]
  33. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1(8):9. [Google Scholar]
  34. Radhakrishnan A, Yang K, Belkin M, Uhler C. Memorization in overparameterized autoencoders. arXiv preprint. 2018:arXiv:1810.10333 [Google Scholar]
  35. Radhakrishnan A, Belkin M, Uhler C. Overparameterized neural networks implement associative memory. Proceedings of the National Academy of Sciences. 2020;117(44):27162–27170. doi: 10.1073/pnas.2005013117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Ramsauer H, Schäfl B, Lehner J, Seidl P, Widrich M, Adler T, Gruber L, Holzleitner M, Pavlovic’ M, Sandve GK, et al. Hopfield networks is all you need. arXiv preprint. 2020:arXiv:2008.02217 [Google Scholar]
  37. Rolls E. The mechanisms for pattern completion and pattern separation in the hippocampus. Frontiers in Systems Neuroscience. 2013;7:74. doi: 10.3389/fnsys.2013.00074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Salvatori T, Song Y, Hong Y, Frieder S, Sha L, Xu Z, Bogacz R, Lukasiewicz T. Associative memories via predictive coding. arXiv preprint. 2021:arXiv:2109.08063. [PMC free article] [PubMed] [Google Scholar]
  39. Sukhbaatar S, Grave E, Lample G, Jegou H, Joulin A. Augmenting self-attention with persistent memory. arXiv preprint. 2019:arXiv:1907.01470 [Google Scholar]
  40. Tang F, Kopp M. A remark on a paper of Krotov and Hopfield [arxiv: 2008.06996] arXiv preprint. 2021:arXiv:2105.15034 [Google Scholar]
  41. Tay Y, Dehghani M, Bahri D, Metzler D. Efficient transformers: A survey. arXiv preprint. 2020:arXiv:2009.06732 [Google Scholar]
  42. Tolstikhin I, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner AP, Keysers D, Uszkoreit J, et al. MLP-mixer: An all-MLP architecture for vision; 35th Conference on Neural Information Processing Systems; 2021. [Google Scholar]
  43. Tsodyks M, Sejnowski T. Associative memory and hippocampal place cells. International Journal of Neural Systems. 1995;6:81–86. [Google Scholar]
  44. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems. 2017:5998–6008. [Google Scholar]
  45. Wang S, Li BZ, Khabsa M, Fang H, Ma H. Linformer: Self-attention with linear complexity. arXiv preprint. 2020:arXiv:2006.04768 [Google Scholar]
  46. Wu Y, Hu J, Wu W, Zhou Y, Du K. Storage capacity of the Hopfield network associative memory; 2012 5th International Conference on Intelligent Computation Technology and Automation; 2012. pp. 330–336. [Google Scholar]
  47. Yang L. An overview of distance metric learning; Proceedings of the Computer Vision and Pattern Recognition Conference; 2007. [Google Scholar]
  48. Yuille AL, Rangarajan A. The concave-convex procedure. Neural Computation. 2003;15(4):915–936. doi: 10.1162/08997660360581958. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

RESOURCES