Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 21.
Published in final edited form as: Adv Neural Inf Process Syst. 2022 Nov;35:38232–38244.

Learning on Arbitrary Graph Topologies via Predictive Coding

Tommaso Salvatori 1,#, Luca Pinchetti 1,#, Beren Millidge 2, Yuhang Song 1,2,, Tianyi Bao 1, Rafal Bogacz 2, Thomas Lukasiewicz 3,1
PMCID: PMC7614467  EMSID: EMS174069  PMID: 37090087

Abstract

Training with backpropagation (BP) in standard deep learning consists of two main steps: a forward pass that maps a data point to its prediction, and a backward pass that propagates the error of this prediction back through the network. This process is highly effective when the goal is to minimize a specific objective function. However, it does not allow training on networks with cyclic or backward connections. This is an obstacle to reaching brain-like capabilities, as the highly complex heterarchical structure of the neural connections in the neocortex are potentially fundamental for its effectiveness. In this paper, we show how predictive coding (PC), a theory of information processing in the cortex, can be used to perform inference and learning on arbitrary graph topologies. We experimentally show how this formulation, called PC graphs, can be used to flexibly perform different tasks with the same network by simply stimulating specific neurons. This enables the model to be queried on stimuli with different structures, such as partial images, images with labels, or images without labels. We conclude by investigating how the topology of the graph influences the final performance, and comparing against simple baselines trained with BP.

1. Introduction

Classical deep learning has achieved remarkable results by training deep neural networks to minimize an objective function. Here, every weight parameter gets updated to minimize this function using reverse differentiation [1, 2]. However, in the brain, every synaptic connection is independently updated to correct the behaviour of its post-synaptic neuron [3] using local information, and it is unknown whether this process minimizes a global objective function. The brain maintains an internal model of the world, which constantly generates predictions of external stimuli. When the predictions differ from reality, the brain immediately corrects this error (difference between reality and prediction) by updating the strengths of the synaptic connections [47]. This theory of information processing, called predictive coding (PC), is highly influential, despite experimental evidence in the cortex being mixed [811], and it is at the centre of a large amount of research in computational neuroscience [1216]. From the machine learning perspective, PC has promising properties: it is able to achieve excellent results in classification [1719] and memorization [20, 21], and is able to process information in both a bottom up and a top down direction. This last property is fundamental for the functioning of different brain areas, such as the hippocampus [22, 20]. PC also shares the generalization capabilities of standard deep learning, as it is able to approximate backpropagation (BP) on any neural structure [23], and a variation of PC is able to exactly replicate the weight update of BP on any computational graph [24, 25]. Moreover, PC only uses local information to update synapses, allowing the network to be fully parallelized, and to train on networks with any topology. Training on networks of any structure is not possible in standard deep learning, where information only flows in one direction via the feedforward pass, and then BP is performed in sequential steps backwards. If a cycle is present inside the computational graph of an artificial neural network (ANN), BP becomes stuck in an infinite loop. More generally, the computational graph of any function F : ℝd ⟶ ℝk is a poset, and hence acyclic. While the problem of training on some specific cyclic structures has been partially addressed using BP through time [27] on sequential data, the restriction to hierarchical architectures may present a limitation to reaching brain-like intelligence, since the human brain has an extremely complex and entangled neural structure that is heterarchically organized with small-world connections [26]—a topology that is likely highly optimized by evolution. This shape of structural brain networks, shown in Fig. 1, generates a unique communication dynamics that is fundamental for information processing in the brain, as different aspects of network topology imply different communication mechanisms, and hence perform different tasks [26]. The heterarchical topology of brain networks has motivated research that aims to develop learning methods on graphs of any topology. A popular example is the assembly calculus [28, 29], a Hebbian learning method that can perform different operations implicated in cognitive phenomena.

Figure 1.

Figure 1

Difference in topology between an artificial neural network (left), and a sketch of a network of structural connections that link distinct neural elements in a brain (right) [26].

In this work, we address this problem by proposing PC graphs, a structure that allows to train on any directed graph using the original (error-driven) framework by Rao and Ballard [7]. We then demonstrate the flexibility of such networks by testing the same network on different tasks, which can be interpreted as conditional expectations on different neurons of the network. Our PC graphs framework enables the model to be queried on stimuli with different structures, such as partial images, images with labels, or images without labels. This is significantly more flexible than the strict input-output structure of standard ANNs, which are limited to scenarios when they are always presented with data and labels in the same format.

Note that the main goal of this work is not to propose a specific architecture that achieves state-of-the-art (SOTA) results on a particular task, but to present PC graphs as a new flexible and biologically plausible model, which can achieve good results on many tasks simultaneously. In this work, we study the simultaneous generation, classification, and associative memory capabilities of PC graphs, highlighting their flexibility and theoretical advantages over standard baselines. Our contributions are briefly summarized as follows:

  • We present PC graphs, which generalize PC to arbitrary graph topologies, and show how a single model can be queried in multiple ways to solve different tasks by simply altering the values of specific nodes, without the need for retraining when switching between tasks. Particularly, we define two different techniques, which we call query by conditioning and query by initialization.

  • We then experimentally show this in the most general case, i.e., for fully connected PC graphs. Here, we train different models on MNIST and FashionMNIST, and show how the two queries can be used to perform different generation tasks. Then, we test the model on classification tasks, and explore its capabilities as an associative memory model.

  • We next investigate how different graph topologies influence the performance of PC graphs on generation tasks, reproducing common network architectures such as feedforward, recurrent, and residual networks as special cases of PC graphs, and investigate how the chosen structure influences the performance on generative tasks. Finally, we also show how PC graphs can be used to derive the popular assembly calculus [28].

2. PC Graphs

Let G = (V, E) be a directed graph, where V is a set of n vertices {1,2,…, n}, and EVV is a set of directed edges between them, where every edge (i, j) ∈ E has a weight parameter θi,j. The set of vertices V is partitioned into two subsets, the sensory and internal vertices. External stimuli are always presented to the network via sensory vertices, which we consider to be the first d vertices of the graph, with d < n. The internal vertices, on the other hand, are used to represent the internal structure of the dataset. Each vertex i encodes several quantities. The main quantity is given by the values of its activity, which change over time, and we refer to it as a value node xi,t. We call the value nodes of the sensory vertices sensory nodes. Additionally, each vertex computes the prediction μi,t of its activity based on its input from value nodes of other vertices:

μi,t=jθj,if(xj,t), (1)

where the summation is over all the vertices j connected to i via outgoing edges, and f is a nonlinearity. Equivalently, it is possible to consider the summation on every j , and have θi,j = 0 if (i, j) ∉ E. The error of every vertex at every time step t is then given by the difference between its value node and its prediction, i.e., εi,t = xi,tμi,t. This local definition of error, which lies not only in the output, but in every vertex of the network, is what allows PC graphs to learn using only local information. The value nodes xi,t and the weight parameters θi,j are updated to minimize the following energy function defined locally on every vertex:

t=12i(εi,t)2. (2)

A fully connected PC graph with 3 vertices is sketched in Fig. 2a, along with the operations that describe the dynamics of the information flow, showing also how every operation can be represented via inhibitory and excitatory connections.

Figure 2.

Figure 2

(a) An example of a fully connected PC graph with three vertices. Zoomed is the neural implementation of PC, where learning is made local via the demonstrated inhibitory and excitatory connections. (b) A sketch of the training process, where the value nodes of the sensory vertices are fixed to the pixels of the image. (c) A sketch of query by conditioning, where a fraction of the value nodes is fixed to the top half of an image, and the bottom half is recovered via inference.

Learning

When presented with a training point s¯ taken from a training set, the value nodes of the sensory vertices are fixed to be equal to the entries of s¯ for the whole duration of the training process, i.e., for every t. A sketch of this is shown in Fig. 2b. Then, the total energy of Eq. (2) is minimized in two phases: inference and weight update. During the inference phase, the weights are fixed, and the value nodes are continuously updated via gradient descent for T iterations, where T is a hyperparameter of the model. The update rule is the following (inference):

Δxi,t=γt/xi,t=γ(εi,t+f(xi,t)k=1nεk,tθi,k), (3)

where γ is the learning rate of the value nodes. This process of iteratively updating the value nodes distributes the output error throughout the PC graph. When the inference phase is completed, the value nodes get fixed, and a single weight update is performed as follows (weight update):

Δθi,j=αT/θi,j=αεi,Tlf(xj,T), (4)

where α is the learning rate of the weight update. We now describe two different ways to query the internal representation of a trained model, where the values of some sensory vertices are undefined, and have to be predicted. In both cases, the weight parameters θi,j are now fixed, and the total energy E is continuously minimized using gradient descent on the re-initialized value nodes via Eq. (3).

Query by conditioning

While each value node is randomly re-initialized, the value nodes of specific vertices are fixed to some desired value, and hence not allowed to change during the energy minimization process. The unconstrained sensory vertices will then converge to the minimum of the energy given the fixed vertices, thus computing the conditional expectation of the latent vertices given the observed stimulus. Formally, let I = {i1,…,iq} ⊂ {1,2,…,n} be a strict subset of vertices. Assume now that we know that a subset of the value nodes corresponding to the vertices I is equal to a stimulus q¯q. Then, running inference until convergence allows to estimate the conditional expectation

E(x¯Tt:(xi1,t,,xiq,t)=q¯), (5)

where x¯T is the vector of value nodes at convergence. Examples of tasks performed this way are (i) classification, where internal nodes are fixed to the pixels of an image, and the sensory nodes are fixed to a 1-hot vector with the labels, (ii) generation, where the single value node encoding the class information is fixed, and the value nodes of the sensory nodes converge to an image of that class, and (iii) reconstruction, such as image completion, where a fraction of the sensory nodes are fixed to the available pixels of an image, and the remaining ones converge to a reasonable completion of it. A sketch of this process is shown in Fig. 2c.

Query by initialization

Again, every value node is randomly initialized, but the value nodes of specific nodes are initialized (for t=0 only), but not fixed (for all t), to some desired value. This differs from the previous query, as here every value node is unconstrained, and hence free to change during inference. The sensory vertices will then converge to the minimum found by gradient descent, when provided with that specific initialization. Again, let I = {i1,…,iq} ⊂ {1,2,…,n} be a strict subset of vertices, and assume that we have an initial stimulus q¯q. Then, wecanestimatethe conditional expectation

E(x¯T(xi1,0,,xiq,0)=q¯). (6)

Examples of tasks performed this way are (i) denoising, such as image denoising, where the sensory neurons are initialized with a noisy version of an image, which is cleared during the energy minimization process, and (ii) reconstruction, such as image completion, where the fraction of missing pixels is now not known a priori.

3. Proof-of-concept: Experiments on Fully Connected PC Graphs

In this section, we perform experiments on a fully connected PC graph G =(V,E), i.e., where E = VV. Such PC graphs are fully general and encode no implicit priors on the structure of the dataset. It is possible to obtain any possible graph topology by simply pruning specific weights of G.

Given a dataset of m datapoints D={s¯i}i<m, with s¯id, we train the PC graph as described in Section 2: The first d neurons are fixed to the entries of a training point, and the energy function Et is minimized via inference and weight updates, via Eqs. (3) and (4). When the training is complete, we show the different tasks that can be performed, without the need of retraining the model. We use MNIST and FashionMNIST [30], fixing the first d nodes to the data point, and show how to perform the tasks of generation, denoising, reconstruction (without and with labels), and classification by querying the PC graph as described in Section 2.

Setup

For every dataset, we have trained 3 models: one for generation and classification tasks, one for denoising and reconstructions, and one for associative memories. The first two models consist of a fully connected graph with 2000 vertices, trained with 794 sensory vertices for classification and generation tasks (784 pixels plus a 1-hot vector for the 10 labels), and 784 sensory vertices for reconstruction and denoising. Further details about other hyperparameters are given in the supplementary material.

Generation

To check the generation capabilities of a trained PC graph, we queried the model by conditioning on the labels: Here, the value nodes dedicated to the 10 labels were fixed to each 1-hot value, and the energy of the model (Eq. (2)) was minimized using Eq. (3) until convergence. The generated images are then taken to be the value nodes of the unconstrained sensory nodes, which were originally fixed to the pixels of the images during training. An example of the images generated for each label is given in Fig. 3a.

Figure 3.

Figure 3

Generation experiments using the first 6 classes of the MNIST and FashionMNIST datasets from the labels {0, 1, 2, 3, 4, 5, 6}and {t-shirt, trouser, pullover, dress, coat, sandal, shirt}, respectively; (b) reconstruction of incomplete images using query by conditioning, when only the top half is available; (c) reconstruction of corrupted images using query by initialization; (d) reconstruction of incomplete images using query by conditioning when also providing the correct label of the test image; and (e) associative memory experiments when presented with half of a training image (left) or a corrupted version (right) that it has already seen and memorized; from top to bottom row: image provided to the network, retrieved image, and original image.

Reconstruction

We provide the PC graph with half of a test image, and ask it to reconstruct the second half. This can be done using both queries: when querying by conditioning, half of the pixels of a test image are fixed to the corresponding sensory nodes; when querying by initialization, the value nodes are simply initialized to the same values. At convergence, we consider the value nodes of the unconstrained nodes, which should reconstruct the missing part of the image based on the information learned during training. The results are given in Fig. 3b. We have also replicated the same experiment using a network trained with the labels, and provided the label during the reconstruction. This computes the distribution of the missing pixels knowing the available ones and the label. The results in this case are visibly better and are given in Fig. 3d.

Denoising

We provide the PC graph with a corrupted image, obtained by adding zero-mean Gaussian noise with variance 0.5. This is done by querying by initialization: before running inference, the value nodes of the sensory nodes are initialized to be equal to the pixels of the corrupted image. At convergence, we consider the value nodes of the unconstrained nodes, which should reconstruct the original image. The results are given in Fig. 3c.

Results

As stated above, we picked a fully connected PC graph due to its generality, and not to obtain the best performance. However, the results show that this framework is able to learn an internal representation of a dataset, and that it can be queried to solve multiple tasks with a reasonable accuracy. The PC graph was in fact able to always generate the correct digit, and almost always able to generate the correct clothing item in generation tasks, and always able to provide a noisy but reasonable reconstruction of incomplete test points. The same happened with denoising experiments, as a cleaner (plausible) image was always produced. In Section 4, we show how to improve all these performances by using different PC graph topologies.

Classification

We consider the same PC graph trained for the generation experiments. To check its generalization capabilities, we query by conditioning the pixels of every test image to the first 784 sensory nodes, and run inference to reconstruct the 1-hot label vector. We do not expect to obtain results directly comparable with standard multilayer perceptrons for two reasons: firstly, the model does not contain any implicit hierarchy, which empirically appears crucial to obtaining good classification results. Secondly, the PC graph is also simultaneously learning to generate the pixels, which are much more numerous than labels. However, to check whether the obtained results were acceptable, we tested against different learning algorithms that train on similar or equivalent fully connected architectures, such as Hopfield networks, unconstrained Boltzmann machines, and a local variation of BP introduced in the late ′80, called Almeida-Pineda, named after the two scientists who independently invented it [31, 32]. As for Hopfield networks, we used the implementation provided in [33]. The results, given in Table 1, show that our model outperforms every other learning algorithm that can be trained on fully connected architectures. Despite this, the results also show that the obtained test accuracy is not nearly comparable to the results obtained by multilayer perceptrons, as they are only slightly better than a linear classifier (obtaining 88% accuracy on MNIST). However, this is not due to the learning rule of PC, which is well-known to be able to reach a competitive performance when provided with a hierarchical multilayer structure [17]. For the SVHN [34] experiment, we used models with 5000 vertices.

Table 1. Test accuracy of different models on MNIST, FashionMNIST, and SVHN.

Model Ours Hopfield Nets Boltzmann Machine Almeida Pineda
MNIST 91.76 ± 0.02 % 65.23 ± 2.21 % 79.23 ± 0.15 % 76.36 ± 0.14 %
FashionMNIST 83.72 ± 0.33 % 51.74 ± 3.94 % 61.31 ± 0.17 % 69.63 ± 1.64 %
SVHN 84.51 ± 0.11 % 48.92 ± 3.11 % 55.74 ± 1.23 % 59.14 ± 2.64 %

Associative memory

We now test whether PC graphs are able to memorize training images and retrieve them given a corrupted or incomplete version of it. Particularly, we show that a fully connected PC graph is able to store complex data points, such as colored images, and retrieve them via running inference. To do that, we trained a novel fully connected PC graph on 100 data points of the MNIST, FashionMNIST, CIFAR10, and SVHN datasets. We have used a model with 1000 vertices for MNIST and FashionMNIST, and 3500 for SVHN and CIFAR10, and asked it to retrieve the original memories by presenting it either only half of the original pixels, or a corrupted version with Gaussian noise variance 0.2. This task is similar to image reconstruction and denoising, with the non-trivial difference that here we only use already seen data points, and hence no generalization is involved. The results of these experiments are given in Fig. 3e, and show that our method is able to successfully store and retrieve data points via energy minimization. More details about the capacity of fully connected PC graphs are given in the supplementary material.

4. Extension to Different PC Graph Topologies

As well-known in deep learning, the performance of the trained model strongly depends on its architecture: the number of vertices, layers, and their intrinsic structure. In Section 3, we studied the general architecture of fully connected PC graphs. Here, we show how to reduce a fully connected PC graph to lighter and even more powerful PC graphs. Particularly, we show how to generate different neural architectures by simply pruning specific edges of a fully connected PC graph G = (V, E). In this case, the pruning is performed by applying a sparse mask M. However, there are multiple equivalent ways of implementing it. Consider now the weight matrix θ¯n×n, where every entry θi,j represents the weight parameter connecting vertex i to vertex j . To generate a neural architecture that consists of a subset of the original connections, it suffices to mask the matrix θ¯ via entry-wise multiplication with a binary matrix M, where Mi,j = 1 if the edge (i, j) exists in E, and Mi,j = 0 otherwise. This allows the creation of hierarchical discriminative architectures such as a PC equivalent of the multilayer perceptron (MLP) in Fig. 4a, or hierarchical generative networks in Fig. 4b, c. More generally, it creates a framework to generate and study architectures with any topology, such as small-world networks inspired by brain regions [36], as shown in Fig. 4d. Guidance on which topology should be used depends on the tasks and dataset, and it is hence hard to propose a general theory (as it is with BP). In what follows, however, we provide multiple examples.

Figure 4.

Figure 4

Examples of PC graphs that can be built by masking a part of the weights of a fully connected PC graph. (a) Masking required to build a standard multilayer architecture, such as the one in [17]. (b) Masking required to build a multilayer architecture, where the weights go in the opposite direction. Here, the sensory nodes are at the end of the hierarchical structure. This model is equivalent to the generative networks in [20]. (c) Examples of masking needed to implement popular architectures with lateral connections, similar to the model in [35]. (d) This is the model in [28], which consists of a set of Erdõs–Renyi graphs that simulate brain regions (dark squares on the diagonal) and connections between them (dark squares off the diagonal).

Experiments

Here, we study how the network topology influences the final performance, performing the same experiments shown on the fully connected PC graph. We expect the generated images to be visibly better due to the enforced hierarchical structure of the PC graph.

Setup

We trained generative PC graphs, recurrent generative PC graphs, assemblies of neurons PC graphs, and standard BP autoencoders with different numbers of hidden layers and hidden dimension, and report the best results. For the generation results, we used the same setup, but added an input layer with 10 vertices, whose value nodes during training were initialized with the 1-hot label vector. We performed a search across learning rates γ and α, and on the number of iterations per batch T. More details are given in the supplementary material, as well as a long discussion on how different parameters influence the final performance of the architecture.

Results

The results are given in Fig. 5a and b. As expected, the hierarchical structure of the considered PC graphs improves over the fully connected PC graph, despite being comparable in the number of parameters. Compared against autoencoders (Fig. 5c), the standard ANN baseline trained with BP, the PC graph results are similar in image denoising, and better in image reconstruction. FID scores on denoising tasks for different levels of noise are given in Table 7.

Figure 5.

Figure 5

Query by initialization (top) and query by conditioning (bottom) on three different PC graph architectures and different datasets. Particularly, we tested these PC graphs against ANN autoencoders trained with BP (d), which perform comparably to the PC graphs on denoising tasks, but less well on image reconstruction.

5. Conditioning on Labels

Assume that we need to reconstruct a test image from an incomplete version of it, with the further assumption that that this time we are also provided with the label of the corrupted image. It would be useful to be able use this extra information to obtain a better reconstruction. In PC graphs, this is straightforward: it suffices to simultaneously fix the value nodes representing the labels to the 1-hot vector of the provided label, and the sensory nodes to the pixels of the corrupted image. This method can be applied when it is difficult to infer to which class an incomplete image belongs, and providing the label during the reconstruction allows the preferred label to influence the reconstruction. Hence, we perform the following task: we provide images of digits that look similar when incomplete, and ask the model to reconstruct the missing half when giving the label information, i.e., use the additional label information to correctly resolve the inherent ambiguity in the reconstruction task.

Experiments

We used the same PC graphs from above for generation tasks. We provided the PC graph the bottom 33%of random images representing 7s or 9s. Note that it is hard to distinguish between these two numbers when only this small portion of the image is available. Then, we generated the missing 67%of the pixels by first giving 7 as a label, and then giving 9. We have repeated the same task using 3s and 5s. The results, available in Fig 6b, show that our model is able to perform conditional inference, as the reconstructed digits always agree with the provided labels.

Figure 6.

Figure 6

Left: Generated images given the labels using feedforward (top) and recurrent (bottom) PC graphs. Right: conditional inference on the labels.

6. Assembly of Neurons

Recently, a model made by assemblies of neurons that are sparsely connected with each other has been proposed to emulate brain regions [28]. This model consists of m ordered clusters of neurons (C1,…,Cm), and any two ordered neurons of the same cluster are connected by a synapse with probability p, creating an Erdõs–Renyi graph Gm,p. Depending on the desired task, two clusters can be connected via sparse connections following the same rule: if cluster Ca is connected to cluster Cb, then, given a vertex viCa and a vertex vjCb , there exists a synaptic connection connecting vi to vj with probability p. Note that this structure is highly general, and allows to build networks such as the one represented in Fig. 1b. To conclude, at each time step, only the k neurons of every cluster with the highest neural activity fire. In the original work, the authors propose a Hebbian-like learning algorithm, however, we show that it can also be trained using PC graphs. A graphical representation on how to encode as a PC graph a network made by assemblies of neurons is given in Fig. 4d. In this case, each dark block on the diagonal represents connections between neurons of the same region. Unlike the other networks in the same figure, these are sparse matrices where every entry is either zero, or one with probability p. As in the brain, not every region is connected with the other, and whether two regions are directly connected has to be decided a priori when designing the architecture. Again, two neurons between connected regions are directly connected with probability p. In Fig. 4d, dark blocks off the diagonal represent the presence of directed connections between two regions Ca and Cb . If situated below the diagonal, the connections go from Ca to Cb , with a < b; if situated above the diagonal, they go from Cb to Ca.

Experiments

We replicated this structure, using 4 clusters with 3000 vertices each, connected in a feedforward way: the first cluster is connected with the second, which is connected with the third, and so on. As sparsity and top-k constants, we used p = 0.1 and k = 0.2, and performed the same generative experiments. The results are given in Fig. 5c. While the results look cleaner than the other methods, note that they are specific to MNIST and FashionMNIST, as the top-k activation on the last cluster well cleans the noise surrounding the reconstructions.

7. Related Work

Our work shares similarities and the final goal with a whole field of research that aims to improve current neural networks by using techniques from computational neuroscience. In fact, the biological implausibility and limitations of BP highlighted in [37, 38] have fueled research in finding a new learning algorithm to train ANNs, with the most promising candidates being energy-based models such as equilibrium propagation [39, 40]. Other interesting energy-based methods are Boltzmann machines [4143], and Hopfield networks [44, 45]. These differ from PC, as they do not encode the concept of error, but learn in a pure Hebbian fashion. Furthermore, they have undirected synaptic connections, and make predictions by minimizing a physical system initialized with a specific input. This is different from PC, that has directed synaptic connections and is tested by fixing specific nodes to an input, while letting the latent ones converge. The PC literature ranges from psychology to neuroscience and machine learning. Particularly, it offers a single mechanism that accounts for diverse perceptual phenomena observed in the brain, examples of which are endstopping [7], repetition-suppression [46], illusory motions [47, 48], bistable perception [49, 50], and even attentional modulation of neural activity [51, 52], and it has even been used to describe the retrieval and storage of memories in the human memory system [22].

Although inspired by neuroscience models of the cortex, the computational model introduced by Rao and Ballard [7] still presents some implausibilities, with the main one being the presence of symmetric connections. An implementation of PC with no symmetric connections that is able to successfully learn image classification tasks has been presented in [53], and in the neural generative coding models, used for continual learning, generative models, and reinforcement learning [54, 55].

8. Discussion

In this work, we have shown that PC is able to perform machine learning tasks on graphs of any topology, called PC graphs. Particularly, we have highlighted two main differences between our framework and standard deep learning: flexibility in structure and query. On the one hand, a flexible structure allows for learning on any graph topology, hence including both classical deep learning models, and small-world networks that resemble sparse brain regions. On the other hand, flexible querying allows the model to be trained and tested on data points that carry different kinds of information: supervised signals, unsupervised, and incomplete. On a much broader level, this work strengthens the connection between the machine learning and the neuroscience communities, as it underlines the importance of PC in both areas, both as a highly plausible algorithm to train brain-inspired architectures, and as an approach to solve corresponding problems in machine intelligence.

The research of this paper (and current PC literature in general) is also of great importance from another perspective: training modern neural networks with BP has become computationally extremely expensive, making modern technologies inaccessible. Biological neural networks, on the other hand, do not have these drawbacks thanks to their biological hardware. Recent breakthroughs in the development of neuromorphic and analog computing, such as the finding of the missing memristor [56], could allow the training of deep neural models using only a tiny fraction of energy and time that modern GPUs need. To do this, however, we need to train neural networks end-to-end on the same chip, something that is not possible using BP (or BP through time), due to the need of a control signal that passes information between different layers. The energy formulation of neuroscience-inspired models allows to overcome this limitation, making them perfect candidates to train deep neural models end-to-end on the same chip [57]. This strongly motivates research in PC and other neuroscience-inspired algorithm, with a potentially huge long-term impact.

Supplementary Material

Supplemental Materials

Figure 7.

Figure 7

FID Score on MNIST on images corrupted with Gaussian noise of different variance.

Acknowledgments

This work was supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1, by the AXA Research Fund, the EPSRC grant EP/R013667/1, the MRC grant MC_UU_00003/1, the BBSRC grant BB/S006338/1, and by the EU TAILOR grant. We also acknowledge the use of the EPSRC-funded Tier 2 facility JADE (EP/P020275/1) and GPU computing support by Scan Computers International Ltd. Yuhang Song was supported by the China Scholarship Council under the State Scholarship Fund and by a J.P. Morgan AI Research Fellowship.

Contributor Information

Tommaso Salvatori, Email: tommaso.salvatori@cs.ox.ac.uk.

Luca Pinchetti, Email: luca.pinchetti@cs.ox.ac.uk.

Beren Millidge, Email: beren.millidge@ndcn.ox.ac.uk.

Yuhang Song, Email: yuhang.song@some.ox.ac.uk.

Tianyi Bao, Email: tianyi.bao@cs.ox.ac.uk.

Rafal Bogacz, Email: rafal.bogacz@ndcn.ox.ac.uk.

Thomas Lukasiewicz, Email: thomas.lukasiewicz@cs.ox.ac.uk.

References

  • [1].Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;323(6088):533–536. [Google Scholar]
  • [2].Linnainmaa S. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s Thesis (in Finnish), Univ Helsinki; 1970. pp. 6–7. [Google Scholar]
  • [3].Hebb D. The Organization of Behavior. Wiley; New York: 1949. [DOI] [PubMed] [Google Scholar]
  • [4].Srinivasan MV, Laughlin SB, Dubs A. Predictive coding: A fresh view of inhibition in the retina. Proceedings of the Royal Society of London Series B Biological Sciences. 1982;216(1205):427–459. doi: 10.1098/rspb.1982.0085. [DOI] [PubMed] [Google Scholar]
  • [5].Mumford D. On the computational architecture of the neocortex. Biological Cybernetics. 1992;66(3):241–251. doi: 10.1007/BF00198477. [DOI] [PubMed] [Google Scholar]
  • [6].Friston K. Learning and inference in the brain. Neural Networks. 2003;16(9):1325–1352. doi: 10.1016/j.neunet.2003.06.005. [DOI] [PubMed] [Google Scholar]
  • [7].Rao RP, Ballard DH. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience. 1999;2(1):79–87. doi: 10.1038/4580. [DOI] [PubMed] [Google Scholar]
  • [8].Walsh KS, McGovern DP, Clark A, O’Connell RG. Evaluating the neurophysiological evidence for predictive processing as a model of perception. Annals of the New York Academy of Sciences. 2020;1464(1):242. doi: 10.1111/nyas.14321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Kell AJ, Yamins DL, Shook EN, Norman-Haignere SV, McDermott JH. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron. 2018;98 doi: 10.1016/j.neuron.2018.03.044. [DOI] [PubMed] [Google Scholar]
  • [10].Millidge B, Seth A, Buckley CL. Predictive coding: A theoretical and experimental review. arXiv:2107.12979. 2021 [Google Scholar]
  • [11].Bastos AM, Usrey WM, Adams RA, Mangun GR, Fries P, Friston KJ. Canonical microcircuits for predictive coding. Neuron. 2012;76(4):695–711. doi: 10.1016/j.neuron.2012.10.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Friston KJ, Parr T, de Vries B. The graphical brain: Belief propagation and active inference. Network Neuroscience. 2017;1(4):381–414. doi: 10.1162/NETN_a_00018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Friston K. A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences. 2005;360(1456) doi: 10.1098/rstb.2005.1622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Spratling MW. A review of predictive coding algorithms. Brain and Cognition. 2017;112:92–97. doi: 10.1016/j.bandc.2015.11.003. [DOI] [PubMed] [Google Scholar]
  • [15].Huang Y, Rao RP. Predictive coding. Wiley Interdisciplinary Reviews: Cognitive Science. 2011;2(5):580–593. doi: 10.1002/wcs.142. [DOI] [PubMed] [Google Scholar]
  • [16].Friston KJ, Trujillo-Barreto N, Daunizeau J. DEM: A variational treatment of dynamic systems. Neuroimage. 2008;41(3):849–885. doi: 10.1016/j.neuroimage.2008.02.054. [DOI] [PubMed] [Google Scholar]
  • [17].Whittington JC, Bogacz R. An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity. Neural Computation. 2017;29(5) doi: 10.1162/NECO_a_00949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Byiringiro B, Salvatori T, Lukasiewicz T. Robust graph representation learning via predictive coding. arXiv preprint. 2022:arXiv:2212.04656 [Google Scholar]
  • [19].Salvatori T, Song Y, Millidge B, Xu Z, Sha L, Emde C, Bogacz R, Lukasiewicz T. Incremental predictive coding: A parallel and fully automatic learning algorithm. arXiv preprint. 2022:arXiv:221200720 [Google Scholar]
  • [20].Salvatori T, Song Y, Hong Y, Sha L, Frieder S, Xu Z, Bogacz R, Lukasiewicz T. Associative memories via predictive coding. Advances in Neural Information Processing Systems. 2021;34 [PMC free article] [PubMed] [Google Scholar]
  • [21].Tang M, Salvatori T, Millidge B, Song Y, Lukasiewicz T, Bogacz R. Recurrent predictive coding models for associative memory employing covariance learning. bioRxiv. 2022 doi: 10.1371/journal.pcbi.1010719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Barron HC, Auksztulewicz R, Friston K. Prediction and memory: A predictive coding account. Progress in Neurobiology. 2020;192:101821. doi: 10.1016/j.pneurobio.2020.101821. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Millidge B, Tschantz A, Buckley CL. Predictive coding approximates backprop along arbitrary computation graphs. arXiv:2006.04182. 2020 doi: 10.1162/neco_a_01497. [DOI] [PubMed] [Google Scholar]
  • [24].Song Y, Lukasiewicz T, Xu Z, Bogacz R. Can the brain do backpropagation? — Exact implementation of backpropagation in predictive coding networks. Advances in Neural Information Processing Systems. 2020;33 [PMC free article] [PubMed] [Google Scholar]
  • [25].Salvatori T, Song Y, Lukasiewicz T, Bogacz R, Xu Z. Reverse differentiation via predictive coding; Proc AAAI; 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Avena-Koenigsberger A, Misic B, Sporns O. Communication dynamics in complex brain networks. Nature Reviews Neuroscience. 2018;19(1):17–33. doi: 10.1038/nrn.2017.149. [DOI] [PubMed] [Google Scholar]
  • [27].Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997;9(8) doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
  • [28].Papadimitriou CH, Vempala SS, Mitropolsky D, Collins M, Maass W. Brain computation by assemblies of neurons. Proceedings of the National Academy of Sciences. 2020 doi: 10.1073/pnas.2001893117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Dabagia M, Papadimitriou CH, Vempala SS. Assemblies of neurons can learn to classify well-separated distributions. arXiv:2110.03171. 2021 [Google Scholar]
  • [30].Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv.1708.07747. 2017 [Google Scholar]
  • [31].Almeida LB. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. 1990:102–111. [Google Scholar]
  • [32].Pineda FJ. Generalization of back-propagation to recurrent neural networks. Physical Review Letters. 1987;59 doi: 10.1103/PhysRevLett.59.2229. [DOI] [PubMed] [Google Scholar]
  • [33].Belyaev MA, Velichko AA. Classification of handwritten digits using the Hopfield network; IOP Conference Series: Materials Science and Engineering; 2020. [Google Scholar]
  • [34].Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY. Reading digits in natural images with unsupervised feature learning. 2011 [Google Scholar]
  • [35].Ororbia A, Kifer D. The neural coding framework for learning generative models. arXiv:2012.03405. 2020 doi: 10.1038/s41467-022-29632-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Telesford QK, Joyce KE, Hayasaka S, Burdette JH, Laurienti PJ. The ubiquity of small-world networks. Brain Connectivity. 2011;1(5):367–375. doi: 10.1089/brain.2011.0038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Lillicrap T, Santoro A, Marris L, Akerman C, Hinton G. Backpropagation and the brain. Nature Reviews Neuroscience. 2020;21 doi: 10.1038/s41583-020-0277-3. [DOI] [PubMed] [Google Scholar]
  • [38].Whittington JC, Bogacz R. Theories of error back-propagation in the brain. Trends in Cognitive Sciences. 2019 doi: 10.1016/j.tics.2018.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Scellier B, Bengio Y. Equilibrium propagation: Bridging the gap between energy-based models and backpropagation. Frontiers in Computational Neuroscience. 2017;11:24. doi: 10.3389/fncom.2017.00024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Scellier B, Goyal A, Binas J, Mesnard T, Bengio Y. Generalization of equilibrium propagation to vector field dynamics. arXiv:1808.04873. 2018 [Google Scholar]
  • [41].Salakhutdinov R, Mnih A, Hinton G. Restricted Boltzmann machines for collaborative filtering; Proceedings of the 24th International Conference on Machine Learning; 2007. [Google Scholar]
  • [42].Salakhutdinov R, Hinton G. Artificial Intelligence and Statistics. PMLR; 2009. Deep Boltzmann machines; pp. 448–455. [Google Scholar]
  • [43].Hinton GE. Deep belief networks. Scholarpedia. 2009;4(5):5947. [Google Scholar]
  • [44].Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences. 1982;79 doi: 10.1073/pnas.79.8.2554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Hopfield JJ. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences. 1984;81 doi: 10.1073/pnas.81.10.3088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Auksztulewicz R, Friston K. Repetition suppression and its contextual determinants in predictive coding. Cortex. 2016;80 doi: 10.1016/j.cortex.2015.11.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Lotter W, Kreiman G, Cox D. Deep predictive coding networks for video prediction and unsupervised learning. arXiv:1605.08104. 2016 [Google Scholar]
  • [48].Watanabe E, Kitaoka A, Sakamoto K, Yasugi M, Tanaka K. Illusory motion reproduced by deep neural networks trained for prediction. Frontiers in Psychology. 2018;9:345. doi: 10.3389/fpsyg.2018.00345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Hohwy J, Roepstorff A, Friston K. Predictive coding explains binocular rivalry: An epistemological review. Cognition. 2008;108(3) doi: 10.1016/j.cognition.2008.05.010. [DOI] [PubMed] [Google Scholar]
  • [50].Weilnhammer V, Stuke H, Hesselmann G, Sterzer P, Schmack K. A predictive coding account of bistable perception-a model-based fMRI study. PLoS Computational Biology. 2017;13(5) doi: 10.1371/journal.pcbi.1005536. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Feldman H, Friston K. Attention, uncertainty, and free-energy. Frontiers in Human Neuroscience. 2010;4 doi: 10.3389/fnhum.2010.00215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Kanai R, Komura Y, Shipp S, Friston K. Cerebral hierarchies: Predictive processing, precision and the pulvinar. Philosophical Transactions of the Royal Society B: Biological Sciences. 2015;370 doi: 10.1098/rstb.2014.0169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [53].Millidge B, Tschantz A, Seth A, Buckley CL. Relaxing the constraints on predictive coding models. arXiv:2010.01047. 2020 [Google Scholar]
  • [54].Ororbia AG, Mali A. Biologically motivated algorithms for propagating local target representations; Pro AAAI; 2019. pp. 4651–4658. [Google Scholar]
  • [55].Ororbia A, Mali A. Active predicting coding: Brain-inspired reinforcement learning for sparse reward robotic control problems. arXiv preprint. 2022:arXiv:2209.09174 [Google Scholar]
  • [56].Strukov DB, Snider GS, Stewart DR, Williams RS. The missing memristor found. Nature. 2008;453(7191):80–83. doi: 10.1038/nature06932. [DOI] [PubMed] [Google Scholar]
  • [57].Kendall J, Pantone R, Manickavasagam K, Bengio Y, Scellier B. Training end-to-end analog neural networks with equilibrium propagation. arXiv:2006.01981. 2020 [Google Scholar]
  • [58].Sacramento J, Costa RP, Bengio Y, Senn W. Dendritic cortical microcircuits approximate the backpropagation algorithm. Advances in Neural Information Processing Systems. 2018:8721–8732. [Google Scholar]
  • [59].Krotov D, Hopfield JJ. Dense associative memory for pattern recognition. Advances in Neural Information Processing Systems. 2016 [Google Scholar]
  • [60].Kendall J, Pantone R, Manickavasagam K, Bengio Y, Scellier B. Training end-to-end analog neural networks with equilibrium propagation. arXiv preprint. 2020:arXiv:2006.01981 [Google Scholar]
  • [61].Wright LG, Onodera T, Stein MM, Wang T, Schachter DT, Hu Z, McMahon PL. Deep physical neural networks trained with backpropagation. Nature. 2022;601(7894):549–555. doi: 10.1038/s41586-021-04223-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Materials

RESOURCES