Modelling neural probabilistic computation using vector symbolic architectures

P Michael Furlong; Chris Eliasmith

doi:10.1007/s11571-023-10031-7

. 2023 Dec 16;18(6):1–24. doi: 10.1007/s11571-023-10031-7

Modelling neural probabilistic computation using vector symbolic architectures

P Michael Furlong ^1,^✉, Chris Eliasmith ¹

PMCID: PMC11655797 PMID: 39712100

Abstract

Distributed vector representations are a key bridging point between connectionist and symbolic representations in cognition. It is unclear how uncertainty should be modelled in systems using such representations. In this paper we discuss how bundles of symbols in certain Vector Symbolic Architectures (VSAs) can be understood as defining an object that has a relationship to a probability distribution, and how statements in VSAs can be understood as being analogous to probabilistic statements. The aim of this paper is to show how (spiking) neural implementations of VSAs can be used to implement probabilistic operations that are useful in building cognitive models. We show how similarity operators between continuous values represented as Spatial Semantic Pointers (SSPs), an example of a technique known as fractional binding, induces a quasi-kernel function that can be used in density estimation. Further, we sketch novel designs for networks that compute entropy and mutual information of VSA-represented distributions and demonstrate their performance when implemented as networks of spiking neurons. We also discuss the relationship between our technique and quantum probability, another technique proposed for modelling uncertainty in cognition. While we restrict ourselves to operators proposed for Holographic Reduced Representations, and for representing real-valued data. We suggest that the methods presented in this paper should translate to any VSA where the dot product between fractionally bound symbols induces a valid kernel.

Keywords: Probability, Bayesian modelling, Vector symbolic architecture, Fractional binding, Spatial semantic pointers

Introduction

Researchers studying the brain have proposed myriad ways in which the system under study can be understood through the lens of probability theory. Such work ranges from interpreting neuron activity as sampling distributions (Ma et al. 2008; Kappel et al. 2015a; Hoyer and Hyvärinen 2002) to high-level Bayesian models of concept learning (Lake et al. 2015). Some models propose probabilistic interpretations of abstract vectors representing concepts or beliefs that violate Kolmogorov’s axioms (Busemeyer et al. 2015; Pothos and Busemeyer 2022), while others suggest that populations of neurons provide distributed representations of probability density functions in the service of explicit Bayesian inference. However, there remains an explanatory gap in this literature between detailed models of spiking neuron behavior and sophisticated cognitive behaviors, more typically studied in non-neural models.

Vector Symbolic Architectures/Algebras (VSAs) have been purported to bridge connectionist and symbolic representations of cognition (Smolensky 1990; Plate 1994; Kanerva 1996; Gayler 2004; Levy and Gayler 2008; Eliasmith 2013), but it is unclear how probability might be modelled using VSAs. In this paper we suggest that some VSAs can naturally model probability by showing that if the similarity function between representations of continuous values induces a (quasi-)kernel function, then VSA statements on bundles of vector symbols are analogous to (quasi-)probabilistic statements.

Notably, VSAs have been used as a method for organizing neural systems that execute cognitive systems [e.g., Eliasmith et al. (2012)]. In such work, the algebraic operations of the VSAs constrain the structure of the neural networks, and the vector representations—of stimuli, motor plans, or even complex data structures—constrain the latent representations of populations of neurons.

This work is enabled through the ability to “compute in superposition” (Kleyko et al. 2022). Through the VSA operation of bundling (discussed in more detail in “Vector symbolic architectures” section), we can construct vectors that represent sets of objects. Through the operation of similarity, we can quantify membership of query points in those sets.

We focus on a particular representation of continuous values called Spatial Semantic Pointers [SSPs;(Komer 2020)], a special case of an operation called fractional binding or fractional power encoding (Plate 1992, 1994; Frady et al. 2021). While we restrict ourselves to the operators proposed for Holographic Reduced Representations [HRRs; Plate (2003)], and in particular their use in the Semantic Pointer Architecture [SPA; Eliasmith (2013)], we infer that the presented results translate to VSAs where similarity induces a meaningful kernel or quasi-kernel functions of the encoded data points, and where similarity distributes over bundling.

Several prior approaches use populations of spiking neurons to represent vector representations of probability functions [e.g., Ma et al. (2008), Sharma et al. (2017), Sharma (2018)]. In these representations, individual neurons represent bins around sample points in the domain of a probability distribution and their activity corresponds to the probability mass in that bin. As well, (Eliasmith 2013, §7.4) illustrates how the thalamacortical-basal-ganglia loop can effect Bayesian inference [as has Bogacz (2015)] and further, how to update distributions using HRR operations. In this paper, we go further and provide explicit VSA operations for computing over probability distributions—how to marginalize a distribution, how to compute entropy, and how to compute the mutual information between two random variables. Given the connection between VSAs and probability that we establish in this paper, we are able to construct neural circuits built using these methods implement probabilistic operations.

Recognizing the connection between VSA statements and probability-like statements provides a great deal of flexibility when designing networks to compute probabilistic inference. Implicit in the methods discussed in this paper is the ability to construct neurons whose activity corresponds to a bin around one value, or a multi-modal distribution over multiple values. Populations of neurons can be constructed that represent distributions over points in arbitrary domains. Similarly, if individual neurons’ tuning curves represent distributions over more than one point, we can construct a population of neurons to represent a collection of more abstract variables.

Our characterization of VSAs is useful for analyzing existing VSA-encoded cognitive models. Reframing VSA operations as probability operations provides a toolkit for exposing the probabilistic model underlying existing theories of cognition. This permits making testable hypotheses about animal behaviour and how they may deviate from standard probability models.

The results presented in this paper are an early effort at formulating neural probabilistic computation that bridges symbolic representations and connectionist implementations. Open questions remain about capacity, the limitations of specific implementations, and what kinds of probability models algebras other than the ones we consider will support. In sum, we claim that the main contributions of this paper are:

We explain how to interpret operations in the HRR VSA probabilistically. This permits other uses of these cognitive modelling tools to integrate probability into modelling tasks.
We demonstrate a simple method for modelling probability distribution functions using SSPs.
We enumerate VSA operations for implementing specific functions of probability distributions.
We have constructed novel neural circuits for computing operations on distributions. These operations are useful in designing autonomous systems, and can be understood as the building blocks for constructing neural systems that incorporate uncertainty when responding to stimuli.

The rest of the paper is laid out as follows. First, we briefly review spiking neural implementations of probability (“Neural representations of probability” section) and cognitive models of probability (“Cognitive representations of probability” section). Then we introduce concepts that are necessary to build the connection between VSAs and probability, through kernel density estimators (“Preliminaries” section). Next we draw the analogies between VSA operations and probability statements (“Binding encodes data” to “Unbinding is analogous to conditioning” sections). We then use these operations to implement VSA and spiking neural implementations of these operations (“Other operations” section), including novel circuits for computing entropy and mutual information using SSPs. Finally, we discuss the implications of using VSAs to model probability (“Discussion” section) and conclude.

Background

Probability is used to describe systems that are variable or uncertain. There is a desire to use probability to capture seemingly non-deterministic behaviour of neurons and networks of neurons, as well as in the behaviour of entire organisms. Furthermore, we often take biological systems to themselves be modeling the uncertainty in their environment, so suppose that they may perform probabilistic computation. Accounting for both neural variability and how that relates to internal neural computation is a well-studied problem. We provide a brief review below of prior work in modelling probability in neurons and in cognition.

Neural representations of probability

If one is to propose that brains represent probability distributions or implement probabilistic reasoning, then one requires a mechanistic explanation for how quantities are represented and how probabilistic operations are performed. A number of neural coding hypotheses have been proposed that relate the activity of a single neuron, $r_{i}$ , or population of neurons $r$ , and how they are used to represent distributions over some stimulus, $s$ , denoted $p (s ∣ r)$ . Here we use the terminology of Ma et al. (2008) to classify these coding schemes.

Explicit probability coding proposes the firing rate of an individual neuron represents the probability that its input is its preferred stimulus [e.g., Salinas and Abbott (1994), Pouget et al. (2003)]. We write this $⟨ r_{i} ⟩ \propto p (S = s_{i})$ , where $⟨ r_{i} ⟩$ represents the time average of the ith neuron’s firing rate. For example, the firing rate from a neuron with a Gaussian tuning curve centred at the preferred stimuli, would be proportional to the probability that the stimuli is equal to the preferred stimuli. Using this representation, a population of neurons would represent the probability of a set of samples from the domain of a distribution, and each neuron operates independent of the others. The probability distribution here is represented in the average activity of neurons. Relatedly, there is a similar representation, log probability codes, where the firing rate is proportional to the log of the probability of the that the input is the preferred stimulus of the neuron [e.g., Rao (2004)], and log likelihood ratio codes, where neuron activity is a function of the ratio of the probability of two, binary, stimuli (Deneve 2008).

Building on neurons that represent some function of the probability of the stimulus are convolutional codes (Anderson and Van Essen 1994; Zemel et al. 1996; Eliasmith and Anderson 2003; Barber et al. 2003). Here each neuron, i, has associated with it some function, represented as a vector, $ψ_{i} (s)$ , and the output of the population is the sum of these functions, weighted by their respective neuron’s firing rates. These networks are composed of neurons whose firing rates may be (approximately) probabilities, but the real output is the combined weighted functions:

\begin{matrix} p (s) \approx \frac{\sum_{i} r_{i} ψ_{i} (s)}{\sum_{j} r_{j}} . \end{matrix}

Beyond decoding the probability of a given stimulus, as a function of neural activity, this approach has also been applied to Bayesian inference (Sharma et al. 2017). This is enabled by the recognition that Bayesian inference, $p (y ∣ d) = \int p (y ∣ x) p (x ∣ d) d x$ , can be implemented as a linear operation (Eliasmith and Anderson 2003, Ch. 9).

Probabilistic population codes (PPC) (Ma et al. 2006, 2008) use the activity and the variability of a population of neurons to represent probability distributions over the stimulus to a neural network. Because they chose an exponential family of distributions to represent distributions, Bayesian cue integration can be accomplished by simple linear combinations of neural activities. It should also be noted that PPCs generalize explicit probability codes: as the tuning curves of the neurons approach a Dirac function, explicit codes are recovered. Recent biological experiments for representations of uncertainty in primates have produced observations that are consistent with the PPC (Darlington et al. 2018; Hou et al. 2019; Walker et al. 2020).

Where population codes represent distributions using the activity of neurons, Neural Sampling, takes a different approach, using neural activity as samples from a distribution, but never representing the distribution itself. Hoyer and Hyvärinen (2002) proposed a model where the instantaneous firing rates of neurons represent samples from a distribution. Their model represents distributions by the accumulation of samples either through populations, or over time in the case of a single neuron. Other approaches rely less on rate coding, such as Anastasio et al. (2000) and Buesing et al. (2011), where spikes represent samples from distributions over binary variables. The work of Buesing et al. (2011) was later extended to model structural plasticity (Kappel et al. 2015b). Huang and Rao (2014) represents variables using populations of neurons, and use sub-populations to represent specific variable values. The firing rates of those sub-populations, averaged over space instead of time, represent the probability of the variable taking on the corresponding value.

More recent neural sampling work examines the role of recurrent population dynamics to effect sampling from some latent distribution. Echeveste et al. (2020) demonstrated that networks that perform probabilistic sampling can explain observed cortical phenomena. Complementary findings suggest that cortical oscillations can improve the quality of sampling from spiking generative models (Korcsak-Gorzo et al. 2022). In this formulation they employ non-linear transformations from sampled points in neural activity space to the desired decoded domain. Masset et al. (2022) build on the work of Boerlin and Denève (2011) and Savin and Denève (2014) to construct a recurrent neural population that effects generalized Markov Chain Monte Carlo sampling. These approaches are able to implement sampling for Bayesian inference, however, they lack the kinds of algebraic manipulations that we present in this work, and hence the compositionality that is inherent in VSA-defined systems.

Both neural sampling and PPC promote neural variability to a first-class entity in an attempt to use the information-carrying capacity of that variability. For neural sampling, variability is required to instantaneously code samples, in contrast to PPC where the variability of the spikes characterizes a distribution’s covariance matrix. In our work we do not rely on firing rate variability to represent distributions, and do not explicitly represent probabilities until points in a distribution’s domain need to be evaluated. Our distributions are explicitly represented, as in PPC, but as points in a Hilbert space, as in quantum probability and other vector approaches.

Cognitive representations of probability

When modelling the behaviour of entire agents, and not just isolated neurons or circuits, probabilistic models have been useful in capturing how agents handle variability and uncertainty (Doya et al. 2007; Chater and Oaksford 2008; Goodman et al. 2016). These approaches run the gamut from tightly integrated with neural implementations (e.g., explicit probability codes), to abstract models of cognition, like the symbolic models of Goodman et al. (2016). High level models are extremely useful, in that they provide ways to reason about symbolic and symbol-like representations beyond the level of stimulus responses, and can lead to models of cognition that can support the kind of few-shot learning we desire of cognitive models [e.g., Lake et al. (2015), Xu et al. (2021), Rule et al. (2022)].

However, there is a desire to close the gap between high-level probabilistic representations and neural implementations, if for no other reason than to hypothesize about neural circuit function. The PPC, discussed above, provides mechanisms for integrating information using simple linear operations.

Likewise, the Neural Engineering Framework (NEF) has been used to implement probabilistic inference. Sharma et al. (2017) used convolutional codes to implement a model of life expectancy estimation task in spiking neural circuits. They found that this approach matched human-level performance on a life-estimation task more accurately than purely algebraic Bayesian estimation alone. In other related work, Eliasmith (2013, Sec. 7.4) suggested that the thalamacortical-basal-ganglia loop could provide a circuit to compute a Bayesian update using this vector representation of probability. This circuit has elsewhere been identified as a candidate mechanism for Bayesian inference (see, e.g., Bogacz and Gurney (2007), Bogacz and Larsen (2011), Bogacz (2015), Doya (2021)).

These NEF-style approaches show that neural networks can readily, and are in fact well suited to, encoding probability statements. More generally, they show that the NEF provides a technique to translate probability statements directly into neural networks. This work straddles the line between neural models of probability and cognitive models of probability, as there are tools for translating probability statements that may be at the cognitive level directly into neural populations.

In this paper, we show that there is a relationship between statements in certain VSAs and probability that arises naturally. Coupled with the NEF, this means that we can take probability statements, translate them into VSA statements, and from there, to networks of spiking neurons.

Quantum probability for cognition

The methods described in the previous section are all using a specific form of probability, namely Bayesian probability. Bayesian probability carries with it certain axioms about distributions, and a rule for integrating information: Bayes’ rule. However, human behaviour does not always strictly adhere to the laws of standard probability, and hence quantum probability has been proposed as a model for probabilistic cognition (Busemeyer and Bruza 2012).

Quantum probability is a generalization of Bayesian probability: where Bayesian probability adheres to Kolmolgorov’s axioms,1 quantum probability relaxes these assumptions. Consequently, Pothos, Buesmeyer, and others have exploited the differences between quantum probability and Kolmolgorov probability to explain disagreements between predicted and actual human behaviours (Pothos and Busemeyer 2013; Busemeyer et al. 2015; Pothos and Busemeyer 2022).

This is accomplished by modelling cognitive states as points in a Hilbert space, i.e., as vectors, and more precisely, as cognitive state vectors constrained to be unit vectors. The probability of a given event is determined by first projecting the cognitive state vector into the region of the Hilbert space that represents the event in question, and then taking the square of the magnitude of the resultant vector.

In the quantum case, if there are two events, A, and B, then the probability of any event is given by transforming the state vector, S, using projection matrices, $M_{A}$ and $M_{B}$ . The quantum probability of a given event, say A, is defined as $q (A) = ‖ M_{A} {S ‖}_{2}^{2}$ , and the probability of the conjunction, $A ⋀ B$ is defined by the sequential application of the projection matrices (Eq. 1).

\begin{matrix} q (A \cap B) & = ‖ M_{A} M_{B} {S ‖}_{2}^{2} or \end{matrix}

\begin{matrix} = ‖ M_{B} M_{A} {S ‖}_{2}^{2} \end{matrix}

Because matrix multiplication in general does not commute, under the quantum formulation the order of conditioning matters. Hence there are two statements of joint probability that are equal under the Bayesian formulation, that are not equal under the quantum one.

Under this interpretation, quantum probability implicitly assumes that the probability mass in the state vector, S, contains the probability for the event A and the event B in superposition. However, to show that quantum probability can recapture standard results, Pothos and Busemeyer (2013) suggest the notion of compatible questions, where the state vector does not contain the individual vectors that represent events A and B, but rather the tensor product, $A \otimes B$ .

Pothos and Busemeyer note that the tensor product has been used in vector symbolic architectures to construct vector-symbols that represent both of the constituent elements (Smolensky 1990). The tensor product is one possible implementation of what is called a binding operation in the vector symbolic architecture literature. Binding, which will be discussed more below, produces a new vector-symbol that acts like a conjunction of the two constituent elements.

Like matrix multiplication, the tensor product does not commute. But if arguments are supplied consistently, then the quantum formulation could represent conjunctions in a manner consistent with Bayesian probability. We will show below that the choice of binding operator in the VSA we use commutes and permits an implementation of probability that is consistent with the Bayesian formulation.

Preliminaries

In this section we review the concepts that make the connection between VSAs and kernel density estimators. First, we briefly discuss Kernel Density Estimators (KDEs) and how Random Fourier Features (RFFs) have been used to improve the memory and time complexity of kernel machines like KDEs. Next, we briefly review VSA operations, grounded in the use case of SSPs.

Kernel density estimators and random Fourier features

Kernel Density Estimators (KDEs) estimate the probability of a query point x based on the average of its similarity to members of a dataset of n observations $D = \{x_{1}, \dots, x_{n}\}$ . Similarity is measured using kernel functions, $k (\cdot, \cdot)$ , which are typically valid density functions. KDEs are defined $f_{X} (x) = \frac{1}{nh} \sum_{i = 1}^{n} k_{h} (x, x_{i})$ for kernel bandwidth $h \in R^{+}$ . A problem with KDEs is the memory required to maintain the dataset, $D$ , which can grow without bound, as does the time to compute a query. Rahimi et al. (2007) addressed this problem for KDEs and other kernel machines with the introduction of Random Fourier Features (RFFs).

RFFs project data into vectors so that the dot product between two vectors approximates a kernel function, i.e., $k (x, y) \approx ϕ (x) \cdot ϕ (y)$ . The data projection is computed $ϕ (x) = {(e^{i ω_{1} x}, \dots e^{i ω_{d} x})}^{T}$ , where the frequency components $ω_{i}$ are i.i.d samples from some probability distribution $G (ω)$ . The choice of $G (ω)$ determines the kernel induced by the dot product,2 and as $d \to \infty$ the kernel approximation is exact.

With RFFs, linear methods can approximate nonlinear kernel methods. Kernels that can be approximated with RFFs of dimensionality $d < n$ improve the memory and time complexity of querying a KDE from linear in the number of samples (n) to linear in the feature representation dimensionality (d). We can see the memory benefits by applying the kernel approximation to the definition of a KDE:

\begin{matrix} P (X = x) = \frac{1}{nh} \sum_{i = 1}^{n} ϕ (\frac{x}{h}) \cdot ϕ (\frac{x_{i}}{h}) \end{matrix}

Because the dot product distributes over the summation, we can rewrite it as:

\begin{matrix} P (X = x) = ϕ (\frac{x}{h}) \cdot \frac{1}{nh} \sum_{i = 1}^{n} ϕ (\frac{x_{i}}{h}) \end{matrix}

For a fixed dataset, the term $\frac{1}{nh} \sum_{i = 1}^{n} ϕ (\frac{x_{i}}{h})$ is a vector, making the complexity of querying the KDE O(d), instead of O(n). The EXPoSE algorithm (Schneider et al. 2016; Schneider 2017), for example, uses RFFs for fast anomaly detection in large datastreams in finite memory. Fourier features have also been applied to Gaussian Process Regression (Rahimi et al. 2007; Mutnỳ and Krause 2019). As we will discuss next, there is a connection between RFFs and the generation of random vectors used in fractional binding in VSAs.

Vector symbolic architectures

VSAs generally represent symbols as vectors, and provide operators for acting on those symbol-vectors. The symbols typically represent discrete concepts, like integers or other atomic symbols, and even data structures (Plate 2003; Kanerva 2009; Eliasmith 2013; Levy and Gayler 2008; Smolensky 1990; Kleyko et al. 2021), but have recently been extended to represent continuous quantities (Voelker et al. 2021; Kleyko et al. 2021; Frady et al. 2021). We focus on a special type of continuous vector representation called a Spatial Semantic Pointer [SSP; Komer (2020)]. Algorithm 1 is one possible algorithm for generating new SSPs, called every time a new SSP is needed. There are two choices that can be made in this procedure: the distribution used for generating the frequency components, $θ_{i, j}$ ; and the dimensionality of the vector, d. Like with RFFs, different generating functions for the frequency components induces different kernel functions (Frady et al. 2021), and can provide improvements in the efficiency of the representation (Komer 2020; Dumont and Eliasmith 2020). In this work we will use a uniform distribution over the frequency components. This leaves the choice of dimensionality as a choice constrained by application specific needs.

graphic file with name 11571_2023_10031_Figa_HTML.jpg

Vectors generated by Algorithm 1 are both unit vectors and unitary vectors, meaning that the magnitude of all frequency components is 1. This latter property has two benefits exploited by SSPs: repeated convolution preserves the vector’s magnitude; and the dot product is preserved up to scale between the Fourier and time domains. That is, for two SSPs, $X_{1}, X_{2}$ , it is the case that $X_{1} \cdot X_{2} = F \{X_{1}\} \cdot \bar{F \{X_{2}\}}$ (Voelker 2020).

With our base representation defined we now turn to SSP operations. The four operators used in this document are similarity, bundling, binding, and unbinding, the same operators as used for HRRs, but here with a mapping to continuous representations. Similarity, $\cdot$ , compares two vector symbols. Similarity is the vector dot product. For two vectors, $x, y \in R^{d}$ , when $x = y$ , $x \cdot y$ should be 1, and when $x \neq y$ , $x \cdot y \approx 0$ .

Bundling, also called collecting or superposition, denoted $+$ , combines two vectors into a new vector that maintains some similarity with its constituent elements, i.e., $x \cdot (x + y)$ and $y \cdot (x + y)$ should both be relatively large. For SSPs and HRRs, bundling is vector addition. Bundles of vectors can be understood as sets of the constituent symbols. Similarity distributes over bundling, $x \cdot (y + z) = x \cdot y + x \cdot z$ , meaning the sum of similarity between a vector and all elements of a bundle can be computed with one operation.

Binding, denoted $\otimes$ , combines two vectors into a new vector that is dissimilar to either of the constituent components, i.e., $x \cdot (x \otimes y) \approx 0$ and $y \cdot (x \otimes y) \approx 0$ . We implement binding with circular convolution,3, denoted $⊛$ . We will use $\otimes$ as binding notation except where we exploit properties of circular convolution. Binding can be used as a basis for representing numbers. For integers, one generates a vector symbol, $X$ , and binds it with itself an integer number of times (Choo and Eliasmith 2010) to represent that integer number. We denote this $ϕ_{X} [n] = \overset{n}{\underset{i = 1}{\otimes}} X$ , where $n \in Z$ . Since this binding is circular convolution, and convolution in the time domain is multiplication in the Fourier domain, we can write $ϕ_{X} [n] = F^{- 1} \{e^{i Θ_{X} n}\}$ . For SSPs, the vector being bound is called an axis vector, and binding is applied a real-valued number of times (Plate 1995). Non-integer binding has been called fractional binding (Komer 2020), and SSPs use this technique extensively.

Unbinding, denoted $⊘$ , undoes, approximately, the binding operation. Given vector symbols, $x, y \in R^{d}$ , and $z = x \otimes y$ , then $x \cdot (z ⊘ y) \approx 1$ and $y \cdot (z ⊘ x) \approx 1$ . Unbinding for circular convolution can be implemented by binding with an “inverted” vector. The pseudo-inverse of a vector $x = {(x_{1}, \dots, x_{d})}^{T}$ is $x^{- 1} = {(x_{1}, x_{d}, x_{d - 1}, \dots, x_{2})}^{T}$ . Thus we write unbinding as either $y \approx z ⊘ x$ or $y \approx z ⊛ x^{- 1}$ .

With these operations, VSAs can produce models of cognition that map onto populations of neurons, as well as construct complex data structures and programs for neuromorphic hardware (Eliasmith 2013; Kleyko et al. 2021). Readers interested in other VSAs operators are referred to (Gosmann and Eliasmith 2019; Neubert et al. 2019; Schlegel et al. 2020). In the following section we show how to interpret VSA statements constructed using SSPs as density estimators.

Analogies to probability operations

In this section we show how operations with SSPs relate to probability statements. Eliasmith (2013, §7.4) outlines one (strict) relationship between operations on vector representations and probability distributions. We provide a different interpretation of how to convert similarities into probabilities. Our work differs other from prior work because we represent distributions as points in a Hilbert space—as vectors—which are encoded in the activities of neural populations. We can manipulate distributions by using VSA operators without leaving the VSA space, only decoding out to probability values when query points need to be evaluated.

Below we discuss how the VSA operations discussed above imply probability statements. We assume a fixed dataset, $D = {x_{1}, \dots, x_{n} ∣ x_{i} \in R^{m}}$ of n samples of m-dimensional data. Unless otherwise stated, we have implemented our models using Nengo (Bekolay et al. 2014), using spiking rectified linear neurons with maximum firing rates of 50 spikes per second, and use 2048-dimensional SSP representations. In Appendix 1 we provide a complexity analysis of the proposed methods.

Binding encodes data

SSPs use fractional binding to project data from some domain $X \subseteq R^{m}$ , into vector representations. Fractional binding is mathematically equivalent to the inverse Fourier transform of data encoded with RFFs. As discussed in Voelker (2020), if our SSPs are unitary, the dot product is preserved, up to scale. As with RFFs, the frequency component distribution determines the kernel induced under similarity (Sutherland and Schneider 2015; Frady et al. 2021). In this paper we use the uniform distribution over $[- π, π]$ to generate frequency components. We use a length scale parameter, h, so when we write $ϕ_{X} (x / h)$ we mean $F^{- 1} \{e^{i Θ_{X} x / h}\}$ , for $x \in R^{m}$ .4

In theory, data axes can have different generating distributions for their axis vectors, as long as the generated vectors remain unitary. However, when computing similarity it is important to encode all data with the same axis vectors and length scale parameter(s), h. It is beyond the scope of this paper, but these elements may form a concept of a data type for VSAs. We assume that for each data set the axis vectors will be randomly generated once, and we will denote data encoded with those vectors and a particular length scale as $ϕ_{X} (x / h)$ .

Similarity computes probability

The fundamental analogy we are drawing is between computing probability with KDEs and measuring the similarity of a query point with a bundle of fractionally bound vector symbols. We define our estimator as $\hat{f} (x ∣ D) = ϕ_{X} (x / h) \cdot \frac{1}{nh} \sum_{x_{i} \in D} ϕ_{X} (x_{i} / h)$ . For any domain space $x \in X \subseteq R^{m}$ , we will denote the normalized sum as $M_{X, n} = \frac{1}{nh} \sum_{x_{i} \in D} ϕ_{X} (x_{i} / h)$ . If we wish to highlight a subdivision of the elements in the vector representation x we may denote the sum $M_{XY, n}$ .

Using Algorithm 1 to generate axis vectors, the dot product between two SSPs induces the normalized sinc function (Voelker 2020), which is a quasi-kernel, as it takes on negative values.5 Consequently, $\hat{f} (x ∣ D)$ , is not a KDE, but is the special-case Fourier Integral Estimator [FIE; Davis (1975), Davis (1977)]. An optimal length scale, h, exists for the FIE (Glad et al. 2007; Chacón et al. 2007), and can be estimated by solving the equation $‖ φ_{n} (1 / h) ‖ = \frac{1}{\sqrt{n + 1}}$ on $1 / h \in [0, \sqrt{n}]$ , where $φ_{n} (t)$ is the empirical characteristic function (Glad et al. 2007), or by cross validation.

While the FIE is not a probability estimator, it can be converted to one. Two techniques for doing so are due to Glad et al. (2003, 2007), and Agarwal et al. (2016). Glad et al. (2003) developed corrections for different classes of quasi-kernels. The particular correction for the FIE is:

\begin{matrix} f_{X} (x) \approx max \{0, {\hat{f}}_{FIE} (x ∣ D) - ξ\} \end{matrix}

where $ξ \in R$ is selected so $\int_{- \infty}^{\infty} max \{0, {\hat{f}}_{FIE} (x ∣ D) - ξ\} d x = 1$ . Using our SSP implementation of the FIE, we can rewrite the conversion as:

\begin{matrix} f_{X} (x) \approx max \{0, ϕ_{X} (x / h) \cdot M_{X, n} - ξ\} . \end{matrix}

By inspection, this conversion is equivalent to a ReLU neuron with a bias, $b = - ξ$ , and either $W = M_{X, n}$ or $W = ϕ_{X} (x / h)$ are the synaptic weights.6 Letting $W = M_{X, n}$ frames populations of neurons (assuming a different $M_{X, n}$ for each neuron in the population) as estimating the probability of a query point, $ϕ_{X} (x / h)$ , under different distributions. In this interpretation, populations of neurons can be understood as collections of (inexact) density estimators, where imprecision is compounded by the differences between a ReLU and biological neuron’s transfer functions. A population of neurons could be used as a boosted version of a KDE. In the same way, each neuron could be understood to represent a distribution conditioned on a random variable.

On the other hand, if we allow the weights of a neuron to represent one point, $W = ϕ_{X} (x / h)$ , then a population of neurons could be sampling points, $x \in X$ , in the domain of the probability distribution that $M_{X, n}$ represents. In the case where the memory, $M_{X, n}$ , contains only one encoded point, this network would be an explicit probability code, in the terminology of Ma et al. (2008), and multiple neurons could be combined to produce a convolutional code. Figure 1 shows the performance of a spiking neural network using both styles of probability encoding using a 2048-d SSP approximation of a FIE.

Fig. 1 — Neuron activity encodes probability. We plot two different techniques for representing probability: single neuron representations where the average vector is stored in one neuron’s synaptic weights (A, C), and population encoding, where the average vector is input to a population that samples a distribution’s domain (B, VSA). We show this for a beta distribution (A, B) and a 1D Gaussian mixture model (C, D). We also plot a KDE with a radial basis function kernel for comparison. All data were encoded as 2048-d SSPs and distributions were generated using $N = 5000$ samples. Neural activity was average over 5 s of operation in the case of the population encoding, and for one second of stimulus presentation for the single neuron encoding. For the neural population encoding we used 1000 neurons uniformly sampling the domain [0, 1] for the beta distrbution, and $[- 10, 10]$ for the GMM

Regardless of whether $M_{X, n}$ or $ϕ_{X} (x / h)$ is playing the role of the synaptic weights, the choice of $ξ$ determines what the neuron is computing. If we choose $ξ$ such that

\begin{matrix} \int_{- \infty}^{\infty} max \{0, ϕ_{X} (x / h) \cdot M_{X, n} - ξ\} d x = 1 \end{matrix}

then the activity of the neuron(s) will approximate $f_{X} (x ∣ D)$ , regardless how the synaptic weights are specified.

However, if we select $ξ$ such that

\begin{matrix} \int_{- \infty}^{\infty} max \{0, ϕ_{X} (x / h) \cdot ϕ_{X} (x^{'} / h) - ξ\} d x = 1 \end{matrix}

and then a neuron is presented with the stimulus $M_{X, n}$ , the activity of the neuron will be

\begin{matrix} a (M_{X, n}) & = max \{0, \frac{1}{nh} \sum_{x_{i} \in D} ϕ_{X} (x_{\cdot} / h) ϕ_{X} (x_{i} / h) - ξ\} \\ \leq \sum_{x_{i} \in D} max \{0, ϕ_{X} (x / h) \cdot ϕ_{X} (x_{i} / h) - ξ\} \\ = \sum_{x_{i} \in D} P (X = x_{i} ∣ μ = x^{'}) \end{matrix}

where $P (X = x ∣ μ = x^{'})$ specifies a distribution centred at $x^{'}$ . We can understand this relationship of computing the probability of a set of observations, given a distribution centred at location $x^{'}$ , and see that its value is likewise bounded by the union bound. If we have some distribution over $x^{'}$ , $P (X^{'} = x^{'})$ , then we could use the activity of the population of neurons to compute an upper bound on the probability of experiences encoded in $M_{X, n}$ .

An alternative to the ReLU-style conversion is developed by Agarwal et al. (2016). Here the FIE output is squared, $f (x) \approx ‖ ϕ_{X} (x ; h) \cdot M_{X, n}^{'} ‖^{2}$ , using a modified version of the standard memory term $M_{X, n}^{'} = \sum_{x_{i} \in D} c_{i} ϕ_{X} (x_{i} / h)$ . The modification requires solving for a set of weighting parameters, $c_{i}$ . This technique is used by Frady et al. (2021), to construct KDEs using a VSA. We observe that conversion of Agarwal et al. (2016) at least superficially resembles Born’s rule for converting the quantum wave function into a probability (Born 1926), hinting at a deeper connection with models of cognition based on quantum theory (Pothos and Busemeyer 2013), as suggested by Stewart and Eliasmith (2013).

Both conversions have parameters that must be solved for and we do not comment on which method is preferable. Agarwal et al. (2016) provides an efficient method for solving for the weighting parameters, but it does require evaluating the Gram matrix. Solving for Glad et al.’s $ξ$ requires computing the integral of a non-linear function of a VSA estimator. Regardless of the chosen conversion, the analogies to probability operations laid out in this paper hold. In this paper the symbol $\overset{C}{\approx}$ indicates that operations on vector symbols are approximating a density, using one of the above conversions, C. We indicate converted values by $f_{X} (x) \approx C [ϕ_{X} (x / h) \cdot M_{X, n}]$ .

Bundling updates beliefs

Updating a belief with observations is vector addition. If we have a memory unit, $M_{n - 1} = \frac{1}{(n - 1) h} \sum_{x_{i} \in D_{n - 1}} ϕ_{X} (x_{i} / h)$ , then updating the memory, and the distribution, is simply updating the running mean $M_{n}$ . To ensure the KDE stays normalized by the number of samples, we should write the update as:

\begin{matrix} M_{X, n} = \frac{1}{nh} ϕ_{X} (x_{n} / h) + \frac{n - 1}{n} M_{X, n - 1} . \end{matrix}

If $M_{X, n}$ , is represented by a population of neurons, there is a concern of saturating the activity of the population. If the running average is computed exactly, then the length of $M_{X, n}$ should stay at 1, but computing the exact average requires an unbounded representation for n. We observe that the running average update in Eq. 3 is the appropriate Bayesian update for the expected $M_{X}$ , assuming a multivariate Gaussian distribution and an uninformative prior.

There is an implementation concern of whether or not to store the memory as a normalized or unnormalized sum. In either case, if the desire is to operate with normalized quasi-probabilities then there is a need to keep track of the length scale, h, and the number of data points in the kernel estimator, n. However, if an unnormalized $M_{X, n}$ is being represented by a population of neurons, there is the risk of saturating the activity of the neurons. It has been suggested that the saturation itself could act as a form of normalization (Eliasmith 2013, §7.4).

Unbinding is analogous to conditioning

There are three ways to understand the unbinding operator acting on fractionally bound representations. First, $ϕ_{X} (x / h) \otimes ϕ_{Y} (y / h) ⊘ ϕ_{Y} (y^{'} / h)$ can be viewed as shifting the representation $y^{'}$ units along the Y-axis. Committing to all query points being evaluated at $y = 0$ uncovers two other interpretations of unbinding. The second interpretation of unbinding is currying the evaluation of a joint probability distribution, i.e., $g (X) = f (X, Y = y) \overset{C}{\approx} ϕ_{X} (x / h) \otimes ϕ_{Y} (0) \cdot \frac{1}{nh} \sum_{x_{i}, y_{i} \in D} ϕ_{X} (x_{i} / h) \otimes ϕ_{Y} (\frac{y_{i} - y}{h})$ . Finally, recognizing that $f (X ∣ Y = y) = \frac{1}{η} f (X, Y = y)$ , then the unbinding operator can be understood as an unnormalized conditioned distribution. Normalizing the conditional distribution will require either memory or time, as we show below.

Encoding a two-dimensional distribution with observations $D \subseteq X \times Y$ in the usual way, $M_{XY, n} = \frac{1}{nh} \sum_{(x_{i}, y_{i}) \in D} ϕ_{X} (x_{i} / h) ⊛ ϕ_{Y} (y_{i} / h)$ . To condition $M_{XY, n}$ on an observation of the random variable, $Y = y$ , we unbind the value y from the sum, $M_{XY, n}$ , giving us:

\begin{matrix} M_{XY, n} ⊘ ϕ_{Y} (y / h) = \frac{1}{nh} \sum_{(x_{i}, y_{i}) \in D} ϕ_{X} (x_{i} / h) \otimes ϕ_{Y} (\frac{y_{i} - y}{h}) \end{matrix}

Taking the dot product between a query point $ϕ_{X} (x / h) \otimes ϕ_{Y} (y^{'} / h)$ and the unbound memory, $M_{X ∣ Y = y, n} = M_{X Y, n} ⊘ ϕ_{Y} (y)$ , then the result should be:

\begin{matrix} ϕ_{X} (x / h) \otimes ϕ_{Y} (y^{'} / h) \cdot M_{X ∣ Y = y, n} \\ = ϕ_{X} (x / h) \otimes ϕ_{Y} (y^{'} / h) \cdot \frac{1}{nh} \sum_{(x_{i}, y_{i}) \in D} ϕ_{X} (x_{i} / h) \otimes ϕ_{Y} ((y_{i} - y) / h) \end{matrix}

Setting $y^{'} = 0 \Rightarrow ϕ_{X} (x / h) \otimes ϕ_{Y} (0) = ϕ_{X} (x / h)$ , meaning $ϕ_{X} (x / h) \cdot (M_{XY, n} ⊘ ϕ_{Y} (y / h))$ is a valid similarity. The result is analogous to the joint probability of the query point $X = x$ with a fixed $Y = y$ . It can be converted to a kernel-smoothed estimate of $\sum_{x_{i} \in D} ϕ_{X} (x_{i} / h)$ near $Y = y$ by a location-dependent normalizing term (Wand and Jones 1995). This requires representing the marginal distribution over the conditioning variable. However, because the similarities can be negative, scaling by similarities in SSP space alone may estimate $\pm f (x ∣ y)$ . Thus, we convert before normalizing, to effect a conditional kernel density estimator (Rosenblatt 1969)

\begin{matrix} f (X ∣ Y = y) \approx \frac{C [ϕ_{X} (x / h) \cdot M_{X ∣ Y = y, n}]}{C [ϕ_{Y} (y / h) \cdot M_{Y, n}]} \end{matrix}

where $M_{Y, n} = \sum_{x_{i}, y_{i} \in D} ϕ_{Y} (y_{i} / h)$ . This adds a burden of either maintaining memory for every set of conditioning variables or the circuit for marginalizing the distributions. Alternatively, one could normalize by the sum of all possible values of x, re-writing it as:

\begin{matrix} f (X ∣ Y = y) \approx \frac{1}{η} C [ϕ_{X} (x / h) \cdot M_{X ∣ Y = y, n}] \end{matrix}

where $η = \int_{X} C [ϕ_{X} (x / h) \cdot M_{X ∣ Y = y, n}] d x$ . This approach requires the time and mechanisms to compute $η$ . To demonstrate the performance of the conditioned distribution we compare a 2D Gaussian distribution with the analytical solution to conditioning, and the quasi-distribution induced by unbinding.

We created a bundle of $n = 5000$ observations sampled from the 2D Gaussian Mixture Model (GMM) distribution shown in Fig. 2. We computed the solution to $f_{Y ∣ X} (y, x)$ for the conditioning values $x \in \{- 9, - 2, 2\}$ . Figure 3 shows the samples used in the estimator as well as the normalized, $f_{Y ∣ X} (y, x)$ , and unnormalized, $f_{XY} (x, y)$ , conditional distributions. Note that for the dataset generated, there are no samples near any location where $x = - 9$ .

Fig. 3 — Above are the samples used for modelling the 2D GMM (A). In the bottom row of graphs (B) we see the unnormalized (dashed lines) and normalized conditional distributions (solid lines) based on the true distribution. Note for the leftmost distribution, $P (Y ∣ X = - 9)$ , there is not a substantial probability mass in that region, but the normalized, conditional distribution has a clear peak around $y = - 2$

The conditioned distributions computed using unbinding and implemented in spiking neurons are shown in Fig. 4. We also show the conditioned distributions with a purely algebraic implementation, and the conditioned distributions using a KDE. As with the 1D distribution, we used a 2048D SSP representation.

We computed the normalizing constant by integrating over the unnormalized distribution, $f_{XY} (x, y)$ . When the conditioned distribution passes through high sample density points we get a reasonable estimate using the SSP implementation. However, in the case of $x = - 9$ , we get conditioned distributions that appear to be uniformative, while the KDE performs reasonably. One could make an argument that when making predictions about an unsampled region of the domain, that something close to a uniform distribution is a reasonable prediction, but we do note that the KDE and SSP predictions differ here.

Other operations

To further expand on the above relationships we show how some standard probability operations can be implemented using bundles of fractionally bound vector symbols. We explain how to compute marginal distributions, entropy, and mutual information.

Marginalization

Marginalization produces a distribution over a subset of variables $U \subset V$ from a distribution over the variables V. This is conducted by integrating over the marginalized variables. In math: $f_{X} (x) = \int_{Y} f_{XY} (x, y) d y$ . Using our analogy we can re-write the marginalization process as

\begin{matrix} f_{X} (x) \approx \int_{Y} C [ϕ_{X} (x / h) \otimes ϕ_{Y} (y / h) \cdot (\sum_{(x_{i}, y_{i}) \in D} ϕ_{X} (x_{i} / h) \otimes ϕ_{Y} (y_{i} / h))] d y . \end{matrix}

Exactly computing the marginal distribution requires integrating estimated probability after the conversion, $C [\cdot]$ , out of SSP-space and into neural activity. An alternative approach is to marginalize in SSP-space and then project out to neural activity. Instead, this is akin to marginalizing a Fourier Integral Estimator, possibly introducing noise into the estimated marginal distribution, but with the benefit of being able to operate on SSPs. We write the SSP-space marginalization as

\begin{matrix} f_{X} (x) \overset{C}{\approx} \int_{Y} ϕ_{X} (x / h) \otimes ϕ_{Y} (y / h) \cdot (\sum_{(x_{i}, y_{i}) \in D} ϕ_{X} (x_{i} / h) \otimes ϕ_{Y} (y_{i} / h)) d y \end{matrix}

and since binding and the dot product distribute over addition:

\begin{matrix} f_{X} (x) \overset{C}{\approx} (ϕ_{X} (x / h) \otimes \int_{Y} ϕ_{Y} (y / h) d y) \cdot (\sum_{(x_{i}, y_{i}) \in D} ϕ_{X} (x_{i} / h) \otimes ϕ_{Y} (y_{i} / h)) \end{matrix}

Note that $\int_{Y} ϕ_{Y} (y / h) d y$ is another vector, and can be approximated by sampling the space Y, or it can be computed directly if the range of integration is finite. Denote $Φ_{Y} = \int_{Y} ϕ_{Y} (y / h) d y$ , then marginalization becomes $f_{X} (x) \overset{C}{\approx} (ϕ_{X} (x / h) \otimes Φ_{Y}) \cdot M_{X Y, n}$ . Assuming that $ϕ_{X} (x / h), Φ_{Y},$ and $M_{X Y, n}$ are column vectors, noting that convolution is commutative, and that circular convolution can be written as a matrix–vector product between one argument and the circulant matrix, $Circ (\cdot)$ , of the other argument, we can make the following simplification:

\begin{matrix} (ϕ_{X} (x / h) \otimes Φ_{Y}) \cdot M_{X Y, n} & = {(Circ, (Φ_{Y}), ϕ_{X}, (x / h))}^{T} M_{X Y, n} \end{matrix}

\begin{matrix} = ϕ_{X} (x / h) \cdot (Circ, {(Φ_{Y})}^{T}, M_{X Y, n}) \end{matrix}

So there is a linear map that marginalizes $M_{X Y, n}$ . The SSP estimator and the true marginalized distribution of the multivariate distribution over ${(x y)}^{T}$ is shown in Fig. 5.

In Fig. 5 we can see the effect of the noise due to a pre-rectification marginalization process. While the post-rectification marginal distributions do not have noise outside of the main probability mass, the post-rectification distribution clearly does, in both dimensions. However, the distribution modelled by pre-rectification marginalization is not unreasonable, and provides a means of computing the operation without leaving the SSP abstraction.

Entropy

Entropy ( $H [\cdot]$ ), is a non-linear function of probability distributions. It can be estimated by sampling observations $ϕ_{X} (x / h) \sim G (M_{X, n})$ , and computing an average of the negative log probability of the sample points, $\hat{H} [X] = \sum_{i = 1}^{S} - f_{X} (x_{i}) log (f_{X} (x_{i})) x$ . If one is using a single neuron encoding, then entropy could be estimated using a running average, ${\hat{H}}_{t} = \frac{1}{t} (- log (C [ϕ_{X} (x_{t} / h) \cdot M_{X, n}])) + \frac{t - 1}{t} {\hat{H}}_{t - 1} .$ Representing an unbounded number of observations, t, in a neural network is challenging. Alternatives to an exact running average include low-pass filtering and computing entropy over a fixed window of samples.

Here we use a fixed number of samples and instead of a single neuron encoding, we estimated entropy using a population encoding, allowing us to compute entropy in additional time step, at the cost of memory. We construct a population of neurons whose synaptic weights represent sample points from the distribution’s domain, $X_{S}$ . For sample points, $x_{s} \in X_{S}$ , one can construct a neural network with a weight matrix, $W_{S}$ :

\begin{matrix} W_{S} = {[\begin{matrix} | & | & | \\ ϕ_{X} (x_{1} / h) & ϕ_{X} (x_{2} / h) & \dots & ϕ_{X} (x_{S} / h) \\ | & | & | \end{matrix}]}^{T} \end{matrix}

and output $a (M_{X, n}) = m a x {0, W_{S} \cdot M_{X, n} - ξ}$ . Then $a (M_{X, n})$ is a vector of the probabilities of the samples $X_{S}$ . We the employ a single hidden-layer for each sample point in $W_{S}$ , to compute the function $- f_{X} (x_{i}) log (f_{X} (x_{i}))$ . This function can be computed by a single layer neural network, trained using transformation principle of the Neural Engineering Framework (Eliasmith and Anderson 2003). Since this function is concave, it is easier to approximate than just $- log (f_{X} (x))$ , and it adds no additional complexity to the network architecture.

The entropy can then be computed by summing the outputs of these populations into a single a single ensemble—this final population represents the entropy of the distribution. Figure 6 panels A and D show the evolution of the output of a network designed to compute entropy for a beta distribution and a 1D GMM, respectively. We also compare an estimate of entropy using a KDE, and a purely algebraic implementation of computing entropy using SSPs.

Fig.
6 — The left column A–C shows performance on the model when estimating a beta distribution, and the right column D–F shows the performance estimating a 1D GMM. The top row shows the evolution of the output of the entropy network (solid blue line) as it evolves over time, as well as the entropy computed using the true distribution, a KDE with a radial basis function kernel, and an algebraic (non-spiking) implementation of the SSP entropy computation. The latency in convergence is due to the initialization of the neuron states in Nengo, and the internal neuron dynamics. The middle row shows the estimate of the probability distribution internal to the network that computes entropy, as well as the KDE and algebraic SSP models. As we can see from panels B and E, there is a high degree of agreement between the algebraic and spiking estimates of probability. Panels C and F show that errors arise when the network computes the quantity $- f_{X} (x) log (f_{X} (x))$ . Particularly for values of $f_{X} (x) < 1$ , the neural implementation diverges from the algebraic model. This error can be reduced by allocating more neural resources to the elements that compute the summands of entropy

There are three things to note about the performance of the entropy calculation network. First, none of the candidate methods exactly achieve the exact value of entropy computed from the true distributions. Second, as can be seen in panels B and E of Fig. 6, the probability distributions are well modelled by the neural and algebraic SSP implementations. Finally, panels C and F of Fig. 6 show the values of the summands of the entropy calculation, $- f_{X} (x_{i}) log (f_{X} (x_{i}))$ , observed in the spiking network. The network produces erroneous outputs for values of $f_{X} (x) < 1$ . The lines marked “Observed” show the values of entropy summands computed by the network, while the lines labelled “Expected” show the exact function of the neural probability estimates from the above panels (B and E, respectively).

This extra noise causes the entropy approximation in panels A and D to over-estimate the true value, as well as the value of entropy computed using SSPs algebraically. This error can reduced by allocating more neural resources to the network which computes $- f_{X} (x_{i}) log (f_{X} (x_{i}))$ , or by pre-training a network to a high degree of accuracy.

Mutual information

Mutual information is a useful tool in a number of applications, including action selection for information gain [e.g., Loredo (2003), Krause et al. (2008), Arora et al. (2019)]. Shannon mutual information can be defined as $I (X ; Y) = H [X] + H [Y] + H [X Y]$ . We can exploit the ability to marginalize distributions to compute these quantities. As with entropy, mutual information should be computable either by time-averaged sampling or by sampling the domain(s), but in this work we only investigate population encodings.

Computing mutual information requires access to the joint and marginal distributions over the variables X and Y. From our previous results we know we can write the marginal distributions as ${\hat{f}}_{X} (x) = C [W_{X} \cdot Circ {(Φ_{Y})}^{T} M_{XY, n}]$ , and $\hat{f} (Y) = C [W_{Y} \cdot Circ {(Φ_{X})}^{T} M_{XY, n}]$ . $W_{X}$ and $W_{Y}$ are the weights of neural networks that sample the domains $X$ and $Y$ , as with entropy, above. Then to compute the mutual information between two variables we can replicate the entropy network three times, and create a neural population that sums all three terms, $H [X]$ , $H [Y]$ , and $- H [X Y]$ .

To demonstrate the performance of the technique, we compute the mutual information between the variables X an Y in the Gaussian Mixture Model shown in Fig. 7. However, the above equation depends on marginalizing distributions and computing entropy. As discussed above, computing the nonlinear function $- f_{X} (x_{i}) log (f_{X} (x_{i}))$ can be noisy, and depending on the form implemented, marginalization can be too. Because of the entropy-induced noise, we only discuss algebraic implementations of mutual information, as opposed to the spiking neural implementations of the preceding functions.

Fig.
7 — To compute mutual information we use the same 2D Gaussian Mixture Model we used in conditioning and marginalization (left). The middle plot shows the fit of the distribtion using a Kernel Density Estimator, and the right shows the fit using the SSP estimator

In Fig. 8 we compare computing mutual information using a KDE with a radial basis function kernel, an SSP implementation using post-rectification marginal distributions, and an SSP implementation using pre-rectification marginal distributions. In other words, the KDE and ’post-rectification’ implementations both assume exact marginalization. All three estimators used the same length scale parameters for the X and Y dimension.

The post-rectification SSP model has comparable performance to the KDE, but the pre-rectification model has considerably worse error, introduced by the pre-rectification marginal computation. Putting aside the question of errors from spiking neural entropy implementations, there is a cost-benefit analysis to consider between using marginalization in SSP space for higher-order operations, versus computing functions of the distribution, post-rectification, in probability space. Alternatively, one could maintain individual memories for relevant variables in a joint distribution, that is, computing $M_{X}$ and $M_{Y}$ , in addition to $M_{XY}$ , at the cost of more memory.

Discussion

So far we have demonstrated how to implement probabilistic statements using certain VSA operations. Next we discuss considerations when translating the algebraic SSP operations to neural implementations (“Considerations for neural mapping” section). We then discuss the relation of such model to cognitive behaviour (“Gaps between probability models and behaviour” section). This is followed by a more in-depth comparison to the quantum probability formulation of cognitive uncertainty (“Relation to quantum probability” section). Finally, we summarize the benefits an limitations of this approach to probabilistic modelling (“Benefits and limitations” section).

Considerations for neural mapping

The benefit of this approach to modelling probability is that one can produce probabilities statements in a VSA and rely on established techniques to map them to (spiking) neural networks. However, this approach has only recently been developed. Below we highlight some considerations for the neural modeller looking to implement these techniques.

Representing Distributions in Synaptic Weights versus Population Activity

When using neurons to compute probability we make reference to two vectors, $M_{X, n}$ , which represents a data set, and $ϕ_{X} (x / h)$ , which encodes a query point. An observation we made is that either of those values can play the role of synaptic weights of a neuron, with the other represented by the activity of a population of afferent neurons. While we primarily represented the distribution in neural population activity and the query points in the synaptic weights, it is not necessarily obvious which vector should play which role. Indeed, that choice is likely to be task dependent. One factor that may guide the choice is which value is changing more frequently.

If one needs to frequently compute the probability of a fixed set of query points given recent experience, then the query points should be the synaptic weights. This formulation may be useful when processing data streams. Take, for example, an agent making foraging decisions, with recent observations stored in working memory, $M_{X, n}$ . One could encode a set of query points $\{x_{i}\}$ as a population of neurons, with associated values, $\{v_{i}\}$ . Then, as $M_{X, n}$ changes with new observations, one could quickly compute the expected value of the current environment. Then, if a threshold value is crossed, the agent could choose to stop and investigate.

If, on the other hand, one is quickly testing query points given a distribution that is likely to have converged, then the distribution, $M_{X, n}$ , should be represented by the synaptic weights and the query point by the activity of neurons. A situation where this may arise is in the case of novelty detection. If an agent has learned a set of typical observations (sounds, smells, etc.) and encounters something novel, neurons that encode long-term experience can provide a signal of low-probability. One could imagine such a system to encourage the agent to change behaviour from exploitative to exploratory, or provide a signal that enables neo-philic or -phobic behaviours.

A third option is to represent both the distribution and the query as the activity of neural populations. This technique has been use in the Semantic Pointer Architecture, but it can become memory intensive, should one need to represent a large number of query points simultaneously. However, if one needs to quickly change both the distribution and the query points, this approach should be considered.

Neural Variability

One motivating factor behind the PPCs and neural sampling approaches to probability, is the desire to explain neural activity variability as an information-carrying quantity. In our implementation we do not require the Poisson variabilty that is explained by these coding methods. However, in NEF-style networks implementing SPA networks, Poisson-like firing statistics have been recovered (Eliasmith et al. 2012). Biological studies indicate that variability may be due to the kinds of stimulation that are provided to neurons—constant stimulation produces Poisson variability, while spike trains produce repeatable outputs (Mainen and Sejnowski 1995). But even these repeatable outputs can vary on the order of 1 millisecond, a quantity that may still have an impact on behaviour (Faisal et al. 2008).

Entropy

In this paper we compute entropy using Shannon’s definition of entropy. This particular choice brings a specific complexity to networks that must be constructed: For every neuron, i, whose activity estimates a probability, $f_{X} (x_{i}) \approx \frac{1}{T} \int_{t - T}^{t} a_{i} (τ) d τ$ , we need to have a population of neurons that compute the function $- f_{X} (x_{i}) log (f_{X} (x_{i})$ . As shown in Fig. 6, while this function is convex, it can be challenging to approximate for small values of $f_{X} (x)$ . To improve the approximation one must increase the number of neurons in the network that stands in for the exact function. Because this requires a population of $N_{log}$ neurons for every query point in the population estimating probability, $N_{prob}$ , the space requirements ( $O (N_{log} N_{prob})$ ) for computing entropy can grow quickly.

The function $f_{X} (x) log f_{X} (x)$ is convex and is itself relatively easy to approximate, for sufficiently large $N_{log}$ , but this nice property vanishes when we consider conditional entropy, where we need to compute the function $f (x, y) = - f_{XY} (x, y) log (f_{Y ∣ X} (y, x))$ , which can go to infinity as $f_{Y ∣ X} (y, x) \to 0$ . Since computer memory is finite, the modeller must trade off the precision in computing entropy and the density with which a distribution’s domain must be sampled.

Furthermore, if one is considering the proposed technique as a hypothesis for the biological computation of entropy, or other quantities like mutual information, the population encoding model implicitly makes strong claims about the brain. First, that the network that computes $- f_{X} (x_{i}) log (f_{X} (x_{i}))$ is repeatedly and reliably constructed. Second, that the network can compute the entropy with high accuracy, meaning it will require a large number of neurons. This may encourage using a temporal smoothing model for computing entropy, but it would be desirable to find more efficient alternative to computing entropy.

A possible solution to these problems is to use Rényi entropy, which is defined $H_{α} (X) = \frac{1}{1 - α} log (\sum_{i = 1}^{n}, p_{i}^{α})$ , where $α \in] 0, 1 [\cup] 1, \infty [$ (Rényi et al. 1961). Rényi entropy generalizes a number of different entropies, and in the limit as $α \to 1$ , we recover Shannon entropy.

Rényi entropy can be used to decrease the complexity of the neural network required to compute entropy, if we make some assumptions. First, we need to compute $p_{i}^{α}$ . In the model presented in this paper we used a ReLU neuron model, and then the firing rate of neurons in probability-computing populations approximates $p_{i}$ . But, we could use neural dynamics to approximate $p_{i}^{α}$ , for a given choice of $α$ . In this case, we would require only one population of neurons to compute the log function once, making the total number of neurons to compute the entropy $N_{prob} + N_{log}$ . However, if another population of neurons is required to compute $p_{i}^{α}$ , then some complexity benefit may be lost.

Any benefits in complexity due to abandoning Shannon entropy come at a cost. When $α \neq 1$ the chain rule of conditional probability, i.e., $H [Y ∣ X] = H [X, Y] - H [X]$ , no longer holds. This causes problems for computing mutual information using the standard formulation. However, Arimoto’s conditional Rényi entropy (Arimoto 1977), Eq. 7, is both monotonic and obeys a weak chain rule (Fehr and Berens 2014), given in Eq. 8.

\begin{matrix} H_{α} [X ∣ Y] = & - \frac{α}{α - 1} log \sum_{y} {(\sum_{x}, P_{XY}, {(x, y)}^{α})}^{\frac{1}{α}} \end{matrix}

\begin{matrix} H_{α} [X ∣ Y] \geq H_{α} [X] - H_{0} [Y] \end{matrix}

These formulations are different from the ones commonly use in information theory, but their existence suggests we may be able to employ similar techniques in information theoretic neural algorithms.

Moving to Rényi entropy reduces the number times the log function needs to be computed, and if the exponent can be computed by a neuron’s activation function, only one additional non-linear computation per population is required, as opposed to one for each sample point, as modelled in this work. Further, because we would be computing the log of a series of sums, we might infer that the likelihood of having to compute the log of small positive values $(< < 1)$ , where an approximation of log will necessarily be worse, should be lower. These benefits warrant further investigation.

Gaps between probability models and behaviour

Beyond the previous neural considerations, there remain open questions about the suitability of modelling cognition probabilistically. As well, it may be unclear how general the approach we have proposed is. In this section, we address both concerns in more detail.

Applicability Across Vector Symbolic Architectures

In this paper we have worked exclusively with circular convolution and continuous representations, and the question stands as to which other VSAs and operators would support this probabilistic interpretation. The primary requirement here would be to permit fractional binding such that the dot product between fractionally bound vectors would induce a (quasi-)kernel. The Fourier Holographic Reduce Representations, used by Frady et al. (2021), is one such VSA.

Using the tensor product for binding, as suggested by Pothos and Busemeyer (2013), would not permit iterated binding, as the memory requirements for representing real numbers would grow exponentially and without bound. Perhaps, if there are other techniques for representing continuous values, the tensor product could be used to represent joint distributions, but then memory would grow with the number of elements in the conjunction. Ignoring continuity, it is conceivable that approximations of continuous distributions could be made, and discrete distributions (histograms) have been represented in VSAs (Joshi et al. 2017). Regardless, the choice of binding operator is a key point of connection between our work and that of quantum probability.

Probability and Behaviour

Beyond the previous neural considerations, there remain open questions about the suitability of modelling cognition probabilistically. This paper serves to outline how certain VSAs may be used to build probabilistic models of cognition, aligning predictions with behaviour remains for future work.

Explaining the mismatch between optimal models and human behaviour is an important aspect of probabilistic cognitive modelling. The question remains open whether brains are imperfectly computing probabilistic models, whether our imposed models do not match the functions brains are trying to compute, or whether brains are (im)perfectly computing non-standard or non-probabilistic models.

The work presented in this paper is one possible way to model uncertainty in neural systems. There are clearly factors in our approach that could cause our models to deviate from ostensibly optimal models, but other approaches to explaining these gaps exist.

Sanborn and Chater (2016) propose that some logical errors arise from inadequately sampling the posterior distribution when using a neural sampling techniques. Pothos and Busemeyer (2013) propose that some logical errors arise from, amongst other things, the difference between the standard implementation of conjunctions and quantum implementations. Sharma et al. (2017) found that the gap between predicted and actual human performance was closed by virtue of implementing the Bayesian update in spiking neurons.

We do not attack the question of model mismatch in this paper, beyond noting that the models we compute are approximations of probability. Given neurons with internal dynamics, latencies may arise in estimating the probability as inputs to networks change, perhaps producing errors similar to what Sanborn and Chater (2016) propose. As previously noted, VSAs are capable of implementing the definitions of conjunction used in quantum probability (Stewart and Eliasmith 2013), so should a quantum model prove desirable, VSAs exist that can accommodate it. As discussed above, order effects from online learning may introduce further deviations from ideal models of behaviour.

We also do not consider the neurological plausibility of interpreting SSP-probability statements as models of biology. Our system is highly regular. For example, all neurons in a population have the same maximum firing rates and biases, that level of uniformity is unlikely in biology. ReLU neurons are not biologically plausible, one would expect to see some degree of saturation. This will effect the accuracy of probabilities estimated. Further closing the gap between the model presented in this paper and biology should yield insights into neural representations of uncertainty.

But the above is motivated by the belief that the brain is imperfectly attempting to compute ideal models. If there is simply a mismatch between the optimal model for a task, and the model that a brain implements, we suggest that the probabilistic understanding of VSAs that we have put forth can prove useful.

Relation to quantum probability

Because our work centres around operations on vectors in a Hilbert space, it is worth drawing comparisons to the quantum probability formulation of probability in cognition. The work we present in this paper differs from the quantum approach by how conjunctions are defined and how values are converted from quasi-probabilities (i.e., values that do not adhere to Kolmolgorov’s axioms) into probabilities.

We define conjunctions through the circular convolution binding operator. Circular convolution commutes and does not increase the dimensionality of the representations involved. Consequently our representations match the formulation of the compatible questions of Pothos and Busemeyer (2013), and do not require unbounded memory resources. It is not obvious how one would represent real-valued data using the standard tensor product.

As discussed in “Quantum probability for cognition” section, quantum probability defines joint distributions through the sequential application of projection matrices, making the implicit assumption that a conjunction in the cognitive state exists in superposition. This technique has successfully explained differences between Bayesian probability models and human behaviour, and representing conjunctions-in-superposition is important to limit the dimensionality of the cognitive state space, but it may be an artefact of the choice of binding operator.

The other aspect of quantum probability that differs from our approach is the conversion from quasi-probability to probability. The quantum formulation uses Born’s rule (Born 1926): that is, the probability of an event is the square of the magnitude of state vector projected into the space representing the event, $q (a) = ‖ M_{A} {S ‖}_{2}^{2}$ . This ensures that all values are non-negative, but not necessarily that values will obey the law of total probability.

In our formulation we bias and rectify the estimated probability, using the method of Glad et al. (2007) in order to both produce non-negative probabilities and to obey the law of total probability. The correction of Glad et al., is arguably more like the behaviour of neurons, which only spike when the input current crosses a certain threshold. That said, should the Born rule be deemed a valuable decoding method for a given application, one can easily construct networks that square values (Gosmann 2015).

Aside from these details, there remain similarities between operations in vector symbolic architectures and quantum probability: mainly that cognitive states can be represented in Hilbert spaces, that events can be stored in superposition, and that probability can be assessed by looking at the similarity between the cognitive state vector and vectors representing individual events. This similarity between quantum probability and vector symbolic architectures has been previously highlighted (Stewart and Eliasmith 2013), but in this work we conducted an investigation of probability and SSPs. Specifically, we investigate distributions over continuous variables and demonstrate how a number of operations within some VSAs are consistent with probability operations.

Regardless of implementation specifics, quantum probability and the model presented in this paper maintain a fundamental mathematical connection through operations in Hilbert spaces. Further investigations into these connections are worthwhile.

Benefits and limitations

There are two major benefits of this framing of bundles of SSPs. First, it lets us interpret SSP representations as probability distributions. Since SSPs have been used to represent neural state spaces, like grid cells (Dumont and Eliasmith 2020) and more complex structures, like trajectories (Voelker et al. 2021), this opens the door to constructing probabilistic models over complex cognitive data. Recognizing probability in these vector representations supports the notion that VSAs enable the description of “soft rules”, getting away from the brittleness of classic symbolic reasoning (Smolensky 1990). We intend to explore modelling the probabilities of mixed integer and real-valued data, as well as more complex structured data, like those discussed in (Eliasmith 2013; Voelker et al. 2021; Frady et al. 2021).

Second, it frames some VSAs as probabilistic neurosymbolic programming languages. If one has probability models of cognition, they could use the techniques discussed in this paper to translate theories to neural circuits. We can then use this to make predictions about observable behaviour. Conversely, we can analyze successful VSA-based models and infer implicit probabilistic models from them. Probabilistic programming techniques like those presented by Goodman et al. (2016) are powerful, and necessary abstractions, but the translation to neural substrates is ultimately important.

Additionally, there may be industrial applications as a language for programming neuromorphic hardware to compute probabilities, possibly in constant time. However, there are some considerations for this approach: The accuracy of kernel approximations induced by similarity may be further limited by hardware, e.g., maximum firing rates, bit precision, and so on.

There are a variety of limitations to the approach proposed in this paper. They revolve around the selection of hyperparameters, computing the memory that represents the dataset, and the difficulty in translating sampling algorithms to this framework.

Like all Kernel Density Estimators, there is a sensitivity to the choice of length scale, h. In the work presented in this paper, we selected length scale parameters by analyzing the training set of data points. If one is simply attempting to model a distribution in neural networks, this may not be a problem. However, a priori selection of the length scale, h, may not be suitable for all models or agents. The initial data set may not be representative of what an agent may expect to encounter in the world, or the distribution that generates observations may drift with time. Similarly, the choice of the bias term, $ξ$ , may need to change as the agent explores the world more.

Rules will need to be determined to update hyperparameters online. Should length scales need to change, then changing the length scales of encoded SSPs should be a linear operation, but the impact of modifying length scales during operations has yet to be determined. Changing $ξ$ may be more challenging, as it requires integrating over the domain of the distribution.

An additional hyperparameter that remained unexplored in this work is the choice of generating distribution in Algorithm 1. We exclusively worked with a uniform distribution, which induces a sinc kernel under the dot product. The choice of distribution effects the choice of kernel, and particular kernels have implications for biology [see (Dumont and Eliasmith 2020)]. The generated kernels can be further augment through binding, bundling, and concatenation of hypervectors, which imply different network architectures. The search for network architectures that produce desirable kernels and probability statements remains an open challenge that could benefit from existing research in neural architecture search.

Similar to the hyperparameters, we took advantage of having entire data sets available to compute the vector that represents the distribution, $M_{X, n}$ . This vector is normalized by the total number of observations, n. Again, this training framework is suitable if one is attempting to encode a specific distribution, but a neural network is unlikely to have explicit access to the value of n. Similarly, organisms are unlikely initially have access to all observations they may encounter in their lifetime. This suggests if we want to understand how distributions are learned, we may need different schemes to encode distributions.

Working with non-normalized probabilities, i.e., $M_{D} = \sum_{x \in D} ϕ_{X} (x / h)$ , is possible, but may saturate neural populations. Avoiding saturation requires some form of forgetting, but what kind of forgetting is best for an application, and what this implies for the probability estimates, remains to be determined. Conversely, saturation itself could provide normalization, as suggested by Eliasmith (2013). A fixed discount factor in the range ]0, 1[, instead of computing the running average, would produce exponential decay. This has the interesting property of inducing something like a temporal kernel, which may produce a recency bias. Similarly, if the distribution vector is being stored in synaptic weights, update rules may introduce temporal decay. It is well understood that the outcome of learning rules is dependent on the order of data presentation. There remain interesting questions to be explored about the interplay of neurally plausible learning rules, SSP representations, and the distributions that are learned.

A final limitation of our approach is the difficulty in generating samples from the learned distribution. Sampling could be useful in, for example, reinforcement learning applications, where actions should be samples from a distribution over state-action pairs. Sampling from standard KDEs is fairly straightforward—randomly select one of the data points, and then sample from the distribution induced by the kernel function centred at that data point. Due to our compressed encoding of the distribution, this is not possible. Algebraically, the solution to this problem is fairly straightforward—for example, one could place a Fisher-von Mises distribution over $M_{D}$ (or an SSP), and sample vectors from that distribution. However, because the space of SSPs is not dense in $R^{d}$ , what we sample will not necessarily be a valid SSP. Consequently, this would require a mechanism to reject invalid samples, and generating valid samples from the distribution may take infeasible numbers of sampling iterations. Research into synaptic sampling (Elliott and Eliasmith 2009; Buesing et al. 2011; Kappel et al. 2015a) and basal ganglia models (Stewart et al. 2010) may provide some benefit, but further investigation is required.

Conclusion

In this paper we have illustrated a connection between continuous representations encoded as SSPs and operations on probability distributions. Specifically, we have sketched novel methods for conditioning distributions and computing entropy and mutual information using circular convolution. VSAs have garnered interest for combining symbolic and connectionist models of cognition, and more recently as a programming framework for neuromorphic computers [e.g., Mundy (2017), Kleyko et al. (2021)]. Open questions remain about best choices for implementation, but the connection between kernel methods and VSAs allows us to naturally bring probabilistic models to cognitive modelling and neuromorphic computing.

At this juncture we are not concerned with strong claims about whether or not the brain is probabilistic, or Bayesian, let alone an optimal computer for probabilities. Rather, the claim we are making is that certain VSAs can act like probabilistic programming languages, ones that we can map to neural structures. If one desires probabilistic models of cognition, then tools are readily available to translate those models to hypotheses about neural structures.

Acknowledgements

The authors would like to thank Nicole Sandra-Yaffa Dumont, Drs. Jeff Orchard, Bryan Tripp, and Terry Stewart for discussions that helped improve this paper. An early version of this work appeared in Furlong and Eliasmith (2022). This work was supported by CFI and OIT infrastructure funding as well as the Canada Research Chairs program, NSERC Discovery grant 261453, NUCC NRC File A-0028850, AFOSR grant FA9550-17-1-0026, and an Intel Neuromorphic Research Community Grant.

Appendix 1 Model complexity analysis

We presented an algebraic interpretation of VSA operations and the results for spiking neural implementations of these algorithms. Next, we present an analysis of the complexity of these networks. We frame in terms of the number of synaptic operations, which would be simple additions in a spiking neural network, or a multiply and accumulate operation in the case of implementation in a graphics processor or general-purpose CPUs. Because we implemented these networks using spiking rectified linear neural networks, we do not account for the complexity of neural dynamics in this analysis. To estimate the complexity, we require the quantities laid out in Table 1. The summary of the analyses is denoted in Big-O notation in Table 2.

Table 1.

Quantities used in computing the complexity of the proposed neural networks

m	The dimensionality of the encoded domain
$m_{marg}$	The number of dimensions being marginalized
d	The dimensionality of the SSP representation
$n_{\log}$	The number of neurons in a population that computes a log function
$n_{ent}$	The number of neurons in a population that computes shannon entropy
$n_{sampling}$	The number of samples along a dimension of the encoded domain. Increasing this quantity reduces aliasing the probability function
$n_{dim}$	The number of neurons to represent one dimension of an SSP. The default value in Nengo is 50

Open in a new tab

Table 2.

A summary of the complexity of the proposed operations in terms of synaptic operations

Operation	Synaptic operations
Probability (single neuron)	O(d)
Probability (sampling)	$O (d {(n_{sampling})}^{m})$
Marginalization (pre-rectification)	$O (d^{2})$
Marginalization (post-rectification)	$O (d {(n_{sampling})}^{m})$
Conditioning	$O (d^{2} n_{prod})$
Entropy	$O ((d + n_{log}) {(n_{sampling})}^{m})$
Mutual Information	$O ((d + n_{log}) {(n_{sampling})}^{2 m})$
Updating	$O (d n_{dim})$

Open in a new tab

Complexities reported here use Big-O notation, the main body of the appendix gives a more detailed analysis. We preserve the term $(d + n_{log})$ to highlight the fact that depending on the VSA dimensionality and the number of neural resources used to compute the log function. In practice, one of these terms will dominate

To produce probability estimates using single neurons we have an input of dimension d, and a neural population size of 1, meaning that estimating the probability of a single observation is d synaptic operations. To estimate a probability density using a population of neurons that represent $n_{sampling}$ sampled points for each input dimension, the number of synaptic operations is $d {(n_{sampling})}^{m}$ .

Pre-rectification marginalization requires one linear mapping in SSP space, which is a $d \times d$ operation. If the SSP is represented as a population of neurons, then we require mapping from the neural population to the SSP latent space, which requires $n_{dim}$ synaptic operations for each SSP dimension, d. Hence, the pre-marginal rectification requires $d^{2} + d n_{dim}$ synaptic operations. Computation of the marginalizing matrix can be computed off-line for each dimension and is not included in the analysis.

Post-rectification marginalization requires first computing a sampled probability distribution, which is $d {(n_{sampling})}^{m}$ , then there must follow the summation over the marginalized dimensions, $m_{marg}$ . This requires, for each point in the unmarginalized dimensions, ${(n_{sampling})}^{m - m_{marg}}$ , computing ${(n_{sampling})}^{m_{marg}}$ sums. This results in a synaptic complexity of $d {(n_{sampling})}^{m} + {(n_{sampling})}^{m}$ .

Conditioning requires computing the binding operator from the HRR VSA, which is circular convolution. In this work we use the default Nengo implementation of binding between two d -dimensional vectors, a and b, which produces a new d-dimensional vector, $c = a ⊛ b$ . The circular convolution is implemented by a series of rotated dot products, defined:

\begin{matrix} c_{i} = \sum_{j = 1}^{d} a_{i} b_{((i - j) mod d)} i \in {1, \dots, d} . \end{matrix}

The multiplication of individual vector elements, $a_{i} b_{((i - j) mod d)}$ , is computed using a product network (Gosmann 2015), which requires $3 n_{prod}$ synaptic operations. Computing the entire circular convolution requires computing d products for all d elements of the vectors, resulting in a complexity of $d^{2} \times 3 n_{prod}$ synaptic operations.

Computing entropy as described in this paper again requires first constructing a sampling of the distribution, which is $O (d {(n_{sampling})}^{m})$ , followed by computing $- p log p$ for every neuron in the distribution, which we implement in a single hidden layer neural network, which contains $n_{log}$ neurons. This requires ${(n_{sampling})}^{m} \times 2 n_{log}$ synaptic operations. This is followed by a population of $n_{ent}$ neurons to represent the sum $\sum_{i} - p_{i} log p_{i}$ . Consequently, the entropy calculation requires $2 n_{ent} {(n_{sampling})}^{m}$ synaptic operations. The function $p log p$ can be difficult to compute, and requires substantial neural resources, so we assume in the worst case that $n_{ent} = n_{log}$ . The total cost to compute the entropy of a distribution is $4 n_{log} {(n_{sampling})}^{m} + O (d {(n_{sampling})}^{m})$ .

To compute mutual information we must first sample the joint probability distribution, which can be over two vector-valued variables. If we have two variables, $X_{1} \in R^{m_{1}}$ and $X_{2} \in R^{m_{2}}$ , we will assume that $m = m a x {m_{1}, m_{2}}$ . Then the initial distribution representation requires $d {(n_{sampling})}^{2 m}$ synaptic operations.

The joint distribution is then marginalized twice, assuming post-rectification for increased accuracy. Exploiting the initial distribution representation permits a complexity of $2 {(n_{sampling})}^{2 m}$ . We then must compute the entropy of the joint and two marginal distributions, which requires $4 n_{log} {(n_{sampling})}^{2 m}$ and $8 n_{log} {(n_{sampling})}^{m}$ , respectively. This results in a final complexity of $(d + 4 n_{log} + 2) {(n_{sampling})}^{2 m} + 8 n_{log} {(n_{sampling})}^{m} .$

Finally, we consider the cost to update an SSP representation of a distribution. To update the distribution we take the neural population whose latent space is representing the distribution using SSPs and project it into the SSP space, this requires $d n_{dim}$ synaptic operations. Then we add it and the new observation together and store them in the neural population, this requires 2d multiplies ( $O (< b^{2})$ for b-bit numbers, depending on the implementation) to compute the running average and $2 d n_{dim}$ synaptic operations to update the population representing the distribution. We note that for more biologically plausible implementations the running average may be replaced by a low pass filter, which has constant multiplication terms that can be integrated into synaptic weights directly.

The above analysis looks at the synaptic operations required for computing the probabilistic operations. This is a measure of resource requirement to construct networks, as well as the total volume of computation that must be executed. However, many of these operations can be parallelized, and on the right computing framework the time between an input being presented to a network to compute these operations can be improved significantly.

Author Contributions

PMF conceived and designed the initial study. CE and PMF discussed and updated the design. Material preparation, data collection and analysis were performed by PMF. The first draft of the manuscript was written by PMF with extensive revision and contribution from CE. CE supervised and administered the project. PMF and CE acquired funding for the project. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Funding

This work was supported by CFI and OIT infrastructure funding as well as the Canada Research Chairs program, NSERC Discovery Grant 261453, NUCC NRC File A-0028850, and AFOSR Grant FA9550-17-1-0026, and an Intel Neuromorphic Research Community Grant.

Data availability

The data used to generate the figures in this paper is available as jupyter notebooks www.gitlab.com/furlong/vsa-prob.

Code availability

The code used to generate the figures in this paper is available as jupyter notebooks www.gitlab.com/furlong/vsa-prob. Those notebooks additionally depend on code available at www.github.com/ctn-waterloo/ssp-bayesopt.

Declarations

Conflict of interest

Chris Eliasmith has a financial interest in Applied Brain Research, Incorporated, holder of patents related to the material in this paper (patent 62/820,089). P. Michael Furlong has performed consulting services for Applied Brain Research. The company or this cooperation did not affect the authenticity and objectivity of the experimental results of this work. The funders had no role in the direction of this research; in the analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Footnotes

(1) The probability any event is non-negative. (2) The probability of all events is 1. (3) The probability of a set of mutually exclusive events is the sum of their individual probabilities.

Depending on the desired kernel, there are more accurate encodings, see Sutherland and Schneider (2015).

The SPA admits other binding operators, e.g. the Vector-derived transformation binding of Gosmann and Eliasmith (2019).

⁴

In this paper we only denote isotropic kernel approximations, but it is possible to have different length scales, h, for the different dimensions of $x$ . For all examples modelling a 2D Gaussian Mixture Model we fit a length scale for each dimension in the domain of the distribtion.

⁵

The sinc function is not a common choice for a kernel, but it can be demonstrated to be better,in the sense of mean integrated square error, than the Epanechnikov kernel, which is commonly considered to be the “optimal” kernel (Tsybakov 2009, §1.3).

⁶

In this work, the activity of a ReLU neuron is given by $a (t) = a_{max} (W \cdot x (t) + b)$ , where $a_{max} > 0$ is the maximum firing rate of the neuron. To recover probability values, we normalize all computed firing rates by $a_{max}$ , however, we elide that scaling from our notation.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Agarwal R, Chen Z, Sarma SV (2016) A novel nonparametric maximum likelihood estimator for probability density functions. IEEE Trans Pattern Anal Mach Intelligence 39(7):1294–1308 [DOI] [PubMed] [Google Scholar]
Anastasio TJ, Patton PE, Belkacem-Boussaid K (2000) Using Bayes’ rule to model multisensory enhancement in the superior colliculus. Neural Comput 12(5):1165–1187 [DOI] [PubMed] [Google Scholar]
Anderson CH, Van Essen DC (1994) Neurobiological computational systems. In: Computational intelligence imitating life 213222
Arimoto S (1977) Information measures and capacity of order for discrete memoryless channels. Topics in information theory
Arora A, Furlong PM, Fitch R et al (2019) Multi-modal active perception for information gathering in science missions. Auton Robot 43(7):1827–1853 [Google Scholar]
Barber MJ, Clark JW, Anderson CH (2003) Neural representation of probabilistic information. Neural Comput 15(8):1843–1864 [DOI] [PubMed] [Google Scholar]
Bekolay T, Bergstra J, Hunsberger E et al (2014) Nengo: a python tool for building large-scale functional brain models. Front Neuroinform 7:48 [DOI] [PMC free article] [PubMed] [Google Scholar]
Boerlin M, Denève S (2011) Spike-based population coding and working memory. PLoS Comput Biol 7(2):e1001080 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bogacz R (2015) Optimal decision making in the cortico-basal-ganglia circuit. In: An introduction to model-based cognitive neuroscience. Springer, pp 291–302
Bogacz R, Gurney K (2007) The basal ganglia and cortex implement optimal decision making between alternative actions. Neural Comput 19(2):442–477 [DOI] [PubMed] [Google Scholar]
Bogacz R, Larsen T (2011) Integration of reinforcement learning and optimal decision-making theories of the basal ganglia. Neural Comput 23(4):817–851 [DOI] [PubMed] [Google Scholar]
Born M (1926) Quantenmechanik der stoßvorgänge. Z Phys 38(11):803–827 [Google Scholar]
Buesing L, Bill J, Nessler B et al (2011) Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Comput Biol 7(11):e1002211 [DOI] [PMC free article] [PubMed] [Google Scholar]
Busemeyer JR, Bruza PD (2012) Quantum models of cognition and decision. Cambridge University Press [Google Scholar]
Busemeyer JR, Wang Z, Shiffrin RM (2015) Bayesian model comparison favors quantum over standard decision theory account of dynamic inconsistency. Decision 2(1):1 [Google Scholar]
Chacón J, Montanero J, Nogales A et al (2007) On the existence and limit behavior of the optimal bandwidth for kernel density estimation. Stat Sin 17(1):289–300 [Google Scholar]
Chater N, Oaksford M (2008) The probabilistic mind: prospects for Bayesian cognitive science. Oxford University Press, USA [Google Scholar]
Choo X, Eliasmith C (2010) A spiking neuron model of serial-order recall. In: Cattrambone R, Ohlsson S (eds) 32nd Annual conference of the cognitive science society. Cognitive Science Society, Portland, OR
Darlington TR, Beck JM, Lisberger SG (2018) Neural implementation of Bayesian inference in a sensorimotor behavior. Nat Neurosci 21(10):1442–1451 [DOI] [PMC free article] [PubMed] [Google Scholar]
Davis KB (1975) Mean square error properties of density estimates. Ann Stat 3:1025–1030 [Google Scholar]
Davis KB (1977) Mean integrated square error properties of density estimates. Ann Stat 5:530–535 [Google Scholar]
Deneve S (2008) Bayesian spiking neurons I: inference. Neural Comput 20(1):91–117 [DOI] [PubMed] [Google Scholar]
Doya K (2021) Canonical cortical circuits and the duality of Bayesian inference and optimal control. Curr Opin Behav Sci 41:160–167 [Google Scholar]
Doya K, Ishii S, Pouget A et al (2007) Bayesian brain: probabilistic approaches to neural coding. MIT Press [Google Scholar]
Dumont N, Eliasmith C (2020) Accurate representation for spatial cognition using grid cells. In: CogSci
Echeveste R, Aitchison L, Hennequin G et al (2020) Cortical-like dynamics in recurrent circuits optimized for sampling-based probabilistic inference. Nat Neurosci 23(9):1138–1149 [DOI] [PMC free article] [PubMed] [Google Scholar]
Eliasmith C (2013) How to build a brain: a neural architecture for biological cognition. Oxford University Press [Google Scholar]
Eliasmith C, Anderson CH (2003) Neural engineering: computation, representation, and dynamics in neurobiological systems. MIT Press, Berlin [Google Scholar]
Eliasmith C, Stewart TC, Choo X et al (2012) A large-scale model of the functioning brain. Science 338(6111):1202–1205 [DOI] [PubMed] [Google Scholar]
Elliott L, Eliasmith C (2009) MCMC with spiking neurons. In: NIPS workshop on Bayesian inference in the brain
Faisal AA, Selen LP, Wolpert DM (2008) Noise in the nervous system. Nat Rev Neurosci 9(4):292–303 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fehr S, Berens S (2014) On the conditional Rényi entropy. IEEE Trans Inf Theory 60(11):6801–6810 [Google Scholar]
Frady EP, Kleyko D, Kymn CJ, et al (2021) Computing on functions using randomized vector representations. arXiv preprint arXiv:2109.03429
Furlong PM, Eliasmith C (2022) Fractional binding in vector symbolic architectures as quasi-probability statements. In: Proceedings of the annual meeting of the cognitive science society
Gayler RW (2004) Vector symbolic architectures answer Jackendoff’s challenges for cognitive neuroscience. arXiv preprint cs/0412059
Glad IK, Hjort NL, Ushakov NG (2003) Correction of density estimators that are not densities. Scand J Stat 30(2):415–427 [Google Scholar]
Glad IK, Hjort NL, Ushakov N (2007) Density estimation using the sinc kernel. Preprint Statistics, vol 2, p 2007
Goodman ND, Tenenbaum JB, Contributors TP (2016) Probabilistic models of cognition. http://probmods.org/v2. Accessed 23 Jan 2023
Gosmann J (2015) Precise multiplications with the NEF. Tech. rep, Centre for Theoretical Neuroscience, Waterloo, ON
Gosmann J, Eliasmith C (2019) Vector-derived transformation binding: an improved binding operation for deep symbol-like processing in neural networks. Neural Comput 31(5):849–869. 10.1162/neco_a_01179 [DOI] [PubMed] [Google Scholar]
Hou H, Zheng Q, Zhao Y et al (2019) Neural correlates of optimal multisensory decision making under time-varying reliabilities with an invariant linear probabilistic population code. Neuron 104(5):1010–1021 [DOI] [PubMed] [Google Scholar]
Hoyer P, Hyvärinen A (2002) Interpreting neural response variability as Monte Carlo sampling of the posterior. In: Advances in neural information processing systems, vol 15
Huang Y, Rao RP (2014) Neurons as Monte Carlo samplers: Bayesian inference and learning in spiking networks. In: Advances in neural information processing systems, vol 27
Joshi A, Halseth JT, Kanerva P (2017) Language geometry using random indexing. In: Quantum interaction: 10th international conference, QI 2016, San Francisco, CA, USA, July 20–22, 2016, Revised Selected Papers 10. Springer, pp 265–274
Kanerva P (1996) Binary spatter-coding of ordered k-tuples. In: Artificial neural networks-ICANN 96: 1996 international conference Bochum, Germany, July 16–19, 1996 Proceedings 6. Springer, pp 869–873
Kanerva P (2009) Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognit Comput 1:139–159 [Google Scholar]
Kappel D, Habenschuss S, Legenstein R et al (2015a) Network plasticity as Bayesian inference. PLoS Comput Biol 11(11):e1004485 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kappel D, Habenschuss S, Legenstein R et al (2015b) Synaptic sampling: a Bayesian approach to neural network plasticity and rewiring. Adv Neural Inf Process Syst 28:370–378 [Google Scholar]
Kleyko D, Davies M, Frady EP, et al (2021) Vector symbolic architectures as a computing framework for nanoscale hardware. arXiv preprint arXiv:2106.05268 [PMC free article] [PubMed]
Kleyko D, Davies M, Frady EP et al (2022) Vector symbolic architectures as a computing framework for emerging hardware. Proc IEEE 110(10):1538–1571 [PMC free article] [PubMed] [Google Scholar]
Komer B (2020) Biologically inspired spatial representation. PhD thesis, University of Waterloo
Korcsak-Gorzo A, Müller MG, Baumbach A et al (2022) Cortical oscillations support sampling-based computations in spiking neural networks. PLoS Comput Biol 18(3):e1009753 [DOI] [PMC free article] [PubMed] [Google Scholar]
Krause A, Singh A, Guestrin C (2008) Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. J Mach Learn Res 9(2):235 [Google Scholar]
Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338 [DOI] [PubMed] [Google Scholar]
Levy SD, Gayler R (2008) Vector symbolic architectures: a new building material for artificial general intelligence. In: Proceedings of the 2008 conference on artificial general intelligence 2008: proceedings of the first AGI conference, pp 414–418
Loredo T (2003) Bayesian adaptive exploration in a nutshell. Stat Probl Particle Phys Astrophys Cosmol 1:162 [Google Scholar]
Ma WJ, Beck JM, Latham PE et al (2006) Bayesian inference with probabilistic population codes. Nat Neurosci 9(11):1432–1438 [DOI] [PubMed] [Google Scholar]
Ma WJ, Beck JM, Pouget A (2008) Spiking networks for Bayesian inference and choice. Curr Opin Neurobiol 18(2):217–222 [DOI] [PubMed] [Google Scholar]
Mainen ZF, Sejnowski TJ (1995) Reliability of spike timing in neocortical neurons. Science 268(5216):1503–1506 [DOI] [PubMed] [Google Scholar]
Masset P, Zavatone-Veth J, Connor JP et al (2022) Natural gradient enables fast sampling in spiking neural networks. Adv Neural Inf Process Syst 35:22018–22034 [PMC free article] [PubMed] [Google Scholar]
Mundy A (2017) Real time Spaun on SpiNNaker functional brain simulation on a massively-parallel computer architecture. The University of Manchester (United Kingdom)
Mutnỳ M, Krause A (2019) Efficient high dimensional Bayesian optimization with additivity and quadrature Fourier features. Adv Neural Inf Process Syst 31:9005–9016 [Google Scholar]
Neubert P, Schubert S, Protzel P (2019) An introduction to hyperdimensional computing for robotics. KI-Künstl Intell 33(4):319–330 [Google Scholar]
Plate TA (1992) Holographic recurrent networks. In: Advances in neural information processing systems, vol 5
Plate TA (1994) Distributed representations and nested compositional structure. University of Toronto, Department of Computer Science
Plate TA (1995) Holographic reduced representations. IEEE Trans Neural Netw 6(3):623–641 [DOI] [PubMed] [Google Scholar]
Plate TA (2003) Holographic reduced representation: distributed representation for cognitive structures. CSLI Publications, Stanford [Google Scholar]
Pothos EM, Busemeyer JR (2013) Can quantum probability provide a new direction for cognitive modeling? Behav Brain Sci 36(3):255–274 [DOI] [PubMed] [Google Scholar]
Pothos EM, Busemeyer JR (2022) Quantum cognition. Annu Rev Psychol 73:749–778 [DOI] [PubMed] [Google Scholar]
Pouget A, Dayan P, Zemel RS (2003) Inference and computation with population codes. Annu Rev Neurosci 26(1):381–410 [DOI] [PubMed] [Google Scholar]
Rahimi A, Recht B, et al (2007) Random features for large-scale kernel machines. In: NIPS, Citeseer, p 5
Rao RP (2004) Bayesian computation in recurrent neural circuits. Neural Comput 16(1):1–38 [DOI] [PubMed] [Google Scholar]
Rényi A, et al (1961) On measures of entropy and information. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Berkeley, California, USA
Rosenblatt M (1969) Conditional probability density and regression estimators. Multivar Anal II 25:31 [Google Scholar]
Rule JS, Piantadosi S, Tenenbaum J (2022) Learning as programming: modeling efficient search in human concept learning. In: Proceedings of the annual meeting of the cognitive science society
Salinas E, Abbott L (1994) Vector reconstruction from firing rates. J Comput Neurosci 1(1–2):89–107 [DOI] [PubMed] [Google Scholar]
Sanborn AN, Chater N (2016) Bayesian brains without probabilities. Trends Cognit Sci 20(12):883–893 [DOI] [PubMed] [Google Scholar]
Savin C, Denève S (2014) Spatio-temporal representations of uncertainty in spiking neural networks. In: Advances in neural information processing systems, vol 27
Schlegel K, Neubert P, Protzel P (2020) A comparison of vector symbolic architectures. arXiv preprint arXiv:2001.11797
Schneider M (2017) Expected similarity estimation for large-scale anomaly detection. PhD thesis, Universität Ulm
Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch and streaming anomaly detection. Mach Learn 105(3):305–333 [Google Scholar]
Sharma S (2018) Neural plausibility of Bayesian inference. Master’s thesis, University of Waterloo
Sharma S, Voelker A, Eliasmith C (2017) A spiking neural Bayesian model of life span inference. In: CogSci
Smolensky P (1990) Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif Intell 46(1–2):159–216 [Google Scholar]
Stewart TC, Eliasmith C (2013) Realistic neurons can compute the operations needed by quantum probability theory and other vector symbolic architectures. Behav Brain Sci 36(3):307 [DOI] [PubMed] [Google Scholar]
Stewart TC, Choo X, Eliasmith C, et al (2010) Dynamic behaviour of a spiking model of action selection in the basal ganglia. In: Proceedings of the 10th international conference on cognitive modeling, Citeseer, pp 235–40
Sutherland DJ, Schneider J (2015) On the error of random Fourier features. arXiv preprint arXiv:1506.02785
Tsybakov AB (2009) Introduction to nonparametric estimation. Springer [Google Scholar]
Voelker AR (2020) A short letter on the dot product between rotated Fourier transforms. arXiv preprint arXiv:2007.13462
Voelker AR, Blouw P, Choo X et al (2021) Simulating and predicting dynamical systems with spatial semantic pointers. Neural Comput 33(8):2033–2067 [DOI] [PubMed] [Google Scholar]
Walker EY, Cotton RJ, Ma WJ et al (2020) A neural basis of probabilistic computation in visual cortex. Nat Neurosci 23(1):122–129 [DOI] [PubMed] [Google Scholar]
Wand MP, Jones M (1995) Kernel smoothing. In: Monographs on statistics and applied probability; 060, 1st edn., Chapman & Hall, London
Xu K, Srivastava A, Gutfreund D et al (2021) A Bayesian-symbolic approach to reasoning and learning in intuitive physics. Adv Neural Inf Process Syst 34:2478–2490 [Google Scholar]
Zemel R, Dayan P, Pouget A (1996) Probabilistic interpretation of population codes. In: Advances in neural information processing systems, vol 9

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used to generate the figures in this paper is available as jupyter notebooks www.gitlab.com/furlong/vsa-prob.

[CR1] Agarwal R, Chen Z, Sarma SV (2016) A novel nonparametric maximum likelihood estimator for probability density functions. IEEE Trans Pattern Anal Mach Intelligence 39(7):1294–1308 [DOI] [PubMed] [Google Scholar]

[CR2] Anastasio TJ, Patton PE, Belkacem-Boussaid K (2000) Using Bayes’ rule to model multisensory enhancement in the superior colliculus. Neural Comput 12(5):1165–1187 [DOI] [PubMed] [Google Scholar]

[CR3] Anderson CH, Van Essen DC (1994) Neurobiological computational systems. In: Computational intelligence imitating life 213222

[CR4] Arimoto S (1977) Information measures and capacity of order for discrete memoryless channels. Topics in information theory

[CR5] Arora A, Furlong PM, Fitch R et al (2019) Multi-modal active perception for information gathering in science missions. Auton Robot 43(7):1827–1853 [Google Scholar]

[CR6] Barber MJ, Clark JW, Anderson CH (2003) Neural representation of probabilistic information. Neural Comput 15(8):1843–1864 [DOI] [PubMed] [Google Scholar]

[CR7] Bekolay T, Bergstra J, Hunsberger E et al (2014) Nengo: a python tool for building large-scale functional brain models. Front Neuroinform 7:48 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] Boerlin M, Denève S (2011) Spike-based population coding and working memory. PLoS Comput Biol 7(2):e1001080 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] Bogacz R (2015) Optimal decision making in the cortico-basal-ganglia circuit. In: An introduction to model-based cognitive neuroscience. Springer, pp 291–302

[CR10] Bogacz R, Gurney K (2007) The basal ganglia and cortex implement optimal decision making between alternative actions. Neural Comput 19(2):442–477 [DOI] [PubMed] [Google Scholar]

[CR11] Bogacz R, Larsen T (2011) Integration of reinforcement learning and optimal decision-making theories of the basal ganglia. Neural Comput 23(4):817–851 [DOI] [PubMed] [Google Scholar]

[CR12] Born M (1926) Quantenmechanik der stoßvorgänge. Z Phys 38(11):803–827 [Google Scholar]

[CR13] Buesing L, Bill J, Nessler B et al (2011) Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Comput Biol 7(11):e1002211 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] Busemeyer JR, Bruza PD (2012) Quantum models of cognition and decision. Cambridge University Press [Google Scholar]

[CR15] Busemeyer JR, Wang Z, Shiffrin RM (2015) Bayesian model comparison favors quantum over standard decision theory account of dynamic inconsistency. Decision 2(1):1 [Google Scholar]

[CR16] Chacón J, Montanero J, Nogales A et al (2007) On the existence and limit behavior of the optimal bandwidth for kernel density estimation. Stat Sin 17(1):289–300 [Google Scholar]

[CR17] Chater N, Oaksford M (2008) The probabilistic mind: prospects for Bayesian cognitive science. Oxford University Press, USA [Google Scholar]

[CR18] Choo X, Eliasmith C (2010) A spiking neuron model of serial-order recall. In: Cattrambone R, Ohlsson S (eds) 32nd Annual conference of the cognitive science society. Cognitive Science Society, Portland, OR

[CR19] Darlington TR, Beck JM, Lisberger SG (2018) Neural implementation of Bayesian inference in a sensorimotor behavior. Nat Neurosci 21(10):1442–1451 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] Davis KB (1975) Mean square error properties of density estimates. Ann Stat 3:1025–1030 [Google Scholar]

[CR21] Davis KB (1977) Mean integrated square error properties of density estimates. Ann Stat 5:530–535 [Google Scholar]

[CR22] Deneve S (2008) Bayesian spiking neurons I: inference. Neural Comput 20(1):91–117 [DOI] [PubMed] [Google Scholar]

[CR23] Doya K (2021) Canonical cortical circuits and the duality of Bayesian inference and optimal control. Curr Opin Behav Sci 41:160–167 [Google Scholar]

[CR24] Doya K, Ishii S, Pouget A et al (2007) Bayesian brain: probabilistic approaches to neural coding. MIT Press [Google Scholar]

[CR25] Dumont N, Eliasmith C (2020) Accurate representation for spatial cognition using grid cells. In: CogSci

[CR26] Echeveste R, Aitchison L, Hennequin G et al (2020) Cortical-like dynamics in recurrent circuits optimized for sampling-based probabilistic inference. Nat Neurosci 23(9):1138–1149 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] Eliasmith C (2013) How to build a brain: a neural architecture for biological cognition. Oxford University Press [Google Scholar]

[CR28] Eliasmith C, Anderson CH (2003) Neural engineering: computation, representation, and dynamics in neurobiological systems. MIT Press, Berlin [Google Scholar]

[CR29] Eliasmith C, Stewart TC, Choo X et al (2012) A large-scale model of the functioning brain. Science 338(6111):1202–1205 [DOI] [PubMed] [Google Scholar]

[CR30] Elliott L, Eliasmith C (2009) MCMC with spiking neurons. In: NIPS workshop on Bayesian inference in the brain

[CR31] Faisal AA, Selen LP, Wolpert DM (2008) Noise in the nervous system. Nat Rev Neurosci 9(4):292–303 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] Fehr S, Berens S (2014) On the conditional Rényi entropy. IEEE Trans Inf Theory 60(11):6801–6810 [Google Scholar]

[CR33] Frady EP, Kleyko D, Kymn CJ, et al (2021) Computing on functions using randomized vector representations. arXiv preprint arXiv:2109.03429

[CR34] Furlong PM, Eliasmith C (2022) Fractional binding in vector symbolic architectures as quasi-probability statements. In: Proceedings of the annual meeting of the cognitive science society

[CR35] Gayler RW (2004) Vector symbolic architectures answer Jackendoff’s challenges for cognitive neuroscience. arXiv preprint cs/0412059

[CR36] Glad IK, Hjort NL, Ushakov NG (2003) Correction of density estimators that are not densities. Scand J Stat 30(2):415–427 [Google Scholar]

[CR37] Glad IK, Hjort NL, Ushakov N (2007) Density estimation using the sinc kernel. Preprint Statistics, vol 2, p 2007

[CR38] Goodman ND, Tenenbaum JB, Contributors TP (2016) Probabilistic models of cognition. http://probmods.org/v2. Accessed 23 Jan 2023

[CR39] Gosmann J (2015) Precise multiplications with the NEF. Tech. rep, Centre for Theoretical Neuroscience, Waterloo, ON

[CR40] Gosmann J, Eliasmith C (2019) Vector-derived transformation binding: an improved binding operation for deep symbol-like processing in neural networks. Neural Comput 31(5):849–869. 10.1162/neco_a_01179 [DOI] [PubMed] [Google Scholar]

[CR41] Hou H, Zheng Q, Zhao Y et al (2019) Neural correlates of optimal multisensory decision making under time-varying reliabilities with an invariant linear probabilistic population code. Neuron 104(5):1010–1021 [DOI] [PubMed] [Google Scholar]

[CR42] Hoyer P, Hyvärinen A (2002) Interpreting neural response variability as Monte Carlo sampling of the posterior. In: Advances in neural information processing systems, vol 15

[CR43] Huang Y, Rao RP (2014) Neurons as Monte Carlo samplers: Bayesian inference and learning in spiking networks. In: Advances in neural information processing systems, vol 27

[CR44] Joshi A, Halseth JT, Kanerva P (2017) Language geometry using random indexing. In: Quantum interaction: 10th international conference, QI 2016, San Francisco, CA, USA, July 20–22, 2016, Revised Selected Papers 10. Springer, pp 265–274

[CR45] Kanerva P (1996) Binary spatter-coding of ordered k-tuples. In: Artificial neural networks-ICANN 96: 1996 international conference Bochum, Germany, July 16–19, 1996 Proceedings 6. Springer, pp 869–873

[CR46] Kanerva P (2009) Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognit Comput 1:139–159 [Google Scholar]

[CR47] Kappel D, Habenschuss S, Legenstein R et al (2015a) Network plasticity as Bayesian inference. PLoS Comput Biol 11(11):e1004485 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] Kappel D, Habenschuss S, Legenstein R et al (2015b) Synaptic sampling: a Bayesian approach to neural network plasticity and rewiring. Adv Neural Inf Process Syst 28:370–378 [Google Scholar]

[CR49] Kleyko D, Davies M, Frady EP, et al (2021) Vector symbolic architectures as a computing framework for nanoscale hardware. arXiv preprint arXiv:2106.05268 [PMC free article] [PubMed]

[CR50] Kleyko D, Davies M, Frady EP et al (2022) Vector symbolic architectures as a computing framework for emerging hardware. Proc IEEE 110(10):1538–1571 [PMC free article] [PubMed] [Google Scholar]

[CR51] Komer B (2020) Biologically inspired spatial representation. PhD thesis, University of Waterloo

[CR52] Korcsak-Gorzo A, Müller MG, Baumbach A et al (2022) Cortical oscillations support sampling-based computations in spiking neural networks. PLoS Comput Biol 18(3):e1009753 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] Krause A, Singh A, Guestrin C (2008) Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. J Mach Learn Res 9(2):235 [Google Scholar]

[CR54] Lake BM, Salakhutdinov R, Tenenbaum JB (2015) Human-level concept learning through probabilistic program induction. Science 350(6266):1332–1338 [DOI] [PubMed] [Google Scholar]

[CR55] Levy SD, Gayler R (2008) Vector symbolic architectures: a new building material for artificial general intelligence. In: Proceedings of the 2008 conference on artificial general intelligence 2008: proceedings of the first AGI conference, pp 414–418

[CR56] Loredo T (2003) Bayesian adaptive exploration in a nutshell. Stat Probl Particle Phys Astrophys Cosmol 1:162 [Google Scholar]

[CR57] Ma WJ, Beck JM, Latham PE et al (2006) Bayesian inference with probabilistic population codes. Nat Neurosci 9(11):1432–1438 [DOI] [PubMed] [Google Scholar]

[CR58] Ma WJ, Beck JM, Pouget A (2008) Spiking networks for Bayesian inference and choice. Curr Opin Neurobiol 18(2):217–222 [DOI] [PubMed] [Google Scholar]

[CR59] Mainen ZF, Sejnowski TJ (1995) Reliability of spike timing in neocortical neurons. Science 268(5216):1503–1506 [DOI] [PubMed] [Google Scholar]

[CR60] Masset P, Zavatone-Veth J, Connor JP et al (2022) Natural gradient enables fast sampling in spiking neural networks. Adv Neural Inf Process Syst 35:22018–22034 [PMC free article] [PubMed] [Google Scholar]

[CR61] Mundy A (2017) Real time Spaun on SpiNNaker functional brain simulation on a massively-parallel computer architecture. The University of Manchester (United Kingdom)

[CR62] Mutnỳ M, Krause A (2019) Efficient high dimensional Bayesian optimization with additivity and quadrature Fourier features. Adv Neural Inf Process Syst 31:9005–9016 [Google Scholar]

[CR63] Neubert P, Schubert S, Protzel P (2019) An introduction to hyperdimensional computing for robotics. KI-Künstl Intell 33(4):319–330 [Google Scholar]

[CR64] Plate TA (1992) Holographic recurrent networks. In: Advances in neural information processing systems, vol 5

[CR65] Plate TA (1994) Distributed representations and nested compositional structure. University of Toronto, Department of Computer Science

[CR66] Plate TA (1995) Holographic reduced representations. IEEE Trans Neural Netw 6(3):623–641 [DOI] [PubMed] [Google Scholar]

[CR67] Plate TA (2003) Holographic reduced representation: distributed representation for cognitive structures. CSLI Publications, Stanford [Google Scholar]

[CR68] Pothos EM, Busemeyer JR (2013) Can quantum probability provide a new direction for cognitive modeling? Behav Brain Sci 36(3):255–274 [DOI] [PubMed] [Google Scholar]

[CR69] Pothos EM, Busemeyer JR (2022) Quantum cognition. Annu Rev Psychol 73:749–778 [DOI] [PubMed] [Google Scholar]

[CR70] Pouget A, Dayan P, Zemel RS (2003) Inference and computation with population codes. Annu Rev Neurosci 26(1):381–410 [DOI] [PubMed] [Google Scholar]

[CR71] Rahimi A, Recht B, et al (2007) Random features for large-scale kernel machines. In: NIPS, Citeseer, p 5

[CR72] Rao RP (2004) Bayesian computation in recurrent neural circuits. Neural Comput 16(1):1–38 [DOI] [PubMed] [Google Scholar]

[CR73] Rényi A, et al (1961) On measures of entropy and information. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Berkeley, California, USA

[CR74] Rosenblatt M (1969) Conditional probability density and regression estimators. Multivar Anal II 25:31 [Google Scholar]

[CR75] Rule JS, Piantadosi S, Tenenbaum J (2022) Learning as programming: modeling efficient search in human concept learning. In: Proceedings of the annual meeting of the cognitive science society

[CR76] Salinas E, Abbott L (1994) Vector reconstruction from firing rates. J Comput Neurosci 1(1–2):89–107 [DOI] [PubMed] [Google Scholar]

[CR77] Sanborn AN, Chater N (2016) Bayesian brains without probabilities. Trends Cognit Sci 20(12):883–893 [DOI] [PubMed] [Google Scholar]

[CR78] Savin C, Denève S (2014) Spatio-temporal representations of uncertainty in spiking neural networks. In: Advances in neural information processing systems, vol 27

[CR79] Schlegel K, Neubert P, Protzel P (2020) A comparison of vector symbolic architectures. arXiv preprint arXiv:2001.11797

[CR80] Schneider M (2017) Expected similarity estimation for large-scale anomaly detection. PhD thesis, Universität Ulm

[CR81] Schneider M, Ertel W, Ramos F (2016) Expected similarity estimation for large-scale batch and streaming anomaly detection. Mach Learn 105(3):305–333 [Google Scholar]

[CR82] Sharma S (2018) Neural plausibility of Bayesian inference. Master’s thesis, University of Waterloo

[CR83] Sharma S, Voelker A, Eliasmith C (2017) A spiking neural Bayesian model of life span inference. In: CogSci

[CR84] Smolensky P (1990) Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif Intell 46(1–2):159–216 [Google Scholar]

[CR85] Stewart TC, Eliasmith C (2013) Realistic neurons can compute the operations needed by quantum probability theory and other vector symbolic architectures. Behav Brain Sci 36(3):307 [DOI] [PubMed] [Google Scholar]

[CR86] Stewart TC, Choo X, Eliasmith C, et al (2010) Dynamic behaviour of a spiking model of action selection in the basal ganglia. In: Proceedings of the 10th international conference on cognitive modeling, Citeseer, pp 235–40

[CR87] Sutherland DJ, Schneider J (2015) On the error of random Fourier features. arXiv preprint arXiv:1506.02785

[CR88] Tsybakov AB (2009) Introduction to nonparametric estimation. Springer [Google Scholar]

[CR89] Voelker AR (2020) A short letter on the dot product between rotated Fourier transforms. arXiv preprint arXiv:2007.13462

[CR90] Voelker AR, Blouw P, Choo X et al (2021) Simulating and predicting dynamical systems with spatial semantic pointers. Neural Comput 33(8):2033–2067 [DOI] [PubMed] [Google Scholar]

[CR91] Walker EY, Cotton RJ, Ma WJ et al (2020) A neural basis of probabilistic computation in visual cortex. Nat Neurosci 23(1):122–129 [DOI] [PubMed] [Google Scholar]

[CR92] Wand MP, Jones M (1995) Kernel smoothing. In: Monographs on statistics and applied probability; 060, 1st edn., Chapman & Hall, London

[CR93] Xu K, Srivastava A, Gutfreund D et al (2021) A Bayesian-symbolic approach to reasoning and learning in intuitive physics. Adv Neural Inf Process Syst 34:2478–2490 [Google Scholar]

[CR94] Zemel R, Dayan P, Pouget A (1996) Probabilistic interpretation of population codes. In: Advances in neural information processing systems, vol 9

PERMALINK

Modelling neural probabilistic computation using vector symbolic architectures

P Michael Furlong

Chris Eliasmith

Abstract

Introduction

Background

Neural representations of probability

Cognitive representations of probability

Quantum probability for cognition

Preliminaries

Kernel density estimators and random Fourier features

Vector symbolic architectures

Analogies to probability operations

Binding encodes data

Similarity computes probability

Fig. 1.

Bundling updates beliefs

Unbinding is analogous to conditioning

Fig. 2.

Fig. 3.

Fig. 4.

Other operations

Marginalization

Fig. 5.

Entropy

Fig. 6.

Mutual information

Fig. 7.

Fig. 8.

Discussion

Considerations for neural mapping

Gaps between probability models and behaviour

Relation to quantum probability

Benefits and limitations

Conclusion

Acknowledgements

Appendix 1 Model complexity analysis

Table 1.

Table 2.

Author Contributions

Funding

Data availability

Code availability

Declarations

Conflict of interest

Ethics approval

Consent to participate

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases