Synaptic plasticity as Bayesian inference

Laurence Aitchison; Jannes Jegminat; Jorge Aurelio Menendez; Jean-Pascal Pfister; Alex Pouget; Peter E Latham

doi:10.1038/s41593-021-00809-5

. Author manuscript; available in PMC: 2024 Dec 7.

Published in final edited form as: Nat Neurosci. 2021 Mar 11;24(4):565–571. doi: 10.1038/s41593-021-00809-5

Synaptic plasticity as Bayesian inference

Laurence Aitchison ^1,^2,^*, Jannes Jegminat ^3,⁴, Jorge Aurelio Menendez ^1,⁵, Jean-Pascal Pfister ^3,⁴, Alex Pouget ^1,⁶, Peter E Latham ¹

PMCID: PMC7617048 EMSID: EMS114909 PMID: 33707754

Abstract

Learning, especially rapid learning, is critical for survival. However, learning is hard: a large number of synaptic weights must be set based on noisy, often ambiguous, sensory information. In such a high-noise regime, keeping track of probability distributions over weights is the optimal strategy. Here we hypothesize that synapses take that strategy; in essence, when they estimate weights, they include error bars. They then use that uncertainty to adjust their learning rates, with more uncertain weights having higher learning rates. We also make a second, independent, hypothesis: synapses communicate their uncertainty by linking it to variability in PSP size, with more uncertainty leading to more variability. These two hypotheses cast synaptic plasticity as a problem of Bayesian inference, and thus provide a normative view of learning. They generalize known learning rules, offer an explanation for the large variability in the size of post-synaptic potentials, and make falsifiable experimental predictions.

1. Introduction

To survive, animals must accurately estimate the state of the world. This estimation problem is plagued by uncertainty: not only is information often extremely limited (e.g., because it is dark) or ambiguous (e.g., a rustle in the bushes could be the wind, or it could be a predator), but sensory receptors, and indeed all neural circuits, are noisy. Historically, models of neural computation ignored this uncertainty, and relied instead on the idea that the nervous system estimates values of quantities in the world, but does not include error bars [1]. However, this does not seem to be what animals do — not only does ignoring uncertainty lead to suboptimal decisions, it is inconsistent with a large body of experimental work [2, 3]. Thus, the current view is that in many, if not most, cases, animals keep track of uncertainty, and use it to guide their decisions [3].

Accurately estimating the state of the world is just one problem faced by animals. They also need to learn, and in particular they need to leverage their past experience. It is believed that learning primarily involves changing synaptic weights. But estimating the correct weights, like estimating the state of the world, is plagued by uncertainty: not only is the information available to synapses often extremely limited (in many cases just pre and post synaptic activity), but that information is highly unreliable. Historically, models of synaptic plasticity ignored this uncertainty, and assumed that synapses do not include error bars when they estimate their weights. However, uncertainty is important for optimal learning — just as it is important for optimal inference of the state of the world.

Motivated by these observations, we propose two hypotheses. The first, Bayesian Plasticity (so named because it is derived using Bayes' rule), states that during learning, synapses do indeed take uncertainty into account. Under this hypothesis, synapses do not just estimate what their weight should be, they also include error bars. This allows synapses to adjust their learning rates on the fly: when uncertainty is high learning rates are turned up, and when uncertainty is low learning rates are turned down. We show that these adjustments allow synapses to learn faster, so there is likely to be considerable evolutionary pressure for such a mechanism. And indeed, the same principle has recently been shown to recover state-of-the-art adaptive optimization algorithms for artificial neural networks [4].

Bayesian Plasticity is a hypothesis about what synapses compute. It does not, however, tell synapses how to set their weights. For that a second hypothesis is needed. Here we propose that weights are sampled from the probability distribution describing the synapse’s degree of uncertainty. Under this hypothesis, which we refer to as Synaptic Sampling, trial to trial variability provides a readout of uncertainty: the larger the trial to trial variability in synaptic strength, the larger the uncertainty. Synaptic Sampling is motivated by the observation that the uncertainty associated with a particular computation should depend on the uncertainty in the weights. Thus, to make optimal decisions, the brain needs to know something about the uncertainty; one way for synapses to communicate that is via variability in the postsynaptic potential (PSP) amplitude (see Supplementary Math Note, Sec. S5, for an extended discussion).

Combined, these two hypotheses make several strong experimental predictions. As we discuss below, one is consistent with re-analysis of existing experimental data; the others, which are feasible in the not so distant future, could falsify one or both hypotheses. We begin by analyzing the first hypothesis, that synapses keep track of their uncertainty (Bayesian Plasticity); after that we discuss our second hypothesis, that synapses sample from the resulting distribution (Synaptic Sampling).

2. Results

Under Bayesian Plasticity, each synapse computes its mean and variance, and updates both based on the pattern of presynaptic spikes. In analogy to classical learning rules, the update rule for the mean pushes it in a direction that reduces a cost function. But in contrast to classical learning rules, the amount the mean changes depends on the uncertainty: the higher the uncertainty, as measured by the variance, the larger the change in the mean. The variance thus sets the learning rate, as shown in Fig. 1. In essence, there is a rule for computing the learning rate of each synapse.

The delta rule is suboptimal. The error bars denote uncertainty (measured by the standard deviation around the mean) in two synapses' estimates of their target weights, w_tar,1 and w_tar,2. The first is reasonably certain; the second less so. The red arrows denote possible changes in response to a negative feedback signal. The arrow labeled “delta rule” represents an equal decrease in the first and second target weights. In contrast, the arrow labeled “optimal” takes uncertainty into account, so there is a larger change in the second, more uncertain, target weight.

To illustrate these ideas, we consider a model of synaptic integration in which postsynaptic potentials combine linearly,

V (t) = \sum_{i} w_{i} (t) x_{i} (t) + ξ_{V} (t)

(1)

where V(t) is the membrane potential, x_i(t) is the synaptic input from neuron i, w_i(t) is the corresponding PSP amplitude, and ξ_V(t) is the membrane potential noise. For simplicity we work in discrete time, so x_i(t) is either 1 (when there is a spike at time t) or 0 (when there is no spike), and we take the time step to be 10 ms, on the order of the membrane time constant [5].

We assume that the goal of the neuron is to set its weights, w_i, so that it achieves a “target” membrane potential (denoted V_tar) - the membrane potential that minimizes some cost to the animal. In this setting, the weights are found using a neuron-specific feedback signal, denoted f. Critically, this feedback signal contains information about the target weights through its dependence on a true error signal, δ - the difference between the target and actual membrane potential,

δ \equiv V_{tar} - V .

(2)

Our focus is on how to use the feedback signal most efficiently, not on where it comes from (an active area of research [6, 7, 8, 9, 10]). Thus, in most of our analysis we simply assume that the neuron receives a feedback signal, and ask how to optimally update the weights via Bayesian inference.

We consider several learning scenarios. In the first, we simply add noise, denoted ξ_δ, to δ, resulting in the error signal f_lin = δ + ξ_δ (the subscript “lin” indicates that the average feedback is linear in δ). The second corresponds to cerebellar learning, in which a Purkinje cell receives a complex spike if its output is too high, thus triggering long term depression [11]. To mimic the all-or-nothing nature of a complex spike [12], we use a cerebellar-like feedback signal: f_cb = Θ(δ + ξ_δ − θ) where Θ is the Heaviside step function. For this feedback signal, f_cb is likely to be 1 if δ is above a threshold, θ, and likely to be 0 if it is below threshold. The third corresponds to reinforcement learning, in which the feedback represents the reward. The reward provides the magnitude of the error signal, but not its sign, so the feedback signal is f_rl = − |δ + ξ_δ|. In the fourth scenario we move beyond analysis of single neurons, and consider learning the output weights of a recurrent neural network. In this scenario, the error signal is δ, without added noise.

The main idea behind Bayesian plasticity is most easily illustrated in the simplest possible setting, linear feedback, f_lin = δ + ξ_δ. In that case there is a well known learning rule, the delta rule [13, 14],

Δ w_{i} = η x_{i} f_{lin} .

(3)

(This is most easily recognized as the delta rule in the absence of noise, so that f_lin = δ.) The change in the weight is the product of a learning rate, η, a presynaptic term, x_i and a postsynaptic term, f_lin. Importantly, the learning rate, η, is the same for all synapses, so all synapses whose presynaptic cells are active (i.e., for which x_i = 1) change by the same amount (the red arrow labeled “delta rule” in Fig. 1).

In the absence of any other information, the delta rule is perfectly sensible. However, suppose, based on previous information, that synapse 1 is relatively certain about its target weight, whereas synapse 2 is uncertain (error bars in Fig. 1). In that case, new information should have a larger effect on synapse 2 than synapse 1, so synapse 2 should update the estimate of its weight more than synapse 1 (red arrow labeled “optimal” in Fig. 1).

Implementing this scheme leads to several features that are not present in classical learning rules. First, the variance needs to be inferred; second, the change in the weight must depend on the inferred variance; and third, because of uncertainty, the “weight” is in fact the inferred mean weight. In Supplementary Math Note, Sec. S1.3, we derive approximate learning rules that take these features into account (see Methods, Sec. M2, for the exact rules). Using μ_i and $σ_{i}^{2}$ to denote the inferred mean and variance of the distribution over weights, those learning rules are

Δ μ_{i} \approx \frac{σ_{i}^{2}}{σ_{δ}^{2}} x_{i} f_{lin} - \frac{1}{τ} (μ_{i} - μ_{prior})

(4a)

Δ σ_{i}^{2} \approx - \frac{σ_{i}^{4}}{σ_{δ}^{2}} x_{i}^{2} - \frac{2}{τ} (σ_{i}^{2} - σ_{prior}^{2})

(4b)

where $σ_{δ}^{2}$ is the variance of f_lin, and τ, μ_prior, and $σ_{prior}^{2}$ are fixed parameters (described shortly). Note that σ_i corresponds to the length of the error bars in Fig. 1.

The update rule for the mean weight, Eq. (4a), is similar to the delta rule, Eq. (3). There are, however, two important differences. First, the fixed learning rate, η, that appears in Eq. (3) has been replaced by a variable learning rate, $σ_{i}^{2} / σ_{δ}^{2}$ , which is proportional to the synapse’s uncertainty, as measured by $σ_{i}^{2}$ . Thus, the more uncertain a synapse is about its target weight, the larger the change in its mean weight when new information arrives — exactly what we expect given Fig. 1. Moreover, as the feedback signal gets noisier (as measured by the variance of f_lin), and thus less informative, the learning rate falls. Second, in the absence of information (x_i = 0, meaning no spikes), the inferred mean weight, μ_i, moves toward the prior, μ_prior. That is because we are considering the realistic case in which the target weights drift randomly over time due to changes in the statistics of the world and/or surrounding circuits. See Methods, Sec. M1.1 for a detailed discussion of this point.

Unlike the update rule for the mean, the update rule for the uncertainty, $σ_{i}^{2}$ (Eq. 4b), does not have a counterpart in classical learning rules. It does, however, have a natural interpretation. The first term in Eq. (4b) reduces uncertainty (note the negative sign) whenever the presynaptic cell is active (x_i = 1); that is, whenever the synapse receives information. The second term has the opposite effect: it continually increases uncertainty (up to the prior uncertainty, $σ_{prior}^{2}$ ), independent of presynaptic spikes. That term arises because random drift slowly reduces knowledge about the target weights.

The learning rules given in Eq. (4) are approximate - their form was optimized for ease of interpretation rather than accuracy. However, the more exact learning rules (Methods, Eqs. (M.41), (M.43) and (M.47), for the three feedback signals) are not that different. In particular, they retain the same flavor: they consist of a presynaptic term (x_i) and a postsynaptic term (a function of f_lin), and the effective learning rate is updated on each timestep. Moreover, the interpretation is the same: the mean is moved, on average, toward its true value, with a rate that scales with uncertainty, and whenever there is a presynaptic spike the uncertainty is reduced.

To determine whether our Bayesian learning rules are able to accurately compute the mean and variance of the weights, we generated a set of target weights, denoted w_tar,i, and used those to construct V_tar,

V_{tar} (t) = \sum_{i} w_{tar, i} (t) x_{i} (t) .

(5)

Simulations show that the mean weights track the target weights very effectively (Fig. 2; compare the black and red lines, which correspond to the target weight and its inferred mean, respectively). Just as importantly, the synapse’s estimate of its uncertainty tracks the difference between its estimate and the actual target (the black line should be inside the 95% confidence interval – the red area in Fig. 2 - 95% of the time, and it is very close to that: linear, 96.1%; cerebellar learning, 95.4%; reinforcement learning, 96.8%). Note that the uncertainty is much lower at high presynaptic firing rate than at low rate (the red regions in the top row of Fig. 2 are much narrower than in the bottom row). That is because for low firing rate x_i is mainly zero, and so there is little decrease in $σ_{i}^{2}$ ; see Eq. (4b).

Bayesian learning rules track the target weight and estimate uncertainty. The black line is the target weight, the red line is the mean of the inferred distribution, and the red area represents 95% confidence intervals of the inferred distribution. Panels a-c correspond to the highest presynaptic firing rate used in the simulations; panels d-f to the lowest. Consistent with our analysis (see in particular Eq. 9), higher presynaptic firing rate resulted in lower uncertainty. a and d. Linear feedback, f_lin = δ + *ξ_δ*. b and e. Cerebellar learning, f_cb = Θ (δ + *ξ_δ* − θ). c and f. Reinforcement learning, f_rl = − |δ +*ξ_δ*|. See Supplementary Math Note, Sec. S3, for simulation details. Note while the red lilnes are all plotted at the same thickness, the greater variability in the lower plots may make those lines appear thicker.

The critical aspect of the learning rules in Eq. (4) is that the learning rate — the change in mean PSP amplitude, μ_i, per presynaptic spike — increases as the synapse’s uncertainty, $σ_{i}^{2}$ , increases. This is a general feature of our learning rules, and not specific to any one of them. Consequently, independent of the learning scenario, we expect performance to be better than for classical learning rules, which do not take uncertainty into account. To check whether this is true, we computed the mean squared error between the actual and target membrane potential, V and V_tar, for classical learning rules, and compared it to the Bayesian learning rules. The results are shown in Fig. 3. The black line in each panel is the mean squared error for the classical learning rules as a function of learning rate; the red line is the mean squared error for the Bayesian learning rules. (The latter is constant because the Bayesian learning rules do not depend on the classical learning rate.) As predicted, the Bayesian learning rules always do better than the classical ones, even if the learning rate is tuned to its optimal value. This result was robust to model mismatch (Supplementary Math Note, Fig. S.6).

Bayesian learning rules exhibit lower error than classical ones. Red: mean squared error between the target and actual membrane potential for the Bayesian learning rules; black: mean squared error for the classical rules. a. Linear feedback, f_lin = δ + *ξ_δ*. b. Cerebellar learning, f_cb = Θ (δ + *ξ_δ* − θ). c. Reinforcement learning, f_rl = −|δ + *ξ_δ*|. See Supplementary Math Note, Sec. S3, for simulation details.

For the examples so far, we considered a single neuron inferring only its own input weights. We focused on this case primarily to illustrate our method in the simplest possible setting. In reality, however, the brain needs to optimize some cost function based on a feedback signal applied to a recurrent neural network. To investigate Bayesian plasticity in this, more realistic, regime, we trained the output weights of a recurrent neural network to produce a target function, using as a feedback signal the difference between the target function and its network estimate (Fig. 4a). The learning rules are very similar to Eq. (4) (see Methods, Sec. M2.3). However, the target weights are not known, so we cannot compare the inferred weights to the target weights, as we did in Fig. 2. We can, however, compare the mean squared error between the target and actual membrane potential, as in Fig. 3. Bayesian plasticity does indeed outperform classical learning rules (Figs. 4b and c). Moreover, the effect is much larger than in Fig. 3: the mean squared error is about an order of magnitude smaller for the Bayesian than for the classical learning rule (note the log scale in Fig. 4c), a result that was highly robust to model mismatch (Supplementary Math Note, Fig. S.7). These simulations suggest that taking into account weight uncertainty has a much larger effect in networks than in single neurons.

Recurrent neural network. a. Schematic of the circuit. I(t) is the input (used to initialize activity) and w corresponds to the learned output weights. The feedback weights (black arrows from V to the recurrent network) are fixed, as are the recurrent weights. During learning, the output of the network, V(t), is compared to the target output, V_tar(t), and the error is used to update the output weights, w. At test time, the target output is not fed back to the circuit. b. Learning curves, measured using mean squared error, for Bayesian and classical learning rules (red and blue, respectively, at a range of learning rates for the classical rule). Although the initial improvement in performance for the Bayesian and classical learning rules was about the same, after 100 time steps Bayesian learning became much more efficient. The arrows correspond to the number of time steps used for the comparison in panel c. c. Mean squared error versus the learning rate of the classical rule. Solid lines: classical learning rules; dashed lines: Bayesian learning rules. The mean squared error for the Bayesian learning rule was about an order of magnitude smaller than for the classical one. In panels c and d we plot the median, taken over n = 400 network/target pairs; error bars are 95% confidence intervals, computed using the percentile bootstrap.

Figures 3 and 4 indicate that there is a clear advantage to using uncertainty to adjust learning rates. But does the brain do this? Addressing that question will require a new generation of plasticity experiments. At present, in typical plasticity experiments only changes in weights are measured; to test our hypothesis, it will be necessary to measure changes in learning rates, and at the same time determine how those changes are related to the synapse’s uncertainty. This presents two challenges. First, measuring changes in learning rates is difficult, as weights must be monitored over long periods of time and under natural conditions, preferably in vivo. Second, we cannot measure the synapse’s uncertainty directly. Here we discuss two approaches to overcoming these challenges.

The first approach is indirect: use neural activity measured over long periods in vivo to estimate the uncertainty a synapse should have; then, armed with that estimate, test the prediction that the learning rate increases with uncertainty. To estimate the uncertainty a synapses should have, we take advantage of a general feature of essentially all learning rules: synapses get information only when the presynaptic neuron spikes. Consequently, the synapse’s uncertainty should fall as the presynaptic firing rate increases. In fact, under mild assumptions, we can derive a very specific relationship: the relative change in weight under a plasticity protocol, Δμ_i/μ_i, should scale approximately as $1 / \sqrt{ν_{i}}$ where ν_i is the firing rate of the neuron presynaptic to synapse i,

\frac{Δ μ_{i}}{μ_{i}} \propto \frac{1}{\sqrt{ν_{i}}},

(6)

a relationship that holds in our simulations for firing rates above about 1 Hz (see Supplementary Math Note, Fig. S.3, bottom row). In essence, firing rate is a proxy for uncertainty, with higher firing rate indicating lower uncertainty and vice versa. This prediction could be tested by observing neurons in vivo, estimating the presynaptic firing rates, then performing plasticity experiments to determine the relative change in synaptic strength, Δμ_i/μ_i.

The second approach is more direct, but it requires an additional hypothesis. While Bayesian Plasticity tells us how to compute the mean and variance of the weights, it does not tell us what weight to use when a spike arrives. But the Synaptic Sampling hypothesis does: it tells us that the mean and variance of the PSP amplitude should be equal to the mean and variance of the inferred distribution over the target weight,

PSP mean = μ_{i}

(7a)

PSP variance = σ_{i}^{2} .

(7b)

Under our learning rules, the change in mean synaptic weight is proportional to the variance, $σ_{i}^{2}$ (Eq. 4a). Consequently, the relative change in weight Δμ_i/μ_i, is proportional to $σ_{i}^{2} / μ_{i}$ ; combining this with Eq. (7) gives us

\frac{Δ μ_{i}}{μ_{i}} \propto \frac{PSP variance}{PSP mean} \equiv \begin{matrix} Normalized \\ Variability \end{matrix},

(8)

where we have defined the normalized variability to be the ratio of PSP variance to its mean. We verify that this relationship holds in simulations (see Supplementary Math Note, Fig. S.2).

Equation (8) implies that when the PSP variance is high, learning rates are also high. Testing that experimentally is technically difficult: it requires monitoring the PSP mean and variance for long periods in vivo, and comparing normalized variability to changes in the mean. However, such experiments are likely to be possible in the near future.

A more indirect approach based on this idea, for which we can apply current data, makes use of Eq. (6) to replace the left hand side of Eq. (8), Δμ_i/μ_i, with $1 / \sqrt{ν_{i}}$ . This gives us

\begin{matrix} Normalized \\ Variability \end{matrix} \propto \frac{1}{\sqrt{ν_{i}}} .

(9)

This is intuitively sensible: as discussed above, higher presynaptic firing rates means the synapse is more certain, and Synaptic Sampling states that higher certainty should reduce the observed variability. This relationship can be tested by estimating presynaptic firing rates in vivo, and comparing them to the normalized variability measured using paired recordings. Such data can be extracted from experiments by Ko and colleagues [15]. In those experiments, calcium signals in mouse visual cortex were recorded in vivo under a variety of stimulation conditions, which provided an estimate of the firing rate of each imaged neuron; subsequently, whole cell recordings of pairs of identified neurons were made in vitro, and the mean and variance of the PSPs were measured. In Fig. 5 we plot the normalized variability versus the firing rate on a log-log scale; on this scale, our theory predicts a slope of −1/2 (red line). The normalized variability does indeed decrease as the firing rate increases (blue line), (p < 0.003), and the slope is not significantly different from the predicted value of −1/2 (p = 0.57). This pattern is broadly matched by simulations, at least at sufficiently high firing rate (Supplementary Math Note, Fig. S.3, top row).

Normalized variability (the ratio of the PSP variance to the mean) versus presynaptic firing rate as a diagnostic of our theory; data supplied to us by the authors of [15] (see Supplementary Math Note, Sec. S4.4). The red line, which has a slope of –1/2, is our prediction (the intercept, for which we do not have a prediction, was chosen to give the best fit to the data). The blue line is fit by linear regression(n =136 points), and the gray region represents 2 standard errors. The slope of the blue line, −0.62, is statistically significantly different from 0 (p < 0.003, t-test) and not significantly different from −1/2 (p = 0.57, t-test; assumes normality which was not formally tested). The firing rate was measured by taking the average signal from a spike deconvolution algorithm [45]. Units are arbitrary because the scale factor relating the average signal from the deconvolution algorithm and the firing rate is not exactly one [46]. Data from layer 2/3 of mouse visual cortex [15].

An alternative explanation for this result is that increases in firing rate reduce the normalized variability because of short term effects on release probability. The release probability, denoted p_r, scales the variance of the PSP by a factor of p_r(1 − p_r) and the mean by a factor of p_r, so the normalized variability (the variance divided by the mean) scales as 1 − p_r. Consequently, an increase in release probability with firing rate would explain Fig. 5. Such increases do indeed occur [16]. However, much more common — especially in rodent layer 2/3, where these experiments were performed — is a decrease in release probability with firing rate [17, 18]. Thus, short term synaptic plasticity would typically lead to an increase, not a decrease, in the normalized variability when firing rate increases; the opposite of what we see experimentally.

3. Discussion

We proposed that synapses do not just keep track of point estimates of their weights, as they do in classical learning rules; they also keep track of their uncertainty. They then use that uncertainty to set learning rates: the higher the uncertainty, the higher the learning rate. This allows different synapses to have different learning rates, and leads to learning rules that allow synapses to exploit all locally available information. This in turn leads to better performance, as measured by mean squared error (Figs. 3 and 4b,c). It also leads to faster learning. That is implicit in Fig. 3 (because the target weights drift, fast learning is essential for achieving low mean squared error) and it is explicit in Fig. 4b (compare red to blue curves).

The critical difference between our learning rules and classical ones is that the learning rates themselves undergo plasticity. We derived three rules, based on three different assumptions about the feedback signal received by the neuron, and in all cases the updates for the mean had the flavor of a classical rule: the change in the mean weight was a function of the presynaptic activity and an error signal. Other assumptions about the feedback signal are clearly possible, and our method can generate a broad range of learning rules. Whether or not they can generate all rules that have been observed experimentally is an avenue for future research.

The hypothesis that synapses keep track of uncertainty, which we refer to as the Bayesian Plasticity hypothesis, makes the general prediction that learning rates, not just synaptic strengths, are a function of pre and postsynaptic activity — something that should be testable with the next generation of plasticity experiments. In particular, it makes a specific prediction about learning rates in vivo: learning rates should vary across synapses, being higher for synapses with lower presynaptic firing rates.

We also made a second, independent, hypothesis, Synaptic Sampling. This hypothesis states that the variability in PSP size associated with a particular synapse matches the uncertainty in the strength of that synapse. This allows synapses to communicate their uncertainty to surrounding circuitry — information that is critical if the brain is to monitor the accuracy of its own computations. The same principle has been applied to neural activity, where it is known as the neural sampling hypothesis [19, 20, 21, 22], which posits that variability in neural activity matches uncertainty about the state of the external world. The neural sampling hypothesis meshes well with synaptic sampling: uncertainty in the weights increases uncertainty in the current estimate of the state of the world, and likewise, variability in the weights increase variability in neural activity (see Supplementary Math Note, Sec. S5). While there is some experimental evidence for the neural sampling hypothesis [21, 23, 22, 24, 25], it has not been firmly established. Whether other proposals for encoding probability distributions with neural activity, such as probabilistic population codes [3, 26], can be combined with Synaptic Sampling is an open question.

By combining our two hypotheses, we were able to make additional predictions. These focused on what we call the normalized variability — the ratio of the variance in PSP size to the mean. First, we predicted that plasticity should increase with normalized variability, which remains to be tested. Second, we predicted that normalized variability should decrease with presynaptic firing rate. Reanalyzing data from [14], we provided evidence that this is indeed the case.

In machine learning, the idea that it is advantageous to keep track of the distribution over weights has a long history [27, 28, 29]. Especially relevant is a recent study in which, as in our scheme, learning rates were reduced when certainty was high [30]. However, rather than updating the uncertainty on every time step, as we do, updating occurred only when there was a change in the task. This occurs on the timescale of minutes to hours; not the millisecond timescale on which uncertainty is updated in our model. Nevertheless, this approach worked well in settings in which deep networks had to learn multiple tasks.

In neuroscience, weight uncertainty was first explored in the context of reinforcement learning [31]. In that work, the weights related sensory stimuli to rewards, and weight correlations that developed due to Bayesian learning provided an exceptionally elegant explanation of backward blocking. The idea lay dormant for over a decade, until it was rediscovered with a slightly different focus, one in which knowledge of weight uncertainty is critical for knowledge of computational uncertainty [3]. Several theoretical studies followed. The first of those [32] bore some resemblance to ours, in that weights were sampled from a distribution. However, the timescale for sampling was hours rather than milliseconds; too slow to explain the spike-to-spike variability in PSP size that is ubiquitous in the brain. More recently, Hiratani and Fukai [33] postulated that the multiple synaptic contacts per connection observed in cortex provides a scaffolding for constructing a non-parametric estimate of the probability distribution over synaptic strength. Weight uncertainty has also been applied to drift diffusion models [34], using methods similar to those in [31]; the main difference was that the reward was binary (correct or incorrect) rather than continuous. Finally, recent work proposed that short-term plasticity is also governed by a Bayesian updating process [35]. It will be interesting to determine which combination of these schemes is used by the brain.

If the Bayesian Plasticity hypothesis is correct, synapses would have to keep track of, and store, two variables: the mean, as is standard, but also the variance (or, equivalently, the learning rate), which is not. The complexity of synapses [36, 37, 38], and their ability to use non-trivial learning rules (e.g., synaptic tagging, in which activity at a synapse “tags” it for future long term changes in strength [39, 40, 41], and metaplasticity, in which the learning rate can be modified by synaptic activity without changing the synaptic strength [42, 43, 44]), suggests that representing uncertainty — or learning rate — is quite possible. It will be nontrivial, but important, to work out how.

Our framework has several implications, both for the interpretation of neurophysiological data and for future work. First, under the Synaptic Sampling hypothesis, PSPs are necessarily noisy. Consequently, noise in synapses (e.g., synaptic failures) is a feature, not a bug. We thus provide a normative theory for one of the major mysteries in synaptic physiology: why neurotransmitter release is probabilistic. Second, our approach allows us to derive local, biologically plausible learning rules, no matter what information is available at the synapse, and no matter what the statistics of the synaptic input. Thus, our approach provides the flexibility necessary to connect theoretical approaches based on optimality to complex biological reality.

In neuroscience, Bayes theorem is typically used to analyze high level inference problems, such as decision-making under uncertainty. Here we demonstrated that Bayes’ theorem, being the optimal way to solve any inference problem, big or small, could be implemented in perhaps the smallest computationally relevant element in the brain: the synapse.

Methods

Here we provide a complete description of our model (Sec. M1) and sketch the derivation of the learning rules (Sec. M2).

M1. Description of our model

In the main text we specified how the membrane potential depends on the weights and incoming spikes (Eq. 1) and how the target membrane potential depends on the target weights and incoming spikes (Eq. 5), and we defined the error signal (Eq. 2). In this section we describe how target weights, w_tar,i, the weights, w_i, the and the spikes, x_i, are generated.

M1.1. Target weights

The target weights are the weights that in some sense optimize the performance of the animal. We do not expect these weights to remain constant over time, for two reasons. First, both the state of the world and the organism change over time, thus changing the target weights. Second, we take a local, single neuron view to learning, and define the target weights on a particular neuron to be the optimal weights given the weights on all the other neurons in the network. Consequently, as the weights of surrounding neurons change due to learning, the target weights on our neuron also change. While these changes may be quite systematic, to a single synapse deep in the brain they are likely to appear random.

Motivated by this last observation, in our model we assume that the target weights evolve according to a random process. To ensure that the weights do not change sign, we work in log space, and on each time step we add a small amount of noise to the log of the target weights. And to ensure that the weights do not become too small or too large, we add a small drift toward a prior log weight. Specifically, defining

λ_{tar, i} = log | w_{tar, i} |

(M.10)

(note the absolute value sign, which allows the weights to be either positive or negative), we let λ_tar,i evolve according to

λ_{tar, i} (t + 1) = λ_{tar, i} (t) - \frac{λ_{tar, i} (t) - m_{prior}}{τ} + \sqrt{\frac{2 s_{prior}^{2}}{τ}} ξ_{tar, i}

(M.11)

where m_prior and $s_{prior}^{2}$ are the prior mean and variance of λ_tar,i(t), τ (which is dimensionless) is the characteristic number of steps over which λ_tar,i(t) changes, and ξ_tar,i is a zero mean, unit variance Gaussian random variable.

We chose the noise process described in Eq. (M.11) for three reasons. First, w_tar,i is equal to either +e^λ_tar,i (for excitatory weights) or −e^λ_tar,i (for inhibitory weights), and thus cannot change sign as λ_tar,i changes with learning. Consequently, excitatory weights cannot become inhibitory, and vice versa, so Dale’s law is preserved. Second, spine sizes obey this stochastic process [47], and while synaptic weights are not spine sizes, they are correlated [48]. Third, this noise process gives a log-normal stationary distribution of weights, as is observed experimentally [49].

The parameters that determine how the weights drift, m_prior and $s_{prior}^{2}$ , were set to the mean and variance of measured log-weights using data from Ref. [49] (Supplementary Math Note, Sec. S4.1). We used a time step of 10 ms, within the range of measured membrane time-constants. For the linear and cerebellar models we set τ to 10⁵; for reinforcement learning we set τ to 5 × 10⁵. These values were chosen so that uncertainty roughly matched observed variability (Supplementary Math Note, Sec. S4.3). For the recurrent network we do not know the target weights, so we do not know the drift rate. Nor do we know the effective drift associated with the fact that the optimal weight on one synapses changes as the surrounding circuit changes. We therefore tried different drifts in our simulations (data not shown). We found that near zero drift was optimal, so we set τ to ∞.

M1.2. Synaptic weights

Our inference algorithm computes a distribution over the target weights. Given that distribution, though, there is nothing in the Bayesian Plasticity hypothesis that tells us how to set the weights when a spike arrives. That is where the Sampling Hypothesis comes in: it tells us to sample the weights, w_i, from the posterior,

w_{i} = e^{m_{i} + s_{i} ξ_{i}},

(M.12)

where m_i and s_i are the mean and standard deviation of the posterior distribution over the log weights, and ξ_i is a zero mean, unit variance Gaussian random variable. The mean and variance of w_i under Eq. (M.12), for which we use μ_i and $σ_{i}^{2}$ , respectively, are the standard expressions for the mean and variance of a log-normal distribution,

𝔼 [w_{i} ∣ 𝒟_{i}] \equiv μ_{i} = e^{m_{i} + s_{i}^{2} / 2}

(M.13a)

Var [w_{i} ∣ 𝒟_{i}] \equiv σ_{i}^{2} = μ_{i}^{2} [e^{s_{i}^{2}} - 1]

(M.13b)

where 𝓓_i is the data seen by the synapse so far (see Eq. M.19 below).

In the main text, we compare our Bayesian learning rules to classical ones (see in particular Figs. 3 and 4). For classical rules there is no posterior to sample from, so we can not use Eq. (M.12). Consequently, for the the classical implementation of linear and cerebellar rules we do not sample, and for Bayesian learning we use w_i = μ_i. The reinforcement learning rule, however, requires sampling, for both Bayesian and classical learning (see Eqs. M.47 and M.48). We thus assumed that the variance is proportional to the mean (as is the case for Poisson statistics). To find the constant of proportionality, denoted k, we use data from Ref. [49]; see Supplementary Math Note, Sec. S4.2 for details. A least squares fit to that data gives k = 0.0877. A naive way to implement this is to sample weights using $w_{i} = μ_{i} + \sqrt{k μ_{i}} ξ_{i}$ with ξ_i ~ 𝒩 (0,1). However, that allows w_i to change sign, so instead we sample the weights using

w_{i} = μ_{i} e^{β_{i} + γ_{i} ξ_{i}},

(M.14)

and choose β_i and γ_i so that the mean and variance of w_i are μ_i and kμ_i, respectively. As is straightforward to show, these conditions are satisfied when β_i and γ_i are given by

β_{i} = - \frac{log (1 + k / μ_{i})}{2}

(M.15a)

γ_{i} = \sqrt{log (1 + k / μ_{i})} .

(M.15b)

M1.3. Synaptic input

For linear, cerebellar and reinforcement learning, neurons receive input from n presynaptic neurons, all firing at different rates. The firing rates, ν_i (i labels presynaptic neuron), are drawn from a log-normal distribution, using a distribution that is intermediate between the narrow range found by some [50] and the broad range found by others [51]: a log-normal with median at 1 Hz and with 95% of firing rates being between 0.1 Hz and 10 Hz,

log ν_{i} \sim 𝒩 (0, {(log \sqrt{10})}^{2})

(M.16)

with ν_i measured in Hz. On each time step, x_i is drawn from a Bernoulli distribution (so it is either 0 or 1),

P (x_{i}) = {(ν_{i} Δ t)}^{x_{i}} {(1 - ν_{i} Δ t)}^{1 - x_{i}} .

(M.17)

M2. Learning rules

Here we outline how a synapse can infer a probability distribution over its target weights. This is done using a well-understood class, hidden Markov models, for which we can use a standard, two-step procedure: in the first step the synapse incorporates new data using Bayes theorem; in the second step it take into account random changes in the target weight.

While straightforward in principle, in practice there are two difficulties with this approach. The first is that it results in a joint distribution over all synaptic weights. It is unlikely, however, that synapses could store such a distribution: even with a Gaussian approximation, for n synapses there are about n²/2 parameters. And it is even more unlikely that they could compute it, as that would require communication among synapses on different dendritic branches. We thus assume that each synapse performs probabilistic inference based only on the data available to it. This makes each synapse locally optimal, and it allows us to derive local learning rules. It is potentially the most important theoretical advance of our analysis. And within the Bayesian framework it is straightforward: each synapse simply integrates over the uncertainty in the target weights of all the other synapses. Nonetheless, this is an unusual approach, and further work is necessary to understand its theoretical properties.

The second difficulty is that even with the local approximation, inference is intractable, as it requires pointwise multiplication of probability distributions and a convolution (see Eqs. M.20 and M.21 below). To remedy this, we approximate the true distribution by a simpler one, a log-normal. The log-normal distribution was chosen for two reasons: it prevent synapses from changing sign, so Dale’s law is respected; and it matches the distribution of the target weights, Eq. (M.11), so it produces the correct distribution in the absence of presynaptic spikes.

M2.1. Single neuron learning rules: general formalism

The goal of a synapse is to compute the probability distribution over synaptic strength given data up to the last time step. Here the data - assumed local, as just discussed - consists of the feedback signal, f (shorthand for f_lin, f_cb or f_rl), the presynaptic input, x_i, and the actual weight, w_i. To reduce clutter, we use d_i(t) to denote the data at time t,

d_{i} (t) \equiv {f (t), x_{i} (t), w_{i} (t)},

(M.18)

and 𝓓_i(t) to denote past data,

𝒟_{i} (t) \equiv {d_{i} (t), d_{i} (t - 1), d_{i} (t - 2), \dots} .

(M.19)

With this notation, the goal of the synapse is to compute P(λ_tar,i(t + 1)|𝓓_i(t)) in terms of P(λ_tar,i(t)|𝓓_i(t − 1)). To reduce clutter even further, here and in what follows all quantities without an explicitly specified time index are evaluated at time step t; thus, we will derive an update rule for P(λ_tar,i(t + 1)|𝓓_i) in terms of P(λ_tar,i|𝓓_i(t − 1)).

Making, as discussed above, the approximation that synapses perform inference based only on local information, the first step in the derivation of the update rule, incorporating new data using Bayes theorem, gives us

P (λ_{tar, i} ∣ 𝒟_{i}) = P (λ_{tar, i} ∣ d_{i}, 𝒟_{i} (t - 1)) \propto P (d_{i} ∣ λ_{tar, i}) P (λ_{tar, i} ∣ 𝒟_{i} (t - 1))

(M.20)

where we used the Markov property: P(d_i|λ_tar,i, 𝓓_i(t − 1)) = P(d_i|λ_tar,i). (Recall that λ_tar,i is the log of the absolute value of the i^th target weight, w_tar,i; see Eq. M.10.) In the second step, the synapse takes into account random changes in the target weight,

P (λ_{tar, i} (t + 1) ∣ 𝒟_{i}) = \int d λ_{tar, i} P (λ_{tar, i} (t + 1) ∣ λ_{tar, i}) P (λ_{tar, i} ∣ 𝒟_{i}) .

(M.21)

The conditional distribution, P(λ_tar,i(t + 1)|λ_tar,i), can be extracted from Eq. (M.11). Combining both steps takes us from the distribution at time t, P(λ_tar,i|𝓓_i(t − 1)), to the distribution at the time t + 1, P (λ_tar,i(t + 1) |𝓓_i).

To make progress analytically, we approximate the true distribution by a log-normal one with mean m_i and variance $s_{i}^{2}$ ; that is, we assume that

λ_{tar, i} ∣ 𝒟_{i} (t - 1) \sim 𝒩 (m_{i}, s_{i}^{2}) .

(M.22)

This is the quantity the synapse needs when it sets the actual weight, w_i. (Recall that quantities with no explicit time dependence are to be evaluated at time t; thus, the left hand side is the probability distribution over λ_tar,i(t) given data up to the previous time step.)

Finalizing the calculation requires two steps: 1) insert Eq. (M.22) into (M.20) and compute P(λ_tar,i|𝓓_i); 2) insert that into Eq. (M.21) and compute P(λ_tar,i(t + 1)|𝓓_i). However, Eq. (M.20) takes us out of our log-normal model class. To remedy this we use Assumed Density Filtering [52], for which posteriors are taken to be log-normal with mean and variance chosen to produce the distribution closest to the true one, where “close” is measured by the KL-divergence between the true and log-normal distributions. This can be achieved by matching moments: the mean and variance of the “closest” log-normal distribution are

m_{i} = 𝔼 [λ_{tar, i} ∣ 𝒟_{i} (t - 1)]

(M.23a)

s_{i}^{2} = Var [λ_{tar, i} ∣ 𝒟_{i} (t - 1)] .

(M.23b)

We will apply this first to Eq. (M.20). Taking the log of both sides of that equation gives

log P (λ_{tar, i} ∣ 𝒟_{i}) = L (λ_{tar, i}) + log P (λ_{tar, i} ∣ 𝒟_{i} (t - 1)) + const

(M.24)

where

L (λ_{tar, i}) \equiv log P (d_{i} ∣ λ_{tar, i})

(M.25)

is the log likelihood of the data at time t given the target weight; we suppress the dependence on d_i to avoid clutter. Under our log-normal assumption, the second term on the right hand side of Eq. (M.24) is Gaussian in λ_tar,i. Motivated by the fact that new data does not provide much information, we assume that the likelihood is a slowly varying function of the target weights. This allows us to make a Laplace approximation: we Taylor expand the log likelihood around m_i, the mean of P(λ_tar,i|𝓓_i(t − 1)), and work only to second order in λ_tar,i − m_i. Also using Eq. (M.22), we have

log P (λ_{tar, i} ∣ 𝒟_{i}) = L^{'} (m_{i}) (λ_{tar, i} - m_{i}) + L^{″} (m_{i}) \frac{{(λ_{tar, i} - m_{i})}^{2}}{2} - \frac{{(λ_{tar, i} - m_{i})}^{2}}{2 s_{i}^{2}} + const .

(M.26)

The right hand side is now quadratic in λ_tar,i. Consequently, P(λ_tar,i|𝓓_i) is Gaussian, with mean and variance given by

𝔼 [λ_{tar, i} ∣ 𝒟_{i}] = m_{i} + \frac{s_{i}^{2} L^{'} (m_{i})}{1 - s_{i}^{2} L^{″} (m_{i})} \approx m_{i} + s_{i}^{2} L^{'} (m_{i})

(M.27a)

Var [λ_{tar, i} ∣ 𝒟_{i}] = s_{i}^{2} + \frac{s_{i}^{4} L^{″} (m_{i})}{1 - s_{i}^{2} L^{″} (m_{i})} \approx s_{i}^{2} + s_{i}^{4} L^{″} (m_{i}) .

(M.27b)

To derive the approximation expressions, we assumed $s_{i}^{2} | L^{″} (m_{i}) | ≪ 1$ . This holds in the limit of slowly varying log likelihood, which we assume throughout our analysis.

Equation (M.27) tells us how to incorporate new data; we now need to incorporate random drift, via the integral in Eq. (M.21). From Eq. (M.11), we see that P(λ_tar,i(t + 1)|λ_tar,i) is Gaussian, so the integral is straightforward, and we have

m_{i} (t + 1) = (1 - \frac{1}{τ}) 𝔼 [λ_{tar, i} ∣ 𝒟_{i}] + \frac{m_{prior}}{τ}

(M.28a)

s_{i}^{2} (t + 1) = {(1 - \frac{1}{τ})}^{2} Var [λ_{tar, i} ∣ 𝒟_{i}] + \frac{2 s_{prior}^{2}}{τ} .

(M.28b)

Inserting Eq. (M.27) into (M.28), and working to lowest nonvanishing order in 1/τ, s_iL’(m_i) and $s_{i}^{2} L^{″} (m_{i})$ , we arrive at our final update equations,

Δ m_{i} = s_{i}^{2} L^{'} (m_{i}) - \frac{m_{i} - m_{prior}}{τ}

(M.29a)

Δ s_{i}^{2} = s_{i}^{4} L^{″} (m_{i}) - \frac{2 (s_{i}^{2} - s_{prior}^{2})}{τ}

(M.29b)

where Δm_i ≡ m_i(t + 1) − m_i and $Δ s_{i}^{2} \equiv s_{i}^{2} (t + 1) - s_{i}^{2}$ . Thus, to update the mean and variance, all we have to do is compute the log likelihood and take the first and second derivatives. In Sec. M2.2 below we sketch how to do that; additional details are provided in Supplementary Math Note, Sec. S1. Note that equality in these expressions (and many that follow) is shorthand for equality under the assumptions and approximations of our model.

M2.2. Single neuron learning rules for our three models

According to the above analysis (see in particular Eq. M.29), to determine the update rules we just need the log likelihood of the current data, d_i(t), given the error signal (either f_lin, f_cb or f_rl). Computing it is nontrivial, as several approximations are required. However, it is not hard to get an intuitive understanding of its form.

Using Eq. (M.18) for the data, d_i, the likelihood - the probability of the data given w_tar,i - may be written

P (d_{i} ∣ λ_{tar, i}) = P (f ∣ x_{i}, w_{i}, λ_{tar, i}) P (x_{i}, w_{i} ∣ λ_{tar, i}) \propto P (f ∣ x_{i}, w_{i}, λ_{tar, i}),

(M.30)

where we are able to drop the term P(x_i, w_i|λ_tar,i) because without an error signal, x_i and w_i do not provide any information about λ_tar,i. For all of our feedback signals, f is a function of f_lin (see main text); we take advantage of this to write

P (f ∣ x_{i}, w_{i}, λ_{tar, i}) = \int d f_{lin} P (f ∣ f_{lin}) P (f_{lin} ∣ x_{i}, w_{i}, λ_{tar, i}) .

(M.31)

We focus here on computing P(f_lin|x_i, w_i, λ_tar,i), and leave the integral for Supplementary Math Note, Sec. S1.1. Using Eqs. (1), (2) and (5) from the main text, we have

f_{lin} = (w_{tar, i} - w_{i}) x_{i} + \sum_{j \neq i} (w_{tar, j} - w_{j}) x_{j} + ξ_{V} + ξ_{δ} .

(M.32)

For synapse i, all the terms in the sum over j are unobserved, and so correspond to noise. By the Central Limit Theorem (and the assumed independence of the synapses), that noise is Gaussian; we take the added noise, ξ_V and ξ_δ, to be Gaussian as well, with total variance

σ_{0}^{2} \equiv Var [ξ_{δ}] + Var [ξ_{V}] .

(M.33)

Consequently, we may write

f_{lin} ∣ w_{i}, x_{i}, λ_{tar, i} \sim 𝒩 ((w_{tar, i} - w_{i}) x_{i}, σ_{δ, i}^{2})

(M.34)

where

σ_{δ, i}^{2} \equiv Var [\sum_{j \neq i} (w_{tar, j} - w_{j}) x_{j} ∣ 𝒟_{i} (t - 1)] + σ_{0}^{2} .

(M.35)

The quantity $σ_{δ, i}^{2}$ depends on synapse, i. However, in the limit that there are a large number of synapses, that dependence is weak. We thus approximate it by including all terms in the sum over j, which we denote $σ_{δ}^{2}$ ,

σ_{δ, i}^{2} \approx σ_{δ}^{2} \equiv Var [\sum_{j} (w_{tar, i} - w_{j}) x_{j} ∣ 𝒟_{i} (t - 1)] + σ_{0}^{2} .

(M.36)

Under this approximation,

f_{lin} ∣ w_{i}, x_{i}, λ_{tar, i} \sim 𝒩 ((\pm e^{λ_{tar, i}} - w_{i}) x_{i}, σ_{δ}^{2}) .

(M.37)

For much of our analysis we use the value of $σ_{δ}^{2}$ under the prior. That quantity, denoted $σ_{δ 0}^{2}$ given by

σ_{δ 0}^{2} \equiv (σ_{prior}^{2} + σ_{w, prior}^{2}) \sum_{j} ν_{j} Δ t (1 - ν_{j} Δ t) + σ_{0}^{2}

(M.38)

where the term ν_jΔt(1 − ν_jΔt) comes from the Bernoulli statistics of x_j (see Eq. M.17), and $σ_{prior}^{2}$ and $σ_{w, prior}^{2}$ are the variances of w_tar,i and w_i under the prior. The latter, $σ_{w, prior}^{2}$ , depends on whether or not we are sampling,

σ_{w, prior}^{2} \equiv {\begin{array}{l} σ_{prior}^{2} & Synaptic Sampling \\ k μ_{prior} & variance proportional to the mean (Sec . M1 . 2) . \end{array}

(M.39)

The prior mean and variance of the weights in terms of the log weights (the quantities we have access to; see Table 1 below) are given by Eq. (M.13),

μ_{prior} \equiv e^{m_{prior} + s_{prior}^{2}} / 2

(M.40a)

σ_{prior}^{2} \equiv μ_{prior}^{2} [e^{s_{prior}^{2}} - 1] .

(M.40b)

This analysis tells us that the distribution P(f_lin|w_i, x_i, λ_tar,i) is Gaussian in e^λ_tar,i. To determine the learning rules, all we have to do is insert Eq. (M.37) into Eq. (M.31), perform an integral, take the log, compute the first two derivatives, and evaluate them at m_i (see Eq. M.29). These steps, which are performed in Supplementary Math Note, Sec. S1.1, are not completely straightforward, as various approximations must be made. However, from a conceptual point of view the approximations do not add much. Thus, here we simply give the results.

Linear feedback, f = f_lin

The Bayesian update rules are

Δ m_{i} = (\frac{s_{i}^{2} μ_{i}}{σ_{δ 0}^{2}}) x_{i} f_{lin} - \frac{1}{τ} (m_{i} - m_{prior})

(M.41a)

Δ s_{i}^{2} = - (\frac{s_{i}^{4} μ_{i}^{2}}{σ_{δ 0}^{2}}) x_{i}^{2} - \frac{2}{τ} (s_{i}^{2} - s_{prior}^{2}) .

(M.41b)

For classical learning, we use the delta rule (main text, Eq. 3),

Δ w_{i} = η x_{i} f_{lin} .

(M.42)

Note that we are not including weight drift in the classical learning rate (both here and below), as weight drift was derived using Bayesian analysis, and has no classical counterpart.

Cerebellar feedback, f = f_cb = Θ(f_lin − θ)

The Bayesian update rules are

Δ m_{i} = (\frac{s_{i}^{2} μ_{i}}{σ_{δ 0}^{2}}) x_{i} σ_{δ 0} (2 f_{cb} - 1) \frac{𝒩 (θ_{cb})}{Φ (θ_{cb})} - \frac{1}{τ} (m_{i} - m_{prior})

(M.43a)

Δ s_{i}^{2} = - (\frac{s_{i}^{4} μ_{i}^{2}}{σ_{δ 0}^{2}}) x_{i}^{2} \frac{𝒩 (θ_{cb})}{Φ (θ_{cb})} [θ_{cb} + \frac{𝒩 (θ_{cb})}{Φ (θ_{cb})}] - \frac{2}{τ} (s_{i}^{2} - s_{prior}^{2})

(M.43b)

where Φ and (in a slight abuse of notation) 𝒩 are the cumulative normal and normal functions, respectively,

Φ (z) \equiv \int_{- \infty}^{z} d u \frac{e^{- u^{2} / 2}}{{(2 π)}^{1 / 2}}

(M.44a)

𝒩 (z) \equiv \frac{e^{- z^{2} / 2}}{{(2 π)}^{1 / 2}}

(M.44b)

and θ_cb is given in terms of the threshold, θ, as

θ_{cb} \equiv (1 - 2 f_{cb}) \frac{θ}{σ_{δ 0}} .

(M.45)

For classical learning, we absorb most of the prefactor in the above mean update into a fixed learning rate,

Δ w_{i} = η (2 f_{cb} - 1) x_{i} \frac{𝒩 (θ_{cb})}{Φ (θ_{cb})} .

(M.46)

Reinforcement learning, f = f_rl = − |f_lin|

The Bayesian update rules are

Δ m_{i} = (\frac{s_{i}^{2} μ_{i}}{σ_{δ}^{2}}) (\frac{f_{rl}^{2}}{σ_{δ}^{2}} - 1) x_{i}^{2} (μ_{i} - w_{i}) - \frac{1}{τ} (m_{i} - m_{prior})

(M.47a)

Δ s_{i}^{2} = - (\frac{s_{i}^{4} μ_{i}^{2}}{σ_{δ}^{2}}) (1 - \frac{f_{rl}^{2}}{σ_{δ}^{2}}) x_{i}^{2} - \frac{2}{τ} (s_{i}^{2} - s_{prior}^{2}) .

(M.47b)

This learning rule appears non-local, as it depends on $σ_{δ}^{2}$ , which in turn depends on all the synapses (Eq. M.36). However, we make it local by changing the feedback signal to $(1 - f_{rl}^{2} / σ_{δ}^{2}) / σ_{δ}^{2}$ . For classical learning, we again absorb most of the prefactor into the learning rate,

Δ w_{i} = η x_{i} (f_{rl} \tanh ((μ_{i} - w_{i}) x_{i} f_{rl} / σ_{δ}^{2}) - (μ_{i} - w_{i}) x_{i}) .

(M.48)

Note that tanh appears in the classical, but not Bayesian, learning rules. That is because for the Bayesian learning rules we made the approximation tanh $((μ_{i} - w_{i}) x_{i} f_{rl} / σ_{δ}^{2}) \approx (μ_{i} -)) w_{i}) x_{i} f_{rl} / σ_{δ}^{2}$ . This approximation, however, made the classical learning rule unstable.

M2.3. Recurrent neural network learning rules

So far we have focused on single neurons. Here we generalize to the more realistic scenario in which the output weights of a recurrent network are trained to produce a time-dependent target function. We will assume that the network, which contains N neurons, evolves according to

τ_{m} \frac{d v_{i}}{d t} = - v_{i} + \sum_{j = 1}^{N} J_{i j} x_{j} + A_{i} V (t) + I_{i} (t)

(M.49a)

x_{j} = \tanh (v_{j})

(M.49b)

V (t) = \sum_{j = 1}^{N} w_{j} x_{j} .

(M.49c)

We interpret v_i as the membrane potential and x_j as the firing rate relative to baseline. The recurrent weights, J_ij, and feedback weights, A_i, are fixed. Parameters of the network, and details of the simulations, are given in Supplementary Math Note, Sec. S3.

The goal of the network is to minimize the distance between V(t) and some target function, denoted V_tar(t); that is, to minimize the error, δ(t), defined, as in Eq. (2), to be

δ (t) \equiv V_{tar} (t) - V (t) .

(M.50)

As with single neurons, we take a Bayesian approach. There are, however, two important differences. The first is that we do not know the target weights (we do not specify them; instead they must be learned). We assume, though, that target weights exist, which means we can write

δ (t) = \sum_{j} (w_{tar, j} (t) - w_{j} (t)) x_{j} (t) .

(M.51)

The second difference is that the feedback signal, δ(t), is a continuous function of time. Consequently, information at times t and t + dt is largely redundant. To deal with this redundancy we make several approximations. First, rather than updating the weights continuously, we update them at times separated by Δt. (In a slight abuse of notation, here Δt does not have the same numerical value as in the single neuron update rules.) Bayes’ theorem, Eq. (M.20), then becomes

P (w_{tar, i} ∣ 𝒟_{i}) \propto P (d_{i} ∣ w_{tar, i}, 𝒟_{i} (t - Δ t)) P (w_{tar, i} ∣ 𝒟_{i} (t - Δ t))

(M.52)

where, as in the single neuron case, the data for synapse i is the presynaptic input, x_i, the actual weight, w_i, and the error signal, δ. To derive this expression we made two simplifications: we did not add noise to the error signal, so the synapses see δ rather than f_lin, and we did not enforce Dale’s law, so the weights can change sign. Because of the latter simplification, we let the weights, rather than the log weights, have a Gaussian distribution; that is why Eq. (M.52) is written in terms of w_tar,i rather than λ_tar,i.

In one respect the analysis is simpler than it was for single neurons: because the target weights do not evolve over time (see comments at the end of Sec. M1.1), we can avoid the integral in Eq. (M.21). However, in another respect it is more complicated: as just discussed, the likelihood (the first term on the right hand side of Eq. M.52) depends on past data. An exact treatment in this regime is beyond the scope of this work. What we do instead is choose the time step, Δt, so it is much larger than the correlation time of δ(t). This allows us to drop the dependence on 𝓓_i(t − Δt) in the likelihood, giving us

P (d_{i} ∣ w_{tar, i}, 𝒟_{i} (t - Δ t)) \approx P (d_{i} ∣ w_{tar, i}) \propto P (δ ∣ x_{i}, w_{i}, w_{tar, i})

(M.53)

where, as in Eq. (M.30), we used the fact that without an error signal, x_i and w_i do not provide any information about w_tar,i.

While this gives us a very good approximation to the likelihood if Δt is large, large Δt means that updates would be made very rarely, and so learning would be slow. We thus make a second approximation, which is to optimize our learning rule (via numerical simulation, as discussed below) with respect to Δt. This gives us approximate Bayesian update rules, which presumably could be improved upon. However, as we will see, the approximate update rules already outperform the classical ones by an order of magnitude. Thus, any improvement would only make the case for Bayesian plasticity stronger.

To find an expression for P(δ|x_i, w_i, w_tar,i), we again write δ as in Eq. (M.32) (but without noise, so ξ_δ = 0, which reduces $σ_{0}^{2}$ ; see Eq. M.33). Now, however, we are interested in the log likelihood with respect to the target weights, w_tar,i, rather than the log of the target weights, λ_tar,i (as mentioned above). Thus, the distribution over δ simplifies relative to Eq. (M.37),

δ ∣ w_{i}, x_{i}, w_{tar, i} \sim 𝒩 ((w_{tar, i} - w_{i}) x_{i}, σ_{δ}^{2}) .

(M.54)

As above, we made the approximation $σ_{δ, i}^{2} \approx σ_{δ}^{2}$ ; see Eq. (M.36). It is now straightforward to write down the log likelihood,

L (w_{tar, i}) = - \frac{{(δ - (w_{tar, i} - w_{i}) x_{i})}^{2}}{2 σ_{δ}^{2}} + const .

(M.55)

The first and second derivatives evaluated at w_tar,i = μ_i are

L^{'} (μ_{i}) = \frac{(δ - (μ_{i} - w_{i}) x_{i}) x_{i}}{σ_{δ}^{2}} \approx \frac{δ x_{i}}{σ_{δ}^{2}}

(M.56a)

L^{″} (μ_{i}) = - \frac{x_{i}^{2}}{σ_{δ}^{2}} .

(M.56b)

(We are justified in dropping the term (μ_i − w_i)x_i because it is a factor of $\sqrt{n}$ smaller than δ. That follows because $σ_{δ}^{2}$ , which is the variance of δ, is $𝒪 (n)$ ; see Eq. M.36.) Inserting these expressions into Eq. (M.29) (with τ taken to ∞ because, as discussed in Sec. M1.1, we are assuming the target weights do not drift over time), we have

Δ μ_{i} = \frac{σ_{i}^{2}}{σ_{δ}^{2}} δ x_{i}

(M.57a)

Δ σ_{i}^{2} = - \frac{σ_{i}^{4}}{σ_{δ}^{2}} x_{i}^{2},

(M.57b)

where Δμ_i = μ_i(t + Δt) − μ_i(t) and similarly for $Δ σ_{i}^{2}$ .

Primarily for convenience, we make a third approximation, which is to update the weights continuously rather than at discrete points separated by Δt. To do that we simply make the approximation Δμ_i ≈ Δtdμ_i/dt, and similarly for $Δ σ_{i}^{2}$ . This allows us to turn the update rules into ordinary differential equations,

\frac{d μ_{i}}{d t} = \frac{1}{Δ t} \frac{σ_{i}^{2}}{σ_{δ}^{2}} δ x_{i}

(M.58a)

\frac{d σ_{i}^{2}}{d t} = - \frac{1}{Δ t} \frac{σ_{i}^{4}}{σ_{δ}^{2}} x_{i}^{2} .

(M.58b)

Then, defining

η_{i} \equiv \frac{σ_{i}^{2}}{σ_{δ}^{2} Δ t},

(M.59)

inserting this into into Eq. (M.58) and, in our fourth approximation, ignoring the time dependence in $σ_{δ}^{2}$ , those equations simplify to

\frac{d μ_{i}}{d t} = η_{i} δ x_{i}

(M.60a)

\frac{d η_{i}}{d t} = - η_{i}^{2} x_{i}^{2} .

(M.60b)

Optimizing over Δt corresponds to optimizing over the initial value of η_i, which we assumed is the same for all i. This optimization is done via numerical simulations.

For the classical learning rules, we drop Eq. (M.60b) and fix η_i to the same value for all synapses.

Further methods details can be found in the attached Life Sciences Reporting Summary.

Supplementary Material

EMS114909-supplement-1.pdf^{(550.2KB, pdf)}

Acknowledgements

LA and PEL were supported by the Gatsby Charitable Foundation; PEL was also supported by the Wellcome Trust (110114/Z/15/Z); JJ and JPP were supported by the Swiss National Science Foundation (PP00P3 150637); JAM was supported by UCL Graduate Research and UCL Overseas Research Scholarships; AP was supported by grants from the Simons Collaboration for the Global Brain and the Swiss National Foundation (31003A 165831).

Footnotes

Author Contributions

AP and PEl were involved in the initial formulation of the problem. LA and PEL were involved in the theoretical development. LA, JJ and JM did the simulations and data analysis. PEL, JP and AP were involved in writing the manuscript.

Competing interests

The authors declare no competing interests.

Code Availability

Code and data are available for download at: https://github.com/Jegmi/the-bayesian-synapse/releases/tag/v2

Data Availability

Code and data are available for download at: https://github.com/Jegmi/the-bayesian-synapse/releases/tag/v2

References

[1].Poggio T. Cold Spring Harbor Symposia on Quantitative Biology. Vol. 55. Cold Spring Harbor Laboratory Press; 1990. A theory of how the brain might work. [DOI] [PubMed] [Google Scholar]
[2].Knill DC, Richards W. Perception as Bayesian Inference. Cambridge University Press; 1996. [Google Scholar]
[3].Pouget A, et al. Probabilistic brains: knowns and unknowns. Nature Neuroscience. 2013;16(9):1170–1178. doi: 10.1038/nn.3495. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Aitchison L. Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods. NeurIPS. 2020 [Google Scholar]
[5].Tripathy SJ, et al. Brain-wide analysis of electrophysiological diversity yields novel categorization of mammalian neuron types. Journal of Neurophysiology. 2015;113(10):3474–3489. doi: 10.1152/jn.00237.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Schiess M, Urbanczik R, Senn W. Somato-dendritic Synaptic Plasticity and Error-backpropagation in Active Dendrites. PLoS Computational Biology. 2016;12(2):e1004638. doi: 10.1371/journal.pcbi.1004638. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Bono J, Clopath C. Modeling somatic and dendritic spike mediated plasticity at the single neuron and network level. Nature Communications. 2017;8:706. doi: 10.1038/s41467-017-00740-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Sacramento J, et al. In: Advances in Neural Information Processing Systems. 31. Bengio S, et al., editors. Curran Associates, Inc; 2018. Dendritic cortical microcircuits approximate the backpropagation algorithm; pp. 8721–8732. [Google Scholar]
[9].Illing B, Gerstner W, Brea J. Biologically plausible deep learning – But how far can we go with shallow networks? Neural Networks. 2019;118:90–101. doi: 10.1016/j.neunet.2019.06.001. [DOI] [PubMed] [Google Scholar]
[10].Akrout M, et al. Deep Learning without Weight Transport. arXiv. 2019 190ļ.05391. [Google Scholar]
[11].Ito M, Sakurai M, Tongroach P. Climbing fibre induced depression of both mossy fibre responsiveness and glutamate sensitivity of cerebellar Purkinje cells. Journal of Physiology. 1982;324(1):113–134. doi: 10.1113/jphysiol.1982.sp014103. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Eccles J, Llinas R, Sasaki K. The excitatory synaptic action of climbing fibres on the Purkinje cells of the cerebellum. Journal of Physiology. 1966;182(2):268–296. doi: 10.1113/jphysiol.1966.sp007824. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Widrow B, Hoff ME. Adaptive switching circuits. Office of Naval Research Technical Report. 1960;1553(1) [Google Scholar]
[14].Dayan P, Abbott LF. Theoretical Neuroscience. MIT Press; Cambridge, MA: 2001. [Google Scholar]
[15].Ko H, et al. The emergence of functional microcircuits in visual cortex. Nature. 2013;496(7443):96–100. doi: 10.1038/nature12015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Thomson AM. Presynaptic frequency- and pattern-dependent filtering. Journal of Computational Neuroscience. 2003;15(2):159–202. doi: 10.1023/a:1025812808362. [DOI] [PubMed] [Google Scholar]
[17].Tsodyks MV, Markram H. The neural code between neocortical pyramidal neu-rons depends on neurotransmitter release probability. Proceedings of the National Academy of Sciences; 1997. pp. 719–723. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Maffei A, Turrigiano GG. Multiple modes of network homeostasis in visual cortical layer 2/3. Journal of Neuroscience. 2008;28(17):4377–4384. doi: 10.1523/JNEUROSCI.5298-07.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Hoyer PO, Hyvarinen A. Interpreting neural response variability as Monte Carlo sampling of the posterior. Advances in Neural Information Processing Systems. 2003:293–300. [Google Scholar]
[20].Fiser J, et al. Statistically optimal perception and learning: from behavior to neural representations. Trends in Cognitive Sciences. 2010;14(3):119–130. doi: 10.1016/j.tics.2010.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Berkes P, et al. Spontaneous Cortical Activity Reveals Hallmarks of an Optimal Internal Model of the Environment. Science. 2011;331(6013):83–87. doi: 10.1126/science.1195870. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Orbán G, et al. Neural Variability and Sampling-Based Probabilistic Representations in the Visual Cortex. Neuron. 2016;92(2):530–543. doi: 10.1016/j.neuron.2016.09.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Haefner RM, Berkes P, Fiser J. Perceptual decision-making as probabilistic inference by neural sampling. Neuron. 2016;90(3):649–660. doi: 10.1016/j.neuron.2016.03.020. [DOI] [PubMed] [Google Scholar]
[24].Aitchison L, Lengyel M. The Hamiltonian BraIn: Efficient Probabilistic Inference with Excitatory-Inhibitory Neural Circuit Dynamics. PLoS Computational Biology. 12(12):e1005186. doi: 10.1371/journal.pcbi.1005186. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Lange RD, Haefner RM. Task-induced neural covariability as a signature of approximate Bayesian learning and inference. bioRxiv. 2020 doi: 10.1371/journal.pcbi.1009557. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Ma WJ, et al. Bayesian inference with probabilistic population codes. Nature Neuroscience. 2006;9(11):1432–1438. doi: 10.1038/nn1790. [DOI] [PubMed] [Google Scholar]
[27].Buntine WL, Weigend AS. Bayesian backpropagation. Complex systems. 1991;5(6):603–643. [Google Scholar]
[28].MacKay DJ. A practical Bayesian framework for backpropagation networks. Neural Computation. 1992;4(3):448–472. [Google Scholar]
[29].Blundell C, et al. Weight uncertainty in neural networks. arXiv. 2015 1505.05424. [Google Scholar]
[30].Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences. 2016;106(25):10296–10301. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Dayan P, Kakade S. Advances in Neural Information Processing Systems. Vol. 13. MIT Press; 2001. Explaining Away in Weight Space; pp. 451–457. [Google Scholar]
[32].Kappel D, et al. Network Plasticity as Bayesian Inference. PLoS computational biology. 2015;11(11):e1004485. doi: 10.1371/journal.pcbi.1004485. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Hiratani N, Fukai T. Redundancy in synaptic connections enables neurons to learn optimally. Proceedings of the National Academy of Sciences. 2018;115(29):E6871–E6879. doi: 10.1073/pnas.1803274115. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Drugowitsch J, et al. Learning optimal decisions with confidence. bioRxiv. 2019 doi: 10.1073/pnas.1906787116. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Pfister J-P, Dayan P, Lengyel M. Synapses with short-term plasticity are op-timal estimators of presynaptic membrane potentials. Nature Neuroscience. 2010;13(10):1271–1275. doi: 10.1038/nn.2640. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Kasai H, Takahashi N, Tokumaru H. Distinct initial SNARE configurations underlying the diversity of exocytosis. Physiological Reviews. 2012;92:1915–1964. doi: 10.1152/physrev.00007.2012. [DOI] [PubMed] [Google Scholar]
[37].Südhof TC. The presynaptic active zone. Neuron. 2012;75:11–25. doi: 10.1016/j.neuron.2012.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Michel K, et al. The presynaptic active zone: A dynamic scaffold that regulates synap-tic efficacy. Experimental Cell Research. 2015;335:157–164. doi: 10.1016/j.yexcr.2015.02.011. [DOI] [PubMed] [Google Scholar]
[39].Frey U, Morris RG. Synaptic tagging and long-term potentiation. Nature. 1997;385(6616):533–536. doi: 10.1038/385533a0. [DOI] [PubMed] [Google Scholar]
[40].Redondo RL, Morris RGM. Making memories last: the synaptic tagging and capture hypothesis. Nature Reviews Neuroscience. 2011;12(1):17–30. doi: 10.1038/nrn2963. [DOI] [PubMed] [Google Scholar]
[41].Rogerson T, et al. Synaptic tagging during memory allocation. Nature Reviews Neuroscience. 2014;15(3):157–169. doi: 10.1038/nrn3667. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Abraham WC, Bear MF. Metaplasticity: the plasticity of synaptic plasticity. Trends in Neurosciences. 1996;4:126–130. doi: 10.1016/s0166-2236(96)80018-x. [DOI] [PubMed] [Google Scholar]
[43].Abraham WC. Metaplasticity: tuning synapses and networks for plasticity. Nature Reviews Neuroscience. 2008;9(5):387. doi: 10.1038/nrn2356. [DOI] [PubMed] [Google Scholar]
[44].Hulme SR, et al. Mechanisms of heterosynaptic metaplasticity. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences. 2014;369(1633) doi: 10.1098/rstb.2013.0148. 20130148. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Vogelstein JT, et al. Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of Neurophysiology. 2010;104(6):3691–3704. doi: 10.1152/jn.01073.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Packer AM, et al. Simultaneous all-optical manipulation and recording of neural cir-cuit activity with cellular resolution in vivo. Nature Methods. 2015;12(2):140–146. doi: 10.1038/nmeth.3217. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Loewenstein Y, Kuras A, Rumpel S. Multiplicative Dynamics Underlie the Emergence of the Log-Normal Distribution of Spine Sizes in the Neocortex In Vivo. Journal of Neuroscience. 2011;31(26):9481–9488. doi: 10.1523/JNEUROSCI.6130-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Matsuzaki M, et al. Structural basis of long-term potentiation in single dendritic spines. Nature. 2004;429(6993):761–766. doi: 10.1038/nature02617. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Song S, et al. Highly nonrandom features of synaptic connectivity in local cortical circuits. PLoS Biology. 2005;3(3):e68. doi: 10.1371/journal.pbio.0030068. [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].O’Connor DH, et al. Neural activity in barrel cortex underlying vibrissa-based object localization in mice. Neuron. 2010;67(6):1048–1061. doi: 10.1016/j.neuron.2010.08.026. [DOI] [PubMed] [Google Scholar]
[51].Mizuseki K, Buzsáki G. Preconfigured, Skewed Distribution of Firing Rates in the Hippocampus and Entorhinal Cortex. Cell Reports. 2013;4(5):1010–1021. doi: 10.1016/j.celrep.2013.07.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Minka TP. A family of algorithms for approximate Bayesian inference. PhD thesis. Massachusetts Institute of Technology; 2001. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

EMS114909-supplement-1.pdf^{(550.2KB, pdf)}

Data Availability Statement

Code and data are available for download at: https://github.com/Jegmi/the-bayesian-synapse/releases/tag/v2

[R1] [1].Poggio T. Cold Spring Harbor Symposia on Quantitative Biology. Vol. 55. Cold Spring Harbor Laboratory Press; 1990. A theory of how the brain might work. [DOI] [PubMed] [Google Scholar]

[R2] [2].Knill DC, Richards W. Perception as Bayesian Inference. Cambridge University Press; 1996. [Google Scholar]

[R3] [3].Pouget A, et al. Probabilistic brains: knowns and unknowns. Nature Neuroscience. 2013;16(9):1170–1178. doi: 10.1038/nn.3495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Aitchison L. Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods. NeurIPS. 2020 [Google Scholar]

[R5] [5].Tripathy SJ, et al. Brain-wide analysis of electrophysiological diversity yields novel categorization of mammalian neuron types. Journal of Neurophysiology. 2015;113(10):3474–3489. doi: 10.1152/jn.00237.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Schiess M, Urbanczik R, Senn W. Somato-dendritic Synaptic Plasticity and Error-backpropagation in Active Dendrites. PLoS Computational Biology. 2016;12(2):e1004638. doi: 10.1371/journal.pcbi.1004638. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Bono J, Clopath C. Modeling somatic and dendritic spike mediated plasticity at the single neuron and network level. Nature Communications. 2017;8:706. doi: 10.1038/s41467-017-00740-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Sacramento J, et al. In: Advances in Neural Information Processing Systems. 31. Bengio S, et al., editors. Curran Associates, Inc; 2018. Dendritic cortical microcircuits approximate the backpropagation algorithm; pp. 8721–8732. [Google Scholar]

[R9] [9].Illing B, Gerstner W, Brea J. Biologically plausible deep learning – But how far can we go with shallow networks? Neural Networks. 2019;118:90–101. doi: 10.1016/j.neunet.2019.06.001. [DOI] [PubMed] [Google Scholar]

[R10] [10].Akrout M, et al. Deep Learning without Weight Transport. arXiv. 2019 190ļ.05391. [Google Scholar]

[R11] [11].Ito M, Sakurai M, Tongroach P. Climbing fibre induced depression of both mossy fibre responsiveness and glutamate sensitivity of cerebellar Purkinje cells. Journal of Physiology. 1982;324(1):113–134. doi: 10.1113/jphysiol.1982.sp014103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Eccles J, Llinas R, Sasaki K. The excitatory synaptic action of climbing fibres on the Purkinje cells of the cerebellum. Journal of Physiology. 1966;182(2):268–296. doi: 10.1113/jphysiol.1966.sp007824. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Widrow B, Hoff ME. Adaptive switching circuits. Office of Naval Research Technical Report. 1960;1553(1) [Google Scholar]

[R14] [14].Dayan P, Abbott LF. Theoretical Neuroscience. MIT Press; Cambridge, MA: 2001. [Google Scholar]

[R15] [15].Ko H, et al. The emergence of functional microcircuits in visual cortex. Nature. 2013;496(7443):96–100. doi: 10.1038/nature12015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Thomson AM. Presynaptic frequency- and pattern-dependent filtering. Journal of Computational Neuroscience. 2003;15(2):159–202. doi: 10.1023/a:1025812808362. [DOI] [PubMed] [Google Scholar]

[R17] [17].Tsodyks MV, Markram H. The neural code between neocortical pyramidal neu-rons depends on neurotransmitter release probability. Proceedings of the National Academy of Sciences; 1997. pp. 719–723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Maffei A, Turrigiano GG. Multiple modes of network homeostasis in visual cortical layer 2/3. Journal of Neuroscience. 2008;28(17):4377–4384. doi: 10.1523/JNEUROSCI.5298-07.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Hoyer PO, Hyvarinen A. Interpreting neural response variability as Monte Carlo sampling of the posterior. Advances in Neural Information Processing Systems. 2003:293–300. [Google Scholar]

[R20] [20].Fiser J, et al. Statistically optimal perception and learning: from behavior to neural representations. Trends in Cognitive Sciences. 2010;14(3):119–130. doi: 10.1016/j.tics.2010.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Berkes P, et al. Spontaneous Cortical Activity Reveals Hallmarks of an Optimal Internal Model of the Environment. Science. 2011;331(6013):83–87. doi: 10.1126/science.1195870. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Orbán G, et al. Neural Variability and Sampling-Based Probabilistic Representations in the Visual Cortex. Neuron. 2016;92(2):530–543. doi: 10.1016/j.neuron.2016.09.038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Haefner RM, Berkes P, Fiser J. Perceptual decision-making as probabilistic inference by neural sampling. Neuron. 2016;90(3):649–660. doi: 10.1016/j.neuron.2016.03.020. [DOI] [PubMed] [Google Scholar]

[R24] [24].Aitchison L, Lengyel M. The Hamiltonian BraIn: Efficient Probabilistic Inference with Excitatory-Inhibitory Neural Circuit Dynamics. PLoS Computational Biology. 12(12):e1005186. doi: 10.1371/journal.pcbi.1005186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Lange RD, Haefner RM. Task-induced neural covariability as a signature of approximate Bayesian learning and inference. bioRxiv. 2020 doi: 10.1371/journal.pcbi.1009557. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Ma WJ, et al. Bayesian inference with probabilistic population codes. Nature Neuroscience. 2006;9(11):1432–1438. doi: 10.1038/nn1790. [DOI] [PubMed] [Google Scholar]

[R27] [27].Buntine WL, Weigend AS. Bayesian backpropagation. Complex systems. 1991;5(6):603–643. [Google Scholar]

[R28] [28].MacKay DJ. A practical Bayesian framework for backpropagation networks. Neural Computation. 1992;4(3):448–472. [Google Scholar]

[R29] [29].Blundell C, et al. Weight uncertainty in neural networks. arXiv. 2015 1505.05424. [Google Scholar]

[R30] [30].Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences. 2016;106(25):10296–10301. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Dayan P, Kakade S. Advances in Neural Information Processing Systems. Vol. 13. MIT Press; 2001. Explaining Away in Weight Space; pp. 451–457. [Google Scholar]

[R32] [32].Kappel D, et al. Network Plasticity as Bayesian Inference. PLoS computational biology. 2015;11(11):e1004485. doi: 10.1371/journal.pcbi.1004485. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Hiratani N, Fukai T. Redundancy in synaptic connections enables neurons to learn optimally. Proceedings of the National Academy of Sciences. 2018;115(29):E6871–E6879. doi: 10.1073/pnas.1803274115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Drugowitsch J, et al. Learning optimal decisions with confidence. bioRxiv. 2019 doi: 10.1073/pnas.1906787116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Pfister J-P, Dayan P, Lengyel M. Synapses with short-term plasticity are op-timal estimators of presynaptic membrane potentials. Nature Neuroscience. 2010;13(10):1271–1275. doi: 10.1038/nn.2640. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Kasai H, Takahashi N, Tokumaru H. Distinct initial SNARE configurations underlying the diversity of exocytosis. Physiological Reviews. 2012;92:1915–1964. doi: 10.1152/physrev.00007.2012. [DOI] [PubMed] [Google Scholar]

[R37] [37].Südhof TC. The presynaptic active zone. Neuron. 2012;75:11–25. doi: 10.1016/j.neuron.2012.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Michel K, et al. The presynaptic active zone: A dynamic scaffold that regulates synap-tic efficacy. Experimental Cell Research. 2015;335:157–164. doi: 10.1016/j.yexcr.2015.02.011. [DOI] [PubMed] [Google Scholar]

[R39] [39].Frey U, Morris RG. Synaptic tagging and long-term potentiation. Nature. 1997;385(6616):533–536. doi: 10.1038/385533a0. [DOI] [PubMed] [Google Scholar]

[R40] [40].Redondo RL, Morris RGM. Making memories last: the synaptic tagging and capture hypothesis. Nature Reviews Neuroscience. 2011;12(1):17–30. doi: 10.1038/nrn2963. [DOI] [PubMed] [Google Scholar]

[R41] [41].Rogerson T, et al. Synaptic tagging during memory allocation. Nature Reviews Neuroscience. 2014;15(3):157–169. doi: 10.1038/nrn3667. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Abraham WC, Bear MF. Metaplasticity: the plasticity of synaptic plasticity. Trends in Neurosciences. 1996;4:126–130. doi: 10.1016/s0166-2236(96)80018-x. [DOI] [PubMed] [Google Scholar]

[R43] [43].Abraham WC. Metaplasticity: tuning synapses and networks for plasticity. Nature Reviews Neuroscience. 2008;9(5):387. doi: 10.1038/nrn2356. [DOI] [PubMed] [Google Scholar]

[R44] [44].Hulme SR, et al. Mechanisms of heterosynaptic metaplasticity. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences. 2014;369(1633) doi: 10.1098/rstb.2013.0148. 20130148. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Vogelstein JT, et al. Fast nonnegative deconvolution for spike train inference from population calcium imaging. Journal of Neurophysiology. 2010;104(6):3691–3704. doi: 10.1152/jn.01073.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Packer AM, et al. Simultaneous all-optical manipulation and recording of neural cir-cuit activity with cellular resolution in vivo. Nature Methods. 2015;12(2):140–146. doi: 10.1038/nmeth.3217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Loewenstein Y, Kuras A, Rumpel S. Multiplicative Dynamics Underlie the Emergence of the Log-Normal Distribution of Spine Sizes in the Neocortex In Vivo. Journal of Neuroscience. 2011;31(26):9481–9488. doi: 10.1523/JNEUROSCI.6130-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Matsuzaki M, et al. Structural basis of long-term potentiation in single dendritic spines. Nature. 2004;429(6993):761–766. doi: 10.1038/nature02617. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Song S, et al. Highly nonrandom features of synaptic connectivity in local cortical circuits. PLoS Biology. 2005;3(3):e68. doi: 10.1371/journal.pbio.0030068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].O’Connor DH, et al. Neural activity in barrel cortex underlying vibrissa-based object localization in mice. Neuron. 2010;67(6):1048–1061. doi: 10.1016/j.neuron.2010.08.026. [DOI] [PubMed] [Google Scholar]

[R51] [51].Mizuseki K, Buzsáki G. Preconfigured, Skewed Distribution of Firing Rates in the Hippocampus and Entorhinal Cortex. Cell Reports. 2013;4(5):1010–1021. doi: 10.1016/j.celrep.2013.07.039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Minka TP. A family of algorithms for approximate Bayesian inference. PhD thesis. Massachusetts Institute of Technology; 2001. [Google Scholar]

PERMALINK

Synaptic plasticity as Bayesian inference

Laurence Aitchison

Jannes Jegminat

Jorge Aurelio Menendez

Jean-Pascal Pfister

Alex Pouget

Peter E Latham

Abstract

1. Introduction

2. Results

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

3. Discussion

Methods

M1. Description of our model

M1.1. Target weights

M1.2. Synaptic weights

M1.3. Synaptic input

M2. Learning rules

M2.1. Single neuron learning rules: general formalism

M2.2. Single neuron learning rules for our three models

M2.3. Recurrent neural network learning rules

Supplementary Material

Acknowledgements

Footnotes

Code Availability

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Synaptic plasticity as Bayesian inference

Laurence Aitchison

Jannes Jegminat

Jorge Aurelio Menendez

Jean-Pascal Pfister

Alex Pouget

Peter E Latham

Abstract

1. Introduction

2. Results

Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

3. Discussion

Methods

M1. Description of our model

M1.1. Target weights

M1.2. Synaptic weights

M1.3. Synaptic input

M2. Learning rules

M2.1. Single neuron learning rules: general formalism

M2.2. Single neuron learning rules for our three models

M2.3. Recurrent neural network learning rules

Supplementary Material

Acknowledgements

Footnotes

Code Availability

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases