Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2024 Mar 5;20(3):e1011926. doi: 10.1371/journal.pcbi.1011926

Unsupervised learning of perceptual feature combinations

Minija Tamosiunaite 1,2,*, Christian Tetzlaff 3,4, Florentin Wörgötter 1
Editor: Daniel Bush5
PMCID: PMC10942261  PMID: 38442095

Abstract

In many situations it is behaviorally relevant for an animal to respond to co-occurrences of perceptual, possibly polymodal features, while these features alone may have no importance. Thus, it is crucial for animals to learn such feature combinations in spite of the fact that they may occur with variable intensity and occurrence frequency. Here, we present a novel unsupervised learning mechanism that is largely independent of these contingencies and allows neurons in a network to achieve specificity for different feature combinations. This is achieved by a novel correlation-based (Hebbian) learning rule, which allows for linear weight growth and which is combined with a mechanism for gradually reducing the learning rate as soon as the neuron’s response becomes feature combination specific. In a set of control experiments, we show that other existing advanced learning rules cannot satisfactorily form ordered multi-feature representations. In addition, we show that networks, which use this type of learning always stabilize and converge to subsets of neurons with different feature-combination specificity. Neurons with this property may, thus, serve as an initial stage for the processing of ecologically relevant real world situations for an animal.

Author summary

During foraging and exploration, the neural system of animals is flooded with numerous sensory features. From this confusing signal repertoire, it needs to learn extracting relevant events often encoded by specific perceptual feature combinations. For example, a specific smell and some distinct visual attribute may be meaningful when occurring together, while by themselves these features are irrelevant. Learning this is complicated by the fact sensory signals occur with different intensity and occurrence frequency beyond the control by the animal. Here we show that it is possible to train neurons with external signals in an unsupervised way to learn responding specifically to different feature combinations largely unaffected by such presentation contingencies. This is achieved by a novel learning rule which achieves stable neuronal responses in a simple way by gradually reducing the learning rate at its synapses as soon as the neuron’s response to the feature combination exceeds a certain level. This allows neurons in a network to code for different feature combinations and may facilitate ecologically meaningful evaluation of perceived situations by the animal.

Introduction

Coincident events or features can be highly relevant for animals and humans, and recognizing feature combinations may make all the difference between danger and safety. The red color of a mushroom paired with white surface dots as compared to a red one with a plain surface makes the difference between the poisonous Amanita muscaria (toadstool) and the eatable Amanita caesarea. While humans learn such feature combinations usually by supervision, animals often do so via trial and error. For example, rats and other animals perform scouting and probing of novel food sources until found to be safe. Repeated exposure to sensor-perceivable feature combinations in conjunction with no negative effects will then lead to the conclusion that it should be safe to eat this.

A central problem that arises here is that features will not only occur in combination but also on their own. This will happen with different individual- as well as coincidence-occurrence frequencies and, in addition, usually also with variable intensity. Thus, in order to learn the meaning of combined features, the nervous system must learn this without being distracted by this variability. While supervised learning methods, like LMS algorithm [1] could address this problem in efficient way, here we investigate unsupervised learning. The latter is much simpler from the point of view of biological implementation, as it does not require additional evaluative sub-systems and mechanisms. As shown below, with unsupervised learning one can detect feature combinations already at the level of a single neuron. To achieve this, neuronal plasticity must come to a halt as soon as a combination has been recognized, otherwise ongoing plasticity would lead to undesired responses to individual features.

Such an ecologically driven stopping of learning is a non-trivial problem for unsupervised learning, though. For example, Hebbian learning leads to unbounded (divergent) weight growth. Many stabilization methods and/or augmentations of the original Hebbian learning rule have been suggested to prevent this, for example Oja’s rule [2], the Bienenstock, Cooper, Munro rule (BCM, [3]), subtractive normalization methods [4] and several more. More recently, learning rules had been introduced which combine Hebbian weight growth with a homeostatic, balancing term, called synaptic scaling [57], for achieving convergence to a target activity [8]. However, below we will show that these methods cannot reliably address the problem of differentiating cases with coincidences of two or more features.

As a consequence the issue of how to control weight development in an unsupervised way such that a neuron will reliably code for multiple-feature combinations remains unresolved. Here we suggest a rather simple solution to this. When growing weights in a network, combination-selective responses can be achieved by gradually dropping the learning rate to zero (simulated annealing) as soon as the neuron’s activity is getting “large enough”, which happens earlier for combined than for individual stimuli. Before describing details, this mechanisms can best be understood by an example (Fig 1). Here two similar inputs (A,B) had been presented randomly with some coincidence between them (vertical dashed lines). Due to learning, the neuron’s output (C) gradually gets bigger, where the response amplitudes that occur for coincident inputs will always exceed individual ones. When a response passes the annealing threshold the learning rate is reduced (red curve in D) and after some time, when the rate has dropped to zero, weight growth will come to a standstill. This mechanism keeps the individual responses small as compared to the combined ones and the resulting output distributions (inset) remain separate with a large gap between single and coincident responses.

Fig 1.

Fig 1

Exemplary development of the response (C), synaptic weights (D), and annealing characteristic (D, red) for a neuron with two inputs (A,B) with mean amplitudes of 1 and 1.2, respectively, and same average occurrence frequency; dashed lines show input coincidences. Inset shows the finally resulting distribution of neural responses between an activation of zero (left) and one (right).

Simulated annealing has become a textbook method in reinforcement learning (RL), for example for step-size reduction [9] or for reducing exploration rates [10] as well as in deep-RL [11]. Annealing is also widely used in supervised learning [1214] as well as in different variants of Hebbian learning [1518], the latter being most closely related to the investigations in this study. However, annealing in those studies is applied as an additional mechanism to ensure an efficient convergence of weights, while we are not aware of studies which would analyze annealing as the main factor for activity stabilization on its own.

Central to our approach is that the principle of using the neuron’s output as the determining variable for the annealing leads to the advantageous property that neurons in a network will indeed develop specificity for different input (or feature) combinations. This differs from mere spike-coincidence detection because—as discussed above—the learning of input combination specificity needs to be independent (within reason) of the intensity of the input, represented by its occurrence frequency and its amplitude (or input firing rate). Amplitude invariance can to some degree be achieved using network-intrinsic normalization (e.g. [19]) by which differently strong activity, e.g. from external sensory features that converges onto a cell, will still lead to similar, albeit not identical, responses. These could then serve as the normalized inputs to the learning neuron. Another aspect that leads to problems is that learning needs repetitions. However, the brain has little or no influence on the occurrence frequency of any external stimulus or stimulus combination. Hence, to reliably learn input combination specificity the system must tolerate quite some variability in the occurrence frequencies of the different inputs as well as concerning their coincidences.

The central contribution of this study is showing that the annealing mechanism allows reliably encoding coincident feature combinations of two or more features in spite of amplitude and frequency variations of the input signals. This is achieved without having to adjust the neuron’s parameters for different stimulus situations. In addition, we show that—for multiple inputs—the neurons’ output distributions are ordered by the total (average) input intensities. This is another factor, which could be ecologically relevant as those neurons this way represent quite faithfully “what comes in from the environment”.

We start this investigation in the first part of the paper by analysing a simple case of a neuron with only two inputs and compare our rule to other, conventional learning mechanisms. Then we address the aspect of multi-input ordering. We show that other rules fail to achieve these properties and provide also a detailed analysis of the BCM rule, which could be seen as a contender to our approach. This is then extended to a recurrent network to address the issue of multiple coincidences. We finally discuss possible biological mechanisms that might support this function and also other issues concerning the learning of input coincidences.

Materials and methods

In this section we will first describe our neuron model, then the learning rule that we are proposing. Afterwards we briefly specify the traditional learning rules to which we are comparing the newly proposed method. Finally, we describe a setup, where we have embedded this rule in a recurrent neural network.

Neuron model

To obtain the neuronal response, first we calculate the weighted sum of the inputs:

y=ωTu, (1)

where u = (u1, …, un)T are inputs, ω = (ω1, …, ωn)T are weights, and n is the number of inputs. In analogy to real neurons, we will call y the membrane potential. We will first analyze the simplest neuron that can detect co-incidences with n = 2, but will increase the number of inputs in the later-shown recurrent network example. To calculate the actual neuronal response v, called spike rate or rate, we apply a nonlinear function:

v=fs(y)={10.9(11+e-b(y-0.5)-0.1),if0,0,otherwise. (2)

where b = 10 if not indicated differently. Coefficients were set to obtain the response characteristic shown in Fig 2A. This represents a sigmoidal function with a threshold yT ≈ 0.281, beneath which the firing rate will be zero:

Fig 2. Functions used in model equations.

Fig 2

A: Neural activation, see Eq (2); we use a saturating function, where in case of small membrane potential (y) the activation v is zero, which is based on empirical observations, e.g., see. [20, 21]; B: Annealing function, which renders a close to zero annealing rate until threshold value va is reached, and afterwards increases abruptly, see Eqs (4) and (5); note that this function is additionally scaled by the annealing rate ρ in Eq (4).

The Hebb rule

We investigate the effect of learning rate annealing on the Hebb rule given by:

dωdt=μ(t)ug(y), (3)

with u the input vector of the neuron, g(y) the influence of the neuron’s output on the learning and μ(t) the learning rate, which will change over time due to annealing.

Learning rate annealing

Central to our method, however, is that the spike rate v guides the annealing of the learning rate μ(t), where we start annealing, as soon as the neuron has reached “high enough” outputs v. The annealing equation is as follows:

dμdt=-ρSa(v-va)μ, (4)

where ρ is a rate factor, va the annealing threshold, and Sa(x) is another sigmoidal function:

Sa(x)=11+e-βx, (5)

where we used β = 100 to obtain a steep step-like transition (see Fig 2B). However, the method will work in a similar way with several times bigger or smaller β. Learning starts at t = 0 with μ(0) = μ0. The sigmoidal function leads to the following effect: at the time when v exceeds va, the annealing rate abruptly increases. The value for va is expected to be around or higher than the inflection point of v. We have investigated va ≥ 0.45. Note that, if annealing happens too early, the neuron’s differentiation capability remains low, as its activation function non-linearity will not play any role.

Hebbian learning rules with annealing

We define for Eq 3 different characteristics for g. First, there is a rule, which we call annealed membrane Hebb (AMH) rule, defining g(y) = y, hence:

dωdt=μ(t)uy (6)

This rule leads to exponential weight growth (see Eq 15) due to the fact that the neuronal output coupled with the learning-equation creates a positive feedback loop.

To avoid this problem, we have replaced this rule with one that is largely output independent and leads to linear weight growth (see Eq 13). This so-called annealed Linear Learning (ALL) rule uses g(y) = H(yη) with H being the Heaviside function:

H(y-η)={1,ify-η>0,0,otherwise. (7)

Hence, the ALL-rule is given by:

dωdt=μ(t)uH(y-η) (8)

This learning rule augments traditional Hebbian learning by the assumption that weight change will not depend on the actual activation of the neuron. Instead learning will start as soon as the membrane potential y exceeds a threshold η and then only depends on the incoming input(s). Analysis of the experimental literature shows (see Discussion) that—especially at dendritic spines—this type of learning may be biophysically more realistic than other variants of the Hebb rule.

Below, we will also show that the ALL rule works best for the here-investigated task of coincidence detection. For simplicity, we used η = 0 but results will not change much as long as one uses reasonably small values for η. Note, that the weight update routine described above in case of η = 0 holds some similarity to Rosenblatt’s perceptron learning rule [22]. However, different from Rosenblatt’s perceptron, where knowledge on the desired outputs is assumed and error terms are used, we analyze unsupervised learning, where self-organization happens without supplying knowledge about the desired output of the neuron. For more considerations on supervised vs. unsupervised learning see Discussion section, subsection “Comparing to other learning principles”.

Note that, in principle, one can also define a Hebb rule that relies on the actual rate v and, hence, considers the output transform (Eq 2) by setting g(y) = v. However, this case, which is governed by the sigmoid output function of the neuron, can be tuned to either approximate the annealed membrane Hebb (AMH) or the annealed linear learning (ALL) rule. Hence, we will not consider it any further.

Reference models

We compared our method to the BCM-rule [3] and the Oja-rule [2] as well as to a newer approach called synaptic scaling [8].

For BCM, there exist several linear as well as non-linear versions in the literature (e.g. [3, 23, 24]). We had analyzed these rules, but here we show results only for the (non-linear) formulation introduced by Intrator and Cooper [23], which superseded the others for the here-investigated tasks. However, in the Results section we will also briefly discuss results from the other BCM rules.

The Intrator-Cooper BCM rule is given by (see [25]):

dωdt=μv(v-ΘM)udvdy, (9)

with ΘM = E(v2), where E represents the expectation value.

We obtain the average described above as given by Toyoizumi et al [24], where also a reference activation value v0 is used (note, the BCM rule works poorly for our task without this variable, see S2 Appendix):

dΘMdt=γμ(-ΘM+vvv0). (10)

where γ is a factor relating the time constants of the two differential equations (Eqs 9 and 10) and γ needs to be big enough to avoid instabilities. Note that parameterizing it this way makes it easier to focus on the influence of the ratio between the time constants in our analyses.

For the Oja rule, we use the standard formulation from [2]:

dωdt=μy(u-αyω), (11)

where we set α = 1. This factor leads to the asymptotic convergence of |ω|2 = 1/α and is discussed in “Results” section.

For synaptic scaling, we use the following equation taken from [8]:

dωdt=μyu+ξ(y0-y)ω2, (12)

with ξ < μ < 1 and where the parameter y0 determines the value at which the output is stabilizing (for concrete values see figure legends).

Experimental settings

Neuron with two inputs

In the first part of the results section, we focus on the investigation of the ALL-rule on a neuron with only two inputs with varying input amplitudes and occurrence frequencies. We vary amplitudes in the interval [1, 1.5] and the occurrence frequencies using a ratio of 1 : 1 or 2 : 1. We also vary the standard deviation of input amplitude (σ = 0.1 or σ = 0.2) as well as the coincidence rates between inputs (50, 30 and 10%, measured in respect to the less frequent input in case of frequency difference). This is extended by detailed statistics how our system behaves for different annealing-rates ρ and thresholds va. Finally, we compare the results from the ALL-rule to results obtained from a set of the most common learning rules under similar conditions.

Recurrent network

In the second part of the results section, we employed the ALL-rule for generation of all possible coincident combinations of N external inputs (Fig 3) in a randomly connected recurrent network of M neurons with sparse connectivity of c connections per neuron on average. We use for connectivity a Gaussian distribution with standard deviation of c/5. However we are limiting this to a minimum of at least one connection onto each neuron. We also impose a limit on the maximally allowed connections, where for c = 2 this amounts to allowing connection numbers in the interval [1, 3]. We analyse the cases M = 200 and M = 1000, with c = 2 and c = 10 (allowed interval [1, 19]). In addition to those connections, 15% of randomly selected neurons are supplied with one connection each from randomly chosen external input neurons. We analyzed cases of N = 3 and N = 5 external inputs. Inputs can take two values: 0 or 1. For this part of the study we did not vary input amplitudes or frequencies. The goal of this part of the study is to show that such a system can self-organize into creating output neurons that respond to different possible combinations of active inputs. Hence, one such neuron will then respond if a certain subset of k inputs is active at the same time (AND operation) and not respond if any one of these k inputs is not present. In this case the remaining nk inputs will not be able to drive this output neuron whatsoever. We were considering that a neuron is signaling for a certain input combination in case its activity is above a “classification” threshold for this combination, but below threshold for any other combination. We analyzed a set of thresholds from v = 0.4 to 0.8 in steps of 0.1.

Fig 3. Schematic diagram of the recurrent network with neurons responsive to different input combinations indicated.

Fig 3

“x” means input can be 0 or 1.

This way, we measured how many neurons, which are signaling a possible combination, appear within a network by calculating statistics from 100 trials to generate and train a network. For this, we varied the network connection matrix trial by trial. Also, neurons in the network were generated with an annealing threshold drawn from a uniform distribution in [0.75, 0.95], which also was re-generated for each trial. Then, we present results as percentage of neurons in the network that represent a certain combination. Hence, 2% means that there were 4 neurons representing that combination in a neural network with M = 200 neurons and 20 neurons representing that combination in the M = 1000 network.

Code for the different experiments is provided in the S1 Code Repository.

Results

First we analyze the properties of the annealing learning rules for a neuron that has only two inputs and compare those to the reference methods (BCM, Oja, Synaptic Scaling). This is started by an analytical calculation that compares annealed membrane Hebb (AMH) with annealed Linear Learning (ALL) after which we show simulation results for a wide variety of cases that cannot be captured by analytical approaches. The central finding here is that the ALL-rule allows for reliable separation between coincidence and no-coincidence cases without having to re-tune neuron parameters for different input situations. Only the BCM rule behaves similarly. However, this part is then extended by analysing more than two inputs. Here we observe now clear differences between BCM and ALL.

Finally this is followed by a study of recurrently connected networks also with more than two inputs, where we ask how reliably such a network could detect various types of coincidences. In addition to this we have performed a set of control experiments, where we added inhibition with a similar characteristic as in the cortical networks (about 20%, with constant synaptic weights and a wider convergence/divergence structure than excitation).

Separation properties

In the following we analyze how well does the ALL-rule, as compared to the AMH-rule, separate the resulting output spike rates (coincidence case) relative to the individual rates obtained from only one input. We can here obtain analytical arguments under the assumption of independent constant inputs in the limit of few coincidences only (where the latter constraint is needed for the AMH-rule only). Then, we also complement these analytical considerations by some simulations that allow relaxing the above constraints.

Hence, we assume two constant inputs, u1 and ϕu1 with ϕ > 1. For the case of the ALL-rule one can calculate weight growths over time as:

ω1(t)=μ0u1t+ω0ω2(t)=μ0ϕu1t+ω0 (13)

where μ0 is the learning rate before annealing and ω0 the start weight. Accordingly, the membrane potentials are:

y1(t)=μ0u12t+ω0u1y2(t)=μ0ϕ2u12t+ω0ϕu1. (14)

For the AMH-rule we get for the weights:

ω1(t)=ω0eμ0u12tω2(t)=ω0eμ0ϕ2u12t, (15)

and the membrane potentials are given by:

y1(t)=ω0u1eμ0u12ty2(t)=ω0ϕu1eμ0ϕ2u12t. (16)

If we allow for (rare) coincidences between the two inputs then the membrane potential becomes y1 + y2 and the neuron’s output will be v1+2 = fs(y1+ y2) (see Eq (2)). Due to the definitions of y1 and y2 the following conjecture holds: v1+2 > v2 > v1. As a consequence v1+2 will eventually hit the annealing threshold va at time ta. If we now assume instantaneous annealing, then all weight growth will stop and we can ask which values will the individual outputs v1 and v2 have reached? This way we can assess the separation between the coincidence-driven output (which is then at va) and the other two outputs. To be able to call such a neuron an AND-operator a clear separation is needed and here we are only concerned with v2, which is anyhow larger than v1. Hence, we calculate for different parameters u1, μ0, ϕ and va how big the separation s(ta) between v1+2(ta) = va and v2(ta) is as s(ta) = vav2(ta). This last step has to be calculated numerically as the resulting terms cannot any longer by analytically solved. Fig 4A shows the results. Note, that μ0 has no influence on the separation, it only determines how early/late the annealing threshold will be reached. The figure shows that only for identical amplitudes the separation between the coincidence case and the individual input case will be the same for the annealed membrane Hebb- and the annealed Linear Learning rule. For all other situations, the ALL-rule leads to a far better separation. Furthermore, note that separation is largely independent of the annealing threshold, which adds to the robustness of the annealing approach.

Fig 4.

Fig 4

A: Separation properties calculated analytically B: Histograms of numerical results for ALL- and AMH-rules in case of Gaussian distribution of input amplitudes. In all cases, input presentation frequencies are equal. In B input coincidence is 10% everywhere; input amplitudes: mean for the blue distributions was normalized to 1.0 and for the orange ones to ϕ; standard deviations indicated above the plots. Annealing parameters are va = 0.7, ρ = 0.2. Initial weights are ω(0) = [0.001, 0.001]T and initial learning rate μ0 = 0.0005; Euler integration with step dt = 1. Disks in A mark the points with the corresponding plots in B. Tilted lines are truncation marks for the blue histograms.

In panel B we show how the ALL- versus AMH-rules behave when using inputs with a Gaussian distribution in amplitudes and the same presentation frequencies for both inputs. Coincidence rate was 10%. Responses to the individual inputs are shown in orange and blue and the coincident case in green. The results are consistent with the analytics in panels A except for a slight increase in separation values due to more balanced weight growth in the simulation, because of the 10% coincidences, where the analytics could only be calculated for the limit case of 0%. The ALL-rule leads to a much stronger separation. Numbers at the bottom show the distance between the mean values of the orange and green distributions, where separability entirely ceases towards the right for AMH. Furthermore, note that the AMH-rule shows the expected exponential run-away property for the stronger (orange) distributions and the blue ones do not develop any firing rate v above zero for unequal amplitudes. Using a rate-based Hebb rule (hence g(y) = v), would mitigate these effects as soon as the membrane potential to rate transformation approaches the Heaviside property.

Annealed Linear Learning rule: Neuron output analysis

In the following we focus on the ALL-rule, which provides a better separation than the the AMH-rule as shown above. In Fig 5 we present histograms of neuron outputs for different input combinations. Input amplitudes are drawn from a Gaussian distribution and are characterized by mean and standard deviation (see first column in Fig 5A). We use mean amplitudes of 1, 1.2 and 1.5 and a standard deviation of std = 0.1 for Fig 5. (Cases with std = 0.2, i.e., higher input variance, had been shown already in Fig 4 above).

Fig 5. Histograms of neuron inputs (first column) and outputs v for the ALL-rule.

Fig 5

A: Equal presentation frequency; B: Different presentation frequency. Parameters: va = 0.7, ρ = 0.1, std = 0.1. Mean amplitudes of the inputs are indicated in the first column. Initial weights are ω(0) = [0.001, 0.001]T and initial learning rate is μ0 = 0.0005; Euler integration with step dt = 1. For other parameters: see plots. Response histograms (blue or yellow) in case of amplitude or presentation frequency difference are grouping very close to zero, where we truncate the zero bin to optimize for visibility (see truncation marks).

In addition to the amplitude distribution, inputs are characterized by their presentation frequency, which could be understood as how often stimuli are delivered to the neuron by the external world. In Fig 5A, we show results in case both inputs are presented with the same frequency, while in Fig 5B results are shown in which the first stimulus is twice more frequent.

Another important input parameter is how frequently two inputs coincide at the neuron. We consider 50, 30 and 10% coincidence. When the presentation frequency of the two inputs differs, we calculate the percentage of coincidence with respect to the input with smaller presentation frequency.

In Fig 5 we show the input distributions of the neuron (left column) and the neuron output v in case of coincidence in green, while the response to single inputs are blue and orange. All neuron outputs are limited to the interval [0, 1], due to the non-linear response curve (see Fig 2A).

As expected, the output for coincident inputs (green) is always the highest. We can also observe that the gap between the blue or orange histograms and the green histogram is in almost all cases quite big. Furthermore, this gap “sits at the same location” such that a unique discrimination threshold vd could be defined to differentiate coincident from non-coincident responses (e.g. vd = 0.6). These properties are, thus, largely independent of input amplitudes, frequencies, and percentages of coincidence. Thus, only due to these invariances such a neuron can indeed be called “input coincidence detector” (AND operation-like). Next we will quantify the robustness of these properties.

In Fig 6 we show for the ALL-rule, how the separation of coincidence vs no coincidence varies with different annealing parameters, where we vary the annealing onset threshold and the annealing rate va and ρ (Eq (4)). We show the classification error for coincidence vs no coincidence. Classification threshold is kept at v = 0.5. First, in Fig 6A and 6B we present error plots in parameter space in case both inputs have the same presentation frequency and both amplitudes are equal: mean = 1, std = 0.1, with 30% (A) or 50% (B) coincidence. These are the most favorable cases from all cases shown here and one can see that the error is zero (or very small) in a very big region of the parameter space (white and light colored patch in the middle of the plots). This patch slightly decreases when amplitudes (E, F), or frequencies (C, D) of the two inputs differ, but differences between the plots remain small. Amplitude increase of the less frequent input can compensate for the frequency decrease (see G, H). The errors in the plots “above-left” the white patch are false positives, while for “bottom-right” they are false negatives.

Fig 6. Classification error (coincidence vs. not coincidence) of the ALL-rule in respect to parameter variations.

Fig 6

Parameters are annealing onset threshold and annealing rate. Decision threshold is 0.5. Panels (A-L) variable amplitude, coincidence and presentation frequency; panels (M-P) extreme cases: bigger variance, smaller coincidence, bigger amplitude difference, bigger frequency difference. Averages over 20 trials are shown. Initial weights are ω(0) = [0.001, 0.001]T and initial learning rate μ0 = 0.0005; Euler integration with step dt = 1.

In the third row (panels I-L) the same type of representation is shown, but for a set of amplitude differences, where the first input average amplitude is always at one, while the second input average amplitude is drawn from a set {0.8, 0.9, 1.0, 1.1 and 1.2} (uniform probability), std = 0.1 everywhere. Also in this case the error is zero in a big patch of the parameter space.

In Fig 6M–6P we present various less favorable cases to investigate the limits of the ALL-rule: higher input variance (std = 0.2, panel M), small coincidence (just 10%, panel N), wide amplitude range in the interval [0.5,1.5] (panel O), as well the case when one input is five times less frequent (30% coincidence, panel P). Except for the last, five times less frequent case, we always get a parameter region where errors are zero. In the case where one input is five times less frequent (P), however, we still get low classification errors for a large range of parameters. Note, that in this case the coincidence percentage is very small as we calculate the 30%-percentage from the less frequent input. Thus, this case is, indeed, very unfavorable.

Comparison to reference methods

In Fig 7 we show results obtained with the three reference methods. Presented results are characteristic for the problems that these methods have with the task of input coincidence detection.

Fig 7. Comparison to reference methods: Results for BCM, Oja and Synaptic Scaling.

Fig 7

Two inputs with coincidence 30% everywhere. Amplitudes and standard deviation (std) are shown above each column. Presentation frequency is equal, except in the last column where it is 2:1. Parameters: μ = 0.001. For Oja and Syn.Scaling: ω(0) = [0.001, 0.001]T, for BCM: ω(0) = [0.2, 0.2]T, ΘM(0) = 0.2, γ = 10 and v0 = 0.2; Synaptic Scaling: y0 = −200, ξ = 0.01.

For synaptic scaling and Oja no unique separation threshold can be found and it depends on the stimulus situation. This could be resolved by using additional mechanisms (e.g. for Oja by adapting the α factor for each stimulus situation individually). However, case-by-case adaptation of parameters is an undesirable feature for biological systems. The Oja rule, in addition, has unfavorably overlapping distributions when input amplitudes or frequencies differ (see third and fifth columns). Note also, that Oja as well as synaptic scaling are in the existing literature normally used in a linear regime and cannot be satisfactorily applied after output transform (2), which we also observed (See S1 Appendix).

For the BCM rule we have investigated different variants, but we will only show the best results. In summary, when using the classical BCM-rule [3] for a 2-input system (linear case), fixed points for the synaptic weights exist, albeit one of which is always negative. Thus, this leads to unrealistic results (see S1 Appendix). This problem can be addressed by using a more advanced version of BCM introduced by Toyoizumi et al. [24]. Their formulation contains several additional parameters, which prevent negative weights. However, here distributions for single features and combinations tend to overlap and the shape and overlap of the distributions depends on those additional parameters.

Different from this, the (non-linear) version introduced by Intrator and Cooper [23] renders results which are—at a first glance—quite satisfactory and robust against input- as well as parameter variations. Thus, in all panels the same separation threshold can be used. (For this rule, however, the value v0 needs to be chosen correctly, see S2 Appendix). A general observation here, though, is that the single-input distributions heavily overlay each other. Hence, different input characteristics get lost in the output. This is clearly visible when considering, for example, three inputs (Fig 8A). Here 7 different output distributions exist: 3 represent the responses for one input each, another 3 for two inputs and 1 for all three inputs. We show here three examples obtained with the BCM rule with the same parameters and same input statistics, where differences arise due to randomness in stimulus sequencing. Here always 5 distributions cluster at small activation values and 2 near an activation of 1.0. The latter consists of the 3-input case “123” and one two-input case, which is, however, not the same in the here-shown three BCM examples. The actual outcome, thus, depends on the stimulus sequences, which are randomized and, thus, different in these three examples. Different from this, the ALL-rule renders an input-output transformation which much better reflects the stimulus combinatorics, where single input responses are on the left, those for two inputs in the middle and the one that belongs to all three inputs is found on the right side of the activation axis. In Fig 8B we show, in addition the weight development for one 3-input case each for ALL and BCM, where the latter converges only after about 35,000 iterations and shows oscillations during convergence. This type of behavior of BCM for multiple inputs is generic and has also been observed by others [26]. Convergence speed can be increased by changing parameter in BCM at the cost of stronger oscillations. Note that for five inputs BCM convergence can take above 1 million iterations. Different from this, ALL converges for three inputs smoothly after about 100 iterations only and this number does not significantly increase for more inputs.

Fig 8. Three input coincidence sorting for ALL and BCM rules.

Fig 8

A: Output histograms. Note that 3 examples for BCM are shown using the same intrinsic parameters but different stimulus sequencing. B: Weight development. Note the different x-axis scales. C: Parameter space analysis: Errors for classification “one active input”, “two active inputs”, “three active inputs” are based on response thresholds 0.25 and 0.75, averages over 20 trials are shown. Light color corresponds to good coincidence sorting. Circles in the error plots show parameter combinations for which histograms are shown in panel (A). Parameters: mean amplitude is 1 in case the input is active, STD = 0.1, ω(0) = [0.2, 0.2, 0.2]T, μ = 0.001, pair-coincidence 30% for every possible combination (12, 13, 23) in respect to that pair, triple co-incidence for 123: 6%; for BCM: ΘM(0) = 0.1; Euler integration with dt = 1 in all cases.

We further evaluate coincidence sorting in the three input case for the ALL and BCM rules in Fig 8C, where we investigate parameter spaces of both rules. We set two thresholds, one at 0.25 and the other at 0.75 (based on approximate boundaries of distributions in the panel A) and evaluate classification error with the assumption that responses to single inputs would remain to the left of the first threshold, two input combinations would be positioned in the middle and three input combinations would reside to the right of the second threshold. For the ALL rule, we vary parameters: annealing threshold va and annealing rate ρ for the BCM rule we vary parameters: target value v0 and the time scale ratio γ; for both methods we vary steepness of the non-linear activation function by manipulating b in Eq (2). Zero and small errors are visible as white-ish patches in the plots.

One can see that for the ALL rule there is an area in the parameter space with small errors and, thus, input differentiation can be obtained. Such an area is present for all three response function steepness values, though it is smaller for the very steep function. The latter is expected, as a very steep function tends to “squeeze” outputs into two classes more strongly. By contrast, for the BCM no parameter combination brings small classification errors. Also, a different steepness of response function does not mend the situation. As already shown in Panel A, BCM tends to divide outputs into two extremes, thus no combination sorting property in case of more than two inputs is obtained by BCM.

In Fig 9 we further analyze ALL rule in case of amplitude variations. We investigate cases where mean amplitudes for the three inputs are (1.0, 1.0, 1.2), (1.0, 1.0, 1.5) and (1.0, 1.2, 1.5). One again can see substantial areas (light color) in parameter space where correct input sorting is happening (see top row in Panel A). At the bottom of Panel A we show two output histograms for instances marked by two circles in the parameter space above. In those instances outputs can be differentiated between “one active input”, “two active inputs” and “three active inputs” given chosen thresholds with a small error.

Fig 9. Input coincidence sorting properties under more variable conditions.

Fig 9

A: Results for the ALL rule for the three input case with amplitude variation. Errors for classification “one active input”, “two active inputs”, “three active inputs” are based on thresholds 0.25 and 0.75. Light color corresponds to good coincidence sorting. Parameters: average amplitude provided above the plots, STD = 0.1, ω(0) = [0.2, 0.2, 0.2]T, μ(0) = 0.001, pair-coincidence 30% for every possible combination (12, 13, 23) in respect to that pair, triple co-incidence for 123: 6%; plots show averages over 20 trials. Histograms of individual runs below correspond to the two circles in parameter plots above. B: Results on input coincidence sorting for ALL and BCM (Intrator-Cooper) rule for a five input case. For 5 inputs there are 31 possible combinations of neurons driven by n ≥ 1 inputs: 5 × 1, 10 × 2, 10 × 3, 5 × 4 and 1 × 5 inputs as indicated beneath the abscissa. Parameters: ω(0) = [0.1, 0.1, 0.1, 0.1, 0.1]Tμ = 0.001, binary subsets of five presented in equal probability, random order, Euler integration with dt = 1 in both cases, for ALL: va = 0.7, ρ = 0.1, for BCM: ΘM(0) = 0.2, v0 = 0.4, γ = 10.

Finally, we investigate five input cases for ALL and BCM (Intrator-Cooper) rules. In Fig 9B we show outputs after learning obtained for different input combinations (there are 31 possible combinations with at least one active input for five inputs). In this case all inputs have a value of 1 (no amplitude variation) and in the learning phase all input subsets are provided with equal probability. For the ALL rule we reliably and consistently obtain outputs sorted by the number of active inputs, as shown in the left plot in Fig 9B. By contrast, for BCM the situation is different ás can be seen on the right in Fig 9B. Note, where there is no blue column, the output is zero. Outputs by BCM are essentially sorted into two classes, close to zero and close to one, similar to the result shown in Fig 8A. Also in a similar manner, the outcome is variable and depends heavily on the actual input sequence. Thus, the BCM rule cannot sort five input coincidences.

Hence, ALL-rule has unique properties in respect to other rules in coincidence sorting or detection.

Recurrent networks with the ALL-rule

First we demonstrate that we can obtain cells representing all possible input combinations in a recurrent network. In Fig 10 we provide a box plot for the number of different combinations obtained for N = 3 or N = 5 external inputs in case of M = 200 neurons in a network. Statistics are shown for 100 randomly generated networks. For this we count, after learning, how many neurons respond, for example, to an input combination of “x11xx”. Such a neuron, shown with green index “12” (decimal for the binary code 01100) in the panel B, thus, requires inputs 2 and 3 (encoded as “1”) to be active, where the other inputs may or may not be present (encoded as “x”), but they will not be able to drive this neuron on their own. One can see that for N = 3 the number of cells representing different combinations is essentially uniformly distributed, while for N = 5 the number of neurons representing single inputs is higher than the rest. As expected standard deviations are high but, in spite of this, for any of the possible combinations there are always at least a few cells that represent them.

Fig 10. Box plots for the number of neurons representing different combinations for the ALL-rule.

Fig 10

A: Input number N = 3. B: Input number N = 5. Combinations are aligned in ascending order of active inputs, with color code indicating the number of inputs, see legend at the bottom. Combinations are indicated by decimal numbers corresponding to binary set notation (e.g. “3” means the combination: 00011, where only the two last inputs are active). “o” means other, where this denotes occurrences of cells signaling several different combinations. The size of the neural network is M = 200, average connectivity c = 2, annealing parameters are: annealing rate ρ = 0.3, where the annealing threshold va for each neuron individually is drawn from a uniform distribution [0.75,0.95]. Decision threshold is 0.7. Initial weights are chosen from Gaussian distribution with mean = 0.001 and std = 0.0002. Initial learning rate μ(0) = 0.0005. Euler integration with dt = 1. Median, mean and standard deviation are shown on the basis of 100 trials.

It is here important to note that this network does not produce an excess of neurons that respond to the condition “other” (about 7 aut of 200 cells do this in the 5-input case, panel B). “Other” means that a neuron would code, for example, for “x1x1x” as well as for “11xxx” and possibly for even more different combinations. If self-organization were driven by a pure random process a very strong excess of such neurons would be expected, which is not the case here. Hence, our networks, indeed, self-organize into a set of input-combination selective neurons.

In Fig 11 we analyze how the proportion of different combinations change with varying decision threshold (A,C) and for the same decision threshold but in different network architectures (B,D). We quantify how many neurons are—on average—selective for any input combination. To achieve this, we first sum the number of neurons that represent the same type of input combination: e.g. single input. Then we divide this sum by the number of possible type-identical combinations. For example, for the N = 5-case there are 5 single, 10 double, 10 triple, 5 quadruple and 1 quintuple possible combinations existing. Hence, percentage plots in Fig 11 do not sum up to 100. However, to also be able to show the strong difference between combination-selective versus non-selective (“other” + “sub-threshold” + “sustained”) neurons, we provide the total percentage of the combination-selective neurons, too (numbers in italics at the top of each plot). Standard deviations are of the same order of magnitude, as shown in the box-plot above and omitted here to make the diagrams better readable.

Fig 11. Different combination distribution based on decision threshold and neural network architecture.

Fig 11

A and B: three input case; C and D: five input case. Numbers 1 to 5 indicate combinations responsive to corresponding number of inputs; “Other” represent cells signaling more than one combination (see text for explanation), “Sub.” denotes sub-threshold cases, while “Sust.” denotes sustained activity, which does not subside after switching off the inputs, which does not happen here (but in the baseline, see Fig 12). Neural network (NN) architecture notation: “No of neurons”- “connectivity” (M-c). Numbers above column groups denote percentage of combination-selective neurons (vs. “Other” and “Sub.” neurons). Initial settings and learning parameters as in Fig 10. Note, Fig 10 corresponds to the results indicated by ovals.

Decision threshold dependence is analyzed in parts A and C for N = 3 and N = 5, respectively. All combinations are represented with an increasing prevalence of more-complex combinations for higher threshold values. Single cases are over-represented for smaller thresholds, where this over-representation decreases with increasing decision threshold. However, the qualitative outcome remains the same, that different combinations exist in a network, irrespective of threshold value.

While all combinations are present, architectures with a bigger number of inputs (200–10, 1000–10) favor higher-order combinations of inputs (see prevalence of 3-combinations in Fig 11B for architecture 200–10 and 1000–10 and prevalence of 5-combinations in Fig 11D, appropriate architectures).

The green columns in the histograms show “Other” cases, which count all the cells in the network that are active with two or more combinations, where one is not the subset of the other. This number is not substantial when connectivity is low c = 2 and higher if we use connectivity c = 10 in five input case (see panel D). However, note that also in this case there are still around 80% of combination specific cells existing and only 20% others. If the decision threshold becomes too high (see for the threshold value 0.8 in A and C), sub-threshold cases emerge (black column). These are cases where the neuron may “fire” but never reaches decision threshold. None of the networks that we trained this way showed sustained activity, which is a type of activity that persists after the inputs have been switched off (but see next).

Results can be compared to baseline performance, where the weights obtained by the ALL-rule are randomly reshuffled (permuted) in between connections in the network, while the general network connectivity pattern (which neuron connects to which other neuron) remains the same. The percentage of different cases is shown in Fig 12, column group “permuted”. Here we see that both, for N = 3 (A) and N = 5 (B), we only have a few percent of neurons responding to single inputs only, whereas essentially no more-complex combinations emerge (compare to columns “learned” plotted on the left). Instead, for baseline most of the cells remain sub-threshold. The question naturally arises whether this is just a scaling effect? Hence, to investigate if we can get more useful above-threshold combinations with bigger weights, we increased all weights in the baseline by 1.5 or by 2 (columns “permuted x 1.5” and “permuted x 2”). Here we get a few more single responses and, as discussed above, also more “Other” responses (numbers in green), but now also sustained activity emerges (brown column in the diagrams) and dominates for “permuted x 2”. Hence, the network activity does not come to rest after stimuli have been removed. Thus, this baseline shows that the ALL-method, suggested in this study, allows generating in an unsupervised manner neurons selective for specific combinations of inputs (low number of random=“other” combinations) in a stable way, hence, without leading to sustained activation.

Fig 12. Comparison to baseline.

Fig 12

A: Three input case, B: Five input case. The column group “learned” shows performance of the ALL-rule, M = 200, c = 2; copied from Fig 11; “permuted” is for the case with learned weights randomly permuted; “permuted x 1.5” and “permuted x 2” for cases with permuted weights multiplied by 1.5 and 2, respectively. Decision threshold kept at 0.7 everywhere. Abbreviations: “Sub.” = sub-threshold, “Sust.” = sustained activity. Green numbers denote percentage of “Other”.

Network with inhibition

This study focuses on the stabilizing effects of annealing in excitatory networks, which otherwise would be prone to effects like sustained activity as shown in the baseline study above. Intuitively, inhibition should not interfere (rather help) with stabilization, but will it affect responses to the input combinatorics?

In Fig 13 we show a boxplot for the number of cells signaling different input combinations in a network with 200 excitatory cells and 40 inhibitory cells (20%). We use a connectivity of the excitatory network c = 2, and N = 5 inputs. In addition, each excitatory cell receives input from ten randomly chosen inhibitory cells. Inhibitory connections are not trainable and their weights are set to 0.01 each. This leads to a total inhibitory strength converging at any target cell, which is only moderately smaller than the learned excitatory weights for which we calculated that we get an average total excitatory weight of ∼0.5. Each inhibitory cell receives inputs from 20 randomly chosen excitatory cells in the network. The plot in Fig 13, shown here, is similar to the one obtained without inhibition (see Fig 10B), where the numbers of cells responsible for combinations are only slightly lower in the case of inhibition. This shows that realistic inhibition, added to the network, does not fundamentally change the behavior of such networks.

Fig 13. Box plot for different combinations of inputs for the cases N = 5 inputs for the ALL-rule with 20% inhibitory cells.

Fig 13

Combinations are aligned in ascending order of active inputs, with color code indicating the number of inputs, see legend at the bottom. Combinations are indicated by decimal numbers corresponding to binary set notation (e.g. “3” means the combination: 00011, where only the two last inputs are active). “o” means other, where this denotes occurrences of cells signaling several different combinations. The size of the neural network is M = 200 (excitatory cells) with 40 inhibitory cells added. Average connectivity of excitatory cells onto excitatory cells is c = 2; connectivity onto inhibitory cells c = 20. Each excitatory cell, in addition, is given 10 inhibitory connections, with a fixed weights of 0.01. Annealing parameters are: annealing rate ρ = 0.3, where the annealing threshold va for each neuron individually is drawn from a uniform distribution [0.75,0.95]. Decision threshold is 0.7. Initial weights for excitatory inputs are chosen from Gaussian distribution with mean = 0.001 and std = 0.0002. Initial learning rate μ(0) = 0.0005. Euler integration with dt = 1. Median, mean and standard deviation are shown on the basis of 100 trials.

Discussion

In this study we have introduced an unsupervised synaptic plasticity rule with learning rate annealing that leads to weight stabilization and useful output sorting in case of different input coincidences even for different input amplitudes and occurrence frequencies. To achieve this we have made two modifications of the traditional Hebb rule:

  • We reduced the influence of the neuronal output onto learning to an all-or-none behavior by using the Heaviside function with a threshold η ≥ 0. This way learning starts as soon as the neuronal activity is larger than this threshold but does not depend on the actual magnitude of the neuronal activity.

  • We used annealing of the learning rate, as soon as the neuron has reached a certain output level. While annealing is a well-known supplementary technique in many, also unsupervised, approaches [1518] we use it as the main mechanism to stabilize learning.

We have shown that with the ALL-rule neurons can learn coincident feature detection, similar to an AND operator (Fig 5). When restricted to two inputs from the here analyzed other rules it is only the BCM rule that achieves this reliably, too. However, we found that BCM cannot sort more than two inputs. This is due to its intrinsic multiple non-linearities. While weights will converge, the actual location of the BCM fixed points can not be predicted in the general case, and neurons may respond too strongly to few inputs and too weakly to combinations of more (see Figs 8A and 9B). The ALL rule, on the other hand, can solve the sorting problem, too.

In addition, we found that different neurons in a network become specific for different feature combinations when using the ALL rule. Remarkably, this happens reliably even in networks that exclude balancing effects due to inhibition and we have shown that such excitatory networks do not run into a regime of uncontrolled sustained activity. Somewhat expected, when adding realistic inhibition to the network findings remain similar and the here-observed characteristics only change for overly strong point-wise acting inhibition, which appears unrealistic when considering cortical networks.

Biophysics

All-or-non learning

The use of the Heaviside function for Hebbian learning (Eq 8) provides, from a theoretical perspective, several clear advantages because it leads only to linear weight growth. Different from this, the membrane Hebb rule, which uses the membrane potential to drive learning (Eq 6), leads to exponential weight growth and a strong run-away effect of the weights that belong to the stronger inputs (see Fig 4B). Furthermore, (especially at dendritic spines) it appears that the post-synaptic depolarization effects, that influence Ca+ + influx through NMDA channels, which determine LTP, have an all-or-none effect on plasticity. The absolute values for Ca++ within the dendrite required for the induction of synaptic plasticity have been estimated as 150–500 nM for LTD and >500 nM for LTP [27]. Furthermore, it has been measured that a single EPSP can raise the Ca++-level to 700 nM, where a pairing of post-synaptic depolarization with synaptic stimulation would even drive it up to as much as 12 μM [28].

Based on these findings [29] had designed a model of plasticity in spines that predicts that an EPSP resulting from the activation of a single synapse is sufficient to cause a significant Ca+ + influx through NMDA receptors. This is in line with experimental data [3032]. As a consequence, it appears that every post-synaptic back-propagating spike or dendritic spike will be enough to lead to substantial Ca+ + influx to trigger plasticity (at a spine). This argues for a sharp transition of the post-synaptic learning influence, where the use of the Heaviside function would represent a limit case. Sigmoidal transition functions similar to Eq (2) could be used instead, where results for this study will be little affected if the sigmoid is steep enough.

Learning Rate Annealing

In 1998, Bi and Poo [33] had shown that the change in EPSP amplitude is inversely related to the size of the EPSP when employing a plasticity protocol. Hence, large synapses grow less than small synapses. This is potentially a ceiling (saturation) effect of LTP and could, in theoretical terms, indeed be captured by a learning rate annealing mechanism. This, however, points to a core problem: For theoreticians the learning rate is just a single variable and learning rate annealing is essentially just an abstraction of meta-plasticity. Linking this to complex multi-faceted biophysical processes, thus, remains difficult. There is a wealth of literature that suggests that the reduction of LTP, due to meta-plasticity, could rely on effects that influence NMDA receptors [3438]. However, the time course of this might be too fast as these effects seem to decay within about one hour [34]. Stimulus driven annealing ought to be able to act rather on longer time-scales because the animal may only now and then encounter the relevant stimuli. Longer lasting reduction of LTP could be obtained by mechanisms that operate on its later phases (late-LTP) [3942] suspected to be essential for establishing synaptic consolidation. However, any potential role of this mechanisms in meta-plasticity related to annealing effects remains unknown. Nevertheless, it seems conceivable that neurons reduce their ‘learning-efforts’ by reducing the synthesis of some relevant biochemical components using a saturation-driven kinetics, as soon as the neuron’s activity has grown enough, which could be understood as learning rate annealing.

Summary

The above discussion provides evidence that the here-assumed novel mechanisms of all-or-non learning paired with annealing are compatible with the biophysics of synapses (especially when considering spines). It is furthermore noteworthy that the biophysical “machinery” to implement the ALL-rule is relatively simple, which is different for most other advanced unsupervised rules (see next).

Other unsupervised rules

BCM

In the theoretical literature, learning rate annealing is a very widely used mechanism applied with different learning rules and for different purposes [17, 18, 43, 44]. Notably, the BCM rule also has a mechanism built in that could be understood as annealing. Its threshold θ relies on the time-averaged level of postsynaptic firing. Thus, if firing levels are maintained at a high level, this threshold shifts, making LTP harder to obtain. With some tuning, this rule was also able to solve the simple AND-operator task investigated in this study. The weight development of BCM, however, does not reflect the ordering of input coincidences in any reliable way (see Figs 8 and 9) and the location of the different output distributions in activation space cannot be predicted for neurons with several inputs. An additional undesired aspect of BCM is that convergence can be very slow for multiple inputs. This had been observed in a recent study [26] and we also found that for five inputs sometimes above one million iterations were needed until convergence. Increasing the learning rate does not much help here as this way quite strong weight oscillations can occur. When considering that we are here dealing with the presentation of external stimuli that “come from the world”, it is impossible for an animal to learn feature constellations using BCM due to delayed convergence. By contrast, the ALL rule converges (by construction always without oscillations) in about 100 iterations, where this number is not much affected by the number of inputs.

Given the complexity of more advanced versions of BCM (e.g. Intrator-Cooper), it is also unclear how this could be modeled in biophysical terms. In particular, also in view of the fact that saturation-driven kinetic mechanisms, which operate on one or more compounds needed for LTP, do not map well to this rule.

Synaptic scaling

Synaptic scaling has been suggested as a possible mechanism to achieve targeted weight-growth, too, and scaling operates on rather long time scales, slower than learning. Hence, one aspect concerns the question to what degree the ALL-rule might relate to synaptic scaling [5]. Scaling assumes that neurons “want” to achieve a certain target activity [7] and that synaptic changes are driven by this target. Hence, this is indeed related to the operation of the ALL-rule. Alas, the existing mathematical formulations where (Hebbian) plasticity is combined with some scaling term [8], do not reliably lead to this property. Different from this, the ALL-rule does achieve this in a robust manner, where—for the purpose of this study—we have set the target activity to relatively high values, which allows getting the AND-operator property. However, due to the design of the annealing mechanism, other target values can also be obtained by using a different (lower) annealing threshold va.

Oja’s rule

This rule did not allow us to obtain in any reliable way the simple AND-operator property and had—as a consequence—not been further investigated.

Summary

The central problem of the above discussed learning rules appears to be that, while they all converge, the locations of the weights’ fix points are not directly coupled to the (average) stimulus intensity given approximately by the product of amplitude with occurrence frequency of the stimuli. Even in cases where the stimulus statistics are identical, different stimulus sequencing will drive weight development into different fix points on their attractor landscape. This is different for the ALL rule, which leads to a rigorous “sorting” of the outputs according to their driving stimuli even for multiple inputs.

Comparing to other learning principles

Clearly, reinforcement learning and supervised learning would be able to achieve the tasks investigated in this study, too. Both, however, require evaluative feedback in the form of rewards or by use of an error function. While evaluative feedback can help achieving more discriminative results in learning tasks, not excluding the here-addressed task of coincidence sorting, the origin of error terms in biological systems is a large unresolved question in its own right [45]. We had discussed that there are formal similarities between our rule (Eq 8) and Rosenblatt’s perceptron [22], but for the perceptron an error term is needed. Note, that error terms in biological systems do not come “for free”. Any system using evaluative feedback needs additional components and complex processes. The multifaceted properties of the dopaminergic system in the animal brain (i.e. a reward-processing system which, however, also strongly reacts to just a novelty signal) testifies to this complexity [46, 47]. Different from this, our method is non-evaluative and performs a process of self-organized stimulus sorting in a single neuron. Any potential ecologically meaningful evaluation could then come on top and, for example, reinforcement learning of a beneficial behavioral policy could make use of the responses of our feature-combination specific neurons.

Limitations

This study has focused on a stationary environment, where the statistics of the inputs does not change between training and testing of the system. It is however, straightforward to complement this with a decay term (forgetting) of the weights with which the system can recover its learning rate. Thus, given the fast convergence properties of the ALL-rule, changes in the environment, which happen usually on a slow time-scale, could be accommodated this way.

Currently the ALL-rule leads only to weight growth. Forgetting would be a passive, possibly slow, mechanisms to reduce weights. Different from this, active weight reduction can also be achieved with a mechanism for long term depression (LTD). This can, for example, be done by using a sigmoidal function G(yη), η > 0 with values between −1 and + 1 (or the Sign function) instead of the Heaviside function, which will lead to weight reduction for yη < 0. We are currently investigating both aspects (forgetting as well as LTD), but this goes beyond the scope of the current study.

Furthermore, we found that the ALL-rule is quite robust to variable stimulus occurrence frequencies and variable amplitudes. Only large amplitude differences will indeed harm performance. However, there is strong evidence that input normalization is a powerful mechanism in many different brain areas ([19], and see review [48]). Note that a factor of 1.5—like for the amplitudes in our experiments—is clearly within the normalization regimes for many of these experimental findings provided in the aforementioned studies [19, 48]. In an ecological setting animals have no control over how often a stimulus will occur and robustness against this kind of variability is useful for the learning process as also observed with the ALL-rule. In addition to this, normalization mechanisms can be used to ameliorate negative effects of amplitude variations.

The here investigated networks are small but their general connectivity pattern appears realistic relative to the here-used neuron numbers. Furthermore, similar types of networks have been used in many studies that address the problem of reservoir computing [49, 50]. The focus on excitation had been chosen to demonstrate that even such networks will stabilize but some general inhibitory connectivity had been introduced, too. More targeted inhibitory connections (e.g. lateral inhibition) will begin to make sense only as soon as some topology is introduced in such networks.

Conclusions

With the mechanisms employed here we demonstrated that neurons can learn to respond to specific input combinations in an unsupervised manner. This can only be achieved if the system reacts in a rather invariant way to stimuli of different amplitude and occurrence frequency, which is assured by annealing. We believe that this specificity for input combinations may be of ecological relevance for an animal, because it allows learning to respond to sets of inputs that might indicate situations with different—positive or negative—valance, where—on the other hand—individual features might be irrelevant.

Supporting information

S1 Appendix. Additional analyzes of reference methods.

(PDF)

pcbi.1011926.s001.pdf (1.6MB, pdf)
S2 Appendix. Parameter analysis for the BCM rule in case of two inputs.

(PDF)

pcbi.1011926.s002.pdf (1.4MB, pdf)
S1 Code Repository. Code for obtaining result figures presented in this manuscript.

(ZIP)

pcbi.1011926.s003.zip (307.3KB, zip)

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

Supported by the German Science Foundation (DFG), grant WO 388/17-1 (F.W.) and grant TE 1172/7-1 (C.T.), as well as by the European Commission H2020, grant no.: 899265 "ADOPD" (F.W.) and the German Federal Ministry of Education and Research (BMBF) grant no. 01IS22093A "KISSKI" (C.T.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Widrow B, Hoff ME. Adaptive switching circuits. In: 1960 IRE WESCON Convention Record, Part 4. New York; 1960. p. 96–104.
  • 2. Oja E. Simplified neuron model as a principal component analyzer. Journal of mathematical biology. 1982;15:267–273. [DOI] [PubMed] [Google Scholar]
  • 3. Bienenstock EL, Cooper LN, Munro PW. Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience. 1982;2(1):32–48. doi: 10.1523/JNEUROSCI.02-01-00032.1982 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Miller KD, MacKay DJ. The role of constraints in Hebbian learning. Neural Computation. 1994;6(1):100–126. doi: 10.1162/neco.1994.6.1.100 [DOI] [Google Scholar]
  • 5. Turrigiano GG, Leslie KR, Desai NS, Rutherford LC, Nelson SB. Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature. 1998;391(6670):892–896. doi: 10.1038/36103 [DOI] [PubMed] [Google Scholar]
  • 6. London M, Segev I. Synaptic scaling in vitro and in vivo. Nature Neuroscience. 2001;4(9):853–854. doi: 10.1038/nn0901-853 [DOI] [PubMed] [Google Scholar]
  • 7. Turrigiano GG, Nelson SB. Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience. 2004;5(2):97–107. doi: 10.1038/nrn1327 [DOI] [PubMed] [Google Scholar]
  • 8. Tetzlaff C, Kolodziejski C, Timme M, Wörgötter F. Synaptic scaling in combination with many generic plasticity mechanisms stabilizes circuit connectivity. Frontiers in Computational Neuroscience. 2011;5:47. doi: 10.3389/fncom.2011.00047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 1998. [Google Scholar]
  • 10. Eberhart RC, Shi Y, Kennedy J. Swarm intelligence. Elsevier; 2001. [Google Scholar]
  • 11. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–533. doi: 10.1038/nature14236 [DOI] [PubMed] [Google Scholar]
  • 12. Nakamura K, Derbel B, Won KJ, Hong BW. Learning-Rate Annealing Methods for Deep Neural Networks. Electronics. 2021;10(16):2029. doi: 10.3390/electronics10162029 [DOI] [Google Scholar]
  • 13.Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:170400109. 2017.
  • 14. Liu C, Huang W, Xu RYD. Implicit bias of deep learning in the large learning rate phase: A data separability perspective. Applied Sciences. 2023;13(6):3961. doi: 10.3390/app13063961 [DOI] [Google Scholar]
  • 15. Xu L, Oja E, Suen CY. Modified Hebbian learning for curve and surface fitting. Neural Networks. 1992;5(3):441–457. doi: 10.1016/0893-6080(92)90006-5 [DOI] [Google Scholar]
  • 16. Hyvärinen A, Oja E. Independent component analysis by general nonlinear Hebbian-like learning rules. Signal Processing. 1998;64(3):301–313. doi: 10.1016/S0165-1684(97)00197-7 [DOI] [Google Scholar]
  • 17. Nessler B, Pfeiffer M, Maass W. Hebbian learning of Bayes optimal decisions. Advances in Neural Information Processing Systems. 2008;21. [Google Scholar]
  • 18. Krotov D, Hopfield JJ. Unsupervised learning by competing hidden units. Proceedings of the National Academy of Sciences. 2019;116(16):7723–7731. doi: 10.1073/pnas.1820458116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Heeger DJ. Normalization of cell responses in cat striate cortex. Visual Neuroscience. 1992;9(2):181–197. doi: 10.1017/S0952523800009640 [DOI] [PubMed] [Google Scholar]
  • 20. Stevens CF, Zador AM. Input synchrony and the irregular firing of cortical neurons. Nature neuroscience. 1998;1(3):210–217. doi: 10.1038/659 [DOI] [PubMed] [Google Scholar]
  • 21. La Camera G, Rauch A, Thurbon D, Luscher HR, Senn W, Fusi S. Multiple time scales of temporal response in pyramidal and fast spiking cortical neurons. Journal of neurophysiology. 2006;96(6):3448–3464. doi: 10.1152/jn.00453.2006 [DOI] [PubMed] [Google Scholar]
  • 22. Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review. 1958;65(6):386. doi: 10.1037/h0042519 [DOI] [PubMed] [Google Scholar]
  • 23. Intrator N, Cooper LN. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks. 1992;5(1):3–17. doi: 10.1016/S0893-6080(05)80003-6 [DOI] [Google Scholar]
  • 24. Toyoizumi T, Kaneko M, Stryker MP, Miller KD. Modeling the dynamic interaction of Hebbian and homeostatic plasticity. Neuron. 2014;84(2):497–510. doi: 10.1016/j.neuron.2014.09.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Blais BS, Cooper L. BCM theory. Scholarpedia. 2008;3(3):1570. doi: 10.4249/scholarpedia.1570 [DOI] [Google Scholar]
  • 26. Froc M, van Rossum MC. Slowdown of BCM plasticity with many synapses. Journal of Computational Neuroscience. 2019;46:141–144. doi: 10.1007/s10827-019-00715-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Cormier R, Greenwood A, Connor J. Bidirectional synaptic plasticity correlated with the magnitude of dendritic calcium transients above a threshold. Journal of Neurophysiology. 2001;85(1):399–406. doi: 10.1152/jn.2001.85.1.399 [DOI] [PubMed] [Google Scholar]
  • 28. Sabatini BL, Oertner TG, Svoboda K. The life cycle of Ca2+ ions in dendritic spines. Neuron. 2002;33(3):439–452. doi: 10.1016/S0896-6273(02)00573-1 [DOI] [PubMed] [Google Scholar]
  • 29. Rackham O, Tsaneva-Atanasova K, Ganesh A, Mellor J. A Ca2+-based computational model for NMDA receptor-dependent synaptic plasticity at individual post-synaptic spines in the hippocampus. Frontiers in Synaptic Neuroscience. 2010; p. 31. doi: 10.3389/fnsyn.2010.00031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Bloodgood BL, Sabatini BL. Nonlinear regulation of unitary synaptic signals by CaV2. 3 voltage-sensitive calcium channels located in dendritic spines. Neuron. 2007;53(2):249–260. doi: 10.1016/j.neuron.2006.12.017 [DOI] [PubMed] [Google Scholar]
  • 31. Canepari M, Djurisic M, Zecevic D. Dendritic signals from rat hippocampal CA1 pyramidal neurons during coincident pre-and post-synaptic activity: a combined voltage-and calcium-imaging study. The Journal of Physiology. 2007;580(2):463–484. doi: 10.1113/jphysiol.2006.125005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Sobczyk A, Svoboda K. Activity-dependent plasticity of the NMDA-receptor fractional Ca2+ current. Neuron. 2007;53(1):17–24. doi: 10.1016/j.neuron.2006.11.016 [DOI] [PubMed] [Google Scholar]
  • 33. Bi GQ, Poo MM. Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience. 1998;18(24):10464–10472. doi: 10.1523/JNEUROSCI.18-24-10464.1998 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Huang YY, Colino A, Selig DK, Malenka RC. The influence of prior synaptic activity on the induction of long-term potentiation. Science. 1992;255(5045):730–733. doi: 10.1126/science.1346729 [DOI] [PubMed] [Google Scholar]
  • 35. Coan E, Irving A, Collingridge G. Low-frequency activation of the NMDA receptor system can prevent the induction of LTP. Neuroscience Letters. 1989;105(1-2):205–210. doi: 10.1016/0304-3940(89)90038-4 [DOI] [PubMed] [Google Scholar]
  • 36. Youssef FF, Addae JI, Stone TW. NMDA-induced preconditioning attenuates synaptic plasticity in the rat hippocampus. Brain Research. 2006;1073:183–189. doi: 10.1016/j.brainres.2005.12.008 [DOI] [PubMed] [Google Scholar]
  • 37. Satoshi F, Yoichiro K, Masami M, Hidekazu F, Hiroshi S, Kenya K, et al. The long-term suppressive effect of prior activation of synaptic inputs by low-frequency stimulation on induction of long-term potentiation in CA1 neurons of guinea pig hippocampal slices. Experimental Brain Research. 1996;111:305–312. doi: 10.1007/BF00228720 [DOI] [PubMed] [Google Scholar]
  • 38. Frey U, Schollmeier K, Reymann K, Seidenbecher T. Asymptotic hippocampal long-term potentiation in rats does not preclude additional potentiation at later phases. Neuroscience. 1995;67(4):799–807. doi: 10.1016/0306-4522(95)00117-2 [DOI] [PubMed] [Google Scholar]
  • 39. Frey U, Morris RG. Synaptic tagging and long-term potentiation. Nature. 1997;385(6616):533–536. doi: 10.1038/385533a0 [DOI] [PubMed] [Google Scholar]
  • 40. Redondo RL, Morris RG. Making memories last: the synaptic tagging and capture hypothesis. Nature Reviews Neuroscience. 2011;12(1):17–30. doi: 10.1038/nrn2963 [DOI] [PubMed] [Google Scholar]
  • 41. Luboeinski J, Tetzlaff C. Memory consolidation and improvement by synaptic tagging and capture in recurrent neural networks. Communications Biology. 2021;4(1):275. doi: 10.1038/s42003-021-01778-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Lehr AB, Luboeinski J, Tetzlaff C. Neuromodulator-dependent synaptic tagging and capture retroactively controls neural coding in spiking neural networks. Scientific Reports. 2022;12(1):17772. doi: 10.1038/s41598-022-22430-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Sandholm TW, Crites RH. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems. 1996;37(1-2):147–166. doi: 10.1016/0303-2647(95)01551-5 [DOI] [PubMed] [Google Scholar]
  • 44. Smith AJ. Applications of the self-organising map to reinforcement learning. Neural networks. 2002;15(8-9):1107–1124. doi: 10.1016/S0893-6080(02)00083-7 [DOI] [PubMed] [Google Scholar]
  • 45. Burton TJ, Balleine BW. The positive valence system, adaptive behaviour and the origins of reward. Emerging Topics in Life Sciences. 2022;6(5):501–513. doi: 10.1042/ETLS20220007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Glimcher PW. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis. Proceedings of the National Academy of Sciences. 2011;108(supplement_3):15647–15654. doi: 10.1073/pnas.1014269108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Duszkiewicz AJ, McNamara CG, Takeuchi T, Genzel L. Novelty and dopaminergic modulation of memory persistence: a tale of two systems. Trends in neurosciences. 2019;42(2):102–114. doi: 10.1016/j.tins.2018.10.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Carandini M, Heeger DJ. Normalization as a canonical neural computation. Nature Reviews Neuroscience. 2012;13(1):51–62. doi: 10.1038/nrn3136 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Maass W, Natschläger T, Markram H. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural computation. 2002;14(11):2531–2560. doi: 10.1162/089976602760407955 [DOI] [PubMed] [Google Scholar]
  • 50. Jaeger H, Haas H. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. science. 2004;304(5667):78–80. doi: 10.1126/science.1091277 [DOI] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011926.r001

Decision Letter 0

Lyle J Graham, Daniel Bush

3 Sep 2023

Dear Dr. Tamosiunaite,

Thank you very much for submitting your manuscript "Unsupervised learning of perceptual feature combinations" (PCOMPBIOL-D-23-01087) for consideration at PLOS Computational Biology. As with all papers peer reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent peer reviewers. Based on the reports, we regret to inform you that we will not be pursuing this manuscript for publication at PLOS Computational Biology. In particular, as highlighted by both reviewers, it is simply not clear that this work represents an advance on existing models.

The reviews are attached below this email, and we hope you will find them helpful if you decide to revise the manuscript for submission elsewhere. We are sorry that we cannot be more positive on this occasion. We very much appreciate your wish to present your work in one of PLOS's Open Access publications.

Thank you for your support, and we hope that you will consider PLOS Computational Biology for other submissions in the future.

Sincerely,

Daniel Bush

Academic Editor

PLOS Computational Biology

Lyle Graham

Section Editor

PLOS Computational Biology

**************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors here analyze the general problem of detecting coincident patterns. In particular, they propose a pair of putative learning algorithms based on Hebbian principles with simulated annealing controlling the learning rate and stabilize the output of the cell. This acts to prevent the type o f run-away growth that Hebbian rules often lead to. However, detecting coincident patterns is easily achievable with a Perceptron learning rule and a multi-layer perceptron network. In fact, I would argue that the learning rules they authors consider are essentially slight modifications to the Perceptron learning algorithm where all patterns cause an increase in weights, up until they reach the annealing threshold. As such, the novelty here is minimal as time-varying learning rates in the perceptron learning algorithm are considered even as textbook exercises. It is rather unfortunate, as well, that Perceptrons and the Perceptron learning algorithm is never mentioned in this manuscript, despite the large similarity between both the problem the authors are trying to solve, and the methodology they use. I outline my points below.

Major Concerns

1) The specific problem these authors have considered is that of detecting coincident patterns. If we receive multi-modal stimuli that coincide, the neuron should respond but not to the stimuli themselves. If we consider the case of two stimuli, then we can define a binary vector [0,0], [0,1], [1,0], and [1,1] where 1 denotes the presence of a stimulus, and 0 denotes the absence. The goal then is to have a neuron respond exclusively to [1,1]. But, this is accomplished with Frank Rosenblatt's Perceptron learning algorithm, and perceptrons, simple computational units that threshold a linear combination of inputs. Note however that this is a convention. We can equally take the entry "0" to denote the presence of a stimulus, and 1 to denote its absence, or we can define any other vector set (e.g. [-1,-1],[-1,1],[1,-1],[1,1]) to denote the presence or absence of a stimulus.

2) In fact, I think the authors have largely "rediscovered" the perceptron learning algorithm, and used it somewhat incorrectly. The author's themselves never cite Rosenblatt's original work, or other perceptron-like pattern classifiers that immediately solve this problem. I outline how the overall setup is largely identical to a perceptron learning algorithm point by point

- The computational units have a set of inputs u, and a weight vector w. The computational units take the inner product of w with u and then apply some kind of thresholding. This is either smooth, as in the case of the sigmoid considered in equation (2) or discrete, as in equation (7) where y=w^T u must be thresholded by eta. A Perceptron is a computational unit that takes an input pattern x and applies a weight w as an inner product, and subsequently thresholds it:

y = 1 if w^T x >0, 0 if w^T x < 0

- The learning rules are largely perceptron-like. The perceptron learning rule alters the weights as

w(t+delta) = w(t) + r(d_i-y_i)*x_i.

where as the Hebbian rule alters the weight as

w(t+delta )=w(t) + mu(t)*y_i*x_i

and the ALL rule alters the weight as.

w(t+delta) = w(t) + mu(t)*H(y_i-eta)*x_i

where i is the index of the training sample. Let's consider the Hebbian rule, as it is a simpler rule to work with. For all 2D input patterns, [0,0],[0,1],[1,0],[1,1], the ALL-Rule returns the following weight changes

Delta_W = [0,0] when x_i =[0,0]

Delta_W = [0,mu(t)] when x_i = [0,1]

Delta_W = [mu(t),0] when x_i = [1,0]

Delta_W = [mu(t),mu(t)] when x_i = [1,1]

While the Perceptron learning rule would return the following

Delta_W = [0,0]

Delta_W =[0,r(d_i-y_i)]

Delta_W = [r(d_i-y_i),0]

Delta_W = [r(d_i-y_i),r(d_i-y_i)]

Thus, if we consider a thresholded neuron where y_i is in the interval [0,1], and we set d_i = 1 to all the inputs, then we would largely obtain weight changes in a similar direction as the ALL rule. But this would imply all the patterns return a neural activation (y=1).

The way the rule seems to work is that you just increase the weight corresponding to an input generally. In fact, this is shown in Figure 1. The weights always increase in response to an input pattern, ti's just the threshold is set to decrease the learning rate once a neuron fires. The coincidence detection is because the two inputs and two positive weights lead to an overall current that is larger than when just a single input is on. Normally however, in a Perceptron, you would have negative changes to the weights which allow you to set a boundary so that not all the responses have to be +1. The annealing is simply stopping the algorithm from yielding a "spiking" or threshold response to all possible input vectors, and so you obtain somewhat coincidentally pattern separation.

Thus, the algorithms considered by the authors are 1) Perceptron-like in their deployment, 2) are likely to be less robust than a direct application of the Perceptron learning rule. This would also imply that the annealing threshold is always hand-tuned to guarantee the correct coincident detection.

Minor Concerns

1) throughout the manuscripts, the authors have type set the quotation marks wrong. In order to do the left quotation in Latex, one needs to press the ' key twice, while " is used exclusively for the right quotation

2) This also occurs when using the single quote ' on line 32 for 'large enough'

3) Munro vs Munroe in the BCM rule, on line 23.

4) There are many statements made by the authors without supporting references. For example "annealing is usually applied as an additional mechanism to ensure an efficient convergence of weights"...examples?

5) The figure legends are very spartan at times. For example, for Figure 2, the legend consists of 4 words, with little other description.

6) I would advise the authors to put their code on a database with a referee read-only access password. I believe github and modeldb have such options.

Reviewer #2: This article addresses the issue of unsupervised development of an effective coding scheme for feature combinations, an issue that is important for various aspects of cognitive processing.

In my estimation, the most important element of this article is Figure 7B, bottom-right panel. This is a somewhat non-obvious result, in that it shows that the rule proposed by the authors is potentially capable of generating an ordered representation of feature combinations beyond pairwise, such that it lends itself to easy read-out by subsequent stages. I found it puzzling that this result is relegated to a tiny subpanel within a figure panel, and that it receives little attention beyond this demonstration. The authors should have expanded much more on this result, with more detailed investigations of the minimal conditions that are required for its emergence, its dependence on the number of inputs, parameterization, and others.

As for the remaining part of the article, which forms the bulk of the paper and which focuses almost entirely on pairwise combinations, I was not entirely convinced that it adds much on top of existing schemes. Let's focus on the Intrator-Cooper BCM rule, labelled BCM-rule by the authors. This is the only competing rule selected by the authors that appears as a valid contender to the ALL-rule proposed by the authors. Indeed, except for the result in Figure 7B mentioned above, the BCM-rule is able to accommodate pretty much all other results achieved by the ALL-rule (see for example Figure 7A).

When I look at equation 9 (BCM-rule), it seems plausible that it may be made to behave similarly to equation 3 (ALL-rule). \\mu and \\mathbf{u} are common to both rules. We are left with a bunch of stuff in equation 9 that depends on v, where v is a function of the linear output y, corresponding to g in equation 3, which is a thresholded (Heaviside) function of y. As noted by the authors, v can be made to approximate the thresholding function in equation 3 (lines 127-128), so the question is whether v^2*dv/dy (with \\Theta_{M} set to ~0) may behave like v when v is very sharp, which is not implausible depending on details. I may have missed it, but I could not find a detailed description of the parameterization adopted for the BCM-rule, so I assume that the authors chose a parameterization for v that is quite gradual. I wonder how the BCM-rule behaves for the specific case shown in Figure 7B when it is parameterized to implement a sharp thresholding function v. It may produce a better separation for the triple conjunction, for example.

If the Intrator-Cooper BCM rule can be parameterized to encompass the results achieved by the ALL-rule, it is still interesting to show that it can be significantly simplified to acquire unexpected properties like Figure 7 (bottom-right panel), but it does subtract somewhat from the contribution of this study in terms of novelty. The issue here is that the ALL-rule may bring some benefit in relation to a specific problem like the one examined here, but may underperform with respect to a different problem. If the BCM-rule can potentially cover the ALL-rule and also deal with other problems, it would remain a more attractive option.

Another (minor) criticism is that the authors choose relatively narrow ranges for input parameterization, for example they explore two inputs that differ in intensity by about 50%, which is not very useful for real biological processing: under real-world scenarios, features must be combined under much more extreme changes in input intensity. I realize that the approach can be augmented to incorporate normalization schemes that act early, but it is not then clear that the ALL-rule would preserve the ordered structure demonstrated in Figure 7B under those conditions. I think this aspect of the study needs further investigation/clarification.

An issue related to the one above is that, possibly as a consequence of the limited amount of details provided by the authors in relation to the training procedure, it was not clear to me how robust the ALL-rule is to potential differences between the training cohort and a subsequent testing cohort that differs significantly, particularly in relation to coding stability. For example, once the network is frozen, is it stable when coding input distributions that differ from those used during training, and that differ among themselves? By stable I mean that if a given neuron codes for combination 13 or 123, it retains its label regardless of substantial changes in the parameterization of the input. For example, if input 1 was twice as intense as input 2 during training, and the network is characterized for this configuration, when I then characterize for a configuration in which input 1 is half as intense as input 2, is the code stable? These issues do not seem to be adequately addressed.

--------------------

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: The authors won't make their code public until after the manuscript has been accepted, but have also not included links for a referee.

Reviewer #2: None

--------------------

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011926.r003

Decision Letter 1

Lyle J Graham, Daniel Bush

18 Jan 2024

Dear Dr. Tamosiunaite,

Thank you very much for submitting your manuscript "Unsupervised learning of perceptual feature combinations" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

In particular, the authors should do a little more to clarify the distinction between their unsupervised learning rule and the Perceptron rule, as well as the advantages of taking an unsupervised approach (i.e. biological plausibility, no need for feedback). However, I do not think they need to formally compare the results obtained using each approach.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Daniel Bush

Academic Editor

PLOS Computational Biology

Lyle Graham

Section Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: I should state upfront that, in general, I do not like to re-review rejected papers. This kind of situation makes peer review messy. Having said that, I sympathize with the authors in that they seem to have reasonable grounds for appeal in this case.

I should also notice upfront that, when going back to this manuscript, I was irritated by the fact that the authors did not highlight revisions using, for example, red ink. This means that Reviewers must re-read the paper from scratch, which adds to their workload (and I am reviewing another 4 papers on top of this one). In the future, please be more respectful of Reviewers' time by clearly highlighting portions of the manuscript that have been modified during the revision process.

Now, with all that in mind, I am trying to keep an open mind here. The authors have expanded on the feature I singled out during the first round, namely the result in Figure 7 that was originally relegated to one subpanel. In my opinion, that is the entire paper: without that, I would not consider this manuscript above threshold for PLOS CB. With that result, I think it deserves consideration. In my understanding, it is an interesting result that such a simple rule can generate an ordered representation of feature combinations that is essentially transparent for further read-out by higher modules.

I was somewhat disappointed by the fact that the authors did not take on the challenge I set out for them in the first round regarding stability of the coding strategy (point they refer to as Q6 in their rebuttal). At the same time, I accept that this is a very challenging problem, so I am willing to concede on that point and accept that it will require further research beyond this paper to be fully addressed.

All in all, the authors have addressed my comments to a reasonable degree of satisfaction.

Reviewer #3: I will not provide a detailed review of this paper, as I was brought in after an initial round of reviews had already taken place. Instead, I will focus primarily on the dispute/appeal regarding reviewer 1's comments on the manuscript. Having read through the reviewer comments and author responses, I am satisfied that the author's are correct in their response, and that the reviewer's objection was based on a misconception/failure to distinguish between supervised/unsupervised learning rules. That said, the very fact that the reviewer (who is presumably an expert in the field) was able to come to this erroneous conclusion suggests that the authors have not taken enough care to clarify the essential contributions of their study and the relevance to previous work. Therefore, I would recommend that the authors explicitly discuss the distinction between supervised/unsupervised learning rules, and make clear to the reader how their work goes beyond simple perceptron learning rules/is more biologically plausible, etc. (it is of course up to the authors to decide how to frame the contributions of their paper). While the authors have added a small paragraph attempting to address this, I found this unsatisfactory and somewhat dismissive of reviewer 1's concerns (e.g., "these two approaches cannot really be compared"). Instead, the authors should explain why this distinction/advance is important, and why this makes the proposed learning rule advantageous. Even better would be if the authors could formally (via numerical simulations and/or mathematical analysis) compare and contrast the supervised perceptron learning rule vs their method (of course supervised will do better, and that's not a problem, but some general insight into how the two learning rules relate could be interesting). Overall, the comparison of this learning rule to the classic perceptron learning rule seems important, as has not been sufficiently elaborated in the manuscript as far as I can tell.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Peter Neri

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011926.r005

Decision Letter 2

Lyle J Graham, Daniel Bush

19 Feb 2024

Dear Dr. Tamosiunaite,

We are pleased to inform you that your manuscript 'Unsupervised learning of perceptual feature combinations' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Daniel Bush

Academic Editor

PLOS Computational Biology

Lyle Graham

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: NA

Reviewer #3: The authors have addressed my comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Peter Neri

Reviewer #3: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011926.r006

Acceptance letter

Lyle J Graham, Daniel Bush

27 Feb 2024

PCOMPBIOL-D-23-01087R2

Unsupervised learning of perceptual feature combinations

Dear Dr Tamosiunaite,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Additional analyzes of reference methods.

    (PDF)

    pcbi.1011926.s001.pdf (1.6MB, pdf)
    S2 Appendix. Parameter analysis for the BCM rule in case of two inputs.

    (PDF)

    pcbi.1011926.s002.pdf (1.4MB, pdf)
    S1 Code Repository. Code for obtaining result figures presented in this manuscript.

    (ZIP)

    pcbi.1011926.s003.zip (307.3KB, zip)
    Attachment

    Submitted filename: Response_to_reviewers.pdf

    pcbi.1011926.s004.pdf (221.6KB, pdf)
    Attachment

    Submitted filename: Responses_to_reviewer_comments_revision2.pdf

    pcbi.1011926.s005.pdf (98KB, pdf)

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES