Abstract
Spiking neural networks (SNNs) inherently rely on the timing of signals for representing and processing information. Augmenting SNNs with trainable transmission delays, alongside synaptic weights, has recently shown to increase their accuracy and parameter efficiency. However, existing training methods to optimize such networks rely on discrete time, approximate gradients, and full access to internal variables such as membrane potentials. This limits their precision, efficiency, and suitability for neuromorphic hardware due to increased memory and I/O-bandwidth demands. Here, we propose DelGrad, an analytical, event-based training method to compute exact loss gradients for both weights and delays. Grounded purely in spike timing, DelGrad eliminates the need to track any other variables to optimize SNNs. We showcase this key advantage by implementing DelGrad on the BrainScaleS-2 mixed-signal neuromorphic platform. For the first time, we experimentally demonstrate the parameter efficiency, accuracy benefits, and stabilizing effect of adding delays to SNNs on noisy hardware. DelGrad thus provides a new way for training SNNs with delays on neuromorphic substrates, with substantial improvements over previous results.
Subject terms: Electronics, photonics and device physics; Learning algorithms; Computational science
It has recently been shown that synaptic transmission delays enhance the computational capabilities of spiking neural networks. In this manuscript, the authors introduce an exact, event-based training method for various types of delays and benchmark it on mixed-signal neuromorphic hardware.
Introduction
The mammalian brain has always represented the ultimate example of computational prowess, and therefore remains an important source of inspiration for understanding intelligence and replicating it in artificial substrates. In particular, its specific mechanisms for transmitting and processing information have been the subject of intense scrutiny and debate. Among these, the pulsed communication between neurons, predominantly based on all-or-none events, called action potentials or spikes, stands out as a distinguishing feature, and has thus been suggested to play an important role in the brain’s remarkable combination of computational performance and energy efficiency1,2. Consequently, spike-based communication represents a de facto standard across current neuromorphic platforms, which aim to inherit the proficiency of their biological archetype by replicating chosen aspects of its structure and dynamics3–7.
Among the various encoding schemes proposed for spiking neurons, the representation of information within the specific timing of individual spikes is of particular interest8, as it effectively allows the communication of, ideally, real-valued signals on an energy budget equivalent to only generating and transmitting a single bit. This gives rise to a specific call for SNN training algorithms that exploit the temporal richness of spike timing codes for solving computational tasks efficiently and accurately, while remaining capable of operating under the realistic constraints of the underlying physical substrate, whether biological or artificial.
Recent years have seen an exciting trend in this direction, showing how the performance of SNNs can be improved by optimizing various temporal parameters. Such parameters include neuronal integration time constants9–15, adaptation time constants16, and delay variables17–19.
In particular, spike transmission delays have been predicted to significantly enrich the information processing capabilities of spiking networks20,21, but specific applications to computationally demanding tasks had remained an open issue. However, recent evidence suggests that a co-optimization of synaptic weights and delays is possible and can significantly reduce the number of training parameters in an SNN, without loss of accuracy22,23. This finding is especially important for neuromorphic architectures that target limited on-chip memory.
Nevertheless, from an algorithmic perspective, optimizing delays in SNNs remains an ongoing research problem. Previous literature has largely focused on either exploiting heterogeneity in delay parameters, while limiting gradient-based training to the weights, resulting in selecting the suitable delays18,23–25, or using evolutionary, not gradient-based algorithms to find delay parameters26. Recently, several approaches based on surrogate gradients27 have been proposed, using convolutional kernels22 or finite difference methods19,28. The underlying surrogate-gradient approach partially addresses, at the expense of both exactness and the conservative property the gradient field29, the discontinuities of (dis)appearing spikes by smoothing out the spiking threshold. These types of algorithms operate in discrete time, which requires the storage of neuronal activities as binary vectors over the entire history of the SNN.
In addition, from a hardware perspective, there is a growing number of neuromorphic platforms that support the emulation of delays. These implementations require additional memory elements and parameter sets to retain the information of the incoming spike for a controllable amount of time. Previous implementations of on-chip delays using Complementary Metal-Oxide-Semiconductor (CMOS) technology have used digital circuits23,30–33, active analog circuits34–37, or mixed-signal solutions38. Furthermore, emerging memory technologies such as Resistive Random Access Memory (RRAM) have also been used to realize delay elements, taking advantage of their non-volatile, small three-dimensional footprint, and zero-static-power properties18,32. This increasing abundance of neuromorphic substrates offering configurable delays reveals an implicit call for algorithms capable of exploiting these novel capabilities.
In this work, we present DelGrad, which, to the best of our knowledge, is the first exact, analytical solution for gradient-based, hardware-compatible co-learning of delays and weights, using exclusively spike times for the computation of parameter updates. Its hardware compatibility stems from an algorithm-hardware co-design approach: the algorithm was developed with physical hardware systems in mind. It considers the real-valued nature of spike times on analog systems, as well as system-level constraints such as low I/O bandwidth and limited on-chip memory, resulting in a method that is inherently hardware-friendly. Compared to previous approaches, this spike-time-based method simultaneously increases precision and computational efficiency, while also minimizing the required memory footprint of the model. Under DelGrad, we quantitatively study the effect of different types of delays in relation to the size and performance of SNNs. Finally, to experimentally demonstrate the efficacy of our approach, we utilize DelGrad to perform chip-in-the-loop training of a delay-based SNN on a mixed-signal neuromorphic chip.
Results
Training delays in SNNs with DelGrad
While spiking neural networks share their overall structure with the more well-known artificial neural networks, the intrinsic dynamics in the networks are different. In SNNs, each unit in the network is a model of a spiking neuron, i.e., communicating with binary all-or-nothing events called spikes, that carry information in their precise timing (Fig. 1). Here, we employ the leaky integrate-and-fire (LIF) neuron model, which despite its relative simplicity captures the most important properties of biological neurons and therefore often serves as a “standard model” in computational neuroscience and neuromorphic engineering39,40. Among these properties are foremost the spiking communication and the leaky-integrator dynamics: all input events are integrated on the membrane potential, which, after an excitation, slowly decays back to its resting state. The precise dynamics of a network of LIF neurons are determined by parameters of the neurons, but also by the inter-neuron connectivity and its parametrization, which includes synaptic weights and transmission delays.
Fig. 1. Information flow in a (SNN).
a Network architecture of a feed-forward SNN with a spiking input layer at the bottom, a hidden layer in the middle and the output layer on top. While the methods described in this manuscript are applicable to many different network architectures, the structure depicted in (a), with a variable size of the hidden layer, is used in the following. b Zoom in on the information processing in a single leaky integrate-and-fire (LIF) neuron in the hidden layer. Incoming spikes (blue, bottom) are integrated by the neuron’s membrane um and generate postsynaptic potentials (PSPs), which accumulate additively. Once the membrane potential passes a threshold (gray dashed line), an output spike (orange, top) is generated and passed on to the neurons in the next layer. The PSP amplitudes are modulated by the respective synaptic weights w (vertical red arrow); these are the parameters that are conventionally adapted during learning. Learnable transmission delays d (horizontal red arrow) shift PSPs in time, providing additional temporal processing power to the neuron. c Zoom out to a raster plot of the full spiking activity in the network. The information passed between the layers is encoded in the timing of the spikes. As sketched in the raster plot, in the experiments in this manuscript, we employ TTFS coding, i.e., each neuron spikes only once, however our method also generalizes to multi-spike scenarios (Section SI.D) if required by the task.
To train the parameters of an ANN, the error backpropagation algorithm41,42, which optimizes parameters via gradient descent, has become the de facto standard. In contrast, only recently has it become clear that in the case of SNNs the non-differentiability of spikes is, in fact, not an impediment for performing gradient-based optimizations. The developed optimization methods for training SNNs can be roughly split into two groups: approximate, surrogate gradient approaches19,27,43,44, and methods that employ exact spike time gradients24,45–48. In the following, we base our study on the exact gradient methods described in ref. 45.
The subthreshold dynamics of the membrane potential um of an LIF neuron with exponential current-based synapses are governed by the differential equation
| 1 |
with neuronal time constant τm, leak potential Eℓ, leak conductance gℓ and synaptic input current Is defined as
| 2 |
where Θ(t–ti) is the Heaviside step function, wi is the weight associated with the synapse receiving a spike at time ti, and τs is the synaptic time constant. Is thus outputs a current which is a first-order low pass filter of the input spike train, represented by the exponential kernel . Is is itself further leaky integrated by the neuron’s membrane, with time constant τm. Upon crossing the spiking threshold ϑ, the membrane is reset to Vreset for a refractory period τref, during which the neuron does not react to any further input spikes, and the neuron emits an output spike.
The response function of a neuron, thus, fundamentally, maps a sequence of input spike times ti to a sequence of output spike times Ti. Here, we focus on a single output spike per neuron for ease-of-notation, but the method can be straightforwardly extended to multi-spike scenarios, as described in Section SI.D. For one such output spike time T, under a parametrization given by the synaptic weights wi of the incoming connections, we can write T as a function of the set of input weights {wi} and set of input spikes {ti}:
| 3 |
Under certain conditions, depending on the values of the neuronal and synaptic time constants τm and τs, the function T becomes analytic, as discussed in ref. 45.
For example, for τm = τs
| 4 |
and for τm = 2τs
| 5 |
where ai and b are explicit functions of wi, ti and is the Lambert W function (see Eq. SI.5). Writing the spike time in this way enables us both to perform an efficient, event-based forward pass and to train this network, by calculating the exact gradient of the output of the network with respect to the network parameters.
In a multi-spike scenario (Section SI.D), all subsequent spikes after the first spike can be calculated by taking the reset into account and solving the equation for different initial conditions. Ultimately, this results in similar expressions as Eqs. (3–5).
To optimize the network parameters via gradient descent, we base the update of each parameter θ on its influence on the loss , the gradient . Employing the chain rule, this gradient is iteratively composed of ∂T/∂θ and ∂T/∂ti, i.e., derivatives of the above equations.
Specifically for a network with parameters wi, ∂T/∂wi allows us to link a deviation in an output spike time to a change in weight parameters, while ∂T/∂ti relates this deviation in the output to a deviation in the input, thereby enabling us to propagate an error in the spike time backwards through the neuron. Crucially, just like the original Eqs. (5 and 4), and in contrast to surrogate-gradient-based approaches, these derivatives only depend on spike times and parameters of the network and can be computed without the calculation or measurement of the membrane potential. This allows us to perform a fully event-based forward and backward pass, without any need for temporal discretization of forward or backwards dynamics.
Transmission delays of spike signals can now simply be introduced as additive parameters d to the original spike times :
| 6 |
These delayed spike times then become the relevant input for the postsynaptic neuron. As above, derivatives of this expression provide the necessary quantities for adapting the delays and for backpropagating the spike timing errors. In this case, the corresponding equations are trivial:
| 7 |
Treating spike times as continuous variables, different from the time-binning performed in other approaches19,22, allows this natural implementation of full-precision delays as well as the exact and simple training of the delay parameters. We note that these considerations do not depend on a specific network setup and thus apply to any activity pattern in arbitrary spiking networks. In the following, we focus our attention on the particular problem of pattern classification, for which we employ a specific network architecture and spike coding scheme (Fig. 1).
To take advantage of a well-established architectural paradigm, we now consider information propagation in hierarchical feed-forward networks. As also shown in the corresponding computational graph (Fig. 2a, solid black arrow), the input t 0 is passed through the sequence of layers until it reaches the output (we use bold symbols to denote non-scalar variables). The gradient of the chosen loss function then goes backwards through the network (dashed red arrow) for optimizing the parameters. In the forward pass, the only information that is transmitted is spike times t l; in the backward pass, we transmit the gradient of the loss function , but note that it is also only evaluated at the times when neurons spike.
Fig. 2. Computational graph of a multi-layer SNN with spike-time information encoding and adjustable delay and weight parameters.
a Graph for a multi-layer network with spike times t 0 injected into the bottom (1st) layer. In the forward pass (black arrows), each layer l takes spike times as inputs and returns spike times as outputs that go into the next layer. The spike times of the topmost layer are used to compute the loss function . The backward pass (red dashed arrows) starts at the loss and passes the gradients backwards through the layers. We consider two types of layers: neuron layers and delay layers. b Neuron layer with parameters wl (synaptic weights). These are used together with the input spike times t l−1 to calculate the output spike times t l according to the nonlinear relation described in Eqs. (4 and 5). c Delay layer with parameters d l that are added (linearly) to the input spike times t l−1 to calculate the output spike times t l as in Eq. (6).
For SNNs with delays, the computational graph differentiates between two types of layer: neuron layers and delay layers (Fig. 2). Both layers receive input spikes t l−1 and return output spikes t l, but using different forward transfer functions, as given by Eq. (4)/Eqs. (5 and 6), respectively. In the backward direction, they pass the partial derivative discussed above.
Figure 2b, c highlight the similarity of the two layer types: both neuron and delay layers take spike trains as an input and produce spike trains as an output in the forward pass, and propagate gradients of the loss with respect to the corresponding spike times in the backward pass. Their respective computations are carried out sequentially, as depicted in Fig. 2a, with delay layers stacked in between neuron layers.
In Fig. 3a we distinguish between different types of delays: axonal delays daxo on a neuron’s output, dendritic delays dden on a neuron’s input, and synaptic delays dsyn that are specific for every connection between pairs of neurons. Their respective natural representations as column vectors, row vectors and matrices are shown in Fig. 3b. The memory footprint of axonal and dendritic delays thus scales linearly with the number of neurons in the network, while for synaptic delays, it scales linearly with the network depth and quadratically with its width.
Fig. 3. Illustrating different types of delays.
a From bottom to top: axonal delays shift the timing of the neuron’s outgoing spikes by daxo (orange); synaptic delays shift the timing of spikes by a specific value dsyn for each pair of pre- and post-synaptic neuron (purple); dendritic delays shift the timing of the incoming spikes into a neuron by dden (red). b Vector and matrix representation of the different types of delays and their dimensionality as a function of the number of pre- and post-synaptic neurons. c Equivalent effect of the dendritic and axonal delays on the output spike time of a neuron, due to the time-shift invariance of the temporal dynamics of a LIF neuron. d Schematic illustration of the location of synaptic, dendritic and axonal delay components in a generic neuromorphic crossbar architecture.
While in principle different types of delays can be simultaneously present in a network and can be combined with each other, it is important to note that, as illustrated in Fig. 3c, combining dendritic and axonal delays for the same neuron is redundant: as neuronal dynamics are invariant to temporal shifts, it is equivalent for the input spikes to arrive with a delay dden = d, resulting in a delayed output spike (red arrow and gray curve), or for the output spike of the neuron to be directly delayed with daxo = d (orange arrow and membrane dynamics in black).
Given the resource constraints of neuromorphic systems, we investigate the performance benefits incurred by the different delay types, with different requirements on the memory resources. Although a quantitative evaluation of the exact energy consumption, chip area and design complexity of different delay architectures heavily depends on system architecture and the chosen design (e.g., analog vs. digital and circuit topology), some generic statements can be made using the mathematical representation of the delay elements.
For typical crossbar architectures (Fig. 3d), the synaptic delay mechanisms are often located within the crossbar array and therefore scale with the product of the array’s input and output size. In contrast, dendritic and axonal delays can be located in the periphery of the array, and thus their required area scales linearly with the input and output array size, respectively. It is worth noting that an important property of axonal delay mechanisms is that they are located directly after the neurons’ output and therefore only need to operate on sparse events. In contrast, dendritic delays are located directly before the neurons’ input, and after the input signals have been scaled by the synaptic weight.
Depending on the design choices, in particular on whether the synaptic integration happens in the synapses or in the neurons, this may require more complex circuitry. Note also that neurons usually receive more spikes than they emit, so the required buffering may also increase the corresponding hardware footprint of dendritic delay implementations.
Simulation results
This section evaluates DelGrad’s ability to co-train delays and weights, demonstrating improved accuracy and parameter efficiency over weight-only training. By systematically studying the effect of hidden layer size and comparing different delay types, we highlight the advantages of incorporating learnable delays.
We benchmark a PyTorch49 implementation of the DelGrad method using the Yin-Yang (YY)50 dataset to evaluate the impact of transmission delays on the SNN performance, and assess how this varies with the network size. This dataset is selected for its advantageous properties—compactness, making it amenable for hardware prototyping, training speed, as well as discriminatory power between network architectures and training paradigms: it leaves ample room for benchmarking above the accuracy achievable with a linear classifier. The task is to classify the region of a Yin-Yang image to which a point in the image plane belongs, as illustrated in Fig. 4a. The coordinates of the point (x, y) and their mirrored values (1–x, 1–y) are encoded into spike times, such that a larger value of the coordinate results in a later spike time, and an early spike time for its mirrored version.
Fig. 4. Classification task and simulation results.
a The Yin-Yang (YY) task50 consists of the classification of dots based on whether they belong to the Yin (red), Yang (blue), or dot (green) regions, as illustrated in (a). The input features are the two-dimensional coordinates (x, y) of the image, along with their mirrored values (1–x, 1–y), totaling four features. These features are encoded into spike times, such that a larger value of x or y coordinate results in a later spike time for x or y and an early spike time for its mirrored version 1–x or 1–y respectively. For more details on the encoding, see the original publication50. b Test error as a function of the number of hidden neurons in an SNN, using different delay types. The solid lines and markers show the median of the error, and the shaded areas illustrate the interquartile ranges (IQRs) for 10 seeds. c Same data as in b but as a function of the number of trainable parameters in the networks, i.e., counting the distinct weights and, if applicable, delays. d Impact of axonal delays as a function of the temporal scale of the dataset. The trainable delays cover a range λ as indicated by the orange hue. The network performance without delays is shown in blue.
The network architecture as shown in Fig. 1 is a feed-forward multi-layer configuration with four input neurons, followed by a variable-size neuron layer (hidden layer) and finally an output layer, comprising three neurons for the three classes (a study on deeper networks is provided in Section SI.A.1). Delay layers are inserted between neuron layers, as previously illustrated in the computational graph (Fig. 2). The neurons have no configurable biases, and the time constants are configured such that τm = 2τs. Thus, we utilize Eq. (5) for training. The refractory period τref is set to infinity, such that all neurons only spike once. The output is represented in a time-to-first-spike (TTFS) decoding scheme, where the first output neuron to spike indicates the predicted class for a given input. To avoid negative or excessively large values for the delays, the effective delay d is calculated as a logistic function of a trainable parameter θd such that d = λ σ(θd), which ensures that the delays remain bounded between 0 and λ. Further details can be found in Section SI.A.
We have chosen the time-invariant mean squared error (MSE) loss to improve accuracy and stability of training:
| 8 |
where n⋆ and n denote the respective indices of the correct and wrong label neurons and Δt is a freely chosen parameter. The purpose of introducing Δt into the loss is to achieve a specific separation of Δt between the spike times of the correct and incorrect label neurons, instead of providing precise target spike times. To ensure a balance between model accuracy and hardware compatibility, Δt is set to 0.2τs in our simulations.
We investigate the effects of different types of delay layers on accuracy, compared to configurations without any delays. Fig. 4 reports the performance of our approach on the YY dataset across different network sizes. Fig. 4b shows the percentage of misclassified samples in the test set (test error) of the network as a function of the number of hidden layers. It demonstrates that co-training delays alongside the weights always improves performance, regardless of the specific type of delay. Among the delay-augmented configurations, the variant with synaptic delays outperforms the ones with axonal- or dendritic-only parameters. This is in line with expectations, as synaptic delays offer the greatest configurable parameter space among the three delay variants.
Figure 4c displays the same test errors, but now as a function of the number of parameters. This representation reveals that, at least for the YY dataset, delay-augmented networks with the same number of parameters perform similarly well, regardless of the type of delay. As before, for a given number of parameters, the co-training of delays always yields at least as good results as the training of synaptic weights alone. In other words, for the same memory footprint, a mix of both weights and delays is better than just synaptic weights.
Notably, the functional benefit of trainable delays depends to a great extent on the temporal structure of the data. In particular, we expect the training of delays to have a larger impact if the input data spans longer time scales. For YY, it is straightforward to change the temporal volume occupied by the dataset by modifying its span—the time difference between the earliest and latest possible input spikes. Fig. 4d shows the effect of trainable delays across these different spans. For small spans, errors are high because the temporal dynamics in the data are too fast for the intrinsic dynamics of the LIF neurons. However, beyond a certain point, we always observe a clear benefit of co-training delays and weights as opposed to weights alone. Furthermore, for a larger dataset span where input spikes can consequently be further apart, the range λ of available delays that are able to push the PSPs together becomes increasingly relevant.
Optimal learning rates are determined through hyperparameter optimization for each configuration of neuron and delay layers. Across all investigated settings, our approach demonstrates robust training convergence (Fig. SI.2a) as well as exploitation of all available resources (Fig. SI.2b). Overall, these results clearly evince the added value of learning delays, as well as the ability of our algorithm to capitalize on this potential.
Hardware results
To calculate the gradients for training weights and delays in SNNs, DelGrad only requires spike time recordings, compared to surrogate-gradient-based approaches, which also require recording membrane potential. Therefore, DelGrad is ideally suited for implementation on a variety of neuromorphic substrates, whose output is spike-based, by design51. Here, we demonstrate the flexibility of our method by describing a successful application in silico, on the neuromorphic platform BrainScaleS-2 (BSS-2), that does not natively support delays.
The BSS-2 system (Fig. 5b,7,52) is built around a mixed-signal neuromorphic chip with 512 physical neuron circuits. The neuron dynamics are accelerated compared to biological time scales by a factor of 103. The neuron circuits emulate the dynamics of the adaptive exponential leaky integrate-and-fire (AdEx) model53 with individually configurable parameters for each neuron54. Neighboring neuron circuits can be connected to form multi-compartment neurons55. The connectivity between the neurons on the chip can be configured arbitrarily within the constraints of the two 256 × 256 synaptic crossbar arrays. The synaptic weights are configured digitally with 6 bit resolution.
Fig. 5. In-the-loop training with on-chip axonal delays on BrainScaleS-2.
a Schematic illustration of the network architecture for on-chip axonal delays; here, we apply this generic approach to the BrainScaleS-2 neuromorphic hardware. Each neuron in the network (black) is paired with a parrot neuron (orange) connected in a one-to-one scheme. The parrot neuron repeats each of its input spikes with a configurable delay. b Photograph of the BrainScaleS-2 neuromorphic chip (taken from65). c Median test errors and IQR on the Yin-Yang dataset when training network weights and axonal delays (orange) or only weights (blue). The dash-dotted lines indicate a hardware-aware simulation (cf. Section SI.A.3) and the dotted lines the hardware emulation results. For comparison, we also show the ideal software simulation results from Fig. 4b in gray. The shaded areas indicate the IQR over 10 runs with different seeds. The values for networks with 30 hidden neurons (highlighted by the dashed box) are shown for a better comparison in (d). d Detailed comparison of performances at 30 hidden neurons of an ideal simulation, hardware-aware simulation and emulation on neuromorphic hardware.
Although the current generation of BSS-2 does not natively support on-chip delays, we present an approach that allows us to explore the computational potential of delays for the current substrate. We realize on-chip axonal delays by re-purposing a subset of the available neurons as delay elements. For this, we utilize the adaptation circuitry as well as multi-compartment functionality of the neurons on chip. This allows us to perform in-the-loop training of both synaptic weights and the on-chip axonal delays and illustrate the computational advantage obtained by the inclusion of delays. The details of the delay implementation are provided in Section SI.B.1. We additionally provide a proof-of-concept of a different on-chip realization of axonal delays using LIF neuron dynamics, which is the most widely adopted neuron model for hardware platforms (see Section SI.B.2).
Even without an explicit hardware implementation of delays, an effective axonal delay can be achieved by exploiting the dynamics of the on-chip infrastructure. For that, a “parrot neuron” is connected to the output of a neuron that is part of the actual trained network (Fig. 5a). For any spike that the network neuron produces, the parrot neuron is configured to elicit a spike after the desired delay.
In our implementation on BSS-2, this behavior is achieved via the interplay between the two neuron compartments that form a parrot neuron. The first compartment reacts to each incoming spike with a reset, which clamps the membrane voltage to the reset potential for a refractory period, during which the neuron is not responsive to incoming spikes. After the end of the configurable refractory period, the second compartment becomes active and its adaptation mechanism almost instantly triggers an output spike of the parrot neuron. Therefore, a spike is generated after the configurable refractory period, used here as the delay. For a more detailed description of the mechanism see Section SI.B.1. We use this method, as it allows us to control the delay produced by the parrot neuron via its refractory period, which is digitally controlled on BSS-2 using 8 bits of precision. This results in a more precise and easily configurable delay, compared to using an analog variable, and is likely closer to a future implementation of native delays on a BSS-2-like system.
This delay mechanism allows us to train a network with axonal delays on BSS-2. We use an in-the-loop training approach, which means that we present a batch of inputs to the network on chip and record the spike times. The spike times are sent back to the host computer, where the loss and the backward pass are calculated in software. The resulting updates for weights and delays are then used to reconfigure the chip before the next batch is presented.
With this chip-in-the-loop setup, we train and evaluate networks, with synaptic weights alone, as well as networks that incorporate both adjustable weights and axonal delays (Fig. 5c). Similar to the simulation results presented in Fig. 4, we experimentally confirm an accuracy gain over a range of network sizes, for the networks with additional delay parameters compared to the ones with only weight parameters.
Overall, the final test errors reached in the software simulation are lower than the ones measured on hardware. This is expected, as hardware effects such as trial-to-trial variations, fixed-pattern noise and jitter on the on-chip delays disturb the dynamics. To illustrate and characterize these effects, we measured the magnitude of several noise sources found on the hardware and modeled them in a series of hardware-aware simulations. Fig. 5c shows that, when the various sources of noise are realistically modeled based on hardware measurements, the hardware-aware simulations capture the increase in test error similarly to the actual emulation. This confirms that the gap in accuracy between software simulations and hardware experiments is mostly due to the modeled sources of noise. For an in-depth description of the noise models employed in the hardware-aware simulations and an analysis of the impact of the different noise sources on the network performance, we refer to Section SI.A.3.
For an easier comparison, we focus on the most expressive networks with 30 hidden neurons, highlighted by boxes in Fig. 5c, and collect the achieved test errors in Fig. 5d, which amount to 7.40% with axonal delays and 13.95% in the weight-only case on the hardware. A full report of the achieved test errors and IQR, both in hardware-aware simulation and on chip, can be found in Table SI.1. Additionally, we note that the performance gap between the delay and no-delay setup is significantly wider on hardware than in the ideal software simulations. We hypothesize that this effect arises because the YY classification problem, by design, does not require a large network to solve, leading to few learnable parameters, low redundancy, and consequently a greater sensitivity to noise. Introducing axonal delays increases redundancy, due to the higher parameter count. However, this increase is rather small compared to the number of parameters in a weight-only setup. This suggests that the computational properties of the delays are at least partially responsible for making the network more noise resilient, explaining the larger performance gap between the two networks on noisy hardware.
These results illustrate that our method for training delays is not only applicable in ideal software simulations but can also be applied to mixed-signal neuromorphic systems. Additionally, they demonstrate the benefit of learnable delays for neuromorphic platforms, especially in resource-constrained scenarios, and might encourage the inclusion of delay mechanisms in future generations of neuromorphic systems.
Discussion
We have introduced an exact event-based algorithm for training temporal variables, specifically transmission delays, in conjunction with synaptic weights in SNNs. Additionally, we have experimentally validated its effectiveness through both software simulations and neuromorphic hardware implementations.
Delay parameters were previously demonstrated to increase the representational power of the SNNs, even without optimization, just by training the weight parameters to select the useful delays for spatio-temporal feature detection18,23,56. However, this optimization-through-selection approach requires an over-allocation of resources in order to provide a sufficiently diverse set of delay parameters from which the best can be selected. To illustrate this, we compare a network of random fixed delays to a network with trained delays using DelGrad. We show in Fig. SI.3 that for the same number of delay parameters, the network with trained delays and weights has a clear accuracy advantage over the network with randomly initialized delays with weight-only optimization. Therefore, it is advantageous to combine a dedicated learning algorithm for transmission delays with hardware capable of configuring them accordingly.
Algorithms based on surrogate gradients for direct training of delay elements have been explored recently, using temporal convolution kernels22 or numerical solutions that estimate the delay gradients using finite-difference approximations19. However, as pointed out in ref. 22, delay training based on finite-difference approximation19 appears to not be sufficiently accurate to achieve an improvement over fixed, random delays. Additionally, both approaches use a time-stepped framework for calculating the gradients.
As such, delays are represented implicitly in the number of simulation time steps before transmitting a spike. However, as delay parameters are essentially shifts in individual spike times, we argue that it is more natural to have a framework where the information is explicitly represented by these spike times32,45,46,48,57, and delays are learned as additive parameters for these times. Furthermore, the objective of building efficient asynchronous neuromorphic systems, where time represents itself, is an additional motivation for representing temporal information in spike times3. Such representations are naturally available from event-based sensors, where the change in the signal is encoded into spike times using the delta modulation encoding scheme58–60.
This work brings together all the aforementioned objectives: DelGrad presents an event-based framework for gradient-based co-training of delay parameters and weights, without any approximations, and which meets the typical demands and constraints of neuromorphic hardware, as demonstrated experimentally on an analog mixed-signal neuromorphic system. As such, it takes an important step towards fully exploiting the temporal nature of SNNs for memory- and power-efficient end-to-end event-based neuromorphic systems.
In this work, we have also compared the effect of dendritic, axonal and synaptic delays on the performance of SNNs on a representative task. The synaptic delays have the highest impact on increasing the expressivity of SNNs, compared to using only dendritic or axonal delays. However, from a hardware perspective, the addition of synaptic delays imposes a quadratic growth on the size and thus the on-chip area of the network, compared to a linear growth in the case of axonal and dendritic delays. In fact, we find that when comparing the performance for equal parameter counts, the gap between different types of delays vanishes while the superiority over weight-only training persists. As memory represents one of the most critical constraints on hardware, reducing on-chip memory is of utmost importance. In particular, this means that a redesign of a chip with fewer neurons but with an intrinsic delay mechanism, based on our findings, will save energy while maintaining expressivity and performance. Therefore, our work suggests that it might be practical to consider using dendritic or axonal delays in future hardware designs, combining favorable scaling and improved processing power.
DelGrad provides an advantage in terms of hardware mappability as it only requires recording the spike times from on-chip neurons. This is in contrast to other approaches19,22, which need access to the membrane potential of all neurons for surrogate gradient learning61,62. Such voltage-based plasticity requires additional components for voltage readout, communication and potentially analog-to-digital conversion. Furthermore, this information is much denser in time than the spikes themselves, imposing further stress on the overall communication bandwidth. In both chip-in-the-loop and on-chip training scenarios, these additional requirements ultimately translate to additional circuitry; not only does this increase the complexity of the chip design, but it also inherently reduces the maximum implementable network size for a chip of a given area. Moreover, if learning is to be implemented on-chip, this additional circuitry will negatively affect the device’s energy efficiency during training.
In this work, the YY dataset was used as a proof of concept and first step to benchmark our approach. The YY dataset provides a problem that can not be solved linearly and where the information can be presented using a TTFS encoding, similar to the previous work45. As indicated by our experiments in Fig. 4d, the performance boost provided by the inclusion of learnable delays increases when the temporal features of the data span larger time scales.
Therefore, the natural next step will reside in a more thorough benchmarking on larger datasets, and in particular on data with explicit temporal components, such as63,64. Especially for data provided by event-based sensors, longer time scales are required, and TTFS might reach its limits as a feasible coding scheme. Although our current software implementation only takes into account a single spike per neuron during the training, this is not a limitation of our proposed mathematical framework and training scheme (see Section SI.D). Additionally, the extension to more complex spike timing codes can go hand in hand with a shift from a feed-forward to a recurrent network architecture.
Supplementary information
Acknowledgements
We want to thank the EIS-Lab, NeuroTMA, CompNeuro, and ElectronicVision(s) groups, in particular Yannik Stradmann, Robin Heinemann, Joscha Ilmberger and Eric Müller, for the continuing support. Additionally, we are grateful to the NNPC conference 2023, where this fruitful collaboration was initialized, as well as the CapoCaccia workshop, where this work was first presented and received much helpful feedback, especially from Paolo Gibertini and Maryada. We also thank Guillaume Bellec for critical comments on the preprint and Florent Draye for providing feedback on the mathematical proofs. The presented work has received funding from the Manfred Stärk Foundation (JG, MAP), the EC Horizon 2020 Framework Programme under grant agreement 945539 (HBP; JG, LK, SB, PL, JS, MAP) and Horizon Europe grant agreement 101147319 (EBRAINS 2.0; MAP), the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy EXC 2181/1-390900948 (Heidelberg STRUCTURES Excellence Cluster; JG, MAP), Swiss National Science Foundation Starting Grant Project UNITE (TMSGI2-211461; JW, LK, SB, MP), and the VolkswagenStiftung under grant number 9C840 (SB).
Author contributions
JG, JW, and LK jointly developed the theory, designed the experiments, implemented the code, executed simulation and hardware experiments; SB, PL, and JS contributed to the hardware experiments; MP and MAP supervised the project, contributed to the experiment design, and provided helpful guidance throughout; JG, JW, LK, MP, and MAP wrote and revised the manuscript.
Peer review
Peer review information
Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Data availability
We used the Yin-Yang data set50, the code is available at https://github.com/lkriener/yin_yang_data_set.
Code availability
Code for the simulations is available at https://github.com/JulianGoeltz/fastAndDeep.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Julian Göltz, Jimmy Weber, Laura Kriener.
These authors jointly supervised this work: Melika Payvand, Mihai A. Petrovici.
Contributor Information
Julian Göltz, Email: julian.goeltz@kip.uni-heidelberg.de.
Jimmy Weber, Email: jimmy.weber@ini.uzh.ch.
Laura Kriener, Email: laurak@ini.uzh.ch.
Melika Payvand, Email: melika@ini.uzh.ch.
Mihai A. Petrovici, Email: mihai.petrovici@unibe.ch
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-025-63120-y.
References
- 1.Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature381, 607–609 (1996). [DOI] [PubMed] [Google Scholar]
- 2.Koch, C. & Segev, I. The role of single neurons in information processing. Nat. Neurosci.3, 1171–1177 (2000). [DOI] [PubMed] [Google Scholar]
- 3.Mead, C. Neuromorphic electronic systems. In Proc. IEEE, Vol. 78, no. 10, 1629–1636 (1990).
- 4.Indiveri, G. & Liu, S.-C. Memory and information processing in neuromorphic systems. In Proc. IEEE, Vol. 103, 8, 1379–1397 (2015).
- 5.Frenkel, C., Bol, D. & Indiveri, G. Bottom-up and top-down approaches for the design of neuromorphic processing systems: tradeoffs and synergies between natural and artificial intelligence. Preprint at 10.1109/JPROC.2023.3273520 (2023).
- 6.Furber, S. B., Galluppi, F., Temple, S. & Plana, L. A. The spinnaker project. In Proc. IEEE, Vol. 102, no. 5, 652–665 (2014).
- 7.Billaudelle, S. et al. Versatile emulation of spiking neural networks on an accelerated neuromorphic substrate. In Proc. International Symposium on Circuits and Systems (ISCAS) (IEEE, 2020).
- 8.Bohte, S. M. The evidence for neural information processing with precise spike-times: a survey. Nat. Comput.3, 195–206 (2004). [Google Scholar]
- 9.Yin, B., Corradi, F. & Bohté, S. M. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nat. Mach. Intell.3, 905–913 (2021). [Google Scholar]
- 10.Rao, A., Plank, P., Wild, A. & Maass, W. A long short-term memory for AI applications in spike-based neuromorphic hardware. Nat. Mach. Intell.4, 467–479 (2022). [Google Scholar]
- 11.Nowotny, T., Turner, J. P. & Knight, J. C. Loss shaping enhances exact gradient learning with eventprop in spiking neural networks. Neuromorphic Comput. Eng.5, 014001 (2025). [Google Scholar]
- 12.Bittar, A. & Garner, P. N. A surrogate gradient spiking baseline for speech command recognition. Front. Neurosci.16, 865897 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Perez-Nieves, N., Leung, V. C. H., Dragotti, P. L. & Goodman, D. F. M. Neural heterogeneity promotes robust learning. Nat. Commun.12, 5791 (2021).
- 14.Fang, W. et al. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proc. IEEE/CVF International Conference on Computer Vision, 2661–2671 (IEEE, 2021).
- 15.Moro, F., Aceituno, P. V., Kriener, L. & Payvand, M. The role of temporal hierarchy in spiking neural networks. Preprint at 10.48550/arXiv.2407.18838 (2024).
- 16.Bellec, G., Salaj, D., Subramoney, A., Legenstein, R. & Maass, W. Long short-term memory and learning-to-learn in networks of spiking neurons. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 795–805 (Curran Associates Inc., 2018).
- 17.Hammouamri, I., Masquelier, T. & Wilson, D. G. Mitigating Catastrophic Forgetting in Spiking Neural Networks through Threshold Modulation. Transactions on Machine Learning Researchhttps://openreview.net/forum?id=15SoThZmtU (2022).
- 18.DAgostino, S. et al. Denram: neuromorphic dendritic architecture with RRAM for efficient temporal processing with delays. Nat. Commun.15, 3446 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shrestha, S. B. & Orchard, G. S. LAYER: spike Layer Error Reassignment in Time. In Proc. 32nd International Conference on Neural Information Processing Systems. 1419–1428 (Curran Associates, Inc., 2018).
- 20.Maass, W. & Schmitt, M. On the complexity of learning for spiking neurons with temporal coding. Inf. Comput.153, 26–46 (1999). [Google Scholar]
- 21.Izhikevich, E. M. Polychronization: computation with spikes. Neural Comput.18, 245–282 (2006). [DOI] [PubMed] [Google Scholar]
- 22.Hammouamri, I., Khalfaoui-Hassani, I. & Masquelier, T. Learning delays in spiking neural networks using dilated convolutions with learnable spacings. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=4r2ybzJnmN (2024).
- 23.Patiño-Saucedo, A. et al. Empirical study on the efficiency of spiking neural networks with axonal delays, and algorithm-hardware benchmarking. In Proc. International Symposium on Circuits and Systems (ISCAS), 1–5 (IEEE, 2023).
- 24.Bohte, S. M., Kok, J. N. & La Poutré, H. Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing48, 17–37 (2002). [Google Scholar]
- 25.Gerstner, W., Kempter, R., van Hemmen, J. L. & Wagner, H. A neuronal learning rule for sub-millisecond temporal coding. Nature383, 76–78 (1996). [DOI] [PubMed] [Google Scholar]
- 26.Schuman, C. D., Mitchell, J. P., Patton, R. M., Potok, T. E. & Plank, J. S. Evolutionary optimization for neuromorphic systems. In Proc. Annual Neuro-Inspired Computational Elements Workshop (NICE), 1–9 (Association for Computing Machinery, 2020).
- 27.Neftci, E. O., Mostafa, H. & Zenke, F. Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Process. Mag.36, 51–63 (2019). [Google Scholar]
- 28.Sun, P., Chua, Y., Devos, P. & Botteldooren, D. Learnable axonal delay in spiking neural networks improves spoken word recognition. Front. Neurosci.17, 1275944 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gygax, J. & Zenke, F. Elucidating the theoretical underpinnings of surrogate gradient learning in spiking neural networks. Neural Comput.37, 886–925 (2025).
- 30.Madhavan, A., Sherwood, T. & Strukov, D. Race logic: a hardware acceleration for dynamic programming algorithms. ACM SIGARCH Comput. Archit. News42, 517–528 (2014). [Google Scholar]
- 31.Davies, M. et al. Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro38, 82–99 (2018). [Google Scholar]
- 32.Madhavan, A., Daniels, M. W. & Stiles, M. D. Temporal state machines: using temporal memory to stitch time-based graph computations. ACM J. Emerg. Technol. Comput. Syst.17, 1–27 (2021). [Google Scholar]
- 33.Merolla, P. A. et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science345, 668–673 (2014). [DOI] [PubMed] [Google Scholar]
- 34.Sheik, S., Chicca, E. & Indiveri, G. Exploiting device mismatch in neuromorphic VLSI systems to implement axonal delays. In Proc. International Joint Conference on Neural Networks (IJCNN), 1–6 (IEEE, 2012).
- 35.Wang, R., Jin, C., McEwan, A. & van Schaik, A. A programmable axonal propagation delay circuit for time-delay spiking neural networks. In Proc. International Symposium of Circuits and Systems (ISCAS), 869–872 (IEEE, 2011).
- 36.Huayaney, F. L. M., Nease, S. & Chicca, E. Learning in silicon beyond STDP: a neuromorphic implementation of multi-factor synaptic plasticity with calcium-based dynamics. IEEE Trans. Circuits Syst. I63, 2189–2199 (2016). [Google Scholar]
- 37.Gerber, S., Steiner, M., Indiveri, G. & Donati, E. et al. Neuromorphic implementation of ecg anomaly detection using delay chains. In Proc. Biomedical Circuits and Systems Conference (BioCAS), 369–373 (IEEE, 2022).
- 38.Richter, O. et al. DYNAP-SE2: a scalable multi-core dynamic neuromorphic asynchronous spiking neural network processor. Neuromorphic Comput. Eng.4, 014003 (2024). [Google Scholar]
- 39.Lapicque, L. Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization. J. Physiol. Pathol.9, 620–635 (1907). [Google Scholar]
- 40.Abbott, L. Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain Res. Bull.50, 303–304 (1999). [DOI] [PubMed] [Google Scholar]
- 41.Linnainmaa, S. The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion of the Local Rounding Errors. MSc Thesis (in Finnish) 6–7 (University of Helsinki, 1970).
- 42.Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature323, 533–536 (1986).
- 43.Zenke, F. & Ganguli, S. Superspike: supervised learning in multilayer spiking neural networks. Neural Comput.30, 1514–1541 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Renner, A., Sheldon, F., Zlotnik, A., Tao, L. & Sornborger, A. The backpropagation algorithm implemented on spiking neuromorphic hardware. Nat. Commun.15, 9691 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Göltz, J. et al. Fast and energy-efficient neuromorphic deep learning with first-spike times. Nat. Mach. Intell.3, 823–835 (2021). [Google Scholar]
- 46.Wunderlich, T. C. & Pehle, C. Event-based backpropagation can compute exact gradients for spiking neural networks. Sci. Rep.11, 12829 (2021).
- 47.Klos, C. & Memmesheimer, R.-M. Smooth exact gradient descent learning in spiking neural networks. Phys. Rev. Lett.134, 027301 (2025).
- 48.Stanojevic, A. et al. High-performance deep spiking neural networks with 0.3 spikes per neuron. Nat. Commun.15, 6793 (2024).
- 49.Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).
- 50.Kriener, L., Göltz, J. & Petrovici, M. A. The Yin-Yang dataset. In Proc. Neuro-Inspired Computational Elements Conference, NICE 2022, 107–111 (Association for Computing Machinery, 2022).
- 51.Boahen, K. A. Communicating Neuronal Ensembles between Neuromorphic Chips. 229–259 (Kluwer Academic Publishers, 1998).
- 52.Pehle, C. et al. The BrainScaleS-2 accelerated neuromorphic system with hybrid plasticity. Front. Neurosci. 16https://www.frontiersin.org/articles/10.3389/fnins.2022.795876 (2022).
- 53.Brette, R. & Gerstner, W. Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. J. Neurophysiol.94, 3637–3642 (2005). [DOI] [PubMed] [Google Scholar]
- 54.Billaudelle, S., Weis, J., Dauer, P. & Schemmel, J. An accurate and flexible analog emulation of AdEx neuron dynamics in silicon. In Proc. 29th International Conference on Electronics, Circuits and Systems (ICECS), 1–4 (IEEE, 2022).
- 55.Schemmel, J., Kriener, L., Müller, P. & Meier, K. An accelerated analog neuromorphic hardware system emulating nmda-and calcium-based non-linear dendrites. In Proc. International Joint Conference on Neural Networks (IJCNN), 2217–2226 (IEEE, 2017).
- 56.Habashy, K. G., Evans, B. D., Goodman, D. F. M. & Bowers, J. S. Adapting to time: Why nature evolved a diverse set of neurons. PLoS Comput. Biol.10.1371/journal.pcbi.1012673 (2024).
- 57.Schrauwen, B. & Van Campenhout, J. Extending spikeprop. In Proc. International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), IJCNN-04 (IEEE, 2002).
- 58.Gallego, G. et al. Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell.44, 154–180 (2020). [Google Scholar]
- 59.van Schaik, A. & Liu, S.-C. Aer ear: a matched silicon cochlea pair with address event representation interface. In Proc. International Symposium on Circuits and Systems (ISCAS), 4213–4216 (IEEE, 2005).
- 60.Bartolozzi, C., Glover, A. & Donati, E. Neuromorphic sensing, perception and control for robotics. in Handbook of Neuroengineering, 1–31 (Springer, 2021).
- 61.Cramer, B. et al. Surrogate gradients for analog neuromorphic computing. Proc. Natl. Acad. Sci. USA.119, e2109194119 (2022).
- 62.Göltz, J. et al. Gradient-based methods for spiking physical systems. In Proc. International Conference on Neuromorphic, Natural and Physical Computing (NNPC) (NNPC, 2023).
- 63.Cramer, B., Stradmann, Y., Schemmel, J. & Zenke, F. The heidelberg spiking data sets for the systematic evaluation of spiking neural networks. IEEE Trans. Neural Netw. Learn. Syst.33, 2744–2757 (2022). [DOI] [PubMed] [Google Scholar]
- 64.Warden, P. Speech commands: a dataset for limited-vocabulary speech recognition. 1804.03209. Preprint at 10.48550/arXiv.1804.03209 (2018).
- 65.Müller, E. et al. Extending BrainScaleS OS for BrainScaleS-2. 2003.13750. Preprint at 10.48550/arXiv.2003.13750 (2020).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
We used the Yin-Yang data set50, the code is available at https://github.com/lkriener/yin_yang_data_set.
Code for the simulations is available at https://github.com/JulianGoeltz/fastAndDeep.





