Abstract
The cerebellum has been considered to perform error-based supervised learning via long-term depression (LTD) at synapses between parallel fibers and Purkinje cells (PCs). Since the discovery of multiple synaptic plasticity other than LTD, recent studies have suggested that synergistic plasticity mechanisms could enhance the learning capability of the cerebellum. Indeed, we have proposed a concept of cerebellar learning as a reinforcement learning (RL) machine. However, there is still a gap between the conceptual algorithm and its detailed implementation. To close this gap, in this research, we implemented a cerebellar spiking network as an RL model in continuous time and space, based on known anatomical properties of the cerebellum. We confirmed that our model successfully learned a state value and solved the mountain car task, a simple RL benchmark. Furthermore, our model demonstrated the ability to solve the delay eyeblink conditioning task using biologically plausible internal dynamics. Our research provides a solid foundation for cerebellar RL theory that challenges the classical view of the cerebellum as primarily a supervised learning machine.
Keywords: cerebellum, reinforcement learning, spiking neural network, computational model, machine learning
Significance Statement.
The cerebellum has traditionally been understood as a supervised learning (SL) machine, yet its potential for reinforcement learning (RL) remains unexplored. This study presents the first cerebellar spiking network model capable of implementing RL through an actor-critic framework, grounded in known anatomical and functional properties. In this model, Purkinje cells and molecular layer interneurons act as actor and critic, respectively, without explicit temporal difference error computation. The model successfully performs both standard RL tasks, such as the mountain car task, and a cerebellum-dependent motor learning task known as the delay eyeblink conditioning. These results suggest that the cerebellum may integrate both SL and RL mechanisms, which may supersede the classical SL-centric view known as the Marr–Albus–Ito model.
Introduction
Theories of learning in the cerebellum have historically attracted considerable interest. One of the most widely known theories, known as the Marr–Albus–Ito model (1–3), proposes that the cerebellum acts as an error-based supervised learning (SL) machine (4) to modify output and reduce the discrepancy between actual and desired outputs. In this theory, climbing fibers (CFs) deliver error or teacher signals that can trigger long-term depression (LTD) at parallel fiber (PF) and Purkinje cell (PC) synapses (5). Since the discovery of LTD in the early 1980s (6), biological research has found many other forms of synaptic plasticity in the cerebellum (7). These findings have been extending the Marr–Albus–Ito theory gradually (8–10). Furthermore, researchers have examined the learning theory using realistic spiking network models of the cerebellum (11–17). Especially, Hausknecht et al. (13) examined the learning capability of the cerebellum through a cerebellar spiking network modeled in an SL context. The network successfully performed SL and control tasks but failed to solve RL and complex recognition tasks. Geminiani et al. (17) implemented a spiking network-based cerebellar model featuring bidirectional plasticity at PF–PC and PF–molecular layer interneurons (MLIs) synapses across multiple microzones, with several microzones more likely depression and others more likely potentiation during motor control. Their results highlight the cerebellum’s capacity for error correction and adaptive learning, particularly within the context of associative learning and motor control. Both computational studies modeled the cerebellum as a conventional associative learning machine and highlighted the cerebellar learning capability, especially in the SL context.
One of the other research directions is to adapt reinforcement learning (RL) theory (18) as an alternative to conventional SL. In an RL context, there is an “agent” and an “environment.” The agent takes an action affecting the state of the environment. In response to the action, the environment sends a “reward” to the agent which represents how good/bad the action was. The agent learns appropriate actions that maximize an expected future reward. One of the most notable differences between SL and RL is feedback information from the environment. An SL machine receives errors or teacher signals representing optimal actions, while an RL machine receives only reward signals.
Swain et al. (19) reported that the cerebellum fulfills three key requirements to establish reinforcement learning: it receives sensory information about the external environment and the internal environment, selects a behavior to execute, and processes evaluative feedback on the success of that behavior. Building on this idea, Yamazaki and Lennon (20) proposed the concept that the cerebellum might perform RL. Specifically, CFs deliver reward information, PCs select actions, and MLIs evaluate the current state. Synaptic weights at PF–MLI as well as PF–PC synapses are modulated by the CF signals. The researchers interpreted this structure as an RL machine, specifically as an actor-critic model (18). However, there has been no detailed spike-based implementation for the cerebellar RL concept.
In this research, based on a continuous time actor-critic framework with spiking neurons (21), we implemented a cerebellar spiking network as an actor-critic model based on the known anatomical properties of the cerebellum. In particular, we modified a weight-updating rule of the framework for the cerebellum. We evaluated the model using a simple machine learning benchmark task, known as the mountain car task (22). We also examined the behavior of the model using delay eyeblink conditioning (23), which is a standard cerebellum-dependent motor learning task. To our knowledge, the model presented here is the first cerebellar spiking network model that can act as an RL machine.
Results
Spike-based implementation of cerebellum-style RL
In this research, we implemented a cerebellar spiking network model as an RL machine (Fig. 1) while referring to the cerebellar RL concept (20) and the spiking actor-critic framework (21). Parallel fibers (PFs) excite stellate cells (SCs), basket cells (BCs), and Purkinje cells (PCs). SCs and BCs inhibit PCs. BCs and PCs inhibit each other, and also both have a self-inhibitory connection. More specifically, BCs and PCs are divided into multiple groups, which correspond to action choices as described in Actor and action selection section. BCs inhibit PCs in the same group and BCs in other groups, whereas PCs inhibit BCs in the same group and PCs in other groups. PCs also inhibit neurons in the deep cerebellar nuclei (DCN), which are also divided into multiple groups. CFs provide excitatory inputs to SCs and PCs. We used CF signals to trigger synaptic plasticity. Therefore, we did not implement actual CF–SC and CF–PC connections.
Fig. 1.
Structure of the present cerebellar network implemented as an actor-critic model. The network consists of granule cells, molecular layer interneurons (SCs and BCs), PCs, DCN, and inferior olive. PFs deliver state information, CF delivers reward information, PCs serve as the actor, and SCs serve as the critic. Triangle-headed arrows and circle-headed arrows represent excitatory synapses and inhibitory synapses, respectively. Dashed lines represent excitatory signals from CF via glutamate spillover (63).
In the cerebellar spiking network model, we implemented an RL algorithm known as the actor-critic model (18) based on the previous spiking actor-critic framework (21). In RL, an agent in state at time t takes action following a policy π. In response to action , the environment changes its state and returns a reward to the agent. The agent learns to find the optimal policy that maximizes the expected future reward. Specifically, the agent of the actor-critic model consists of two modules: an actor and a critic. The actor calculates preferences of actions and decides on action , while the critic computes the state value which refers to the expected cumulative future reward. The temporal difference error (TD error) is calculated from the returned reward and the state value for updating the inner parameters of both the critic and the actor.
In our implementation, was transmitted through PFs to PCs, BCs, and SCs. The actor and the critic were PCs and SCs, respectively. Practically, the activity of PCs represented the avoidance of actions , while the activity of SCs represented the state value with the inverted sign rather than TD error which was represented by MLIs in a previous theoretical model (20). BCs contributed to action selection by modulating the activity of PCs (24, 25). Then, the most activated DCN group decided on an actual action . In response to the action, reward information was transmitted through CFs to PCs and SCs.
Representation of states and rewards
States were represented by PFs following the formulation in the classical Marr–Albus–Ito model (1–3). For simplicity, we arranged the PFs on a 2D grid plane composed of a number of small tiles (26). We named this plane the PF plane. Each tile had an individual receptive field that determined the PF’s firing rate based on the observed state, as described in Implementations section.
In the present study, rewards were actually considered as punishments so that . To represent this negative reward signal in the spike-based framework, CFs were modeled as a spike train scaled by a task-specific constant as follows:
(1) |
where is a task-specific negative constant and represents the spike train of the CF. In response to the action, the environment modulates the firing rates of CFs.
Critic
We designed the temporal activity of SCs to approximate the state value. Strictly speaking, the state value with the inverted sign was calculated by SCs as follows:
(2) |
where ν, , and represent a scaling factor, the temporal activity of SCs, and the baseline of the state value, respectively. The baseline was set at ν times the average value of , which was calculated during the latter half of an episode without learning. was calculated as follows:
(3) |
(4) |
(5) |
where and represent the number of SCs and the temporal activity of ith SC, respectively. l is a label in . The temporal activity of ith SC rises with a time constant and decays to 0 with a time constant . Spike train of the ith neuron is defined as with the Dirac delta function and the fth spike of the ith neuron.
Actor and action selection
PCs performed the actor and determined the action taken (27). We interpreted the temporal activity of PCs as expressing avoidance of certain actions. Specifically, the PCs in our model compute the avoidance of an action instead of the preference for an action . Therefore, the taken action was determined by the least-avoided action in action space at time t.
(6) |
The avoidance of action was represented by the activity of PCs in group as follows:
(7) |
where is the number of neurons in the group . was the temporal activity of ith PC defined as follows:
(8) |
(9) |
where l, , and represent the label , the time constant for l, and the spike train of the ith PC, respectively. Practically, the actual action was generated at the downstream DCN (see Implementations section).
Those spike activities of PCs were used for updating synaptic weights to realize which action was selected. It was vital that one PC group pauses while the other PC groups activate. We referred to this pausing behavior as a “dent,” while activation of neurons in a specific area was referred to as a “bump” by Frémaux et al. (21). The inhibitory loop between BCs and PCs helped to pause the single group solely to make a dent. Although the original spiking actor-critic framework supports continuous action space, the present model only supports a discrete action space with two actions to demonstrate that the present model can solve tasks in the simplest settings.
Weight-update rules
Our weight-update rule was based on TD long-term potentiation (LTP) which had been proposed in a previous spiking actor-critic model (21). However, unlike the previous model, which calculated the time derivative of the state value directly to compute the TD error, our model did not calculate it, as there was no component in the molecular layer of our model that explicitly handled the TD error.
To address this problem, we conducted an eligibility trace and transformed the weight-update rule. As a result, our weight-update rule for the state value was as follows:
(10) |
(11) |
(12) |
(13) |
where , α, , , , , , , , , and are the amount of synaptic change in nth episode, the learning rate, the negative reward, the eligibility trace for , the reward discount time constant, the decay time constant of eligibility trace, the forcing term of eligibility trace, the postsynaptic spike train, the presynaptic spike train, the window function for spike timing dependent plasticity, and the decay time constant of the window function. According to the previous model (21), the eligibility term memorizes the history of inputs before the time of the last postsynaptic neuron spike, is reset to 0 after the postsynaptic neuron spike. The relationship of our weight-update rule to the original weight-update rule (21, 28) is described in Relationship between our weight-update rule and the previous weight-update rule section of Supplementary information. The synaptic weight was updated at the end of every episode by with learning rate η, and the synaptic weight was bounded between . Since SCs and CFs represented the state value with the inverted sign and the negative rewards, respectively, the equation for the PF–SC synapses was expressed as follows:
(14) |
On the other hand, the weight-update rule for PF–PC synapses was expressed as follows:
(15) |
The specific learning rates for PF–SC synapses and PF–PC synapses are denoted as and , respectively.
Specifically, given that the conjunctive activation of PFs and CFs induces LTP at PF–SC synapses (29–32), our SCs learned the approximate the state value with the inverted sign. LTD at PF–SC synapses was due to PF firing and was modulated by postsynaptic activation results in bidirectional synaptic plasticity (29–32). At PF–PC synapses, the conjunctive activation of PFs and CFs induces LTD (7), while PF activation induces LTP. These synaptic plasticities are also modulated by inhibitions of SCs (33), and we assumed that PC activity contributes to eligibility trace and bidirectional synaptic plasticity (34, 35).
Additionally, we confirmed that our critic is capable of learning state values through a Linear track task simulation (see Simulation of linear track task section of Supplementary information), which is a simplified version of the water-maze task used to isolate learning by the critic from that by the actor (21).
Simulation of mountain car task
In order to evaluate the learning capability of our model as an RL machine, we applied our model to the mountain car task (36), a classic RL challenge, and evaluated its performance over independent 10 runs, each learning for 1,000 episodes. In this task, an agent controls a car situated between two hills (Fig. 2A). The goal is to have the car climb up the mountain to reach a flag positioned at the peak of the right hill. Due to the car’s limited engine power and the hill’s steepness, direct ascent is not possible. An agent observes a position and a horizontal element of the car velocity , and decides the direction (either left or right) to push the car with a constant force . In early episodes, the car was confined to moving at the bottom of the valley, and failed to reach the goal (Fig. 2B). However, after hundreds of episodes, our model learned to utilize the slope for acceleration, and eventually reached the goal (Fig. 2C). Specifically, we found that the trajectory involved the car ascending the right hill, then the left, and finally the right hill again. The state value with the inverted sign , which was approximated by SC activity, was noisy but mostly stayed around 0 at every state before learning (Fig. 2B). In a successful episode after learning (Fig. 2C), the value was high at the start and decreased over time as the car approached the goal. The moving average of success rate over 10 runs progressively increased and stabilized at ∼80% after 600 episodes (Fig. 2D).
Fig. 2.
Simulation of the mountain car task. A) Schematic of the mountain car task. is the horizontal position of the cart, is the horizontal velocity, and F is the external force applied to the cart by the agent. The flag marks the goal position. For clarity in the image, we adjusted the x-axis to start at 0 on the left edge. B, C) Comparison of behaviors in the mountain car task between (B) an early failure episode and (C) a later successful episode. Top panels show raster plots of SCs (0–323), BCs (324–647) and PCs (648–667). The number of these neurons are 324, 324, and 20, respectively. Middle panels show a trajectory of computed from the activity of SCs. Bottom panels show trajectory of scaled states. Blue solid line and orange dashed line represent the normalized value of a position and that of a velocity. Horizontal red solid line represents the goal position. When the position reaches the goal, an episode ends while a simulation lasts 100 ms (see Pre- and postepisode processes section). D) Change of the success rate.
These results indicate that our model solved the mountain car task, and suggest that our model successfully acted as an RL machine.
Simulation of delay eyeblink conditioning task
To examine whether our cerebellar model shows internal dynamics consistent with the biological cerebellum, we conducted simulation of a standard cerebellum-dependent motor learning task known as the delay eyeblink conditioning task (23) over 10 runs, each learning for 500 episodes. In this task (Fig. 3A), an animal is presented with two types of stimuli: a conditioned stimulus (CS) and an unconditioned stimulus (US). The CS, a neutral stimulus such as a tone, initially does not trigger any response. On the other hand, the US, such as an air puff, naturally induces an unconditioned blink reflex in the animal. However, when the CS is repeatedly paired with the US, the animal begins to anticipate the air puff whenever the CS is presented. This anticipation leads the animal to blink in response to the CS alone, a learned behavior known as the conditioned response (CR). We modified the task to have 2D state space: time t and eyelid position , along with two actions: opening and closing the eyelid. The time axis ranges from 0 ms to 1,000 ms, corresponding to the duration of stimulus presentation, while the eyelid position is measured on a scale from fully closed (0) to fully open (1). The CS was presented from 0 ms to 1,000 ms as a time signal, and the US was introduced at 500 ms as a punishment if the eyelid was open, i.e. if was greater than 0.1.
Fig. 3.
Simulation of the delay eyeblink conditioning task. A) Schematic of the task. A neutral stimulus (CS) is presented, followed by an aversive stimulus (US) after a fixed interval (). After repeated pairings, the agent closes its eyes (CR) in response to the CS alone. B, C) Population activity of PCs during (B) a failure episode and (C) a success episode. Top and bottom panels show the population activity of antiopen and anticlose PC populations, respectively. Bin width is 10 ms. US onset is indicated by a vertical red line. The period during the agent selects the eye-closing action is shown by green shaded rectangle. D) Change of the success rate. E) The activity changes of a single PC in the anticlose group, with dots representing spikes and a red line indicating US onset. F) Change of the value of computed from the activity of SCs across episodes. The color of each trace indicates the episode number, ranging from 0 to 500, as shown by the color bar. The black bold line represents the average over the last 10 episodes, the brown dashed line shows , and the vertical red line indicates US onset.
In early episodes (Fig. 3B), the agent preferred to open the eyelid, so the agent received the US as a negative reward. As learning progressed (Fig. 3C), the antiopen PC group became active, whereas the anticlose PC group paused as if PCs expected the US onset. Thus, the eyelid started to close before the US onset. For every 10 runs, the agent successfully acquired the CR to the CS (Fig. 3D): closing its eyelid to avoid a punishment before the arrival of the US. Note that we defined a successful episode as a sufficiently closed eyelid to avoid the punishment at the US onset. When focusing on the activity changes of a single PC in the anticlose group (Fig. 3E), the neuron acquired a behavior to pause before US onset. These behaviors were consistent with behavioral experiments of the delayed eyeblink conditioning task (23).
We also analyzed the activity of SCs during the task (Fig. 3F). In early episodes, since the agent received the punishment as the US, the SC activity as the state value with the inverted sign ramped up toward the arrival of the US. These results are not inconsistent with biological experiments (37, 38), which show increases in MLIs’ firing frequency from the time of CS onset to the time of US onset. In later episodes when the agent successfully closed the eyelid more than 80% of the time, the SC activity returned to almost baseline. This returning behavior resulted from the disappearance of CF signals due to the successful avoidance of punishments.
We also conducted a simulation of the same task without PF–MLI synaptic plasticity. The activity of the anticlose PC group decreased as it approached the US onset (Fig. 4A). However, the agent was not able to elicit a stable closing response in our settings. In a later episode, the time to close the eye was delayed (Fig. 4B) compared to that in the intact simulation (Fig. 3C), so the agent could not close the eye sufficiently.
Fig. 4.
Results of the delay eyeblink conditioning task without PF–SC synaptic plasticity. A) The activity changes of a single PC in the anticlose group. Dots represent spikes of the PC, and vertical red line represents the time of US onset. B) Example PC population activity of (top) antiopen PC population and (bottom) those of anticlose PC population in a later episode. Bin width is 10 ms. Vertical red lines represent the time of US onset, while the green shaded rectangles show when the agent closes its eyelid around the US onset.
These results are also consistent with experimental results where MLI–PC inhibitions were impaired (39). Furthermore, these results suggested that our model successfully solved the eyeblink task with biologically plausible internal dynamics.
Discussion
In this study, we implemented a cerebellar spiking network model performing RL in an actor-critic manner. Specifically, we interpreted that PFs and CF represent states and negative reward signals, respectively. SCs act as a critic and represent the state value with the inverted sign, while PCs act as an actor and represent avoidance of certain actions. Our simulations confirmed the model’s ability to solve a simple RL task and demonstrated a cerebellum-dependent motor leaning task with biologically plausible internal dynamics.
In a standard RL paradigm, a value called a TD error or a reward prediction error (RPE), which represents the difference between an actual reward and a predicted reward, plays an essential role. In the cerebellum, Ohmae and Medina (40) first reported RPE-like complex spike activities in PCs during eyeblink conditioning in mice. After the pioneering work, experimental studies in mice demonstrated that CFs carry multiplex information including not just motor-related information but also reward-related information such as reward delivery, prediction, and omission (41–45). Hoang et al. (46) recapitulated the idea of RPE conveyed by CFs to Q-learning. While these studies interpreted that RPEs are provided by CFs, another hypothesis is that CFs convey negative reward information, and MLIs compute RPE. This hypothesis is more closely related to conventional motor adaptation studies of the cerebellum, because negative reward information could be regarded as a form of teaching signals. Experimental findings about the functional roles of MLIs beyond MLI–PC inhibition (24, 25, 32, 47–50), assistive roles of MLIs in cerebellar learning (37–39, 51), plasticity at PF–MLI synapses (29–31), and the modulatory role in PF–PC synaptic plasticity (33) supports the hypothesis. According to these studies, Yamazaki and Lennon (20) interpreted that MLIs provide RPE and proposed a conceptual cerebellar RL model as an actor-critic model.
The present study follows the latter hypothesis, because our model interpreted the SC population as a critic. On the other hand, by introducing our learning rule as in (14) and (15), our model does not need to calculate TD errors explicitly. Even without the explicit calculation, our model succeeded in solving the simple RL task and also suggested that impairments in PF–SC synaptic plasticity may affect the stability of movement acquired through an association task. With the new learning rule, first, SC population activity gradually increased toward the onset of the CF signal (Figs. 3F and S1B), and then the SC activity returned to baseline once the optimal policy was acquired and no further punishment was expected (Fig. 3F). These observations imply that there is no way to distinguish the activities of sufficiently trained SCs with those of untrained SCs. To observe a learning-dependent transient increase/decrease of SC activities, one needs to track the activities of SCs persistently during the whole period of a behavioral training. It is possible to track the same granule cells over several days of learning (52). Similar attempts could be made to SCs.
Although the present study proposes that the cerebellum can act as an RL machine, this proposal does not necessarily deny the classical and standard view of the cerebellum as an SL machine. Rather, we consider that both SL and RL co-exist within the cerebellum. There are two possibilities for the co-existence. First, a single CF could deliver both evaluative and teaching signals for RL and SL, respectively, thereby a single microzone can perform both RL and SL simultaneously. In this scenario, teaching signals mainly drive learning, while evaluative signals may play an assistive role for the learning. By combining teaching signals with evaluative signals, the cerebellum may learn appropriate actions more quickly than when using evaluative signals alone. Second, a single CF innervating a microzone delivers either evaluative or teaching signals, thereby a single microzone can perform either RL or SL, and multiple microzones participate to perform both RL and SL simultaneously. In this scenario, distinct microzones may specialize in different learning strategies, some implementing RL and others SL, depending on their functional roles. In both scenarios, by combining SL and RL, the learning capability of the cerebellum can be enhanced for faster, smoother, and more complex motor and cognitive tasks such as realtime motor control and online learning (14, 53–55).
Our model still has several limitations. First, we limited actions to two discrete ones. In our model, PCs directly inhibited but indirectly disinhibited each other via BCs to form a single dent. On the other hand, in the previous spiking actor-critic framework (21), actor neurons that directly inhibit and excite each other to form a single bump and employ population vector coding. Therefore, for a broader action space, more precise parameter adjustment of the intra-layer synaptic connections of the actor in our model would be necessary to implement population vector coding in PCs. Another limitation of our model is that the CF activity was modeled as a spike train. To represent negative rewards, only the firing rate of CF was controlled in each task within the biologically plausible firing rate. This limitation may spoil the learning performance of our model in more complex RL tasks. However, recent studies (56–58) have showed that CF signals can carry graded information. Adopting the mechanisms might enhance the learning performance of our model in the tasks.
If both the cerebellum and the basal ganglia perform RL, how do these regions cooperate for learning? One possible interpretation is hierarchical RL (HRL) (18) as discussed in our previous research (10). HRL considers two RL machines organized hierarchically. The higher RL machine breaks down a complex task into simpler, smaller subtasks. The lower RL machine tries to solve these subtasks one by one. Eventually, HRL involves creating a hierarchy of policies that makes the learning process more efficient and scalable (18, 59, 60). If the basal ganglia and the cerebellum cooperate to perform HRL, the resulting network would demonstrate powerful learning capabilities.
To our knowledge, our model is the first to implement general cerebellar RL with spiking neurons. Our research provides an implementation that supports cerebellar RL theory to challenge the traditional view of the cerebellum as primarily a SL machine. Furthermore, our interpretation of a potential cooperative role of the basal ganglia and the cerebellum as HRL sheds light on how the multiple brain regions act together synergistically as a whole.
Materials and methods
Implementations
First, we discretized a state space and mapped it on a PF plane with grid size . A tile positioned at on the PF plane had its own responsible state as follows:
(16) |
(17) |
where x and y are the axes of the PF plane, and and represent the ranges of state x and y, respectively. When the agent observed state , PFs on a tile at fired as an inhomogeneous Poisson spike generator with a firing rate defined as follows:
(18) |
where is a scaling factor of firing rate, and is a standard deviation. To maintain the total activity of all PFs when the agent is at the edge of a state space, we added “margin” tiles as an outer frame. Although we modeled PFs as Poisson spike generators for simplicity, more realistic implementations are reported (61).
We utilized a leaky integrate-and-fire model (LIF) to SCs, BCs, and PCs. A LIF model was implemented as follows.
(19) |
(20) |
(21) |
where , , t, , , , , and represent the membrane capacitance, the membrane potential of the ith neuron, time, the leak conductance, the resting potential, the synaptic current to the ith neuron, the external input current, and the noise current to the ith neuron, respectively. When the membrane potential of the ith neuron reaches the threshold potential , that time is denoted as , and the neuron fires the fth spike and resets its membrane potential to the reset potential . A synaptic current was defined as follows:
(22) |
(23) |
where c is a synapse label, , , , and are the scaling factor, synaptic weight between the jth presynaptic neuron and the ith postsynaptic neuron, decay time constant, and spike train of the jth presynaptic neuron, respectively. Whether it is an inhibitory or excitatory current is determined by the sign of the scaling factor. Noise current was modeled as follows:
(24) |
where is a time constant, and ζ is an amplitude factor. is a uniform random variable in the range . Parameters of all neurons were set as listed in Table 1, and parameters of all synapses were set as listed in Table 2.
Table 1.
Neuron-specific parameters.
Type | [pF] | [mS] | [mV] | [mV] | [mV] | [pA] | ζ [pA] |
---|---|---|---|---|---|---|---|
SC | 107 | 2.32 | 30.0 | 5.0 | |||
BC | 107 | 2.32 | 30.0 | 1.0 | |||
PC | 107 | 2.32 | 30.0 | 5.0 |
These parameters were adjusted from those used in our previous model (15).
Table 2.
Synaptic parameters for each connection type.
Connection type | Scaling factor | Weight | (ms) | Probability |
---|---|---|---|---|
PF–SC | 200.0 | – | 8.3 | 0.1 |
PF–BC | 10.0 | 1.0 | 8.3 | 0.5 |
PF–PC | 5.0 | – | 8.3 | 0.5 |
SC–PC | 0.03 | 10.0 | 0.5 | |
BC–BC | 1.0 | 10.0 | 0.3 | |
BC–PC | 1.0 | 10.0 | 0.3 | |
PC–BC | 1.0 | 10.0 | 0.8 | |
PC–PC | 1.0 | 5.0 | 1.0 |
A hyphen in a weight cell means it will be modified because of synaptic plasticity.
As mentioned above, PFs were distributed on the 2D grid plane, and the grid size was determined by each task. Each tile contained 100 PFs. SCs were prepared 1 per tile. Each SC positioned at could receive excitatory inputs from PFs located in and . The number of BCs was the same as that of SCs, and the number of PCs was 20. Both BCs and PCs could receive excitatory PF inputs from all tiles. The BC–BC, BC–PC, PC–BC, and PC–PC connections were constructed based on the grouping manner mentioned in Spike-based implementation of cerebellum-style RL section. The actual synapses were formed stochastically (Table 2). In this study, we divided BCs and PCs into two groups for all tasks. Additionally, for the sake of simplicity, we assumed that all CFs synchronized their activity so that our model had a single CF.
Synaptic plasticity was applied to PF–SC and PF–PC synapses. The details of the weight-update rules are described in (14) and (15).
Finally, the outcome of actions were generated by DCN cells prepared in an equal number of PCs. Each DCN cell was inhibited by its counterpart PC. This configuration means that the activity of each DCN cell, denoted as , was modulated by the corresponding PC activity defined by Equation 8, as follows:
(25) |
where , , and ν represent the decay time constant, baseline activity of , and a scaling factor, respectively.
Since PCs in the group represented the avoidance of action α and inhibited DCN cells, the activity of DCN group represented the preference of actions as follows:
(26) |
where is the number of neurons in the group , and is the activity of ith DCN in the group corresponding to action α. Therefore, the most activated DCN group represented the most preferred action so that the final outcome of actions was chosen as follows:
(27) |
Task definitions
Pre- and postepisode processes
As a pre-episode process, at the start of each episode, a free run for 200 ms was carried out to discard initial transient activity that could affect learning. Then, at the end of each episode, a simulation continued for 100 ms while stopping the firing of PFs to update synaptic weights. These pre- and postepisode processes were performed in all tasks.
Mountain car task
The dynamics of the car were defined by the following differential equations:
(28) |
(29) |
where is a force amplitude, represents an action as the force direction, represents the gravity. Observation spaces of the velocity and position are bounded by and , respectively. When the car collides with the wall, its velocity was immediately reset to 0 without any negative reward associated with the collision. In the original settings, an agent receives constant negative reward at every step. However, the maximum firing frequency of CF, which is approximately 10 Hz (62), is not enough to represent the constant negative reward. Therefore, in this study, we modeled a new negative reward function for CF. CF fired when the car was far from the goal and the speed was slow. Thus, the firing rate of CF was defined as follows:
(30) |
where , , , and are max firing rate of 5 Hz, goal position, left side wall position, and limit of speed respectively. The constant in (1) was set to .
The car starts from the bottom of the valley with a velocity of 0 in every episode. The action represents the direction in which the agent pushes the car. An episode ends when the car reaches the goal or when the elapsed time exceeds 1,000 ms, with no special negative reward given.
The grid size of the PF plane was . The parameters of the state value function, as defined in (2), were and . The reward discount time constant , the decay time constant of eligibility trace , and the decay time constant of the window function were set to , , and , respectively. Initial synaptic weights of PF–SC and PF–PC were 0.05 and 0.8, respectively. Both learning rates for the critic and the agent were 0.4.
We conducted 10 runs of 1,000 episodes each. We calculated the changes in success rate for each run using a moving average of results (, ) with a window size of 10 episodes, and then calculated the average over 10 runs (Fig. 2D).
Delay eyeblink conditioning task
The dynamics of the eyelid was modeled as follows:
(31) |
where represents the action the agent takes at time t. The CF was assumed to fire once when the air puff was applied to the eye and the eyelid was not closed sufficiently. Thus, the spike train of CF was defined as follows:
(32) |
where represents the US onset. The constant in (1) was set to .
The grid size of the PF plane was . The parameters of the state value function, which defined in (2), were and . The reward discount time constant , the decay time constant of eligibility trace , and the decay time constant of the window function were set to 100, 20, and , respectively. To prioritize the “open” action, we initially set synaptic weights for PF– to 0.8, while PF– was set to 0.7. We conducted 10 runs of the simulation for 500 episodes with learning rates and . We calculated the changes in success rate for each run using a moving average of results (, ) with a window size of 10 episodes, and then calculated the average over 10 runs (Fig. 3D).
The whole code of our model and all environments were written in C++. All ordinary differential equations were solved numerically with a forward Euler method with a temporal resolution of 1.0 ms.
Supplementary Material
Acknowledgments
We would like to thank Profs. Kazuo Kitamura, Jun Igarashi, Shogo Ohmae, and Kenji Doya for their fruitful discussions and helpful insights.
Contributor Information
Rin Kuriyama, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan.
Hideyuki Yoshimura, Neural Computation Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa 904-0495, Japan.
Tadashi Yamazaki, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan.
Supplementary Material
Supplementary material is available at PNAS Nexus online.
Funding
This work was supported by Japan Society for the Promotion of Science KAKENHI Grant numbers JP22KJ1372 and JP22H05161. Part of this study was supported by the Ministry of Education, Culture, Sports, Science and Technology Program for Promoting Researches on the Supercomputer Fugaku hp220162.
Author Contributions
T.Y. conceived and designed the research. R.K. and H.Y. contributed the formulation and the implementation of the algorithms. R.K. wrote the code, performed simulation, and analyzed the data. R.K. and T.Y. wrote the original draft. R.K., H.Y., and T.Y. discussed the draft and revised it. All authors contributed to the article and approved the submitted version.
Preprints
This manuscript was posted on a preprint: https://doi.org/10.1101/2024.06.23.600300.
Data Availability
The code used in this study, including the implementation of the cerebellar model and the environments, is available at the author’s GitHub repository: https://github.com/Rkuriyama/CeRL.
References
- 1. Marr D. 1969. A theory of cerebellar cortex. J Physiol. 202(2):437–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Ito M. 1970. Neurophysiological aspects of the cerebellar motor control system. Int J Neurol. 7(2):162–176. [PubMed] [Google Scholar]
- 3. Albus JS. 1971. A theory of cerebellar function. Math Biosci. 10(1-2):25–61. [Google Scholar]
- 4. Rosenblatt F. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev. 65(6):386–408. [DOI] [PubMed] [Google Scholar]
- 5. Ito M. 2001. Cerebellar long-term depression: characterization, signal transduction, and functional roles. Physiol Rev. 81(3):1143–1195. [DOI] [PubMed] [Google Scholar]
- 6. Ito M, Kano M. 1982. Long-lasting depression of parallel fiber-Purkinje cell transmission induced by conjunctive stimulation of parallel fibers and climbing fibers in the cerebellar cortex. Neurosci Lett. 33(3):253–258. [DOI] [PubMed] [Google Scholar]
- 7. D’Angelo E. 2014. The organization of plasticity in the cerebellar cortex: from synapses to control. Vol. 210. Progress in brain research. Elsevier, 31–58. [DOI] [PubMed] [Google Scholar]
- 8. Fujita M. 2016. A theory of cerebellar cortex and adaptive motor control based on two types of universal function approximation capability. Neural Netw. 75(1):173–196. [DOI] [PubMed] [Google Scholar]
- 9. Raymond JL, Medina JF. 2018. Computational principles of supervised learning in the cerebellum. Annu Rev Neurosci. 41(1):233–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Yamazaki T Evolution of the Marr-Albus-Ito model. In: Mizusawa H, Kakei S, editors. Cerebellum as a CNS hub. Springer, 2021. p. 239–255. [Google Scholar]
- 11. Medina JF, Garcia KS, Nores WL, Taylor NM, Mauk MD. 2000. Timing mechanisms in the cerebellum: testing predictions of a large-scale computer simulation. J Neurosci. 20(14):5516–5525. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Gosui M, Yamazaki T. 2016. Real-world-time simulation of memory consolidation in a large-scale cerebellar model. Front Neuroanat. 10(24):21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hausknecht M, Li W-K, Mauk M, Stone P. 2017. Machine learning capabilities of a simulated cerebellum. IEEE Trans Neural Netw Learn Syst. 28(3):510–522. [DOI] [PubMed] [Google Scholar]
- 14. Abadía I, Naveros F, Ros E, Carrillo RR, Luque NR. 2021. A cerebellar-based solution to the nondeterministic time delay problem in robotic control. Sci Robot. 6(58):eabf2756. [DOI] [PubMed] [Google Scholar]
- 15. Kuriyama R, Casellato C, D’Angelo E, Yamazaki T. 2021. Real-time simulation of a cerebellar scaffold model on graphics processing units. Front Cell Neurosci. 15:623552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Shinji Y, Okuno H, Hirata Y. 2024. Artificial cerebellum on FPGA: realistic real-time cerebellar spiking neural network model capable of real-world adaptive motor control. Front Neurosci. 18:1220908. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Geminiani A, et al. 2024. Mesoscale simulations predict the role of synergistic cerebellar plasticity during classical eyeblink conditioning. PLoS Comput Biol. 20(4):e1011277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Sutton Richard S., Barto Andrew G. 2018. Reinforcement learning: an introduction. 2nd ed. Adaptive computation and machine learning series. Cambridge, Massachusetts, The MIT Press, [Google Scholar]
- 19. Swain RA, Kerr AL, Thompson RF. 2011. The cerebellum: a neural system for the study of reinforcement learning. Front Behav Neurosci. 5:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Yamazaki T, Lennon W. 2019. Revisiting a theory of cerebellar cortex. Neurosci Res. 148(5):1–8. [DOI] [PubMed] [Google Scholar]
- 21. Frémaux N, Sprekeler H, Gerstner W. 2013. Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS Comput Biol. 9(4):e1003024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. William Moore A. 1990. Efficient memory-based learning for robot control [PhD thesis]. University of Cambridge.
- 23. Jirenhed D-A, Bengtsson F, Hesslow G. 2007. Acquisition, extinction, and reacquisition of a cerebellar cortical memory trace. J Neurosci. 27(10):2493–2502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Johanna Dizon M, Khodakhah K. 2011. The role of interneurons in shaping Purkinje cell responses in the cerebellar cortex. J Neurosci. 31(29):10463–10473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Brown AM, et al. 2019. Molecular layer interneurons shape the spike activity of cerebellar Purkinje cells. Sci Rep. 9(1):1742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Albus JS. 1975. A new approach to manipulator control: the cerebellar model articulation controller (CMAC). J Dyn Syst Meas Control. 97(3):220–227. [Google Scholar]
- 27. Ito M. The cerebellum and neural control. Raven Press, New York, 1984. [Google Scholar]
- 28. Doya K. 2000. Reinforcement learning in continuous time and space. Neural Comput. 12(1):219–245. [DOI] [PubMed] [Google Scholar]
- 29. Jörntell H, Ekerot C-F. 2002. Reciprocal bidirectional plasticity of parallel fiber receptive fields in cerebellar purkinje cells and their afferent interneurons. Neuron. 34(5):797–806. [DOI] [PubMed] [Google Scholar]
- 30. Jörntell H, Ekerot C-F. 2003. Receptive field plasticity profoundly alters the cutaneous parallel fiber synaptic input to cerebellar interneurons in vivo. J Neurosci. 23(29):9620–9631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Jirenhed D-A, Bengtsson F, Jörntell H. 2013. Parallel fiber and climbing fiber responses in rat cerebellar cortical neurons in vivo. Front Syst Neurosci. 7:16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. June Liu S Stellate cells: synaptic processing and plasticity. In: Manto M, Gruol DL, Schmahmann JD, Koibuchi N, Rossi F, editors. Handbook of the cerebellum and cerebellar disorders. Vol. 2. Springer, 2013. p. 809–828. [Google Scholar]
- 33. Rowan MJM, et al. 2018. Graded control of climbing-fiber-mediated plasticity and learning by inhibition in the cerebellum. Neuron. 99(5):999–1015.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Bonnan A, Rowan MMJ, Baker CA, McLean Bolton M, Christie JM. 2021. Autonomous Purkinje cell activation instructs bidirectional motor learning through evoked dendritic calcium signaling. Nat Commun. 12(1):2153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Tatiana Silva N, Ramírez-Buriticá J, Pritchett DL, Carey MR. 2024. Climbing fibers provide essential instructive signals for associative learning. Nat Neurosci. 27(5):940–951. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Towers M, et al. Gymnasium, March 2023. URL https://zenodo.org/record/8127025.
- 37. Ten Brinke MM, et al. 2015. Evolving models of pavlovian conditioning: cerebellar cortical dynamics in awake behaving mice. Cell Rep. 13(9):1977–1988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Ma M, et al. 2020. Molecular layer interneurons in the cerebellum encode for valence in associative learning. Nat Commun. 11(1):4217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Boele H-J, et al. 2018. Impact of parallel fiber to Purkinje cell long-term depression is unmasked in absence of inhibitory input. Sci Adv. 4(10):eaas9426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Ohmae S, Medina JF. 2015. Climbing fibers encode a temporal-difference prediction error during cerebellar learning in mice. Nat Neurosci. 18(12):1798–1803. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Heffley W, et al. 2018. Coordinated cerebellar climbing fiber activity signals learned sensorimotor predictions. Nat Neurosci. 21(10):1431–1441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Heffley W, Hull C. 2019. Classical conditioning drives learned reward prediction signals in climbing fibers across the lateral cerebellum. Elife. 8:e46764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Kostadinov D, Beau M, Blanco-Pozo M, Häusser M. 2019. Predictive and reactive reward signals conveyed by climbing fiber inputs to cerebellar Purkinje cells. Nat Neurosci. 22(6):950–962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Ikezoe K, et al. 2023. Cerebellar climbing fibers multiplex movement and reward signals during a voluntary movement task in mice. Commun Biol. 6(1):924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Hoang H, et al. 2023. Dynamic organization of cerebellar climbing fiber response and synchrony in multiple functional components reduces dimensions for reinforcement learning. Elife. 12:e86340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Hoang H, et al. 2025. Predictive reward-prediction errors of climbing fiber inputs integrate modular reinforcement learning with supervised learning. PLoS Comput Biol. 21(3):e1012899. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Wulff P, et al. 2009. Synaptic inhibition of Purkinje cells mediates consolidation of vestibulo-cerebellar motor learning. Nat Neurosci. 12(8):1042–1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Hung Lee K, et al. 2015. Circuit mechanisms underlying motor memory formation in the cerebellum. Neuron. 86(2):529–540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Hirano T, Yamazaki Y, Nakamura Y. 2016. LTD, RP, and motor learning. Cerebellum. 15(1):51–53. [DOI] [PubMed] [Google Scholar]
- 50. Jelitai M, Puggioni P, Ishikawa T, Rinaldi A, Duguid I. 2016. Dendritic excitation–inhibition balance shapes cerebellar output during motor behaviour. Nat Commun. 7(1):13722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Johansson F, Jirenhed D-A, Rasmussen A, Zucca R, Hesslow G. 2014. Memory trace and timing mechanism localized to cerebellar Purkinje cells. Proc Natl Acad Sci U S A. 111(41):14930–14934. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Wagner MJ, Hyun Kim T, Savall J, Schnitzer MJ, Luo L. 2017. Cerebellar granule cells encode the expectation of reward. Nature. 544(7648):96–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Antonietti A, et al. 2022. Brain-inspired spiking neural network controller for a neurorobotic whisker system. Front Neurorobot. 16:817948. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Kuniyoshi Y, et al. 2023. Embodied bidirectional simulation of a spiking cortico-basal ganglia-cerebellar-thalamic brain model and a mouse musculoskeletal body model distributed across computers including the supercomputer Fugaku. Front Neurorobot. 17:1269848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Mitsuhashi T, Kuniyoshi Y, Ikezoe K, Kitamura K, Yamazaki T. 2026. A spiking network model of the cerebellum for predicting movements with diverse complex spikes. Neural Netw. 193(4):107962. [DOI] [PubMed] [Google Scholar]
- 56. Najafi F, Medina JF. 2013. Beyond “all-or-nothing” climbing fibers: graded representation of teaching signals in Purkinje cells. Front Neural Circuits. 7:115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Zang Y, Schutter ED. 2019. Climbing fibers provide graded error signals in cerebellar learning. Front Syst Neurosci. 13:46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Rasmussen A. 2020. Graded error signals in eyeblink conditioning. Neurobiol Learn Mem. 170(6):107023. [DOI] [PubMed] [Google Scholar]
- 59. Morimoto J, Doya K. 2001. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Rob Auton Syst. 36(1):37–51. [Google Scholar]
- 60. Kulkarni TD, Narasimhan KR, Saeedi A, Tenenbaum JB. Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In: NIPS’16. Curran Associates Inc., Red Hook, NY, USA, 2016. p. 3682–3690. [Google Scholar]
- 61. Rössert C, Solinas S, D’Angelo E, Dean P, Porrill J. 2014. Model cerebellar granule cells can faithfully transmit modulated firing rate signals. Front Cell Neurosci. 8:304. doi: 10.3389/fncel.2014.00304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. De Zeeuw CI, et al. 1998. Microcircuitry and function of the inferior olive. Trends Neurosci. 21(9):391–400. [DOI] [PubMed] [Google Scholar]
- 63. Szapiro G, Barbour B. 2007. Multiple climbing fibers signal to molecular layer interneurons exclusively via glutamate spillover. Nat Neurosci. 10(6):735–742. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Towers M, et al. Gymnasium, March 2023. URL https://zenodo.org/record/8127025.
Supplementary Materials
Data Availability Statement
The code used in this study, including the implementation of the cerebellar model and the environments, is available at the author’s GitHub repository: https://github.com/Rkuriyama/CeRL.