Skip to main content
Nature Communications logoLink to Nature Communications
. 2025 Dec 11;16:11081. doi: 10.1038/s41467-025-66081-4

Prefrontal cortex temporally multiplexes slow and fast dynamics in value learning and memory

Seyed-Reza Hashemirad 1, Mojtaba Abbaszadeh 1, Ali Ghazizadeh 2,
PMCID: PMC12698692  PMID: 41381499

Abstract

Balancing stability and flexibility is a fundamental challenge in value-based learning: how does the brain maintain long-term value memories while adapting to new environmental contingencies? To address this, we propose a reinforcement learning model composed of two distinct processes with fast and slow dynamics for updating and forgetting object values. Using a combined theoretical and experimental approach in male macaque monkeys, we validate a key behavioral prediction of this two-rate system—spontaneous recovery of prior value memories following value reversal. At the neural level, we show that single neurons in the ventrolateral prefrontal cortex (vlPFC) temporally multiplex these dynamics, with distinct firing components reflecting fast and slow learning processes. Together, these findings suggest that reward learning and memory are supported by a two-rate system that enables both flexibility and stability, and identify the vlPFC as a critical neural substrate for this mechanism.

Subject terms: Decision, Cognitive control, Reward, Learning and memory


Neural mechanisms underlying interplay between stable and flexible value encoding are not fully understood. Here authors propose a two-rate reinforcement learning model where fast and slow processes resolve the trade-off between memory stability and adaptability. Prefrontal neurons multiplex both dynamics, providing a neural substrate for their interaction.

Introduction

Foods, currencies, friends, and romantic partners often maintain stable values over time, but object values can change, prompting one to reconsider old reward associations. Consequently, while the stable representation of values is shown to be essential for skillful interactions with objects, such as rapid orienting and ignoring distractors1,2, maintaining flexibility to adapt to new values would be crucial for survival. Nevertheless, the requirements for flexibility vs. stability are often at odds. On the one hand, the system must learn fast and forget fast if it wants to adapt flexibly to environmental and contextual changes. On the other hand, useful memories require slow forgetting to maintain learned values and slow learning to retain only associations that are consistent in time35. We refer to this tension between forming stable value memories and new learning, which requires flexibility, as the stability vs. flexibility dilemma. We posit that such tension challenges our current understanding of reinforcement learning (RL) mechanisms in which a single process is assumed to learn and maintain the value associations.

Previous studies have suggested distinct circuits underlying goal-directed (flexible) and habitual (stable) object-reward associations, especially involving the basal ganglia68. Specifically, within the macaque striatum, the caudate head (CDh) has been shown to mediate adaptability by fast value learning and fast forgetting, while the caudate tail (CDt) mediates stability by slow value learning but slow value forgetting9. CDh and CDt connect with relatively segregated cortical and subcortical areas. They are even enervated by separate groups of dopamine neurons in substantia nigra pars compacta (SNc), forming separate loops specialized in flexible vs. stable reward learning and memory10,11. However, the interaction between these two systems, particularly when conflicts between past and present reward associations arise, remains unknown. Among various cortical areas, the ventrolateral prefrontal cortex (vlPFC) is a potential candidate for handling the balance between stability and flexibility in value learning. vlPFC is long recognized for its role in encoding short-term value associated with objects 1214and, more recently, has been identified for its involvement in reward probability updating, and reversal learning tasks15,16. Additionally, recent findings have uncovered its involvement in representing long-term memory of object values17. In addition to its multifaceted function, vlPFC is among the few areas that take part in both cortical-striatal-thalamocortical loops that control flexibility and stability in reward learning and memory18,19.

Thus, we hypothesized that this cortical area could be a natural hub for the interplay between stable and flexible value encoding. To address this hypothesis, single-unit responses and local field potential in vlPFC were recorded while macaque monkeys performed a paradigm in which stable object values underwent an abrupt reversal in their reward associations. Interestingly, we observed that not only was the behavioral output consistent with a two-rate model with slow and fast dynamics, but that both dynamics coexisted in the early and late neural responses within individual neurons in vlPFC, an effect that we refer to as the temporal multiplexing of the two processes.

Results

A value reversal paradigm was used to study the interaction between stable and flexible object reward associations. Two macaque monkeys (P and H) were initially trained to associate abstract fractal objects with preassigned reward contingencies for more than 5 days to create stable object-reward associations9,17,20. Fractals were trained in sets of eight, half of which were associated with high and the other half with low juice amount as reward (good and bad objects, respectively; Fig. 1a). The value training task consisted of force trials where monkeys made a saccade to a single fractal object and received its corresponding reward and interleaving choice trials between a good and a bad fractal to track the degree of value learning (Fig. 1b). The average choice performance was ~83% at the end of the first day and increased to ~98.5% after five sessions indicating emergence of robust, stable learning consistent with previous reports9,17,21. Subsequently, for a given set, the object reward associations underwent an abrupt reversal in one session such that now the good objects received low reward and bad objects received high reward (Fig. 1d). The monkey’s choice performance following the sudden value reversal, was initially poor but quickly adapted such that they chose the new high-value objects after a few trials despite the over-training with the original contingencies (Fig. 1g). Then, the value reversal was followed by a period during which the system was allowed to forget learned values by the passage of time and with no additional value learning for those objects (Fig. 1d). This epoch is referred to as the error-free period, borrowing from a technique previously used to study motor learning and memory22. During this period, subjects had no expectation of reward, resulting in a zero-reward prediction error and thus no new learning. This “error-free” epoch allowed us to isolate the system’s behavior in a forgetting-only regime, which best differentiated the predictions of single-rate versus two-rate models. Note that in this study and all figures, we refer to objects as good and bad based on their original overtrained associations with reward before value reversal.

Fig. 1. Value reversal paradigm, model predictions, and behavioral results.

Fig. 1

a Sets of eight fractal objects: four high-valued (“good”) and four low-valued (“bad”). b Forced-choice (FC) value training: In forced trials a single fractal (good or bad) was associated with reward. In choice trials, animals selected between a good and bad object. c Value probes included passive viewing (PV) for neural recordings and free viewing (FV) for behavior. In PV, objects were shown sequentially in the neuron’s receptive field, and in FV, monkeys freely viewed good and bad objects presented simultaneously, both without reward. d Experimental design: Each fractal set was trained in FC task for >5 days, followed by PV and FV probes before reversal. Value reversal was performed via FC with reward contingencies reversed, then probed again with PV and FV immediately, 20 min, and 1 day later. e The task reward schedule, including initial value learning, value reversal and the error-free period. f Simulations of single-rate (left) and two-rate (right) reinforcement learning models under (e); two-rate model predicts spontaneous recovery during the error-free period. g Average performance of monkey P (top) and Monkey H (bottom) in choice trials of value reversal task (gray) and predicted choice rate by the best single-rate (orange) and two-rate (green) models. Here and thereafter, error bars indicate the standard error of mean (SEM). h Viewing time difference (good–bad) in the FV task across the four value probes, along with prediction of the best single-rate (orange) and two-rate (green) models. N = 41 and 45 sessions for monkeys P and H, respectively. Here and thereafter, box plots show the median (black) and mean (red) lines, boxes = interquartile range (IQR; 25th–75th percentiles), whiskers = extreme values within 1.5 × IQR, and points outside the whisker as small black circles. Inset bars show AIC and BIC for the best single- and two-rate models. Two-sided paired t tests, FDR-corrected (q < 0.05); uncorrected p-values for before, immediately after, 20 min after, and day after reversal: Monkey P: 5.73 × 10−13, 0.0112, 0.528, 0.0175; Monkey H: 1.71 × 10−12, 1.99 × 10−5, 0.162, 0.00075 ( = : main effect of memory periods, *p < 0.05, **p < 0.01, ***p < 0.001, here and after).

Single and two-rate reinforcement learning models

The value learning processes in our reversal task can be modeled using a modified RL framework. In this framework, the learned values are updated from trial “n” to trial “n+1” based on two key components: the previous trial’s value modulated by a forgetting rate, and a reward prediction error (RPE). The RPE is calculated as the difference between the current predicted value and the actual reward received by the monkey in trial “n”, and subsequently, RPE is scaled by a learning rate (Eq. 1). This modified model allows us to account for how learned values are adjusted over time, taking into consideration both retention and adaptation to new rewards. (Eq. 1).

Vn+1=α.Vn+β.en 1

Vn Net value output of the model on trial n

r(n) Reward received in trial n

e(n)RPEontrialne(n)=r(n)V(n)

β Learning rate

α Retention factor

Nevertheless, previous literature on object value learning has revealed two parallel systems in the basal ganglia, one that has a slow learning rate and slow forgetting, and the other that has a high forgetting and a poor memory but a fast learning rate, which is particularly suited for accommodating rapid changes in value such as reversals9. The two-rate model can be modeled by a simple extension of the standard single-rate model by having two processes with different learning and forgetting rates. Given that slow and fast processes in the basal ganglia converge on the downstream neurons in the superior colliculus (SC)23, we proposed the behavioral output to be a simple sum of slow and fast processes (see “Methods” for details).

vfn+1=αf.vfn+βf.en 2
vsn+1=αs.vsn+βs.e(n)
V=vf+vs
αf<αs,βf>βs

The spontaneous recovery following reversal and the two-rate model

Figure 1e, f shows the simulations of the full experimental paradigm and predicted responses in single-rate and two-rate models. During initial training, both models learn the values of stimuli, and the error converges to zero. In response to value reversal (a sudden rise in error), both single-rate and two-rate models predict similar patterns of output reversal, attempting to track the sudden change in values and a sharp increase in RPE. Consistently, both the single-rate (Eq. 1) and two-rate models (Eq. 2) can reproduce the observed pattern of choice performance during the reversal task (Fig. 1g, albeit for monkey H, the two-rate model had a better fit even after accounting for the number of parameters using Akaike (AIC) and Bayesian (BIC) information criteria methods; the best two-rate model AIC = −80.2, BIC = −71.7; best single-rate model AIC = −72.1, BIC = −66.7).

Nevertheless, during this brief reversal, learning the slow and fast processes in the two-rate model shows a very different pattern from that of the single-rate model. The slow process fails to adapt to this sudden change due to its low learning rate and remains biased toward the initial learning. In contrast, the fast process rapidly responds and accommodates the value reversal in such a manner that the net value output shifts toward the reversed ones by the end of the reversal session. As a result, two- and single-rate models can make very different predictions in the error-free period. During the error-free period, the single rate model predicts decay of learned values to zero. In contrast, the two-rate model predicts spontaneous recovery from the reversed values toward the initial values previously learned (Fig. 1f). This is because the fast process forgets rapidly during the error-free. Thus, the summed output of both processes reverts to the initial learned values kept by the slow forgetting process.

To test this prediction, the output of the RL model was probed at four time points: before the reversal, immediately after, 20 min after, and a day after the reversal. Note that special care must be taken so that the value probes minimally affect the learned values and do not cause additional learning. Thus, choice trials similar to those used in value training and reversal are not suited for value probes since they engage the RPE term due to the reward expectation and delivery for each object. Instead, learned values were probed using gaze bias in a free-viewing (FV) task (Fig. 1c, see “Methods”). Previous work has shown that FV gaze bias is a reliable probe of learned object values24. Notably, the FV task does not engage the RPE term, as there is no direct reward associated with viewing either of the objects; thus, there is no reward expectation. Furthermore, gaze bias is shown to remain unchanged across consecutive FV trials, which would otherwise have been reduced if there had been any extinction25.

Figure 1h shows gaze bias in the FV task in the four time points for both monkeys, along with predictions of single- and two-rate models (see “Methods”). As expected, before the reversal, both monkeys showed a clear bias toward viewing good fractals (monkey H: t (44) = 9.69, P < 0.0001, Cohen’s d = 1.42; monkey P: t (40) = 10.42, P < 0.0001, Cohen’s d = 1.6). Following value reversal, both monkeys’ gaze bias switched toward bad fractals (the new good; monkey H: t (44) = 4.77, P < 0.0001, Cohen’s d = 0.7; monkey P: t (40) = 2.66, P = 0.01, Cohen’s d = 0.4). 20 min after reversal, both monkeys gradually forgot the reversal and viewed good and bad objects almost equally (monkey H: t (37) = 1.42, P = 0.16, Cohen’s d = 0.22; monkey P: t (40) = 0.63, P = 0.52, Cohen’s d = 0.1). Remarkably, a day after the reversal, monkeys’ gaze bias showed a spontaneous recovery toward viewing initially good fractals (monkey H: t (41) = 3.64, P 0.0008, Cohen’s d = 0.55; monkey P: t (40) = 2.47, P = 0.017, Cohen’s d = 0.37).

The two-rate model outperforms the single-rate model in explaining gaze bias at the four time points even after accounting for the difference in the number of parameters (Fig. S1, best two-rate model AIC < −29.24, best two-rate model BIC < −39.06, best single-rate model AIC < −18.41, best single-rate model BIC < −22.09, Fig. 1h inset, see “methods”). This is because the single-rate model fails to explain the spontaneous recovery; instead, it predicts no bias the day after value reversal.

To evaluate the stability of our model fitting, we conducted both model recovery and parameter recovery analyses (see “Methods”). In the model recovery analysis, data were generated using single- and two-rate models to determine whether the generating model could be reliably identified. As shown in Fig. S14, while distinguishing between models within the single-rate and two-rate categories can be challenging in some cases, the distinction between the two categories is highly reliable. Furthermore, results show that our model selection procedure was conservative, favoring the null hypothesis (i.e., the single-rate model; Fig. S14b). This conservatism strengthens the credibility of our conclusion that the two-rate model provided the best fit for both monkeys.

In the parameter recovery analysis, we assessed whether the true parameters underlying the simulated behavior could be accurately recovered (Figs. S15, 16). Two experiments were conducted: the first used parameters close to those fitted to the behavior of the two monkeys, which would have higher biological relevance (Fig. S15), and the second explored a broader parameter range (Fig. S16). Both experiments revealed significant correlations between the recovered and true parameters (Spearman’s ρ > 0.21, p < 0.015), except for the forgetting rate of the fast system on the day following the reversal (ρ < −0.13, p > 0.13). This exception is consistent with expectations, as the fast system is largely extinguished by the day after the test, making its forgetting rate challenging to estimate accurately. Moreover, the average error across all parameter sets in both experiments was below approximately 12% (see “Methods”). These results suggest that, except for the forgetting rate of the day-after fast system, the overall parameter recovery was robust, further supporting the reliability of our model-fitting approach.

Slow and fast processes are multiplexed in early and late firing components in vlPFC

To examine the neural representation of fast and slow processes, acute single-unit recordings from vlPFC (121 neurons, area 46v) were performed (Fig. S2). The value probe was done using a passive-viewing (PV) task at the same four time points (Fig. 1d). In the PV task, monkeys viewed good and bad fractals while fixating centrally without contingent reward (Fig. 1c). The value signal in vlPFC neurons is also known to remain unchanged across PV trials, similar to the FV task, as there is no contingent reward associated with the fractals. This makes it appropriate for probing value memories in the absence of RPE17. PV task is particularly suited for neural recording since the objects can be shown in the receptive field of a given neuron and are not contaminated with saccadic responses as animals maintain central fixation, allowing one to examine purely visually evoked value memories for each object.

Figure 2a shows an example neuron (Monkey H: Neuron # 27) with response time-locked to display onset across the four aforementioned time points. This is an example of a ‘’good-preferring” neuron firing stronger in response to the good than the bad fractals. Prior to the switch, there was a clear difference in response of the vlPFC neuron to good vs. bad objects from 100–600 ms, signaling a well-established value memory in the PV task. Importantly, following value reversal, the neuron’s response was divided into two temporal components. The ‘’early component” (~100–300 ms) of the neural response still showed higher firing to previously good compared to previously bad objects, consistent with their initial values. However, the ‘’late component” (~300–600 ms) showed higher firing to the bad (new good) objects, consistent with object values after reversal. Interestingly, after 20 min, the late component started to fade away, while the early response was preserved. Since we were performing acute neural recording, a neuron could not be kept to test responses a day after reversal. To address this issue, we next showed a different set of objects that underwent value reversal a day before to the same neuron to test its day-long value memory. This technique was used previously to test long-term value memory in neurons17,21. Day-long memory showed a preserved preference for the initial values in the early component, while there was hardly a response difference in the late component. Figure 2b represents the population’s average response to preferred vs. non-preferred object values across value probes. Value preference (good-preferring [Gp] vs. bad-preferring [Bp]) of each neuron was determined using cross-validation based on responses before the value reversal (see “Methods”). The population average response clearly showed the presence of early (dark gray) and late (light gray) components. The differential response to preferred vs. non-preferred objects, henceforth referred to as the value signal, conspicuously demonstrated the dissociation of these two components after stimulus onset (Fig. 2b, bottom row). Prior to reversal, the neural response was significantly stronger in preferred values (positive value signal) in both early and late epochs, denoting the well-learned values (ts (120) > 5.3, ps < 0.0001). Following reversal learning, the early response got smaller (F (3442) = 17.65, p < 0.0001) but remained positive (t (120) = 6.32, p < 0.0001); however, the late response underwent a significant sign change to negative values (t (120) = 5.08, p < 0.0001) as predicted by the two-rate model (Fig. 1f, right). Subsequently, over time, the late component decayed toward zero from 20 min after (t (120) = 1.61, p = 0.1) to a day after the reversal (t (82) = 0.85, p = 0.3). On the other hand, the early component was relatively stable and remained significantly positive (after 20 min: t (120) = 5.5, p < 0.0001; next day: t (82) = 7.01, p < 0.0001) (for individual subject data see Fig. S3a, c). Importantly, we could predict the average vlPFC responses using the two-rate model and a one-to-one correspondence between early and late components with slow and fast processes, respectively (Fig. 2b, bottom row inset green bars). Similar early and later components in firing rate were observed during the actual reversal learning task (Fig. S4a). Note, however, that, compared to the PV task, the responses during the value reversal training are expected to have less power since they are changing over trials due to learning. Furthermore, the firing rate during value learning has both visual and saccadic components, which adds a confound to the interpretation of neural responses in the active task (Fig. S4b).

Fig. 2. Example neuron and average population response in value probes.

Fig. 2

a The peri-stimulus time histograms (PSTH, first row) and raster plots (second and third rows) time-locked to the display onset of good (red) and bad (blue) objects were shown. The black vertical line indicates the object’s onset. b Average population PSTH for preferred and non-preferred values (top row) for n = 121 neurons before, immediately after, and 20 min after reversal, and n = 83 neurons the day-after reversal condition. Dark and light gray patches represent early (100–300 ms) and late (300–600 ms) epochs, respectively. The difference in firing rate to preferred vs. non-preferred values (value signal) during the experiment (bottom row). Shading represents the SEM here and after. Inset boxplots show the average value signal during the early and late epochs, along with predicted values from the slow and fast processes of the two-rate model. Data points beyond the whiskers were included in all analyses but are not shown for visualization. Two-sided paired t-tests with FDR correction; uncorrected p-values (Early epoch: p = 1.89 × 10⁻²⁰ (before), 4.53 × 10⁻⁹ (immediately after), 1.72 × 10⁻⁷ (20 min after), 5.89 × 10⁻¹⁰ (day after); Late epoch: p = 3.56 × 10⁻⁷ (before), 1.29 × 10⁻⁶ (immediately after), 0.111 (20 min after), 0.398 (day after). c Neuron-wise scatterplot of neural value signal for early (left) and late (right) epochs and gaze bias following reversal training. Here and thereafter, all correlations are Spearman’s ρ, early epoch: ρ = 0.10, p = 0.25; late epoch: ρ = 0.28, p = 0.002. d Same format as (c) but for spontaneous recovery the day after the reversal: early epoch: ρ = 0.20, p = 0.04; late epoch: ρ = 0.17, p = 0.09.

According to our proposed two-state model, viewing duration at any time point should be controlled by the sum of slow and fast processes. Therefore, one predicts that viewing duration should be correlated with the sum of early and late epochs in the firing rate value signal. Results show that the sum of both components in the firing rate shows a significant positive correlation with FV gaze bias both immediately after reversal and also a day later (Fig. S5a: reversal of gaze bias: Spearman’s ρ = 0.21, p = 0.023; spontaneous recovery: ρ = 0.20, p = 0.04.).

In the next level, one may ask how individual early and late epochs (not just their sum) should correlate with viewing duration across sessions. Once again, the two-state model provides certain expectations. Given the low learning and forgetting rates for the slow process and the brevity of the reversal period, the variability in parameters of the slow (early) process is expected not to show itself immediately following reversal; rather, the viewing duration immediately after reversal is expected to be more strongly correlated with the variability in the fast (late) process. This is indeed what we observed in component correlations immediately after the reversal (Fig. 2c slow: ρ = 0.10, p = 0.25; fast: ρ = 0.28, p = 0.002). However, for the day later time point, the variability in the slow system (in particular, the variability in the forgetting rate) has enough time to become apparent in the behavior. Thus, one predicts that the slow systems will show a stronger correlation at this time point, during which spontaneous recovery is observed. Once again, the neural results are broadly consistent with this prediction (Fig. 2d slow: ρ = 0.2, p = 0.04, fast: ρ = 0.17, p = 0.09).

Furthermore, the presence of the unchanging early component in vlPFC firing raises the interesting possibility that the initial saccade in the FV task should be toward the initially good object even after the reversal. Indeed, this was found to be the case. In both monkeys, the first saccade bias remained toward the initial good objects (monkey H: t (44) > 2.97, p < 0.0048; monkey P: t (40) > 3.19, p < 0.0028, Fig. S6a) even though the overall viewing duration clearly switched toward the bad (new good) objects after the reversal. This suggests the early component, with slow learning/forgetting, underlies automatic and rapid attentional control toward objects. In contrast, the fast learning/forgetting process may be involved in the more deliberate, later components of attention.

Significantly, the two-rate model predicts that continuing the reversal should eventually cause even the slow system to reverse polarity. If the early epoch is indeed a readout of the slow system, one predicts that both epochs should become negative after extended reversal training. To test this hypothesis, we conducted a control experiment with one of the monkeys (monkey H). The experimental paradigm was the same, except the reversal training continued for 5 days instead of one session. In this case, results show a complete reversal in both early and late responses (Fig. S7a, t(33) > 4.06, p < 0.0002). The viewing duration bias also shifted to the bad (now high-valued) fractals (Fig. S7b top row, t(14) = 7.72, p < 0.0001). The extended reversal training also resulted in a biased first saccade toward bad objects (Fig. S7b bottom row, t(14) = 13.24, p < 0.0001). This phenomenon, which did not occur with only one reversal session (Fig. 2b, Fig. S6), supports our hypothesis that the early component is responsible for coding the slow system, which underlies quick, habitual, and automatic responses.

Next, we addressed whether the temporal multiplexing is observed at the level of individual neurons or is mainly a population effect. In the latter case, the early and late components can be segregated across different sub-populations. Furthermore, the stability of the early component may not be a feature of individual neurons but of the population (e.g., as one subpopulation loses the value coding, others gain it). These scenarios will have different implications for value coding in vlPFC and the representation of slow and fast processes in this region. Clearly, the example neuron in Fig. 2a supports the former possibility. To further address this question, the discriminability of good vs. bad objects (value signal) across value probes in the early and late component for all neurons was quantified using the area under the receiver operating characteristic curve (termed value AUC here; Fig. 3). Value AUCs significantly above 0.5 indicate higher firing for good compared to bad trials (Gp), and below 0.5 indicates higher firing for bad compared to good trials (Bp). Consistent with previous results of vlPFC, most neurons showed significantly stronger excitation to good objects initially before reversal (mean AUC = 0.64, t (120) = 15.4, p < 0.0001) in the marginal AUC distribution in the early epoch. They showed little change at various points afterward (mean AUC > 0.55, t (120) > 6.78, p < 0.0001). The marginal distribution before reversal in the late epoch was also significantly biased toward good preference (mean AUC = 0.56, t (120) = 7.02, p < 0.0001), but immediately after value reversal, the marginal AUC distribution underwent significant change shifting toward bad preference (mean AUC = 0.45, t (120) = 5.05, p < 0.0001). Notably, 20 min after reversal, there was no significant preference (mean AUC = 0.49, t (120) = 0.23, p = 0.81). Still, there was a marginally significant preference after a day (mean AUC = 0.51, t(82) = 1.99, p = 0.049), which did not remain significant after false discovery rate (FDR) correction for multiple comparisons (adjusted p = 0.056, q < 0.05).

Fig. 3. Value responses in individual neurons in early and late epochs across value probes.

Fig. 3

a Scatter plot of pairwise discriminability of good vs. bad objects in the early epoch (Value AUC) for immediately after (left), 20 min after (middle), and a day after (right) value reversal vs. before reversal along with marginal distributions. The magenta, cyan, and gray colors in the value AUC histograms show the good-preferred (GP, significant AUC > 0.5), the bad-preferred (BP, significant AUC < 0.5), and nonsignificant (NS) neurons, respectively (two-sided Wilcoxon rank-sum). Black dots: significant neurons in either axes; Gray dots: non-significant neurons. The solid black line indicates the linear fit (Deming regression). Average AUCs (marginal histograms) were computed across value probes and tested against the chance level (AUC = 0.5); two-sided paired t-tests with FDR correction; uncorrected p values: before = 4.43 × 10⁻³⁰, immediately after = 4.06 × 10⁻¹⁰, 20 min after = 4.65 × 10⁻¹⁰, day after = 6.35 × 10⁻⁸. Correlations between AUCs before and after reversal: pre vs. immediately after ρ = 0.48 (p = 2.29 × 10⁻8), vs. 20 min after ρ = 0.48 (p = 1.9 × 10⁻8), vs. day after ρ = 0.47 (p = 8.4 × 10⁻6). The inset squares show the percentage of neurons in each quadrant with grayscale color code. b same as (a), but for the late epoch. uncorrected p values: before = 1.67 × 10⁻¹⁰, immediately after = 1.55 × 10⁻⁶, 20 min after = 0.81, day after = 0.049 (not significant after FDR correction). Correlations between pre- and post-reversal AUCs: pre vs. immediately after ρ = 0.17 (p = 0.055), vs. 20 min after ρ = 0.27 (p = 0.0029), vs. day after ρ = 0.35 (p = 0.0013).

There was a significant positive correlation in value AUC before reversal and all other memory conditions (Spearman’s ρs > 0.47, ps < 0.0001), indicating a relatively stable value memory in early components across individual neurons. The correlation in value AUC in the late epoch before reversal and after was, in general, weaker than the early period and was not significant immediately after the reversal (immediately after: ρ = 0.17, p = 0.055, 20 min after: ρ = 0.27, p = 0.0029, day after: ρ = 0.35, p = 0.0013). Results show a marginally significant difference between early and late epochs across columns in Fig. 3, particularly immediately after reversal (correlation difference in “before” vs. “after reversal” (p = 0.007, z = 2.69), in “before” vs. “20 min after” correlations (p = 0.058, z = 1.89) and in “before” vs. “day after” correlations (p = 0.36, z = 0.91)). Importantly, examination of the percentage of neurons with positive or negative value signals (Fig. 3a, b inset with <0.5 and >0.5 value AUCs, respectively) showed the stability of value coding in early component (neurons in the first quadrant, χ2=46,p<0.0001) and value reversal in the late component (neurons in the fourth quadrant, χ2=16,p<0.0001) following reversal for a significantly large fraction of neurons. We also conducted this analysis across all pairs of time points separately for early (Fig. S8) and late (Fig. S9) components. As expected for the early component of firing, all pairwise correlations were highly significant and positive, being roughly unaffected by the reversal and forgetting periods, consistent with the early component representing a relatively stable process. On the other hand, the late component showed a pairwise correlation that changes as the function of temporal distance between period and the occurrence of reversal task (i.e., before and after reversal) consistent with the late component to represent a more dynamic and flexible process.

Population state of the PFC neurons is affected by value reversal beyond the value signal

Previous analysis showed that slow and fast coding processes in the early and late components were evident in a considerable percentage of neurons. However, some neurons still did not conform to the two-rate model. This suggests that other information may be encoded by these subpopulations. To better understand dominant information in the population beyond the value signal, we examined the population response dynamics by applying a modified principal component analysis, which we refer to as partial-PCA (see “Methods”). Partial-PCA extracts PCA components (in this case, the first two dimensions, PC 1 and 2), which capture the directions with the largest variance orthogonal to the value dimensions that align the most with the value signal present in the neurons (i.e., PCA is done after value dimension is partialled out). Figure 4 shows the first two PCs plotted against the value axis for early and late responses. As expected, the preferred and non-preferred objects were well-separated along the value axis. Consistent with the previous analysis, the polarity of pref/non-pref discrimination remained the same across the four time points in the early component but was reversed in the late component following value reversal. Interestingly, while PC1 showed a trajectory that traveled out and returned a day later to a similar position as before reversal, PC2 showed a sustained displacement from before reversal to a day later (resampling F (1398) > 19.7, p < 0.0001, MANOVA test). Additionally, there was a significant difference between before-learning and the next day’s neural states (resampling Hotelling’s T2 (298) > 36.8, p < 0.0001). These results indicate that apart from early and late components of the value signal across paradigm, the vlPFC population showed irreversible changes in its response due to experiencing value reversal for the fractals. Using demix-PCA26 also reveals one value probe component, which comes back to a pre-reversal state, and the other, which sustains a change even a day later, in addition to two value dimensions that roughly correspond to early and late components (Fig. S10). Finally, standard PCA also gives qualitatively similar results, though less tuned to the relevant dimensions of the tasks as expected (Fig. S11).

Fig. 4. 2D projection of population responses using partial-PCA across value probes.

Fig. 4

a PC1 vs. value dimension (left) and PC2 vs. value dimension for the early epoch. b same as (a), but for the late epoch.

The two-rate model is reflected in the vlPFC local field potential

Modulations in various frequency bands in local field potential (LFP) are believed to be indicative of local and global computations in neural networks that are relevant for dissociating inputs to and outputs from a region2731. Figure 5a shows the time-frequency power modulations for good and bad objects and their difference (bottom row) in the PV task in the four time points and different frequency bands. Almost all major band powers, including high-gamma (60–200 Hz), low-gamma (30–60 Hz), beta (12–30 Hz), alpha (8–12 Hz), and theta (4–7 Hz), showed significant modulation in response to objects and value reversal, with many showing differences for good vs. bad object presentations. Among all frequencies, the power in high-gamma resembled the pattern seen in the population average the most, with the coexisting of early and late components of its value signal (Fig. 5b). The early component in high-gamma value signal (power difference of good–bad) was significantly positive in all time points including immediately after reversal (t(50–67) > 3.39, ps < 0.001). In contrast, the positive value signal in the late epoch (t(67) = 7.63, p < 0.0001) became negative following reversal (t(67) = 4.03, p = 0.00014). Similar to the population firing rate, the negative value signal in the late component faded away in the 20 min after and one day later (t(50–67) < 1.85, p > 0.067) (for individual subject data see Fig. S3b, d). On the other hand, there was no evidence of time multiplexing in the alpha and theta bands as the value signal was similar in both early and late components after the reversal in alpha (t (67) = 0.27, p = 0.78) and theta bands (t (67) = 0.13, p = 0.89). However, both bands showed a strong spontaneous recovery in their value signal (Fig. S12a). In both bands, there was a positive value signal for good vs bad objects (ts (67) > 3.04, ps < 0.003), which was attenuated after reversal (t(67) < 1.56, ps>0.12) but reemerged in 20 min after reversal for theta band (t(67) > 2.78, ps < 0.006) and in the next day for both alpha and theta band (t(50) > 2.85, ps<0.006). Interestingly, the value signal in 20 min after and on the next day was almost as strong as the value signal before reversal in these two bands (Post hoc Tukey HSD: ps>0.94). The beta band did not show a significant value signal at any timepoint, suggesting that it may not play a role in fast or slow value memories (Fig. S12a).

Fig. 5. LFP response to good and bad objects across value probes and correlations with behavior.

Fig. 5

a LFP power across time for 0–200 Hz frequencies for good (top row), bad (middle row), and the response difference to good vs bad (LFP value signal, bottom row) across the four time points. Color bar shows LFP power changes Z-normalized relative to the baseline (−300 to 0 ms). Y-axes are logarithmic. b Average high-gamma (60–200 Hz) band power in response to good (red) and bad (blue) fractals and the power difference (good–bad, gray) for n = 68 LFP sessions before, immediately after, and 20 min after reversal, and n = 51 sessions the day-after reversal condition. Inset boxplots show the mean power difference (good–bad) across timepoints for the early and late epochs. Data points beyond the whiskers were included in all analyses but are not shown for visualization. c Session-wise scatterplot of high-gamma band power difference (good–bad) for early (left) and late (right) epochs and gaze bias following reversal training: early epoch: ρ = 0.05, p = 0.66; late epoch: ρ = 0.38, p = 0.001. d same format as (c) but for the behavioral spontaneous recovery the day after the reversal: early epoch: ρ = 0.21, p = 0.06; late epoch: ρ = 0.36, p = 0.001.

Similar to the firing rate value signal, there was a trending positive correlation between gaze bias reversal and the sum of early and late components (Spearman’s ρ = 0.23, p = 0.052, Fig. S5b left) and a significant correlation with spontaneous recovery (ρ = 0.27, p = 0.015, Fig. S5b right). Also, similar to what was observed for the firing rates, there was a significant correlation between the late component but not the early component of the high-gamma band power value signal and the behavioral reversal of gaze bias in FV following reversal training (ρearly = 0.05, p = 0.66, ρlate = 0.38, p = 0.001, Fig. 5c). There was no significant correlation between behavioral reversal and neither late nor early response of any of the other frequency bands (only a weak negative correlation between behavioral reversal and the late epoch of beta band, which became non-significant after FDR correction; ρ = –0.25, p = 0.03; but for others: ρ < 0.21, ps>0.07, Fig. S13a). Moreover, a significant correlation was also identified between behavioral spontaneous recovery and the late components of high-gamma band power value signal and trending correlation with the early component (ρearly = 0.21, p = 0.06, ρlate = 0.36, p = 0.001, Fig. 5d). Once again, there was no significant correlation between spontaneous recovery and either of the other frequency bands (ρs < 0.13, ps > 0.24, Fig. S13b). Finally, the high-gamma band value signal showed the strongest and most consistent correlation with the firing rate value signal (Fig. S12b). Interestingly, results showed decreased power correlations between the early component of high-gamma and alpha/theta bands following the value reversal. This may suggest a change in the routing of information subsequent to experiencing value reversal (Fig. S12c).

Discussion

Objects that keep their values stably and for a long time enable rapid and skillful reactions1,32. Nevertheless, changes in old values can happen, requiring flexibility to stop automatic reactions and the ability to engage in deliberate interaction based on new values. Here, we studied the interplay of slow and fast value learning processes that supported maintaining old values vs. adapting to new values in the environment, respectively. Importantly, we showed that a two-rate but not a single-rate model predicted relapse to old values after a period of value reversals, a prediction that was then confirmed in behavioral data (Fig. 1). We showed that both slow and fast learning processes were represented in vlPFC population and within individual neurons in the early and late components of their firing (Figs. 2, 3). The two-rate processes were also evident in various LFP frequency bands, in particular the high-gamma, alpha, and theta ranges, which showed the effect of reversal and spontaneous recovery to old values (Fig. 5). Finally, there was a significant correlation between value signal in neural firing and in LFP with both the degree of behavioral reversal and spontaneous recovery in free viewing (Figs. 2, 5), consistent with the role of vlPFC in the behaviors studied.

On the behavioral side, our results demonstrated a striking aspect of value adaptation, namely spontaneous recovery of attention to old values despite a period of complete value unlearning (Fig. 1h). We showed that such spontaneous recovery was diagnostic of a two-rate model rather than a classical single-rate reinforcement model (Fig. 1e, f, h). Such a spontaneous recovery can be a key factor in explaining why old habits die hard and the relapse to old maladaptive behaviors, as seen in substance abuse disorders33,34. Such a two-rate model has already been established in motor learning and is shown to successfully explain a slew of phenomena, such as spontaneous recovery22, savings (where relearning happens faster than initial learning)35,36, anterograde interference (when learning an initial adaptation slows the learning of an opposite adaptation)37,38. This suggests that a two-rate model might be a fundamental principle underlying adaptation across a wide range of domains, from value to sensorimotor learning. Furthermore, such a two-rate model can predict behaviors such as savings and anterograde interference in value learning, which has not been addressed to date.

Importantly, neurons in vlPFC reflected signatures of both slow and fast processes in the early and late components of their firing rates, respectively. Late response suddenly emerged following value reversal and faded away after ~20 min, signaling flexible value coding, whereas the early component was relatively preserved, indicating stable coding. Interestingly, the early component of vlPFC was comparable to the quick value signal in the CDt, appearing ~100 ms after stimulus onset, consistent with stable value coding and a more automatic response to objects. On the other hand, the late response of vlPFC was similar to the value signal in CDh, with an onset of ~300 ms consistent with flexible value coding and a late and more deliberate response to objects9. This suggests that the late response in vlPFC might be influenced by or potentially originate from the CDh, although further studies are necessary to confirm this connection. In addition to the early and late response components, we found sustained changes in the population responses in vlPFC that started from the reversal and remained the next day. A similar phenomenon has recently been reported in motor learning, which involves a wash-out period39.

Our result also showed significant value signal and signatures of the two-rate model in vlPFC LFP. In particular, we found both slow and fast processes to be encoded in high-gamma LFP in their early and late temporal components after object onset. This can be expected as the high-gamma power is believed to be linked to the spiking output of a region indicative of local computations supported by the interneuron network29,30. Thus, the high-gamma results are mainly confirmatory for the results seen in the population firing rates in the vlPFC region. However, there were some differences in the correlation of early and late epochs in firing rates and LFPs (Figs. 2d, 5d) with behavior, which warrants further investigation to determine which of the two signals is a better candidate for supporting the predictions of the two-state model. On the other hand, there was no evidence of time multiplexing in the alpha and theta bands. However, both the alpha and theta bands showed a single component in the value signal that was attenuated following reversal but recovered afterward (Figs. 5 and S12). Low-frequency LFP signals can reflect synchronous modulation by the synaptic input29,30 or feedback to the input regions40. Interestingly, there was a decreased correlation between early components of high-gamma and alpha/theta band value signal following the reversal, which suggests the possibility that the early component of vlPFC neurons may have become more independent of inputs after established value expectations were violated during the reversal.

In summary, our results showed that value adaptation engages a two-rate model with slow and fast dynamics that can give rise to the relapse to old habits after a successful washout. This is consistent with mounting evidence showing that the process of extinction does not wipe old memories but creates new ones that temporarily mask the old41. Thus, the two-rate system provides a framework to understand relapse and extinction in value-based maladaptive behaviors. Remarkably, our study signifies the role of vlPFC in representing conflicting information by multiplexing stable and flexible values in early and late firing components. The two-rate system encoded by vlPFC neurons allows for capturing a broad spectrum of reward information, enabling both rapid behavioral adaptations and preserving long-term value memories. Studies on distributional RL have also shown that distinct neuronal processes may have different timescales in value learning and RPE. Together, these findings suggest that employing multiple learning rates might be the brain’s strategy for capturing the full spectrum of reward probabilities in different contexts42.

Furthermore, our findings are also consistent with previous studies that highlight the role of the vlPFC in updating reward probabilities, behavioral flexibility, and reversal learning. It has been shown that vlPFC is crucial for dynamically tracking outcome availability, probabilistic associations, and adjusting decisions based on probabilistic feedback, which is essential for adapting behavior in response to changing environments16. Lesions in this area have been shown to impair the ability to update reward contingencies and disrupt strategies like win-stay/lose-shift, which are vital for effective reversal learning15,16. vlPFC encodes changes in reward contingencies and facilitates adaptive responses by integrating both stable and flexible value signals across early and late components16. These findings also suggest vlPFC as a potential target for controlling relapse in maladaptive behaviors such as substance use disorders alongside conventional areas in the orbitofrontal cortex and ventral striatum4346. Indeed, the vlPFC connectivity with basal ganglia structures forming cortical-striatal-thalamocortical loop10,18,19 and other regions, such as the orbitofrontal cortex (OFC)14,47,48, anterior cingulate cortex (ACC)48,49, lateral intraparietal cortex (LIP)48,50, and the amygdala51, which mediate reward-based decisions and error monitoring52, with inferior temporal cortex for object discrimination and long-term value learning and memory24,48, with the pre-supplementary motor area, subthalamic nuclei and OFC for behavioral switching and reversal learning47,5355 as well as directly with SC and frontal eye field (FEF) for saccade control18,48 situates vlPFC as a central hub to be informed of and manage conflicting reward histories, in particular when the interplay of both short-term and long-term values should be orchestrated for optimal decision making. Future investigations should address the neural mechanisms that allow vlPFC to encode and multiplex early and late value components and whether each component can be independently manipulated.

Methods

Subjects and surgery

Two male adult rhesus macaque monkeys (Macaca mulatta) were utilized for this experiment (monkeys P and H, 7 and 12 years old, respectively). All animal care and experimental procedures were carried out in compliance with ethical guidelines established by the National Institutes of Health (NIH). They were approved by the local ethics committee at the Institute for Research in Fundamental Sciences (IPM) (protocol number 99/60/1/172). Prior to the experiment, both monkeys underwent surgery under general anesthesia for the head holder and recording chamber installation. The head holder was implanted at the midline, and a recording chamber was placed and tilted laterally over the right PFC for monkey P and the left PFC for monkey H. Following surgery, MRI imaging was performed to confirm the correct chamber position for both monkeys. Subsequently, monkeys were trained to learn the experimental tasks, after which a second surgery was performed for craniotomy over the PFC region of both monkeys. The neural recording was done through grids with 1-mm spacings placed over the chamber.

Recording localization

Ventrolateral PFC (vlPFC) localization was performed using T1- and T2-weighted MRI imaging (3T, Prisma Siemens). During imaging, the recording chambers of both monkeys were filled with Gadolinium as a contrast agent for enhanced imaging results. Subsequently, AFNI and ImageJ software were exploited to transfer monkeys’ native space into the standard monkey atlas (NMT) to further verify the PFC location and to determine the accessible vlPFC region through the recording chamber of each monkey56 (Fig. S2). A total of 62 and 59 neurons were included in the analysis from area 46v ventral to the principal sulcus for monkeys P and H, respectively.

Stimuli

Fractal geometry objects were used as visual stimuli. Four-point-symmetrical polygons overlaid around a common center, with smaller polygons in the front, comprised a fractal. The size, edges, and color of each polygon were chosen randomly. Fractal diameters averaged 4° and were displayed on a CRT monitor. During the course of the experiment, each monkey was exposed to a substantial number of fractals (>500 fractals) in sets of eight for this task and previously published work57. For the current task, monkeys P and H were trained on 44 and 45 sets, respectively, which were used during the vlPFC recording and/or modeling work presented. Furthermore, the macaque illustration used in Fig. 1c was created with BioRender (Hashemi, R., 2025, https://BioRender.com/thgidps).

Task control and neural recording

All behavioral tasks and recordings were controlled by a customized software program written in C. Neural data acquisition and output control were performed using a Cerebus Blackrock Microsystem device (www.blackrockneurotech.com). During each experimental session, head-fixed monkeys sat in a primate chair and viewed visual stimuli on a 21-in. LG CRT monitor. Eyelink 1000 plus was used to track eye position during the experiment with a sampling rate of 1 kHz.

Single-unit activity was recorded with tungsten epoxy-coated (FHC, 200 μm thickness) and a laminar microelectrode array (MicroProbes, 16 channels, 125 μm thickness with 250 μm-thickness stainless steel body). For each recording session, the electrode was loaded into a sharpened stainless steel guide tube and then mounted on a Narishige (MO-97A, Japan) oil-driven micromanipulator. The dura was punctured by the guide tube, and the electrode was inserted into the brain using the aforementioned micromanipulator. A total of 44 sessions with FHC electrodes for monkey P, 35 sessions of FHC electrodes, and four sessions of laminar recording for monkey H were recorded. Additionally, 15 FHC electrode sessions from monkey H were recorded for the control experiment (Fig. S7).

The electrode’s electrical signal was amplified and filtered (1 Hz to 10 kHz) and subsequently digitized at 30 kHz. Blackrock online sorting with the hoops was used to isolate the unit spike shapes. The online peri-stimulus time histogram (PSTH) and raster plots of the selected unit were created using a custom code written in MATLAB 2018b (Mathworks Inc). All well-isolated and visually responsive neurons were recorded. For the actual analysis of each session, offline sorting with Plexon offline sorter was used afterward (Plexon, Dallas, TX, USA). In this analysis, a negative threshold of 3–3.5 standard deviations away from the median was applied to detect spike data. Then, using principal component analysis (PCA) and Plexon’s built-in template sorting algorithm, the spike data were isolated and clustered into different units. All units for which less than or equal to 0.5% of the population fired in the refractory period of 1.2 ms were considered as single units. To select visually responsive neurons, custom-written MATLAB functions were used. In short, first, the average PSTH time-locked to display onset across all trials for each neuron was computed. Then, the data was z-scored using the baseline from –200 to 0 ms relative to the object onset. Using MATLAB findpeaks, the first response peak after object onset was detected. The minimum threshold for peak height was 1.64, corresponding to the 95% confidence interval. The first valley before the first detected peak was taken as the visual onset of the neuron. The onset of the value signal was also detected using this algorithm on the average value PSTH of individual neurons. All visual responsive neurons that maintained the visual response throughout the paradigm were included in this study.

Neural data analysis

Single-unit spike analysis

All neural responses were time-locked to object (visual stimuli) onset. The primary analysis epoch was from 100 to 600 ms after stimulus onset in the passive viewing task (see Experimental tasks), which was further divided into early (100–300 ms) and late (300–600 ms) epochs. The area under the receiver operating characteristic curve (AUC) was used to measure the discriminability of objects’ values based on the firing rate across trials in response to good and bad objects in the corresponding epochs.

Additionally, in the population of neurons in the vlPFC that encode value, some neurons are “good-preferring” (i.e., they fire more strongly for high-valued fractals compared to low-valued ones), while some are “bad-preferring” (they respond more to low-valued fractals). To avoid canceling out the differences in the average responses across these neurons, we categorize responses as “preferred” vs. “non-preferred” before averaging across the population. The preferred object value for each neuron was determined using a cross-validation technique17. A deconvolution technique was also used to dissociate the visual and saccadic responses in the PSTH of the reversal training task (Fig. S3b)58.

Local Field Potential (LFP) analysis

For the LFP analysis, data were down-sampled to 1 kHz and band-pass filtered between 0.1 and 250 Hz with EEGLAB (–v2022.0) FIR filter. Line noise and its sub-harmonics were removed using notch filters. For each session, trials with an LFP amplitude range of 3 standard deviations from the median were identified as noisy trials and removed in pre-processing. The time-frequency analysis was performed using a continuous wavelet transform (CWT) with a seven-cycle window. For this purpose, the Fourier transform of the analytic Morlet wavelet and the signal were calculated separately. The resulting Fourier transform of the wavelet and the LFP signal was then multiplied following an inverse Fourier transform to obtain the CWT time-frequency results59. Zero-padding was performed to avoid edge artifacts. Subsequently, baseline normalization was performed on the resulting time–frequency power, using the z-scoring method with a baseline of –300 to 0 ms relative to object onset. In this study, five frequency bands were defined as theta (4–7 Hz), alpha (8–12 Hz), beta (12–30 Hz), low-Gamma (30–60 Hz), and high-Gamma (60–200 Hz).

Population analysis

To explore population dynamic changes following reversal, PCA analysis, an unsupervised dimensionality reduction method, was used. The high-dimensional data was a matrix D with rows equal to the number of neurons and columns equal to two values (preferred and non-preferred) by four memory time points for each neuron, resulting in a total of eight columns. The average firing rate in 100–600 ms across all trials was used to fill the eight columns for each neuron. Subsequently, MATLAB pca function was utilized with singular value decomposition (SVD) algorithm to extract components with maximum explained variance. Eventually, the transformation vector obtained was used to project the average neural response in early and late components to PCA states separately, yielding early and late neural states. The three leading components were plotted (Fig. S11).

The dimensions found by conventional PCA do not necessarily encode the value dimension but find directions with the most explained variance. To find the dimension that was most informative about value coding in the region, a partial-PCA (pPCA) method was developed and applied to the data. In this algorithm, we first identified the two projection axes that maximized the value signal in the neural data (since the initial value vector did not capture the late value signal reversal, we repeated this process two times).

graphic file with name 41467_2025_66081_Equa_HTML.gif
findP1×nsuchthatPTD∣∣P∣∣=B1×8=β1,β2,β3,β4,β5,β6,β7,β8
P=argmaxP(β1β2)2+(β3β4)2+(β5β6)2+(β7β8)2

Subsequently, conventional PCA was applied to the residuals to find the remaining directions that explained the most variability in the data. The residuals D were calculated according to the following formula:

D=DPPTD∣∣P∣∣2

To better visualize the data, since the two value components were orthogonal to each other, we took the sum of the projected values and plotted them against the other two residual PC leading components (Fig. 4). The pPCA results were similar to the results using previously introduced demixed-PCA method by Kobak, Brendel26 which addressed a similar limitation in standard PCA (Fig. S10).

Experimental tasks

Value training: force-choice (FC) task

A biased reward association task was used to train object values in monkeys. Each training session was performed with one set of eight fractals (4 good/4 bad fractals). In each trial, initially, a white dot appeared on the center of the screen (2°) for monkeys to fixate. After maintaining the fixation for 200 ms, either a high or low-valued fractal object was displayed in one of eight peripheral locations at 9.5° eccentricity. Following a 400 ms overlap, the central dot disappeared, which was a cue for the monkey to make a saccade to the fractal. Monkeys were required to hold the gaze for 500 ± 100 ms to receive a small (low value) or large (high value) reward. A variable inter-trial interval of 1–1.5 s following reward delivery was initiated with a blank screen. A correct tone was played after a correct trial; however, a premature saccade or breaking fixation resulted in playing an error tone and restarting the trial. Each session consisted of 80 trials, including 64 force trials with the aforementioned design and 16 choice trials pseudo-randomly interleaved between force trials in such order that every five trials had one choice trial randomly presented in one of the five trials. Choice trials had the same structure as force trials except that one high-valued and one low-valued object appeared simultaneously on opposite sides of the screen, and the monkey had to make a choice. Monkeys’ choice rates in choice trials were used to measure objects’ value learning. Diluted apple juice (60% for monkey H and 50% for monkey P) was used as a reward. The high-to-low juice reward ratio was 3:1 in both monkeys.

Receptive Field (RF) mapping

In this task, monkeys had to keep fixating on the central white dot while fractal objects were shown in one of the 32 locations spanning eight radial directions 45° apart and eccentricities from 0 to 11.2° in four steps pseudo-randomly in a manner that all eight fractals of a given set were presented once in each of the locations. Each session consisted of 64 trials. In each trial, four fractals were sequentially presented with 400 ms on and 400 ms off period. Animals were rewarded for fixating after every four object presentations with medium reward following an inter-trial interval of 1–1.5 s with a black screen. Subsequently, the online (PSTH) and raster plots of the selected unit time-locked to fractal presentation onset were created using a custom code written in MATLAB 2018b (MathWorks Inc) and average firing rate across trials in 100–400 ms window was computed for each location and the maximum response location was used as the receptive field of that neuron (RF-In area) for subsequent passive-viewing tasks in the recording session.

Neural value memory: passive viewing task

The passive-viewing (PV) task design followed the same structure as the RF mapping task, except that the objects were shown close to each neuron’s location of maximal visual response (RF-In). In the case of laminar recording involving more than one channel, the passive-viewing task was performed in all 32 mentioned locations. Subsequent analysis for receptive field mapping was carried out offline. Analogous to the online mapping, the average response across trials in the mentioned window for each location was computed, and the locations for which the average response exceeded 70% of the maximum response were considered RF-In locations and were included in further analyses.

Behavioral value memory: free viewing task

Each FV session consisted of 20 trials with one set of fractals. In any given trial, four fractals (two good and two bad) were randomly chosen from the set and shown in one of the four corners of an imaginary diamond or square around the center (9.5 ° away from the display center). Fractals were displayed for 3 s, during which time the monkey was free to look at any objects or ignore them. There was no behavioral outcome in this period. Then, the fractals disappeared, and after a delay of 0.5 to 0.7 s, a white dot appeared in one of the nine random locations on the screen (center and eight radial locations). Monkey had to make a saccade to the dot to get a medium reward; however, this reward was not contingent on the free viewing period. After reward delivery, a black screen with ITI of 1–1.5 s would ensue.

Subsequently, in offline analysis, gaze locations were analyzed using custom-written MATLAB functions to extract saccades and stationary periods. In the FV task, trials in which monkeys viewed the screen for less than 1 s were removed as a measure of lack of concentration. Then, the time of viewing the fractals as a behavioral measure of gaze bias was computed. Additionally, we calculated the probability of the first saccade to the initial good objects. The behavioral data for this task were from 44 sets of monkey P and 45 sets of monkey H.

Considering the structure of each task, it is worth noting the step-by-step training that the monkeys undergo from scratch to ensure proficiency in the tasks and prepare for the actual experiment. Initially, monkeys were trained on simple fixation tasks with increasing durations. Afterward, they learned to maintain central fixation while objects flashed in the periphery. At this stage, the monkeys were able to do receptive field (RF) mapping and Passive Viewing (PV) tasks. Once monkeys learned to have <5% in fixation break during PV, they were introduced to the FC task in which the central fixation was turned off after 400 ms of overlap, instructing the monkeys to make a saccade and hold the target. Monkeys were considered proficient in FC if they did a trial with <5% error (errors such as fixation break, lack of saccade, or hold on target). In the final phase, the FV task was introduced, in which monkeys viewed four fractals on the screen, which were then replaced with a single fixation dot in the periphery. Monkeys learned they had to fixate and hold this dot to receive a medium reward. Monkeys were considered proficient in FV after at least ten days of exposure to this task and <5% error rate (errors such as lack of saccade or hold on fixation dot). The initial training for each monkey before the actual paradigm and recording started was ~3 months.

Experimental design

For a given set of eight fractals, monkeys were initially trained to associate four fractals with high reward (good objects) and four fractals with low reward (bad objects) for more than five days to establish a well-learned memory of the objects’ values. Monkeys’ choice rate was used as a measure of value learning. In addition, monkeys’ choice performance had to be >90% for two consecutive training days for a set to pass on to the recording session. In the recording session, after isolating a given unit, an RF mapping task was done to find the neuron’s receptive field. After that, passive-viewing and free-viewing tasks were performed to examine the established neural and behavioral value memory, respectively, before value reversal (1st time point). Subsequently, the monkey did a value reversal task. The value reversal task had the same structure and number of trials as a regular FC block except that the reward associations of good and bad objects were reversed (making saccade to good objects resulted in low reward and vice versa). Each set was used only once in the reversal task. Monkeys learned to adapt quickly to this abrupt switch of values and reversed their choices toward the initially bad objects to get the maximum reward. The value learning following reversal was measured based on the monkey’s binary choice accuracy (i.e., choosing a high reward object over a low reward object). In some sessions, monkey P’s choice rate after reversal remained below 30%. These sessions were excluded from further analysis (3 out of 44 sessions in monkey P). All sessions for monkey H were included in the analysis as the correct choice rate was above 30%. Reversal was followed by passive-viewing and free-viewing tasks immediately after (2nd time point), 15–20 min after (3rd time point) and a day after the reversal (4th time point) to probe changes in neural value memory and behavioral gaze bias, respectively. Monkeys engaged in a search task with a different set of fractals in the intervening time between the 2nd and 3rd time points to keep them engaged with different objects57. Given that we were conducting acute neural recordings and could not test a neuron’s response a day after reversal (4th time point), we instead presented the same neuron with a different set of fractals whose values had been reversed the previous day. We used a similar method previously to look at the longevity of value memory across many days in a single neuron17. This method is especially suited for vlPFC as its neurons are shown not to have a significant object selectivity to interfere with their value coding17.

Modeling

In this study, two models were used to outline the progression of value memory throughout the course of the paradigm: a single-rate model and a two-rate model. In both models, value learning in a particular trial is a function of the current value state and RPE experienced in that trial. The learning rule for these models is as follows:

Single-rate Model

Vn+1=α.Vn+β.en 1

Double-rate Model

vfn+1=αf.vfn+βf.en 2
vsn+1=αs.vsn+βs.e(n)
V=vf+vs
αf<αs,βf>βs

Vn Net value output of the model on trial n

vf; vs fast and slow states that contribute to the net value output

e(n) Error on trial nen=rnV(n)

rnReward received in trial n

β   Learning rate

α   Retention factor (1 – forgetting rate)

Where ‘’n” is the trial number. The error arises because of the difference in the predicted value by the model and the reward the monkey actually receives. In the single-rate model, one retention and learning rate was used, whereas the two-rate model consisted of two interactive processes, each with their own retention and learning rates (Eqs. 1 and 2). In the two-rate model, we assumed that one process has a higher learning and lower retention rate (i.e., higher forgetting rate), which we refer to as the fast process, and the other has a lower learning rate and higher retention rate (i.e., lower forgetting rate), which we refer to as the slow process. For simplicity, the same retention and learning rate was used for all objects for each animal.

In the recording session, PV and FV tasks and passage of time were modeled by forgetting rate only, as there was no RPE to drive learning (error-free period). The behavioral data used to fit the model in this period were the FV gaze bias between good and bad objects in the four time points where FV was done (before, immediately after, 20 min after, and a day after). An exponential function with a scaling factor as an additional parameter was used to map values to gaze bias. The choice of exponential function was based on the previous graded-reward studies25,60.

Timeofviewingfractals=eaV 3

a,scalingfactor

V,valuepredictedbythemodel

On the other hand, the reversal task was modeled using the full equation, which included forgetting and learning terms. The behavioral data used to fit the model in this task were the choice trial performance. In order to map the choice rate to value to compare with the predicted values of the model, a sigmoid function was used in accordance with previous studies, with the slope of the sigmoid as an additional parameter.

Choicerate=11+ebV 4

b,Sigmoidslope

V,valuepredictedbythemodel

Different combinations of parameters were introduced based on these two models. In one combination, we assumed different forgetting rates for each task used in the paradigm, namely, two-alternative FC reverse, passive viewing, free viewing, and irrelevant search tasks, along with an additional parameter for forgetting the next day. In another combination, we assumed one forgetting rate for active (FC reverse), one for passive tasks (passive-viewing, free-viewing, and irrelevant search tasks), and another for the next day.

Additionally, to account for the possibility that the good and bad objects could become linked in learning, such that changes in the value of a single good object may provoke changes in value estimate for other unobserved good and/or bad objects, the value update rule was made by taking into account the covariance (Σ،sigma) matrix between object values as a normative approach suggested previously61.

Σ=σ12σ12σ21σ22,Σ^=Σ+W 5
e(n)=Σ^*CCT*Σ*C+τ2(rnC*V(n))

W, Process noise (here for simplicity considered 0)

τ, Measurement noise (here for simplicity considered 0)

C, the presented stimuli (either high-valued or low-valued)

Σsigma, covarianceoftwostimuli(goodandbad)distributions

The model prediction evolved throughout the session as the monkey engaged in the experiment, but the ground truth for fitting the model consisted of the choice rate in 16 choice trials in the FC reversal task and 80 trials in four FV tasks (one before and three after the reversal). FV model fitting was performed using the difference in viewing good vs. bad objects.

The sum of squared errors (SSE) was used as the cost function to fit the model to the behavioral data. Given that the experiment comprised two distinct tasks, Forced Choice (FC) and FV, a multi-objective genetic algorithm (GA) was utilized through MATLAB’s “gamultiobj” function (Non-dominated Sorting Genetic Algorithm II (NSGA-II)) with the hybrid “fgoalattain” method to simultaneously fit both the choice rate FC and the viewing time of fractals (FV). To enhance the comparability of the two tasks’ SSE during model fitting, we converted the viewing bias in the FV task (measured in milliseconds) to a percentage of time spent viewing the fractals. As mentioned previously, in each trial, the fractals are presented for 3 s; thus, we divided the predicted viewing time by this total presentation duration to obtain the viewing percentage. As a result, both the choice rate and FV task metrics are scaled between 0 and 1 for each trial. After each model fitting, the GA algorithm generated a set of Pareto-optimal solutions based on the data. To determine the best parameter set from these Pareto solutions, we selected the one that minimized the combined SSE for the FC (16 points) and FV (80 points) tasks.

Taken together, ten models were fit to behavioral data: five single-rate and five two-rate models.

The first single rate model had five parameters: one parameter for learning rate of FC reverse, one for the slope of the sigmoid function, one forgetting rate for both active (FC reverse 80 trials) and passive tasks (passive-viewing (64 trials), free-viewing (20 trials) and irrelevant search (240 trials) during the 20 min), one for scaling factor of the exponential function and eventually one parameter for forgetting factor of the next day. The covariance of two (good and bad) distributions (Σ(sigma)) was considered one. In other words, the updated value of the seen fractal was applied to the unseen fractal. The six-parameter model had two parameters for learning and forgetting in FC reverse, one for the slope of the sigmoid function, one for forgetting in passive tasks (passive-viewing, free-viewing, and irrelevant search), one for scaling factor, and eventually one parameter for forgetting factor of the next day. The Σ was considered one. The third single-rate model had Σ, the covariance of two distributions, as the extra parameter with a total of seven parameters. The fourth single-rate model had different parameters for forgetting rates in passive-viewing, free-viewing, and irrelevant search, with a total of eight parameters. The final single rate model had the same parameters with Σ as an additional parameter and a total of nine parameters.

The first two-rate model had eight parameters: two parameters for learning rates of slow and fast processes in FC reverse, one for the slope of the sigmoid function, two forgetting rates of slow and fast in both active (FC reverse) and passive trials (passive-viewing, free-viewing, and irrelevant search), one for scaling factor and eventually two parameters for slow and fast forgetting of the next day. The ten-parameter tow-rate model had four parameters for learning and forgetting rates of slow and fast processes in FC reverse, one for the slope of the sigmoid function, two forgetting rates of slow and fast in all passive trials (passive-viewing, free-viewing, and irrelevant search), one for scaling factor and eventually two parameters for slow and fast forgetting of the next day. The third two-rate model followed the same parameters with Σ as an extra parameter. The fourth double-rate model had different parameters for forgetting rates of slow and fast processes in passive-viewing, free-viewing, and irrelevant search, with a total of 14 parameters. Finally, the fifth model had the same parameters with Σ as an additional parameter and a total of 15 parameters. The estimated parameters for each model are summarized in Supplementary Fig. S1. In the table of estimated parameters, we have shown the combined SSE values.

Model fitting for each monkey was done once across all pooled sessions, assuming consistent learning and forgetting rates. This approach, though simplified, is sufficient to differentiate between single-rate and two-rate models. With only 16 choice trials and 80 free-viewing trials per run, session-based fitting (involving up to 15 parameters) is unlikely to be reliable due to limited data. However, session-by-session fits for both models are also provided for comparison (Supplementary Tables S1S4). These fits show that while mean values align with the cross-session results, they have high variance, likely due to measurement noise and session fluctuations in learning and forgetting rates as expected.

Model selection was performed based on information criteria methods exploiting both Akaike and Bayesian information criteria (AIC, BIC) obtained from the following formula using SSE:

AIC=n*logSSEn*logn+2k+1 6
BIC=n*logSSEn*logn+logn*k+1

n, the number of observations (4 time points for FV)

k, number of parameters

Model selection was based on AIC and BIC for the free-viewing (FV) task, as our primary goal was to determine which model best captures the behavior during spontaneous recovery, where the single- and two-rate models are expected to make different predictions, whereas both models produce similar predictions during the reversal learning phase.

Model stability analysis

To evaluate the reliability of our model selection and parameter estimation, we conducted both model recovery and parameter recovery analyses. First, to define a realistic parameter space, we employed a bootstrap resampling method (with replacement) to sample behavioral data sessions, generating 50 surrogate datasets for each monkey. Due to computational constraints, it was not feasible to perform model recovery analysis across all possible combinations of our 10 × 10 models. Instead, we selected a representative subset of five models to provide sufficient coverage for assessing the stability of the model recovery process. The selected models included: (1) The 10-parameter two-rate model with shared parameters across passive tasks, (2) its corresponding 6-parameter single-rate model (collapsing fast and slow parameters), (3) the best-performing two-rate model with distinct parameters across tasks (14 parameters), (4) its corresponding single-rate model (8 parameters), and (5) the simplest single-rate model (5 parameters).

The surrogate datasets were fitted to these five models, and the resulting fits were used to calculate the mean, variance, and covariance of the parameters for each model. These statistics defined a multivariate normal distribution of parameters for each model. For the model recovery analysis, we generated 100 parameter sets for each model, sampled from the respective multivariate normal distributions. Using these parameter sets, we simulated value learning in the task according to the corresponding model. Behavioral datasets were generated by adding normal noise (mean = 0, standard deviation = 0.05) to the simulated values, which were then mapped to choice rates in the FC task and viewing durations in the FV task using Eqs. 3 and 4. This process was repeated 40 times for each parameter set, creating a 40-session behavioral dataset where the number of trials per session matched those of the actual experiment. Each simulated dataset was subsequently fitted to all five models, including the single-rate models (5, 6, and 8 parameters) and the two-rate models (10 and 14 parameters). Model selection was based on AIC values for the FV task, consistent with the model-fitting and selection procedure applied to the behavioral data. This analysis resulted in a confusion matrix that quantified the accuracy of model recovery after adjusting for differences in the number of parameters.

For the parameter recovery analysis, in the first step, we used 130 sets of parameters from the two-rate model (10-parameter model, 65 for each monkey) sampled from the aforementioned multivariate normal distribution to further explore the parameter space and fit the model using the two-rate framework. We then correlated the actual parameters with the fitted parameters (Spearman correlation). Furthermore, average error across all parameter sets was calculated (percentage Root Mean Squared Error [RMSE]) by dividing RMSE of each parameter by its range, then averaged across all parameters. Additionally, for each parameter, we calculated the difference between the ground truth values and the fitted parameters. This approach ensured that the parameter recovery analysis focused on biologically plausible parameter spaces, which were naturally constrained to narrower ranges.

To evaluate parameter recovery across a broader range of parameters, we repeated the analysis by sampling parameters from a uniform distribution spanning a wider range. To create meaningful two-rate behavior and capture spontaneous recovery, the following constraints were applied to the parameter space:

  • The learning rate of the fast system was required to be at least 10% higher than that of the slow system.

  • The retention rate of the slow system was required to be at least 10% higher than that of the fast system.

  • The number of trials during reversal learning was varied between 5 and 80 trials, and the error-free period was set to a minimum of 5 trials. This was done to ensure that at least 15% of spontaneous recovery (defined as 15% of the initial value prior to reversal learning) was preserved.

These constraints ensured that the parameter sets could generate the desired behavior, including spontaneous recovery, while allowing the exploration of a broader range of values.

We generated 250 datasets and fitted the 10-parameter two-rate model to each. The Spearman correlation, error percentage, and differences between ground truth and predicted parameters were subsequently calculated to assess the accuracy and robustness of the parameter recovery process.

Statistical analysis

A paired t-test was used to compare viewing bias in the FV task, preferred and non-preferred firing rates in neural response, as well as response to good vs bad fractals in LFP power for the average early and late epochs separately. Additionally, we reported Cohen’s ‘’d” as a measure of effect size for behavioral viewing bias. A one-sample t-test was used to compare the average marginal AUC at each time point with the baseline. All resulting statistics were corrected for multiple comparisons using the false discovery rate (FDR q < 0.05) method62.

A one-way repeated measures ANOVA test was utilized to examine the effect of memory time points on behavioral viewing bias, firing rate, LFP signal, and AUC analysis. Subsequent post hoc analysis used the Tukey-HSD test for multiple comparison corrections.

The Chi-squared test was used to compare the percentages of neurons in each quadrant in the squares of the AUC analysis. For correlation analysis, Spearman correlation (ρ) was used to better handle outliers. For comparison of correlation coefficients, the MATLAB function “corr_rtest”, which implements such comparisons using Fisher transformation, was used63.

For population-level analysis with pPCA, we initially conducted a resampling procedure. In each iteration, we randomly selected a dataset from the pool of neurons with replacement, ensuring that the dataset’s size matched the total number of neurons. pPCA was then applied to the selected dataset, and this procedure was repeated 100 times. Subsequently, a MANOVA test was performed, using residual PC1 and PC2 data from the 100 datasets to investigate the effect of memory time points on neural states. Eventually, Hotelling’s T2 test was used to compare the neural state between two memory time points64.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Reporting Summary (107.9KB, pdf)

Source data

Source data (41.6MB, zip)

Acknowledgements

We thank M. Khorasani, A. Shabani, and G. Sadeghian for their technical assistance in monkey care and Dr. Narmashiri for his helpful cooperation. This work was supported by internal funds from the School of Cognitive Sciences, IPM made to the corresponding author.

Author contributions

A.G. conceived and designed the study. S.R.H. and M.A. implemented the design. S.R.H. collected the data with M.A.'s help. S.R.H. analyzed the data under A.G. supervision. A.G. and S.R.H. wrote the initial draft. All authors contributed to the final draft of the paper.

Peer review

Peer review information

Nature Communications thanks Bolton Chau (eRef), who co-reviewed with Jocelyn To (E.C.R.) and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Data availability

The datasets generated and analyzed during the current study have been deposited in Zenodo under the following accession: Hashemirad, S., Abbaszadeh, M., & Ghazizadeh, A. (2025). Prefrontal cortex temporally multiplexes slow and fast dynamics in value learning and memory (v1.0.0). Zenodo. 10.5281/zenodo.1675228765. This archive includes trial-level behavioral datasets and spike-sorted single-unit neural data, with both average and trial-wise peristimulus time histograms (PSTHs). Source data underlying each main and Supplementary Fig. are provided as individual Excel files alongside the paper. Additional related datasets are available from the corresponding author upon request. Source data are provided with this paper.

Code availability

The MATLAB code (2021a and 2022a, Mathworks Inc.) used for behavioral analysis, model fitting, and neural data analysis, which produces the main findings of the paper (Figs. 1 and 2), has been deposited in Zenodo under the following accession: Reza Hashemi (2025). Hemi1997/Tworate_RL: Prefrontal Cortex Temporally Multiplexes Slow and Fast Dynamics in Value Learning and Memory (Code) (v1.0.0). Zenodo. 10.5281/zenodo.1675184366. Additional custom analysis scripts are available from the corresponding author upon request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-66081-4.

References

  • 1.Ghazizadeh, A., Griggs, W. & Hikosaka, O. Object-finding skill created by repeated reward experience. J. Vis.16, 17–17 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gottlieb, J. Attention, learning, and the value of information. Neuron76, 281–295 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Abraham, W. C. & Robins, A. Memory retention–the synaptic stability versus plasticity dilemma. Trends Neurosci.28, 73–78 (2005). [DOI] [PubMed] [Google Scholar]
  • 4.Liljenström, H. Neural stability and flexibility: a computational approach. Neuropsychopharmacology28, S64–S73 (2003). [DOI] [PubMed] [Google Scholar]
  • 5.Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature441, 876–879 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hikosaka, O., Ghazizadeh, A., Griggs, W. & Amita, H. Parallel basal ganglia circuits for decision making. J. Neural Transm.125, 515–529 (2018). [DOI] [PubMed] [Google Scholar]
  • 7.Alexander, G. E., DeLong, M. R. & Strick, P. L. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci.9, 357–381 (1986). [DOI] [PubMed] [Google Scholar]
  • 8.Yin, H. H. & Knowlton, B. J. The role of the basal ganglia in habit formation. Nat. Rev. Neurosci.7, 464–476 (2006). [DOI] [PubMed] [Google Scholar]
  • 9.Kim, H. F. & Hikosaka, O. Distinct basal ganglia circuits controlling behaviors guided by flexible and stable values. Neuron79, 1001–1010 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Griggs, W. S. et al. Flexible and stable value coding areas in caudate head and tail receive anatomically distinct cortical and subcortical inputs. Front. Neuroanat.11, 106 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kim, H. F., Ghazizadeh, A. & Hikosaka, O. Separate groups of dopamine neurons innervate caudate head and tail encoding flexible and stable value memories. Front. Neuroanat.8, 120 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Miller, E. K. The prefontral cortex and cognitive control. Nat. Rev. Neurosci.1, 59–65 (2000). [DOI] [PubMed] [Google Scholar]
  • 13.Pasupathy, A. & Miller, E. K. Different time courses of learning-related activity in the prefrontal cortex and striatum. Nature433, 873–876 (2005). [DOI] [PubMed] [Google Scholar]
  • 14.Wallis, J. D. Neuronal mechanisms in prefrontal cortex underlying adaptive choice behavior. Ann. N.Y. Acad. Sci.1121, 447–460 (2007). [DOI] [PubMed] [Google Scholar]
  • 15.Chau, B. K. et al. Contrasting roles for orbitofrontal cortex and amygdala in credit assignment and learning in macaques. Neuron87, 1106–1118 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Rudebeck, P. H., Saunders, R. C., Lundgren, D. A. & Murray, E. A. Specialized representations of value in the orbital and ventrolateral prefrontal cortex: desirability versus availability of outcomes. Neuron95, 1208–1220.e1205 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ghazizadeh, A., Hong, S. & Hikosaka, O. Prefrontal cortex represents long-term memory of object values for months. Curr. Biol.28, 2206–2217. e2205 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Borra, E., Gerbella, M., Rozzi, S. & Luppino, G. Projections from caudal ventrolateral prefrontal areas to brainstem preoculomotor structures and to basal ganglia and cerebellar oculomotor loops in the macaque. Cereb. Cortex25, 748–764 (2015). [DOI] [PubMed] [Google Scholar]
  • 19.Middleton, F. A. Basal-ganglia ‘projections’ to the prefrontal cortex of the primate. Cereb. Cortex12, 926–935 (2002). [DOI] [PubMed] [Google Scholar]
  • 20.Kim, H. F., Ghazizadeh, A. & Hikosaka, O. Dopamine neurons encoding long-term memory of object value for habitual behavior. Cell163, 1165–1175 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Yasuda, M., Yamamoto, S. & Hikosaka, O. Robust representation of stable object values in the oculomotor basal ganglia. J. Neurosci.32, 16917–16932 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Smith, M. A., Ghazizadeh, A. & Shadmehr, R. Interacting adaptive processes with different timescales underlie short-term motor learning. PLoS Biol.4, e179 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Yasuda, M. & Hikosaka, O. Functional territories in primate substantia nigra pars reticulata separately signaling stable and flexible values. J. Neurophysiol.113, 1681–1696 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ghazizadeh, A., Griggs, W., Leopold, D. A. & Hikosaka, O. Temporal–prefrontal cortical network for discrimination of valuable objects in long-term memory. Proc. Natl. Acad. Sci.115, E2135–E2144 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ghazizadeh, A., Griggs, W. & Hikosaka, O. Ecological origins of object salience: reward, uncertainty, aversiveness, and novelty. Front. Neurosci.10, 378 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kobak, D. et al. Demixed principal component analysis of neural population data. elife5, e10989 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Buzsaki, G. & Draguhn, A. Neuronal oscillations in cortical networks. Science304, 1926–1929 (2004). [DOI] [PubMed] [Google Scholar]
  • 28.Palva, S. & Palva, J. M. Functional roles of alpha-band phase synchronization in local and large-scale cortical networks. Front. Psychol.2, 204 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Belitski, A. et al. Low-frequency local field potentials and spikes in primary visual cortex convey independent visual information. J. Neurosci.28, 5696–5709 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Khawaja, F. A., Tsui, J. M. & Pack, C. C. Pattern motion selectivity of spiking outputs and local field potentials in macaque visual cortex. J. Neurosci.29, 13702–13709 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Einevoll, G. T., Kayser, C., Logothetis, N. K. & Panzeri, S. Modelling and analysis of local field potentials for studying the function of cortical circuits. Nat. Rev. Neurosci.14, 770–785 (2013). [DOI] [PubMed] [Google Scholar]
  • 32.Hikosaka, O., Yamamoto, S., Yasuda, M. & Kim, H. F. Why skill matters. Trends Cogn. Sci.17, 434–441 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Brandon, T. H., Vidrine, J. I. & Litvin, E. B. Relapse and relapse prevention. Annu. Rev. Clin. Psychol.3, 257–284 (2007). [DOI] [PubMed] [Google Scholar]
  • 34.Hunt, W. A., Barnett, L. W. & Branch, L. G. Relapse rates in addiction programs. J. Clin. Psychol.27, 455–456 (1971). [DOI] [PubMed]
  • 35.Medina, J. F., Garcia, K. S. & Mauk, M. D. A mechanism for savings in the cerebellum. J. Neurosci.21, 4081–4089 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Coltman, S. K., Cashaback, J. G. A. & Gribble, P. L. Both fast and slow learning processes contribute to savings following sensorimotor adaptation. J. Neurophysiol.121, 1575–1583 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Brashers-Krug, T., Shadmehr, R. & Bizzi, E. Consolidation in human motor memory. Nature382, 252–255 (1996). [DOI] [PubMed] [Google Scholar]
  • 38.Thoroughman, K. A. & Shadmehr, R. Electromyographic correlates of learning an internal model of reaching movements. J. Neurosci.19, 8573–8588 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Sun, X. et al. Cortical preparatory activity indexes learned motor memories. Nature602, 274–279 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Van Kerkoerle, T. et al. Alpha and gamma oscillations characterize feedback and feedforward processing in monkey visual cortex. Proc. Natl. Acad. Sci.111, 14332–14341 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Quirk, G. J. & Mueller, D. Neural mechanisms of extinction learning and retrieval. Neuropsychopharmacology33, 56–72 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Muller, T. H. et al. Distributional reinforcement learning in prefrontal cortex. Nat. Neurosci.27, 403–408 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Dom, G., Sabbe, B., Hulstijn, W. & van Den Brink, W. Substance use disorders and the orbitofrontal cortex: Systematic review of behavioural decision-making and neuroimaging studies. Br. J. Psychiatry187, 209–220 (2005). [DOI] [PubMed] [Google Scholar]
  • 44.Schoenbaum, G. & Shaham, Y. The role of orbitofrontal cortex in drug addiction: a review of preclinical studies. Biol. Psychiatry63, 256–262 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Davidson, B. et al. Deep brain stimulation of the nucleus accumbens in the treatment of severe alcohol use disorder: a phase I pilot trial. Mol. Psychiatry27, 3992–4000 (2022). [DOI] [PubMed] [Google Scholar]
  • 46.Mani, P., Kelley, V. & Alexander, Y. Nucleus accumbens and its role in reward and emotional circuitry: a potential hot mess in substance use and emotional disorders. AIMS Neurosci.4, 52–70 (2017). [Google Scholar]
  • 47.Wallis, J. D. Orbitofrontal cortex and its contribution to decision-making. Annu. Rev. Neurosci.30, 31–56 (2007). [DOI] [PubMed] [Google Scholar]
  • 48.Gerbella, M., Borra, E., Tonelli, S., Rozzi, S. & Luppino, G. Connectional heterogeneity of the ventral part of the macaque area 46. Cereb. Cortex23, 967–987 (2013). [DOI] [PubMed] [Google Scholar]
  • 49.Kennerley, S. W., Walton, M. E., Behrens, T. E., Buckley, M. J. & Rushworth, M. F. Optimal decision making and the anterior cingulate cortex. Nat. Neurosci.9, 940–947 (2006). [DOI] [PubMed] [Google Scholar]
  • 50.Seo, H., Barraclough, D. J. & Lee, D. Lateral intraparietal cortex and reinforcement learning during a mixed-strategy game. J. Neurosci.29, 7278–7289 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Paton, J. J., Belova, M. A., Morrison, S. E. & Salzman, C. D. The primate amygdala represents the positive and negative value of visual stimuli during learning. Nature439, 865–870 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kim, C., Kroger, J. & Kim, J. A functional dissociation of conflict processing within anterior cingulate cortex. Nat. Preced.1 (2008). [DOI] [PMC free article] [PubMed]
  • 53.Isoda, M. & Hikosaka, O. Switching from automatic to controlled action by monkey medial frontal cortex. Nat. Neurosci.10, 240–248 (2007). [DOI] [PubMed] [Google Scholar]
  • 54.Tabu, H., Mima, T., Aso, T., Takahashi, R. & Fukuyama, H. Functional relevance of pre-supplementary motor areas for the choice to stop during Stop signal task. Neurosci. Res.70, 277–284 (2011). [DOI] [PubMed] [Google Scholar]
  • 55.Aron, A. R. & Poldrack, R. A. Cortical and subcortical contributions to stop signal response inhibition: role of the subthalamic nucleus. J. Neurosci.26, 2424–2433 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Nadian, M. H., Farmani, S. & Ghazizadeh, A. A novel methodology for exact targeting of human and non-human primate brain structures and skull implants using atlas-based 3D reconstruction. J. Neurosci. Methods391, 109851 (2023). [DOI] [PubMed] [Google Scholar]
  • 57.Abbaszadeh, M., Panjehpour, A., Amin Alemohammad, S. M., Ghavampour, A. & Ghazizadeh, A. Prefrontal cortex encodes value pop-out in visual search. iScience26, 107521 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Ghazizadeh, A., Fields, H. L. & Ambroggi, F. Isolating event-related neuronal responses by deconvolution. J. Neurophysiol.104, 1790–1802 (2010). [DOI] [PubMed] [Google Scholar]
  • 59.Cohen, M. X. Analyzing Neural Time Series Data: Theory and Practice. MIT Press (2014).
  • 60.Ghazizadeh, A. & Hikosaka, O. Common coding of expected value and value uncertainty memories in the prefrontal cortex and basal ganglia output. Sci. Adv.7, eabe0693 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Dayan, P. & Jyu, A. Uncertainty and learning. IETE J. Res.49, 171–181 (2003). [Google Scholar]
  • 62.Groppe, D. fdr_bh. MATLAB Central File Exchange https://www.mathworks.com/matlabcentral/fileexchange/27418-fdr_bh, (2025).
  • 63.Takeuchi, R. F. comparison test of two correlation coefficient: corr_rtest(ra, rb, na, nb). MATLAB Central File Exchange https://www.mathworks.com/matlabcentral/fileexchange/61398-comparison-test-of-two-correlation-coefficient-corr_rtest-ra-rb-na-nb, (2025).
  • 64.Trujillo-Ortiz, A. HotellingT2. MATLAB Central File Exchange https://www.mathworks.com/matlabcentral/fileexchange/2844-hotellingt2, (2025).
  • 65.Hashemirad, S., Abbaszadeh, M. & Ghazizadeh, A. Prefrontal cortex temporally multiplexes slow and fast dynamics in value learning and memory. 10.1101/2024.02.02.578632 (2025). [DOI] [PubMed]
  • 66.Hashemi, R. Hemi1997/Tworate_RL: prefrontal cortex temporally multiplexes slow and fast dynamics in value learning and memory (Code). 10.1101/2024.02.02.578632 (2025). [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting Summary (107.9KB, pdf)
Source data (41.6MB, zip)

Data Availability Statement

The datasets generated and analyzed during the current study have been deposited in Zenodo under the following accession: Hashemirad, S., Abbaszadeh, M., & Ghazizadeh, A. (2025). Prefrontal cortex temporally multiplexes slow and fast dynamics in value learning and memory (v1.0.0). Zenodo. 10.5281/zenodo.1675228765. This archive includes trial-level behavioral datasets and spike-sorted single-unit neural data, with both average and trial-wise peristimulus time histograms (PSTHs). Source data underlying each main and Supplementary Fig. are provided as individual Excel files alongside the paper. Additional related datasets are available from the corresponding author upon request. Source data are provided with this paper.

The MATLAB code (2021a and 2022a, Mathworks Inc.) used for behavioral analysis, model fitting, and neural data analysis, which produces the main findings of the paper (Figs. 1 and 2), has been deposited in Zenodo under the following accession: Reza Hashemi (2025). Hemi1997/Tworate_RL: Prefrontal Cortex Temporally Multiplexes Slow and Fast Dynamics in Value Learning and Memory (Code) (v1.0.0). Zenodo. 10.5281/zenodo.1675184366. Additional custom analysis scripts are available from the corresponding author upon request.


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES