Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2023 Jun 15;19(6):e1011206. doi: 10.1371/journal.pcbi.1011206

Opponent learning with different representations in the cortico-basal ganglia pathways can develop obsession-compulsion cycle

Reo Sato 1,#, Kanji Shimomura 1,#, Kenji Morita 1,2,*
Editor: Stefano Palminteri3
PMCID: PMC10306209  PMID: 37319256

Abstract

Obsessive-compulsive disorder (OCD) has been suggested to be associated with impairment of model-based behavioral control. Meanwhile, recent work suggested shorter memory trace for negative than positive prediction errors (PEs) in OCD. We explored relations between these two suggestions through computational modeling. Based on the properties of cortico-basal ganglia pathways, we modeled human as an agent having a combination of successor representation (SR)-based system that enables model-based-like control and individual representation (IR)-based system that only hosts model-free control, with the two systems potentially learning from positive and negative PEs in different rates. We simulated the agent’s behavior in the environmental model used in the recent work that describes potential development of obsession-compulsion cycle. We found that the dual-system agent could develop enhanced obsession-compulsion cycle, similarly to the agent having memory trace imbalance in the recent work, if the SR- and IR-based systems learned mainly from positive and negative PEs, respectively. We then simulated the behavior of such an opponent SR+IR agent in the two-stage decision task, in comparison with the agent having only SR-based control. Fitting of the agents’ behavior by the model weighing model-based and model-free control developed in the original two-stage task study resulted in smaller weights of model-based control for the opponent SR+IR agent than for the SR-only agent. These results reconcile the previous suggestions about OCD, i.e., impaired model-based control and memory trace imbalance, raising a novel possibility that opponent learning in model(SR)-based and model-free controllers underlies obsession-compulsion. Our model cannot explain the behavior of OCD patients in punishment, rather than reward, contexts, but it could be resolved if opponent SR+IR learning operates also in the recently revealed non-canonical cortico-basal ganglia-dopamine circuit for threat/aversiveness, rather than reward, reinforcement learning, and the aversive SR + appetitive IR agent could actually develop obsession-compulsion if the environment is modeled differently.

Author summary

Obsessive-compulsive disorder (OCD) is one of the major psychiatric disorders diagnosed in 2.5%-3% of the population, and is characterized as an enhanced cycle of obsessive thought, e.g., whether the door was locked, and compulsive action, e.g., checking door lock. It remains elusive why such an apparently maladaptive behavior could be enhanced. A prevailing theory proposes that humans use two control systems, flexible yet costly goal-directed system and inflexible yet costless habitual system, and impairment of the goal-directed system leads to OCD. On the other hand, recent work proposed a new theory that shorter memory trace for credit-assignment of negative, than positive, prediction errors can induce OCD. Relation between these two theories remains unclear. We show that opponent learning of particular type of goal-directed(-like) system, suggested to be implemented in the brain, and habitual system from positive versus negative prediction errors could exhibit an (apparent) overall decrease in goal-directed control and also develop enhanced obsession-compulsion cycle similar to the one developed by memory-trace imbalance, thereby bridging the two theories. Such an opponent learning of the two systems was actually suggested to be advantageous in certain dynamic environments, and could thus be evolutionarily selected at the cost of possible development of OCD.

Introduction

Obsessive-compulsive disorder (OCD) is one of the major psychiatric disorders diagnosed in 2.5%-3% of the population [1], and is characterized as an enhanced cycle of obsessive thought, e.g., whether the door was locked, and compulsive action, e.g., checking door lock. It remains elusive why such an apparently maladaptive behavior could be enhanced. Dual process theories and their advanced forms [24] suggest that humans use two control systems, one flexible yet costly goal-directed/model-based system, where the "model" refers to the internal model of the environments (state transitions and rewards/punishments), and another inflexible yet costless habitual/model-free system. A prevailing suggestion based on these theories is that impairment of the goal-directed/model-based system relates to OCD [1,57], potentially explaining its apparently irrational behavior. In the two-stage decision task with reward or no-reward outcomes [8], OCD patients showed impaired model-based choices [6,7]. Also, deficits in model-based control in the two-stage task were strongly associated with a symptom dimension "compulsive behavior and intrusive thought" in the general population [9].

Meanwhile, recent work [10] suggested that memory trace imbalance underlies OCD. Specifically, the authors constructed an environmental model that describes potential enhancement of obsession-compulsion cycle, and showed that such a cycle can be enhanced in the agent having much shorter memory (eligibility) trace for negative than positive reward prediction errors (RPEs). Then, they conducted fitting of the behavioral choice in a delayed feedback task, showing that OCD patients were indeed fitted with such imbalanced eligibility traces as compared to healthy controls (HCs), although even some HCs tended to show a similar imbalance. Mechanisms of such memory trace imbalance, as well as its relation to the suggestion of impaired model-based control, remain unclear.

Whereas the abovementioned modeling in the recent work [10] examined the agent that only had model-free control, multiple studies have suggested that humans and other animals use both model-based and model-free control [4,8]. Model-free control has been suggested to be implemented in the way that dopamine represents RPEs and dopamine-dependent plasticity of cortico-striatal synapses represents RPE-based update of state/action values [11,12]. Neural implementation of model-based control has remained more elusive, but recent studies suggest that if states or actions are represented in a manner that contains information about action-state transitions, partially model-based-like control can be acquired through RPE-based update and thereby through dopamine-dependent plasticity, similarly to model-free control [13]. As such a representation, successor representation (SR), in which state/action is represented by cumulative occupancies of successor states/actions (Fig 1A) [14], has been suggested to be used in humans [15] and implemented in the brain [1618].

Fig 1. The dual-system agent having a coupled successor-representation(SR)-based system and individual-representation(IR)-based system, adopted from the previous study [35].

Fig 1

(A) Schematic illustrations of SR (left) and IR (right). In SR, action A1 is represented by a set of "discounted cumulative occupancies" of its successor actions (including A1 itself), i.e., (temporally discounted) cumulative frequencies with which each successor action is taken, starting from A1, under a given policy in the environment. The widths of the arrows indicate the probabilities of state transitions (to state S1 or S2) and action selections (A2 or A3 at S1 and A4 or A5 at S2), and the darkness of each circle indicates the occupancy of each action. Value of A1 is given by the dot product of the vector of occupancy of each action (x1 x2 x3 x4 x5) and a weight vector (w1 w2 w3 w4 w5), which is updated by reward prediction errors (RPEs). By contrast, in IR, action A1 is represented just by itself, separately from other actions. Value of A1, Q(A1), is directly updated by RPEs. (B) Different cortical regions/populations having SR or IR may unevenly target/activate the direct and indirect pathways of basal ganglia (BG), which have been suggested to be crucial for learning from positive and negative feedbacks, respectively. The line widths of the arrows indicate the suggested preferences of projections/activations described in the Introduction and [35]. (C) The dual-system agent incorporating the potentially uneven projections/activations from the cortical regions/populations having SR or IR to the BG pathways. Each of the SR-based system and the IR-based system develops the system-specific value of each action, and the average of the two system-specific values, named the integrated action value, is used for soft-max action selection and calculation of SARSA-type TD-RPEs. The TD-RPEs are used for updating the system-specific values (or more specifically, the IR-based system-specific values and the weights for the SR-based system-specific values), with the learning rates for the SR- and IR-based systems in the cases of positive (non-negative) and negative TD-RPEs, denoted as αSR+, αSR−, αIR+, and αIR−, can take different values.

Different cortical regions or neural populations may adopt different ways of representations, in particular, SR and more classical individual (punctate) representation (IR), the latter of which can implement model-free control. These different cortical regions/populations may unevenly project to and/or activate the direct and indirect pathways of basal ganglia [1921], which have been suggested or implied to be crucial for learning from positive and negative feedbacks, respectively [2227] (Fig 1B), and their computational roles have been studied [22,2832]. Different circuits for positive and negative learning other than these pathways have also been suggested [33,34]. Given these, it is conceivable that SR-based system and IR-based system differentially learn from positive and negative RPEs in the brain (Fig 1C). Recent work [35] explored this possibility, and found that combination of SR- and IR-based systems learning mainly from positives and negative RPEs, respectively (referred to as the appetitive SR + aversive IR agent), performed well in certain dynamic reward environments. This work further suggested that implementation of such a combination in the cortico-basal ganglia pathways seems in line with various anatomical and physiological findings, including activations indicative of SR in limbic or visual cortices [17,18] and preferential connections from limbic and visual cortices to the direct pathway of basal ganglia [20,21].

Crucially, SR has a similarity to the eligibility trace (c.f., its "forward view" [36]) while model-free learning without eligibility trace is naturally described by using IR. Given this correspondence, we expected that the appetitive SR + aversive IR agent could potentially show similar behavior to the one shown by the agent having long / short eligibility traces for positive / negative RPEs in the abovementioned recent work [10], in particular, enhancement of obsession-compulsion cycle. Moreover, we also expected that fitting of the appetitive SR + aversive IR agent’s behavior could potentially result in smaller weights of model-based control than the case of the agent with SR-based system only in the two-stage decision task [8], whereas eligibility-trace imbalance per se would not bias the estimation of the degree of model-based control. Here we addressed these expectations. Psychologically, in the appetitive SR + aversive IR agent, reinforcing effects of the pleasantness of a relief from obsession (though not necessarily consciously perceived) operate more generally or widely (or in longer time scales due to the similarity between SR and eligibility trace) than reinforcing effects of the unpleasantness of a stay at obsession. In other words, this agent is characterized by (over)generalization of the pleasantness of a relief from obsession.

Results

Environmental model and dual-system agent

We adopted the previously developed environmental model that describes possible enhancement of obsession-compulsion cycle [10] (Fig 2). There are two states named the relief state and the obsession state. At the relief state, the agent can take one of two actions named the "abnormal reaction" and the "other" action. Here it is presumed that occasionally some anxiety e.g., about whether the door was locked arises spontaneously (intrusive thought), and the agent can either get deeply into the anxiety ("abnormal reaction"), entering the obsession state, or just ignore or forget about it and do other things ("other" action), remaining in the relief state. Once the agent enters the obsession state, the agent again can take one of two actions. If the agent takes the "compulsion" action by paying a small cost, e.g., confirms door lock, the agent can transition back to the relief state with a high probability. Alternatively, the agent can take "other" action without any cost, but it makes the agent transition back to the relief state only with a small probability. Every stay at the obsession state imposes punishment (negative reward).

Fig 2. Environmental model describing possible development of obsession-compulsion cycle, adopted from the previous study [10].

Fig 2

There are two states: the relief state and the obsession state. At the relief state, the agent can take the "abnormal reaction" (to an intrusive thought, i.e., spontaneously arising anxiety e.g., about door lock), which induces a transition to the obsession state, or the "other" action (e.g., just ignore or forget about the intrusive thought), with which the agent stays at the relief state. At the obsession state, the agent can take the "compulsion" action (e.g., confirms door lock), which requires a small cost (0.01) but induces a transition back to the relief state with a high probability (50%), or the "other" action, which requires no cost but induces a transition back to the relief state only with a small probability (10%). Every stay at the obsession state imposes punishment (negative reward −1).

Because of this punishment and also the cost, the optimal value of the obsession state should be negative and lower than the optimal value of the relief state. Thus, the optimal value of the "abnormal reaction", which induces transition to the obsession state, should be negative and lower than the optimal value of the "other" action at the relief state. Therefore, normative reinforcement learning (RL) agents should learn a policy that minimizes selection of the "abnormal reaction", thereby never developing enhanced obsession-compulsion cycle. However, agents that deviate from normative RL, such as the one having different lengths of eligibility traces for positive versus negative RPEs examined in the previous work [10], may behave differently.

We examined how the dual-system agent having coupled SR- and IR-based systems (Fig 1C), originally developed in [35], behaved in this environmental model. As mentioned in the Introduction (and also in [35]), this dual-system agent was motivated by the suggestions that humans/animals use both model-based (SR-based) and model-free (IR-based) controls, which may learn differently from positive and negative RPEs through uneven projections from different cortical populations hosting different representations (SR or IR) to the direct and indirect pathways of basal ganglia. In this dual-system agent, each system develops the system-specific value of each action, and the average of the two system-specific values, named the (integrated) action value, is used for soft-max action selection and calculation of RPEs. The RPEs are used for updating the system-specific values, with the learning rates for the SR- and IR-based systems in the cases of positive and negative RPEs, denoted as αSR+, αSR−, αIR+, and αIR−, allowed to take different values.

Behavior of the dual-system agent in the environmental model

First, we confirmed whether the agent that effectively had only the IR-based system with the same learning rates from positive and negative RPEs, i.e., the conventional model-free agent, behaved in the normative manner as described above, by setting the learning rates of the SR-based system from positive and negative RPEs (αSR+ and αSR−) to 0 while αIR+ and αIR− were set to 0.1. Fig 3A-top shows an example of the time evolution of the difference in the (integrated) action values of "abnormal reaction" and "other" at the relief state (QabnormalQother@relief). The difference became negative, indicating that the abnormal reaction became unlikely to be taken. Fig 3A-middle shows the moving average of the proportion that the agent was at the obsession state in the same example simulation, and Fig 3A-bottom shows the average of such a moving-average proportion across 100 simulations. The proportion became, and remained to be, low. It was thus confirmed that this agent minimized selection of the "abnormal reaction" at the relief state and never developed enhanced obsession-compulsion cycle. We next examined the case where the agent effectively had only the SR-based system with unbiased learning from positive and negative PREs, by setting αIR+ and αIR− to 0 while αSR+ and αSR− were set to 0.1. As shown in Fig 3B, this agent also minimized selection of the "abnormal reaction". This is actually reasonable given that SR-based temporal-difference (TD) leaning can generally approximate optimal value functions.

Fig 3. Behavior of different types of the dual-system agents.

Fig 3

(A) Behavior of the agent that effectively had IR-based system only ((αSR+, αSR−, αIR+, αIR−) = (0, 0, 0.1, 0.1)). Top: An example of the time evolution of the difference in the action values of the "abnormal reaction" and the "other" at the relief state (QabnormalQother@relief). Middle: The moving average (over the past 100 time steps) of the proportion that the agent was at the obsession state in the same example simulation. Bottom: The average of the moving-average proportion of the obsession state across 100 simulations (black line), presented with ±SD (gray thin lines). (B) Behavior of the agent that effectively had SR-based system only ((αSR+, αSR−, αIR+, αIR−) = (0.1, 0.1, 0, 0)). (C) Behavior of the agent having IR- and SR-based systems, both of which learned equally from positive and negative RPEs ((αSR+, αSR−, αIR+, αIR−) = (0.05, 0.05, 0.05, 0.05)). (D) Behavior of the agent having appetitive SR- and aversive IR-based systems ((αSR+, αSR−, αIR+, αIR−) = (0.09, 0.01, 0.01, 0.09)). (E) Proportion of the obsession state during time-steps 4901~5000, averaged across 100 simulations, in various cases with different learning rates. The horizontal and vertical axes indicate the learning rates of the SR-based system from positive and negative RPEs (i.e., αSR+ and αSR−), respectively, while αSR+ + αIR+ and αSR− + αIR− (i.e., total learning rates from positive and negative RPEs, respectively) were kept constant at 0.1.

Next, we examined the behavior of the agent in which both IR-based system and SR-based system were effective and the learning rate was all equal for both systems and also for positive and negative RPEs (i.e., αIR+ = αIR− = αSR+ = αSR− = 0.05, referred to as the neutral SR+IR agent). As shown in Fig 3C, this agent also minimized selection of the "abnormal reaction". This may also be reasonable, as this agent combined rigorous normative model-free TD leaner and approximation of normative TD leaner. We then examined the behavior of the agent having appetitive SR- and aversive IR-based systems (referred to as the appetitive SR + aversive IR agent) by setting αSR+ = αIR− = 0.09 and αSR− = αIR+ = 0.01 (Fig 3D). As shown in Fig 3D-top, different from the previous cases, the QabnormalQother@relief difference did not smoothly decrease but remained to fluctuate and sometimes even became positive, indicating that this agent continued to take the "abnormal reaction" and transition to the obsession state rather frequently, as actually appeared in Fig 3D-middle,bottom. In other words, obsession-compulsion cycle was often developed and enhanced in this agent, as we expected. We further examined cases with different learning rates while αSR+ + αIR+ and αSR− + αIR− (i.e., total learning rates from positive and negative RPEs, respectively) were kept constant at 0.1, finding that agents with large αSR+ & small αIR+ and small αSR− & large αIR− (i.e., appetitive SR- and aversive IR-based systems) could develop enhanced obsession-compulsion cycle (Fig 3E).

Modifications and variations in the models

We conducted simulations for longer times, and found that the enhanced obsession-compulsion cycle developed in the appetitive SR + aversive IR agent ultimately terminated (Fig 4A). We looked at the time evolution of variables, and found that the IR-system-specific values, as well as the weights for the SR-system-specific values (calculated as dot products of the weights and the occupancies: Fig 1A), grew to large magnitudes, while the magnitudes of the integrated action values and the SR matrix remained small (Fig 4B). This value/weight growth is considered to be due to the opponent setting rather than factors specific to SR, as value growth also occurred in the agent consisting of an appetitive IR-based system and an aversive IR-based system, for which obsession-compulsion cycle was not enhanced (Fig 4C). Returning to the appetitive SR + aversive IR agent, because the weights for the SR-system-specific values became so large, update of the SR matrix could cause large changes in the SR-system-specific values and thereby changes in the integrated values, whose impact could be quite large given the small magnitudes of the integrated values. This presumably triggered the termination of enhanced obsession-compulsion cycle. Indeed, if the rate of the update of SR matrix, which was so far fixed at 0.01, decreased over time and converged to 0, i.e., SR became rigid (c.f., [37]), enhanced obsession-compulsion cycle appeared to persist, despite the growth of the system-specific weights/values (Fig 4D and 4E). Considering neural implementation, however, such an unbounded growth of the weights/values may be implausible. As an alternative modification to the model, we introduced a small decay (forgetting) (c.f., [38,39]) to the original model with update of SR matrix at the fixed rate 0.01. Specifically, the IR-system-specific values and the weights for the SR-system-specific values were assumed to decay at a constant rate (0.001) per each time step. Then, the growth of these values and weights was bounded, and enhanced obsession-compulsion cycle persisted (Fig 4F–4H).

Fig 4. Behavior of the original and modified dual-system agents in the long run.

Fig 4

(A) Behavior of the original appetitive SR + aversive IR agent for 50000 time steps. The learning rates were set as (αSR+, αSR−, αIR+, αIR−) = (0.09, 0.01, 0.01, 0.09) (same for (A,B, D-F)). Arrangements are the same as those in Fig 3A–3D. (B) Integrated action values, system-specific values/weights, and SR matrix of the original appetitive SR + aversive IR agent. Top: Integrated value of each action averaged across 100 simulations (brown: "other" at the relief state, purple: "abnormal reaction", black: "other" at the obsession state, red: "compulsion"; same below). Middle: Weights for SR-based system-specific values (top four lines) and IR-based system-specific values (bottom four lines) averaged across 100 simulations. Bottom: SR matrix at time step 5000 (left) and 50000 (right) averaged across 100 simulations. Each row indicates the SR of each action shown in the left, with the darkness of the squares indicating the discounted cumulative occupancies of the actions shown in the bottom. (C) Behavior of the appetitive IR ((αIR+, αIR−) = (0.09, 0.01)) + aversive IR ((αIR+, αIR−) = (0.01, 0.09)) agent. The moving average of the proportion that the agent was at the obsession state (top) and the system-specific values of the compulsion (bottom) in a single simulation are shown. (D,E) Results for the appetitive SR + aversive IR agent with a modification, in which the rate of the update of SR matrix decreased over time according to 0.01/(1 + time-step/1000). (F,G) Results for the appetitive SR + aversive IR agent with another modification, in which the IR-based system-specific values and the weights for the SR-based system-specific values decayed at a constant rate (0.001 at each time step) while the rate of the update of SR matrix was fixed at 0.01 as in the original agent. (H) Average proportion of the obsession state during time-steps 49901~50000 for the modified agent with the decay of values/weights, in various cases with different learning rates. Notations are the same as those in Fig 3E. (I) The average of the moving-average proportion of the obsession state across 100 simulations (black line), presented with ±SD (gray thin lines), of the appetitive SR + aversive IR agent with the decay of values/weights, with the inverse temperature increased tenfold (top) or fifty-fold (bottom: −SD is mostly invisible).

The reason for the stable persistence of enhanced obsession-compulsion cycle in the appetitive SR + aversive IR agent can be understood by looking at the action values and the SR matrix during the enhanced cycle (Fig 4G). The "other" action at the relief state had a positive value, and the abnormal reaction had a lower but still positive value. These positive values are considered to be shaped through asymmetric SR-based updates by RPEs generated when the agent took an action at the obsession state. Specifically, positive RPE upon transition to the relief state had large effects since αSR+ was large whereas negative RPE upon stay at the obsession state had small effects since αSR− was small. Now, assume that the agent, developing enhanced obsession-compulsion cycle, exists at the relief state and happens to take the "other" action repeatedly, without taking the abnormal reaction. It is in fact the optimal policy, and because no reward or punishment is given as long as the agent only takes the "other" action, the value of the "other" action should approach, through RPE-based update, to 0, which is equal to the optimal value. However, this value, 0, is lower than the positive value of the abnormal reaction (even though it slowly decays). Therefore, as the agent repeatedly takes the "other" action more and more, ironically the abnormal reaction becomes more and more likely to be chosen. This pulls the agent back to enhanced obsession-compulsion cycle, away from the optimal policy. Notably, because the value of the abnormal reaction was (positive but) lower than the value of the "other" action at the relief state as mentioned above, it seems possible that if the agent’s choice policy is nearly deterministic (i.e., the inverse temperature is quite high), the agent would rarely take the abnormal reaction and thereby rarely develop enhanced obsession-compulsion cycle. We examined this possibility. When the inverse temperature increased tenfold (from β = 10 to β = 100), development of enhanced obsession-compulsion cycle was on average fairly delayed (Fig 4I, top), and when it further increased fivefold (β = 500), enhanced cycle was on average not developed within the examined time steps (50000) (Fig 4I, bottom). These results support the abovementioned conjecture about the mechanism and suggest a possible effect of choice exploration/exploitation.

In the environmental model used so far (developed by [10]) (Fig 2), there was only a single "other" action at each state. It is conceivable to assume multiple (different) "other" actions (Fig 5A). We examined the behavior of the dual-system agent with decay in modified environments where there were two or eight "other" actions at each state. When there were two "other" actions at each state (Fig 5B and 5C), the appetitive SR + aversive IR agent still exhibited enhanced obsession-compulsion cycle, although the proportion of the obsession state decreased from the original case with a single "other" action. The value of the abnormal reaction in such an agent was around 0 (purple line in Fig 5B middle), and so the abovementioned mechanism for enhanced obsession-compulsion cycle would barely hold. However, when there were eight "other" actions at each state (Fig 5D and 5E), the value of the abnormal reaction became negative (purple line in Fig 5D middle) and obsession-compulsion cycle was hardly enhanced even in the appetitive SR + aversive IR agent. We further examined the cases where there were eight "other" actions at the relief state and one "other" action at the obsession state (Fig 5F, top) or one "other" action at the relief state and eight "other" actions at the obsession state (Fig 5F, bottom). Enhanced obsession-compulsion cycle was developed only in the latter case. These results suggest that the reduced representation of actions other than the abnormal reaction at the relief state in the original environmental model is essential for enhancement of obsession-compulsion cycle.

Fig 5. Behavior of the modified agent with the decay of values/weights in the modified environmental model, in which there were multiple "other" actions. (A) Modified environmental model assuming multiple "other" actions.

Fig 5

(B) Behavior of the appetitive SR + aversive IR agent ((αSR+, αSR−, αIR+, αIR−) = (0.09, 0.01, 0.01, 0.09)) with the decay of values/weights in the environment with two "other" actions at each state. Top: The average of the moving-average proportion of the obsession state across 100 simulations (black line), presented with ±SD (gray thin lines). Middle: Integrated action value averaged across 100 simulations (brown: average of "other" actions at the relief state, purple: "abnormal reaction", black: average of "other" actions at the obsession state, red: "compulsion"). Bottom: SR matrices averaged across 100 simulations, in which each row indicates the SR of "abnormal reaction" and "compulsion" and the mean SR of "other" actions at each state shown in the left (i.e., averaged across "other" actions at the same states), with the darkness of the squares indicating the discounted cumulative occupancies of "abnormal reaction" and "compulsion" and the summed discounted cumulative occupancies of "other" actions at each state shown in the bottom (i.e., summed across "other" actions at the same states). (C) Average proportion of the obsession state during time-steps 49901~50000 for the agent with the decay of values/weights, in various cases with different learning rates, in the environment with two "other" actions at each state. The color bar is the same as the one in Fig 4H. (D,E) Results for the cases where there were eight "other" actions at each state in the environmental model. (F) Behavior (average ±SD of the moving-average proportion of the obsession state) of the appetitive SR + aversive IR agent with the decay of values/weights in the environment where there were eight "other" actions at the relief state and one "other" action at the obsession state (top) or one "other" action at the relief state and eight "other" actions at the obsession state (bottom).

Behavior of the dual-system agent in the two-stage decision task

We examined how the appetitive SR + aversive IR agent behaved in the two-stage decision task [8], in comparison to the agent with SR-based system only (referred to as the SR-only agent) that would correspond to healthy controls who had long memory traces for both positive and negative RPEs in the previous work [10]. Specifically, we simulated choice sequences of both agents 100 times for each using different pseudorandom numbers, and fitted them by the RL model weighing model-based and model-free control developed in the original study [8] to estimate the parameter w that represents the degree of model-based control. As shown in the left panels of Fig 6A and 6B, the estimated w for the appetitive SR + aversive IR agent (Fig 6B) was generally smaller (as a distribution) than that for the SR-only agent (Fig 6A) (Wilcoxon rank sum test, p = 1.04×10−7). This reduction of w was rather prominent, even though the appetitive SR + aversive IR agent was half made of SR learner, and we think it could thus potentially explain the reduction of w observed in OCD patients [7]. Notably, the distribution of the estimated w for SR-only agent was not very close to 1 but rather wide, and this might reflect the fact that our SR-based system incorporated PE-based updates rather than direct value calculation by multiplication of fixed SR features and the reward vector. In contrast, fitting of choices generated by the original RL model with balanced eligibility trace (λ = 0.4 for both positive and negative RPEs) and those generated by a modified model with imbalanced trace (λ = 0.4 and 0 for positive and negative RPEs, respectively), both of which had the same degree of model-based control (w = 0.5), did not result in smaller estimated w in the latter, as shown in the left panels of Fig 6C and 6D (Wilcoxon rank sum test, p = 0.5134). Therefore, our expectation was again verified, and this result indicates a possibility that the OCD patients who were suggested to have imbalanced eligibility trace [10] might not actually have imbalanced trace but instead have opponent SR+IR structure and overlap with previously examined OCD populations who showed impairment of model-based control [7].

Fig 6. Behavior of the dual-system agents in the two-stage decision task, as compared to the agent with balanced or imbalanced memory trace.

Fig 6

(A,B) Comparison between the SR-only agent ((αSR+, αSR−, αIR+, αIR−) = (0.3, 0.3, 0, 0)) (A) and the appetitive SR + aversive IR agent ((αSR+, αSR−, αIR+, αIR−) = (0.27, 0.03, 0.03, 0.27)) (B). Left panels: Distributions of the estimated parameter representing the degree of model-based control (w) across 97 simulations (out of 100 simulations: see the Materials and Methods for details). Right panels: Proportion of stay choice (i.e., choosing the same option as the one at the previous trial) at the first stage depending on whether reward was obtained or not and whether common or rare transition was occurred at the previous trial. The bars indicate the average across 1000 simulations, and the error-bars indicate ±SEM. (C,D) Comparison between the RL model developed in the original two-stage task study [8] with balanced eligibility trace (λ = 0.4) (C) and a modified model with imbalanced eligibility trace ((λ1, λ2) = (0.4, 0) for positive and negative TD-RPE, respectively) (D). The degree of model-based control (w) was set to 0.5 in the simulations of both models. Left panels: Distributions of the estimated w across 100 simulations. Right panels: Mean (±SEM) proportion of stay choice at the first stage across 1000 simulations. (E) Results for the neutral SR + IR agent ((αSR+, αSR−, αIR+, αIR−) = (0.15, 0.15, 0.15, 0.15)). Configurations are the same those in (A-D), with the left panel showing the distribution of w across 97 out of 100 simulations.

We further conducted the fitting analysis for the neutral SR + IR agent. As shown in the left panel of Fig 6E, the estimated w was typically low, and its distribution did not drastically differ from the case of the appetitive SR + aversive IR agent, although there was a tendency (Wilcoxon rank sum test, p = 0.05857). Thus, from the estimation of the weight of model-based control (w) only, we cannot say that the appetitive SR + aversive IR agent is a better model of OCD patients than the neutral SR + IR agent, while such a claim was supported by the abovementioned simulation results (Figs 3 and 4H).

We analyzed the choice pattern of each type of agent, focusing on how much the agent’s choices (apparently) took into account whether the occurred state transition was the common type or the rare type. Specifically, we focused on the degree with which the agent made more "stay" (i.e., chose the same first-stage option) after reward was obtained in the case of common transition as compared to rare transition, and also made less stays after no reward was obtained in the case of common, as compared to rare, transition. The appetitive SR + aversive IR agent showed a smaller common-vs-rare difference in the "stay" proportion after no-reward than after reward (Fig 6B, right panel) (comparison of (P(Stay)|commonP(Stay)|rare)|reward and (P(Stay)|rareP(Stay)|common)|no-reward, t(999) = 6.97, p = 5.93×10−12), as naturally expected from the combination of appetitive/aversive learning and SR/IR. In contrast, in the SR-only agent, the neutral SR+IR agent, as well as the agents with balanced or imbalanced eligibility trace, the common-vs-rare difference in the "stay" proportion was similar between after reward and after no-reward (Fig 6A, 6C, 6D and 6E right panels). Therefore, whether the appetitive SR + aversive IR agent or the neutral SR+IR agent better describes OCD could be distinguished by examining whether there exists such an asymmetry between after reward and after no-reward in the common-vs-rare difference in the "stay" proportion.

Behavior of the dual-system agents in the delayed feedback task

We examined whether the observed differential behaviors of OCD patients and HCs in a delayed feedback task [10] could be explained by our dual-system agents. In the task, a pair of stimuli were presented, and subjects were required to choose one of them. Some stimuli caused immediate positive or negative feedback (monetary gain or loss) in the same trial, while other stimuli caused delayed feedback three trials later. Compared to HCs, OCD patients showed intact learning of stimuli causing immediate feedback but impaired learning of stimuli with delayed feedback (Fig 3B, 3C of [10]), and this pattern was reproduced by models with balanced and imbalanced eligibility traces (Fig 3E, 3F of [10]). We simulated execution of this task by the dual-system agents, in which actions of choosing each stimulus were represented by estimated discounted cumulative occupancies of successor states defined based on presented feedback (see the Materials and Methods for details). As shown in Fig 7A and 7B, the appetitive SR + aversive IR agent (αSR+, αSR−, αIR+, αIR−) = (0.018, 0.002, 0.002, 0.018), as compared to the SR-only agent (αSR+, αSR−, αIR+, αIR−) = (0.02, 0.02, 0, 0), showed a particular impairment in learning of delayed feedback, largely reproducing the pattern observed in the experiment [10].

Fig 7. Behavior of the dual-system agents in the delayed feedback task examined in [10].

Fig 7

(A) Total obtained outcome (feedback) from stimuli causing immediate (left) or delayed feedback (right) in each session, averaged across 1000 simulations, for the appetitive SR + aversive IR agent (red) and the SR-only agent (blue). The error-bars indicate ±SEM. (B) Total obtained outcome (feedback) in sessions 5 and 6, averaged across 1000 simulations, from stimuli with immediate or delayed feedback for the two types of agents. The error-bars indicate ±SEM. (C) Results of fitting of the choices of the two types of agents (47 and 45 out of 50 simulations for each, separately conducted from those shown in (B)) by the separate eligibility trace actor-critic model considered in [10]. The horizontal and vertical axes indicate the estimated parameters ν+ and ν (decay time scale of the trace for positive and negative RPEs), respectively.

We further fitted the choices of these two types of dual-system agents by the separate eligibility trace actor-critic model considered in [10]. As a result, the choices of the SR-only agent were in large part fitted by the model with large eligibility traces for both positive and negative RPEs whereas the choices of the appetitive SR + aversive IR agent were in large part fitted by the model with a larger trace for positive than negative RPEs (Fig 7C), as we expected from the similarity between the eligibility trace and SR. These results indicate that the observed behaviors of OCD patients and HCs in the delayed feedback task [10] could be explained by the appetitive SR + aversive IR agent and the SR-only agent, respectively.

Discussion

We have verified our two expectations: i) the appetitive SR + aversive IR agent could develop enhanced obsession-compulsion cycle, similarly to the agent having long / short eligibility traces for positive / negative RPEs examined in [10], and ii) fitting of the appetitive SR + aversive IR agent’s behavior could result in smaller weights of model-based control than the SR-only agent in the two-stage decision task whereas eligibility-trace imbalance per se would not bias the estimation of the degree of model-based control. We have also shown that the appetitive SR + aversive IR agent and the SR-only agent could explain the behaviors of OCD patients and HCs, respectively, in the delayed feedback task in [10]. These results reconcile the recent suggestion of memory trace imbalance in OCD [10] with the long-standing suggestion that OCD is associated with impairment of model-based (goal-directed) control [1, 5, 7, 9], raising a possibility that opponent learning in model-based (SR-based) and model-free (IR-based) controls underlies obsession-compulsion. Below we discuss implications, limitations, and predictions of the present study, and also an alternative environmental model.

Implications of our results, in relation to other studies

As described in the Introduction, the opponent combination of appetitive SR- and aversive IR-based systems has recently been shown to perform well in certain dynamic environments [35]. Specifically, in a virtual reward navigation task in which reward location dynamically changed, the appetitive SR + aversive IR agent outperformed the agents with other combinations. Its presumable reasons include a compensation of a potential weakness of SR-based learning from negative RPEs, indicated from the difficulty in learning using successor features (an extension of SR) upon drastic changes in the goal/policy [40], by IR-based aversive learning. Moreover, implementation of such an appetitive SR & aversive IR combination in the cortico-basal ganglia pathways appears consistent with various physiological and anatomical findings (c.f., Fig 11 of [35]), including activations indicative of SR in limbic or visual cortices [17,18] and preferential connections from limbic and visual cortices to the direct pathway of basal ganglia [20,21]. These previous suggestions, together with the results of the present study that the opponent SR+IR combination could develop enhanced obsession-compulsion cycle, potentially explain a mystery shown in the recent work [10]. Specifically, in that work, fitting of even HCs tended to result in shorter memory trace for negative (than positive) RPEs, and the authors suggested that it could indicate humans’ proneness to OCD, but why humans may have such a bias in the first place remained unexplained. We propose that the opponent SR+IR combination has been evolutionarily selected because it is advantageous under certain dynamic environments, at the cost of proneness to OCD.

We have shown that when the number of possible actions other than the abnormal reaction at the relief state was increased, obsession-compulsion cycle was not enhanced. This result suggests that a factor contributing to enhanced obsession-compulsion cycle is that alternative actions are represented in a dimension-reduced manner. Intuitively, this could correspond to a situation where a person concentrates on the emerging intrusive thoughts so much that s/he cannot think of a variety of other things. Previous modeling work [37] suggested that dimension-reduced SR could relate to addiction if it is rigid, i.e., not updated according to changes in the policy. Empirical work [41] has shown that compulsivity is associated with impairment of state transition learning. In the present work, rigidness of SR was needed for stable retention of enhanced obsession-compulsion cycle in the original model without decay, although not for the modified model with decay. Taken together, dimension-reduction and rigidness of state/action representation could generally relate to psychiatric disorders/symptoms.

Our model does not specifically describe how the SR- and IR-based systems are arbitrated and how they are modulated by medical treatments. Serotonin has been suggested to be related to arbitration of model-based and model-free control [7,42], and recent work demonstrated that silencing of dorsal raphe serotonin neurons disrupted model-based control in outcome devaluation tasks [43]. Meanwhile, selective serotonin reuptake inhibitors (SSRIs) have been used for treatment of OCD, and one of the suggested mechanisms of its effectiveness is that SSRIs re-engage the brain system for model-based control [1]. Our modeling results indicate a possibility that OCD is linked to impairment of learning of SR-based (model-based) system particularly from negative RPEs, which could be caused by serotonin depletion and recovered by SSRIs. This possibility is potentially consistent with a previous result shown in [42] that diminishing serotonin signaling by tryptophan depletion appeared to impair model-based choice after no-reward but not after reward in the two-stage decision task (their Fig 2A, the left two panels), similarly to the pattern predicted by our model (Fig 6B right panel).

We modeled the behavior of healthy subjects in the two-stage task by the SR-only agent rather than by the neutral SR+IR agent. This could be justified by the results that the choices of the neutral SR+IR agent were fitted with rather low weights of model-based control (w) while those of the SR-only agent were fitted by relatively large values of w (Fig 6A and 6E, left panels). We consider, however, that healthy people would not always use SR(model)-based learning but may also use IR-based learning, or even appetitive SR- and aversive IR-based learning when it is advantageous. Healthy people might be able to strongly rely on SR(model)-based system when appropriate while OCD patients might have an impairment in such an arbitration of behavioral control, potentially in line with a previous suggestion [44].

Previous studies using probabilistic reversal learning tasks reported unchanged / increased learning rate for positive / negative RPEs in OCD patients [45] or increased / decreased learning rate for positive / negative RPEs in youth with OCD [46], and a recent study using a probabilistic instrumental learning task reported unchanged / decreased learning rate for positive / negative RPEs in OCD patients [47]. Reasons for these mixed results remain elusive, but our appetitive SR + aversive IR agent (large / small learning rate for positive / negative RPE in the SR-based system but the opposite pattern in the IR-based system) could potentially explain them. Moreover, the recent study [47] has further shown that the decreased learning rate for negative RPEs in OCD was associated with attenuated representation of negative RPEs in the dorsomedial prefrontal cortex and the caudate. Given that the caudate (like dorsomedial striatum in rodents) has been implicated in model-based control (while putamen/dorsolateral striatum has been implicated in model-free control) [4,48], the attenuated representation of negative RPEs in the caudate in OCD can be consistent with our appetitive SR-based system having smaller learning rate for negative RPEs.

Limitations

The previous studies examining the two-stage decision task in OCD subjects [7] or subjects with tryptophan depletion [42] also examined a different, "punishment version" of the two-stage task, in which reward or no-reward outcomes in the original task were replaced with no-punishment or punishment outcomes. Different from the results in the reward version of the task, in the punishment version, OCD subjects did not show impaired model-based choice [7] and tryptophan depletion rather promoted model-based choice [42]. The environmental model that we used (developed by [10]) only contained punishment and cost, and so it can be said to have a similarity to the punishment version, rather than the reward version. Nonetheless, even if the environmental model was altered in a way that punishment (−1) for the stay at the obsession state was replaced with no reward for the stay at the obsession state and reward (1) for all the other cases, the behavior of the dual-system agent without decay was largely preserved provided that the IR-system-specific values and the weights for SR-system-specific values were initialized to 1+γ+γ2+… = 1/(1−γ) = 2 instead of 0. But still, the lack of outcome-valence dependence is an important limitation of our model (we will discuss an alternative environmental model below). It was suggested [42] that serotonin’s possible role in representation of the average reward rate [49] may be related to such outcome valence-dependent differential effects of tryptophan depletion. Exploration of how our model can be integrated into the average reward RL framework (c.f., [49]) is a possible future direction.

Our modeling results also appear not consistent with the result of a study examining generalization in OCD patients [50]. Using sensory preconditioning paradigm with monetary rewards and losses, this study reported that OCD patients generalized less from rewards, but not losses, than heathy controls (HCs). Looking at their results, in fact the HCs did not on average show significant generalization from rewards but showed chance-level choices, and the OCD patients significantly avoided the stimulus associated with a second stimulus that was later associated with stochastic (risky) gain and preferred the alternative stimulus associated with a second stimulus later associated with neutral outcome (Fig 3A of [50]). In the second-stimulus-outcome (gain/loss/neutral) association phase of the task, participants were required to "bid" on the presented stimulus and received feedback regardless of their bidding decisions, whereas no feedback was given in the final choice phase of the task. Given these task properties and observed choice patterns, we speculate that the task worked as a test for generalization of aversiveness associated with riskiness and/or loss of opportunity that arose when participants did not "bid" on reward-associated stimulus, and the OCD patients showed such a generalization of aversiveness while the HCs did not. As SR-based learning is more generalizable than IR-based one, the appetitive SR & aversive IR combination of our model predicts lager generalization from positive than negative RPEs in OCD, apparently opposite to the abovementioned conjecture.

However, this contradiction could potentially be resolved by considering recent findings of dopamine neuronal subpopulation representing threat prediction errors (PEs) rather than RPEs and striatal subpopulations engaged in threat (rather than reward) learning/management, including the tail of the striatum (TS) and substantia nigra pars lateralis (SNL) in mice [5153], and also dopamine neuronal subpopulations potentially representing error signals for aversive outcome [54] or reward-omission [55]. Specifically, it is conceivable that greater learning of SR-based system from positive than negative dopamine signals in OCD suggested by our model occurs not only in the canonical cortico-basal ganglia-dopamine circuit for reward RL but also in the circuit for RL of threat or aversiveness, potentially through common mechanisms for corticostriatal projection preferences. Then, the latter circuit, which is recruited in punishment contexts, should implement the aversive SR + appetitive IR agent (Fig 8) and show (over)generalization of threat or aversiveness, potentially in line with the abovementioned conjecture. Existence of such multiple opponent SR+IR circuits could further explain the previous mixed results regarding learning rate biases in OCD discussed above.

Fig 8. Schematic diagram of the proposed opponent SR + IR learning in multiple parallel cortico-basal ganglia (BG)-dopamine (DA) circuits.

Fig 8

It is hypothesized that there exist preferential connections from the cortical populations having SR and IR to the striatal direct and indirect pathway neurons expressing D1 and D2 receptors, respectively. Such preferential connections cause greater learning of SR- and IR-based systems from positive and negative DA signals, respectively. It implements the appetitive SR + aversive IR agent in the canonical cortico-BG-DA circuit for reward reinforcement learning (RL), in which DA represents reward prediction error (RPE) (left part of this figure). In contrast, in the recently revealed cortico-BG-DA circuit for threat/aversiveness RL, in which DA represents threat/aversiveness PE, the same greater learning of SR/IR-based systems from positive/negative DA signals implements the aversive SR + appetitive IR agent (right part of this figure).

From a more general viewpoint, a potential limitation of our model is that it may not explain a wide range of phenomena associated with OCD. In particular, it has been suggested that OCD is accompanied with various degrees of obsession and compulsion, and there can be pure compulsion (compulsive hoarding can be it [1]) and pure obsession [56] (but see [1]). By contrast, the environmental model that we used, developed by [10], assumes the obsession-compulsion cycle as an inseparable core of OCD. The model in fact assumes that intrusive thoughts universally occur even in healthy people, and what distinguishes OCD is the "abnormal reaction" to such thoughts. Once entering the obsession state, even the SR-only agent is likely to choose the "compulsion" as it should have a larger value than the "other" because the "other" leads to punishment more frequently than the "compulsion". So in this model, OCD is not characterized by an increase in the value/choice-frequency of compulsion. The model could still be compatible with the existence of (apparently) pure compulsion if the "abnormal reaction/obsession" can be subconscious, and also with pure obsession if the "compulsion" is not a motor action but a thought (as implied in [1]), presumably depending on which part of cortico-basal ganglia circuits has the opponent SR+IR structure. However, it has also been suggested that compulsion (rather than obsession or intrusive thoughts) is a key factor of OCD [1] and obsession can emerge from compulsion through erroneous reverse inference [57]. Future studies are desired to test these different theories, or possibly integrate them by elaborating the environmental model, which is currently rather simple (we will discuss an alternative environmental model below). Given that environmental complexity was suggested [58] to have crucial effects in addiction, which also entails compulsion, future elaboration of models and exploration of parameters are highly desired.

Predictions, and comparisons with actual data

Our model predicts that OCD patients show smaller common-vs-rare differences in the "stay" proportion after no-reward than after reward in the two-stage decision task (Fig 6B, right panel). This pattern looks similar to the actually observed pattern based on 48 OCD subjects reported in the previous study [7] (their Supplementary Fig 1), although not similar to the pattern based on 32 OCD subjects reported in their preceding study [6] (their Supplementary Fig S1). Statistical test of this specific prediction is desired to be conducted in the future.

Looking at the choice pattern of OCD subjects in the punishment version of the two-stage task (Supplementary Fig 1 of [7]), apparently, OCD subjects tended to show smaller common-vs-rare differences in the "stay" proportion after no-punishment than after punishment. If punishment/no-punishment were equated with no-reward/reward considering the sign of RPEs, the direction of this asymmetry is opposite to the direction in the reward version of the task, and thus opposite to our model’s prediction. However, if greater SR-based learning from positive dopamine signals suggested by our model occurs also in the circuit for RL of threat/aversiveness, which is recruited in punishment contexts, as discussed above (Fig 8), such a direction is exactly predicted, as we confirmed by simulation of the aversive SR + appetitive IR agent (Fig 9A, bottom).

Fig 9. Behavior in the two-stage decision task: additional simulation result and results obtained by analysis of publicly available experimental data.

Fig 9

(A) Behavior of the aversive SR + appetitive IR agent ((αSR+, αSR−, αIR+, αIR−) = (0.03, 0.27, 0.27, 0.03)) in the punishment version of the two-stage task. Top: Distribution of the estimated parameter representing the degree of model-based control (w) (across 98 out of 100 simulations). Bottom: Mean (±SEM) proportion of stay choice at the first stage across 1000 simulations. (B) Mean (±SEM) proportion of stay choice at the first stage of the participants with high scores of OCI-R (≥ 40) in Experiment 1 (23 out of 548 participants) (top panel) and Experiment 2 (58 out of 1413 participants) (bottom panel) of [9], obtained by analysis of the data at https://osf.io/usdgt/. Cases where the "reward" at the previous trial was negative in the data file were omitted from the analysis.

The data of the study examining the behavior in the two-stage task and self-report assessment of OCD using the Obsessive-Compulsive Inventory—Revised (OCI-R) [59] in general populations [9] is publicly available (https://osf.io/usdgt/). We extracted the data and analyzed the choice of participants with high scores of OCI-R (≥ 40) in their Experiments 1 and 2 (23 out of 548 and 58 out of 1413 participants, respectively). The choice in Experiment 2 (Fig 9B, bottom) showed a tendency of smaller common-vs-rare differences in the "stay" proportion after reward than after no-reward (comparison of (P(Stay)|commonP(Stay)|rare)|reward and (P(Stay)|rareP(Stay)|common)|no-reward, t(57) = −1.99, p = 0.0516 (Experiment 2); t(22) = −1.31, p = 0.203 (Experiment 1)). This direction of asymmetry is the same as the direction apparently shown by OCD subjects in the punishment version, rather than the reward version [7]. We think this could potentially be explained as follows. In the study using general populations [9], "Participants were paid a base rate (Experiment 1: $2, Experiment 2: $2.50) in addition to a bonus based on their earnings during the reinforcement-learning task (In each experiment, M = $0.54, SD = 0.04) (sentence extracted from page 13 of [9])". Given that completion of the task (200 trials) would require a considerable amount of time, these payments were rather low. So it seems possible that the participants had expected to obtain much performance-dependent bonus and they felt every failure as a punishment, even though outcomes were facially presented as reward vs no-reward. Future study is desired to examine this possibility by manipulating the amount of payment. In terms of our model, the asymmetry in Experiment 2 can again be explained by the aversive SR + appetitive IR agent (Fig 9A, bottom).

An alternative environmental model

The aversive SR + appetitive IR combination did not develop enhanced obsession-compulsion cycles in the environmental model that we used (proposed by [10]), as shown in the top-left corner of Figs 3E and 4H. If OCD patients have both appetitive SR + aversive IR and aversive SR + appetitive IR combinations in the parallel circuits for reward RL and threat RL, respectively, as proposed in Fig 8, obsession-compulsion of the patients could still potentially be explained by the former combination which developed enhanced obsession-compulsion cycles. However, a difficulty with this possibility is whether the circuit for reward RL, rather than that for threat RL, could be activated in such an aversive situation where the patients develop obsession-compulsion. Or even more generally, overgeneralization of (or longer memory trace for) positive, rather than negative, feedback that causes enhanced obsession-compulsion cycles in this environmental model might not be intuitively convincing, and can actually be inconsistent with (even opposite to) the previous work examining generalization in OCD patients [50] as discussed above.

A different possibility, then, is that the aversive SR + appetitive IR combination, in the threat RL circuit, could in fact develop obsession-compulsion if the environment is modeled differently. Indeed, in an alternative environmental model shown in Fig 10A, the aversive SR + appetitive IR combination developed intermittent bursts of repetitive obsession-compulsions, whereas the other combinations (appetitive SR + aversive IR, neutral SR + neutral IR, SR only, and IR only) rarely did so (Fig 10B and 10C). In this alternative environmental model, "compulsion" causes a stay at the obsession state with punishment whereas "depart" causes a transition to the relief state without cost or punishment, and so rational agents should learn to avoid "compulsion". Likewise, "intrusive" causes a transition to the obsession state with large punishment whereas "normal" causes a stay at the relief state without punishment, and so "intrusive" should normatively be minimized. However, the aversive SR + appetitive IR agent could overgeneralize the large punishing feedback upon "intrusive" (entering the obsession state) to preceding "depart" and "normal", deteriorating their values and increasing, in turn, the probabilities that "compulsion" and "intrusive" are chosen. Key differences from the original environmental model (Fig 2, [10]) are (i) punishment is given upon entering the obsession state, not only upon staying at it, (ii) "compulsion" could be repeated without returning to the relief state every time, and (iii) intrusive thought, rather than "abnormal reaction" to it, is modeled. Given these features, this alternative environmental model not only demonstrates that the aversive SR + appetitive IR combination could develop obsession-compulsion but also potentially overcomes several difficulties in the original environmental model discussed above.

Fig 10. Behavior of the dual-system agents in the alternative environmental model.

Fig 10

(A) Diagram of action-state transitions. At the relief state, the agent can have "normal (thought)" or "intrusive (thought)". Having "intrusive" causes transition to the obsession state, with punishment. At the obsession state, the agent can take "compulsion", which causes a stay at the obsession state with punishment, or "depart", which causes transition to the relief state. (B) Examples of the moving-average proportion of the obsession state (averaged over 100 time-steps, plotted every 100 time-steps) in the cases of the different types of agents (from top to bottom: aversive SR + appetitive IR (αSR+, αSR−, αIR+, αIR−) = (0.01, 0.09, 0.09, 0.01), appetitive SR + aversive IR (0.09, 0.01, 0.01, 0.09), neutral SR + neutral IR (0.05, 0.05, 0.05, 0.05), SR-only (0.1, 0.1, 0, 0) and IR-only (0, 0, 0.1, 0.1)). (C) Percentage of the period of repetitive obsession-compulsions, in which the moving-average proportion of the obsession state was ≥ 0.5, during time-steps 0~50000 in various cases with different learning rates, averaged across 100 simulations for each case. The horizontal and vertical axes indicate αSR+ and αSR−, respectively, while αSR+ + αIR+ and αSR− + αIR− were kept constant at 0.1 (in the same manner as in Figs 3E and 4H). (D-F) Percentage of the period of repetitive obsession-compulsions in the cases where the size of punishment upon staying at the obsession state (originally 0.2 in (C)) was changed to 0.1, 0.3, 0.4, or 0.5 (D), the inverse temperature (originally 10 in (C)) was changed to 5 or 20 (E), or the time discount factor (originally 0.8 in (C)) was changed to 0.7 or 0.9 (F).

This being said, however, consideration of such an alternative environmental model may be ad hoc. The agents’ behavioral patterns in this model have parameter dependence. Specifically, while moderately changing the size of punishment upon staying at the obsession state (Fig 10D, top two panels) did not drastically change the patterns, further increasing the punishment upon stay (Fig 10D, bottom), changing the inverse temperature (Fig 10E), or changing the time discount factor (Fig 10F) caused significant changes: disappearance of (or prominent decrease in) the intermittent bursts of repetitive obsession-compulsions in the aversive SR + appetitive IR combination and/or appearance of them in other combinations. These results might rather indicate a fundamental limitation of modeling the environment for psychiatric disorders like OCD by such simple diagrams with only a few states and actions (c.f., [58]). Nonetheless, we would like to argue that the parallel opponent SR+IR configurations, which can be in line with the differential cortical targeting/activations of the direct and indirect basal-ganglia pathways and the existence of parallel cortico-basal ganglia-dopamine circuits for reward RL and threat RL, could potentially provide a novel biologically grounded framework to integrate the different lines of behavioral findings on OCD: memory trace imbalance, impaired model-based control, and its valence-dependence.

Materials and methods

Environmental model that describes possible enhancement of obsession-compulsion cycle

We adopted the environmental model developed by the previous study [10] that describes possible enhancement of obsession-compulsion cycle (Fig 2). There are two states, i.e., the relief state and the obsession state, and two available actions at each state, i.e., the "abnormal reaction" and the "other" action at the relief state and the "compulsion" and the "other" action at the obsession state. State transitions are determined as follows: at the relief state, taking the abnormal reaction always causes transition to the obsession state whereas taking the other action always causes stay at the relief state, and at the obsession state, taking the compulsion causes transition to the relief state and stay at the obsession state with 50% probability each whereas taking the other action causes transition to the relief state with 10% probability and stay at the obsession state with 90% probability. Taking the compulsion at the obsession state requires cost 0.01 (i.e., negative reward −0.01), and every stay at the obsession state imposes punishment 1 (i.e., negative reward −1); no other reward is assumed. The parameter values regarding state transitions and rewards are the same as those used in the previous work [10]. We also examined modified versions of the environmental model, in which there were two or eight "other" actions at both or one of the states (Fig 5). Probabilities of state transitions for these multiple "other" actions were set to the same values for the "other" actions in the original model. We further examined an alternative environmental model as depicted in Fig 10A.

Agent having coupled SR- and IR-based systems

We adopted the model agent developed by the recent work [35] that has coupled SR- and IR-based systems (Fig 1C), with a modification that SR of actions (c.f., [13]) rather than SR of states was used in the present study because there were only two states in the environmental model. Each system developed the system-specific value of each action, and their average, referred to as the (integrated) action value, was used for action selection and TD-RPE calculation.

Action selection was made in a soft-max manner. Specifically, when there were multiple available actions Ai (i = 1,…, n), whose action values were Q(Ai), Ai was selected with probability

exp(βQ(Ai))/Σi{exp(βQ(Ai))},

where β was the inverse temperature parameter representing the degree of choice exploitation over exploration and was set to 10 in the simulations with the abovementioned environmental models unless otherwise mentioned.

SARSA-type TD-RPE was used for the results presented in the figures. Specifically, when the agent took action At−1, paid cost c, received reward r, transitioned to the next state, and took the next action At, SARSA-type TD-RPE

δ=c+r+γQ(At)Q(At1),

where γ was the time discount factor and was set to 0.5 in the simulations with the abovementioned original environmental model (Fig 2) and its modified versions (Fig 5A) and 0.8 (or 0.7 or 0.9) in the simulations with the alternative environmental model (Fig 10A), was calculated. We also examined the cases with Q-learning-type TD-RPE for the original agent and environmental models, and the main properties of the agent’s behavior shown in Fig 3 were not largely changed. Along with TD-RPE, TD errors for SR features were calculated:

Δ=et1+γMt,Mt1,

where et−1 was the vector that had elements corresponding to all the actions and the element corresponding to At−1 was 1 and all the other elements were 0, and Mi (i = t−1 or t) was the Ai-corresponding row of the SR matrix M.

IR-based system-specific value of At−1, QIR(At−1), was updated based on the TD-RPE:

QIR(At1)QIR(At1)+αIR+δ(whenδ0)or
QIR(At1)+αIRδ(whenδ<0),

where αIR+ and αIR− were the learning rates in the IR-based system for positive (non-negative) and negative TD-RPEs, respectively. SR-based system-specific value of At−1, QSR(At−1), was given by the dot product of the SR feature vector for At−1, i.e., Mt−1, and the weight vector w, i.e.,

QSR(At1)=wMt1.

The weight w was updated based on the TD-RPE:

ww+αSR+δMt1(whenδ0)or
ww+αSRδMt1(whenδ<0),

where αSR+ and αSR− were the learning rates in the SR-based system for positive (non-negative) and negative TD-RPEs, respectively. The SR feature vector for At−1, Mt−1, was updated based on the TD errors for SR features:

Mt1Mt1+αfeatureΔ,

where αfeature was the learning rate for feature update and was set to 0.01 in the simulations with the abovementioned environmental models except for the simulations shown in Fig 4D and 4E, in which αfeature decreased over time according to

0.01/(1+timestep/1000).

For the simulations shown in Figs 4F–4H, 5 and 10, the IR-system-specific values for all the actions and all the elements of the weight vector for the SR system-specific values (w) decayed at a constant rate (0.001), i.e., were multiplied by 0.999 at each time step.

The agent was initially placed at the relief state. The IR-system-specific values for all the actions were initialized to 0, and the weight vector for the SR system-specific values (w) was initialized to 0. The SR matrix M was initialized to the identity matrix, i.e., the features corresponding to the own actions were initialized to 1 and all the other features were initialized to 0. 100 simulations were conducted for each condition. Below are notes on our simulations using the original environmental model. The value of the time discount factor (γ = 0.5) was the same as the one used in the previous work [10]. The ranges of the learning rates αIR+, αIR−, αSR+, and αSR− were also determined in reference to the value of the learning rate used in that study (0.1), although because the average (rather than the sum) of the SR- and IR-based system-specific values was used for action selection and TD-RPE calculation, there was effectively a difference of two times, and also effective learning rates for the SR-based system were considered to vary as mentioned below. The value of the inverse temperature used in most simulations (β = 10) was larger than the value used in the previous work (β = 1) [10]. If β was set to 1 in our original model without decay, the proportion that the agent was at the obsession state was rather high (≥ around 0.5 on average) in any combinations examined in Fig 3A–3D ((αSR+, αSR−, αIR+, αIR−) = (0, 0, 0.1, 0.1), (0.1, 0.1, 0, 0), (0.05, 0.05, 0.05, 0.05), and (0.09, 0.01, 0.01, 0.09)).

Notably, the abovementioned algorithm for TD learning of the SR-based system differed from the one described in [13], not only in the specific assumptions on the learning rates and the decay but also in that our algorithm lacked a scaling of the SR feature vector by its squared norm upon updating the weight. Without this scaling, the learning rate for weight update was effectively multiplied with the squared norm of the feature vector. In our simulations using the original environmental model, elements of the feature vectors were upper-bounded by 1 + γ + γ2 +… = 2 (as we set γ = 0.5) and so the squared norm of the feature vector was upper-bounded by 22 = 4, and since we varied the learning rates for weight update in the range of ≤ 0.1, effective learning rates did not exceed 1.

In order to examine the reason for the value growth occurred in the original appetitive SR + aversive IR agent, we also examined the agent consisting of an appetitive IR-based system ((αIR+, αIR−) = (0.09, 0.01)) and an aversive IR-based system ((αIR+, αIR−) = (0.01, 0.09)). As in the case of SR + IR agent, the average of the two IR-based system-specific values, initialized to 0, was used for action selection and TD-RPE calculation.

Simulation and fitting of the two-stage decision task

We simulated the behavior of the agent having a coupled SR- and IR-based systems in the two-stage decision task consisting of 201 trials [8]. There were two choice options at the first stage, and selection of one of them led to either of two pairs of second-stage options with 70% or 30% probabilities (referred to as the common and rare transitions, respectively) whereas selection of the other first-stage option led to either of the two pairs with the opposite 30% and 70% probabilities. Selection of one of the second-stage options led to reward (size 1) or no-reward. The probability of reward for each second-stage option was determined according to Gaussian random walk with reflecting boundaries at 0.25 and 0.75, and it was implemented in a way that pseudo normal random numbers (mean 0, SD 0.025) was added to the probability at every trial and the probability was reflected at 0.25 and 0.75.

At every trial, choice was made in the soft-max manner with the degree of exploitation over exploration (inverse temperature) (β) set to 5 at both stages. SARSA-type TD-RPE for the first stage was calculated after the second-stage choice was determined:

δk(1)=0+γV(Ok(2))V(Ok(1)),

where Ok(1) and Ok(2) were chosen options for the first and second stages at the k-th trial, respectively, and V(O) was the value of option O (i.e., average of SR- and IR-system specific values). γ was the time discount factor, and it was set to 1 (i.e., no temporal discounting). Then, after reward (R = 1 or 0) was determined, TD-RPE for the second stage was calculated:

δk(2)=R+0V(Ok(2)),

where V(Ok(2)) reflected δk(1)-based update of the weights of the SR system-specific values. The IR-based system-specific value of each first- and second-stage option was initialized to 0 and updated based on the TD-RPEs. The SR-based system had the SR of the first- and second-stage options, and the SR-based system-specific value was given by the dot product of the SR feature vector for the corresponding action and the weights, which were updated based on TD-RPEs. The SR matrix was initialized to the one that assumed the random policy on second-stage choice and incorporated the actual stage-transition probabilities:

[100.350.350.150.15010.150.150.350.35001000000100000010000001],

in which the first two rows/columns correspond to the two first-stage options and the subsequent four rows/columns correspond to the four (2×2) second-stage options. At every trial, after stage transition occurred and second-stage option was determined, SR features for the chosen first-stage option was updated based on the TD errors for SR features with the learning rate 0.05.

We considered an agent that had appetitive SR- and aversive IR-based systems, an agent that effectively had SR-based system only, and an agent that had neutral SR- and IR-based systems for which the learning rates in each system for positive and negative TD-RPEs were set to (αSR+, αSR−, αIR+, αIR−) = (0.27, 0.03, 0.03, 0.27), (0.3, 0.3, 0, 0), and (0.15, 0.15, 0.15, 0.15), respectively, common to both stages, and conducted 100 simulations for each agent. We also conducted 100 simulations of the punishment version of the task, in which reward or no-reward (R = 1 or 0) in the original task was replaced with no-punishment or punishment (R = 0 or −1), for an agent that had aversive SR- and appetitive IR-based systems ((αSR+, αSR−, αIR+, αIR−) = (0.03, 0.27, 0.27, 0.03)).

Generated choice sequences were individually fitted by the RL model weighing model-based and model-free control described in the Supplemental material of the original study [8]. The RL model had seven free parameters: a1 and a2 (learning rates for the first and second stages), b1 and b2 (inverse temperatures for the first and second stages), λ (eligibility trace, representing the degree with which the model-free value of first-stage option was updated based on the TD-RPE at the second stage), ρ (perseveration or switching at first-stage choices), and w (degree (weight) of model-based control, with the degree (weight) of model-free control set to (1−w)). Since a1, a2, λ, and w should normally be between 0 and 1, these variables were transformed into atanh(2x−1), where x represents a1, a2, λ, or w, in the fitting process so that the range of exploration was effectively bound to [0 1]. For each choice sequence, we explored a set of parameters that maximized the log likelihood [60] by using the "fminsearch" function of MATLAB, with the initial parameter values set as (a1, a2, b1, b2, λ, ρ) = (0.3, 0.3, 5, 5, 0.2, 0.2) and w = 0.2, 0.5, and 0.8 (i.e., these three cases were examined) and the ’MaxFunEvals’ and ’MaxIter’ options of this function set to both 10000. Among the three cases with different initial values of w, we selected a case that gave the maximum log likelihood, and if the "fminsearch" function terminated with "exitflag" 1 (meaning normal termination) in that selected case, the corresponding data (value of w) was included in the following analysis. This was true for 97, 97, 97, and 98 out of 100 simulations for the abovementioned four types of agents, respectively. We compared the obtained distribution of w between the appetitive SR + aversive IR agent and the SR-only agent (originally planned), and also between the appetitive SR + aversive IR agent and the neutral SR + IR agent (additionally examined), by using the Wilcoxon rank-sum test (Mann Whitney U test).

We also simulated the behavior of the original RL model [8], as well as the behavior of a modified model with imbalanced eligibility trace, in the original reward version of the two-stage decision task. As for the original model, the parameters were set to (a1, a2, b1, b2, λ, ρ, w) = (0.3, 0.3, 5, 5, 0.4, 0.2, 0.5). The modified model had two parameters, λ1 and λ2, representing the eligibility trace for positive (non-negative) and negative TD-RPEs, respectively, instead of the single parameter λ in the original model, and they were set to (λ1, λ2) = (0.4, 0) to represent imbalanced trace; the other parameters were set to the same values as those set for the original model. 100 simulations were conducted for each model, and generated choice sequences were individually fitted by the original model in the same manner as in the case of the agent with coupled SR- and IR-based systems described above. For each choice sequence, among the three cases of fitting with different initial values of w, we again selected a case that gave the maximum log likelihood, and this time the "fminsearch" function terminated with "exitflag" 1 in that selected case for all of the 100 simulations for both models. We again compared the obtained distribution of w for each model by using the Wilcoxon rank-sum test.

Precisely speaking, our implementation of the RL model developed in [8] and its variant, used for the fitting and simulations mentioned above, could be slightly different from the original one. Specifically, we could not find how the model-based system estimated the transition probabilities when the actual frequencies of past occurrence of both types of transitions exactly matched (for example, both types of transitions occurred 10 trials in the initial 20 trials) in the original study, and we assumed that the model-based system estimated them as 50% and 50% in such situations.

In addition to the analyses mentioned above, we separately conducted 1000 simulations for each of the six cases mentioned above, and analyzed the choice pattern, i.e., calculated the proportion of stay choice (i.e., choosing the same option as the one chosen at the previous trial) at the first stage depending on whether reward, or punishment, was obtained or not and whether common or rare transition was occurred at the previous trial (we increased the number of simulations to 1000 in order to obtain the average pattern more accurately).

Simulation and fitting of the delayed feedback task

We simulated execution of the delayed feedback task [10] by the dual-system agents. There were eight stimuli, which caused +40, +10, −10, or −40 immediate feedback (denoted as +40im etc) or +40, +10, −10, or −40 delayed feedback three trials later (denoted as +40de etc). The task consisted of six sessions, and in each session, each of six pairs of stimuli ((+10im, +40im), (+10de, +40de), (−10im, −40im), (−10de, −40de), (+10im, +40de), (−10im, −40de)) was presented ten times and each of ten pairs ((+40im, +40de), (+10im, +10de), (−10im, −10de), (−40im, −40de), (+10de, +40im), (−10de, −40im), (+10im, −10im), (+40im, −40im), (+10de, −10de), (+40de, −40de)) was presented five times in a pseudorandom order (in the same manner as in the original study [10]).

The IR-based system of the dual-system agent learned the system-specific values of eight actions to choose each of the eight stimuli through updating the value of taken action by RPE multiplied by the learning rate (αIR+ or αIR− for positive or negative RPE, respectively). The SR-based system represented each action by estimated discounted cumulative occupancies of successor "states". Seven states were defined based on the presented (total) feedback at each trial: S1: +80, S2: +50 or +40 or +30, S3: +20 or +10, S4: 0, S5: −10 or −20, S6: −30 or −40 or −50, and S7: −80. At each trial k, "trace" of each action Ai (i = 1,.., 8), T(Ai), was updated as

T(Ai)γT(Ai)(ifAiwasnottakenattrialk)or
T(Ai)γT(Ai)+1(ifAiwastakenattrialk),

where γ was the time discount factor set to 0.75. Then, a column of 8×7 SR matrix M that corresponded to the state at trial k (determined by the feedback presented at trial k), Sj, was updated to be equal to the traces for each action:

MijT(Ai)(i=1,..,8),

after M was used for value calculation and value (weight) update (mentioned below). The other columns of M remained unchanged. The SR-based system-specific value of Ai (i = 1,.., 8) was given by the dot product of the SR feature vector for Ai, i.e., i-th row of M, M(i), and a weight vector w, i.e., wM(i). The weight w was updated based on the action Ai that was taken at trial k and RPE δ:

ww+αSR+δM(i)(whenδ0)or
ww+αSRδM(i)(whenδ<0).

Integrated action values were calculated as the average of the IR- and SR-based system-specific action values. Choice between two actions corresponding to two stimuli presented at each trial was made in the soft-max manner with the degree of exploitation over exploration (inverse temperature) (β) set to 0.25. RPE was calculated as the difference between the (total) feedback and the integrated value of taken action. The IR-based system-specific values for all the actions and all the elements of the weight vector for the SR-based system-specific values (w) decayed at a constant rate (0.001), i.e., were multiplied by 0.999 at each trial. The traces for actions T(Ai), the elements of the SR matrix M, and the elements of the weight vector w were all initialized to 0. The learning rates were set as (αSR+, αSR−, αIR+, αIR−) = (0.02, 0.02, 0, 0) or (0.018, 0.002, 0.002, 0.018) for the SR-only agent or the appetitive SR + aversive IR agent, respectively. We conducted 1000 simulations for each of the two types of agents, and calculated the averages and the SEMs.

We also conducted fitting of the choices of these two types of dual-system agents (50 simulations for each, separately conducted from the abovementioned 1000 simulations) by the separate eligibility trace actor-critic model considered in [10]. This model had four free parameters: a (learning rate), b (inverse temperature for choice), and ν+ and ν (decay time scale of the trace for positive and negative RPEs, respectively). Since a, ν+, and ν should normally be between 0 and 1, these variables were transformed into atanh(2x−1), where x represents a, ν+, or ν, in the fitting process so that the range of exploration was effectively bound to [0 1]. For each choice sequence, we explored a set of parameters that maximized the log likelihood [60] by using the "fminsearch" function of MATLAB, with the initial parameter values set as (a, b) = (0.0011, 2.96), which were the medians of estimated parameters of HCs in [10], and ν+, ν = 0.1, 0.5, or 0.9 (i.e., 3 × 3 = 9 cases were examined) and the ’MaxFunEvals’ and ’MaxIter’ options of this function set to both 10000. Among the nine cases with different initial values of (ν+, ν), we selected the cases in which the "fminsearch" function terminated with "exitflag" 1 (meaning normal termination), and if there were any such case(s) (this was true for 47 and 45 simulations for the SR-only and the appetitive SR + aversive IR agent, respectively), we selected a case that gave the maximum log likelihood among them, and plotted the corresponding values of ν+ and ν in Fig 7C.

Software and statistics

Simulations and fitting were conducted by using MATLAB, and its functions "rand", "randn", and "randperm" were used to implement probabilities and pseudorandom numbers. Standard deviation (SD) of simulation data was normalized by n, and standard error of the mean (SEM) was approximately calculated as SD/√n. The Wilcoxon rank-sum test (Mann Whitney U test), Shapiro-Wilk normality test, and paired t test were conducted by using R (functions "wilcox.test", "shapiro.test", and "t.test").

Code availability

Codes to generate/reproduce the data presented in the figures are available at: https://github.com/kenjimoritagithub/sr101.

Supporting information

S1 Data. Codes to generate/reproduce the data presented in the figures.

Please read "readme.txt" for details.

(ZIP)

Data Availability

All relevant codes are within the Supporting Information files. Codes to generate/reproduce the data presented in the figures are available at:https://github.com/kenjimoritagithub/sr101

Funding Statement

KM was supported by Grant-in-Aid for Scientific Research (No. 20H05049 and 23H03295) of the Ministry of Education, Culture, Sports, Science and Technology in Japan (MEXT) (http://www.mext.go.jp/en/) and the Japan Society for the Promotion of Science (JSPS) (https://www.jsps.go.jp/english/) and the Naito Foundation (https://www.naito-f.or.jp/en/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Robbins TW, Vaghi MM, Banca P. Obsessive-Compulsive Disorder: Puzzles and Prospects. Neuron. 2019;102(1):27–47. doi: 10.1016/j.neuron.2019.01.046 . [DOI] [PubMed] [Google Scholar]
  • 2.Dickinson A, Balleine B. Motivational control of goal-directed action. Animal learning and behavior. 1994;22(1):1–18. [Google Scholar]
  • 3.Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8(12):1704–11. doi: 10.1038/nn1560 . [DOI] [PubMed] [Google Scholar]
  • 4.Dolan RJ, Dayan P. Goals and habits in the brain. Neuron. 2013;80(2):312–25. doi: 10.1016/j.neuron.2013.09.007 ; PubMed Central PMCID: PMC3807793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Gillan CM, Papmeyer M, Morein-Zamir S, Sahakian BJ, Fineberg NA, Robbins TW, et al. Disruption in the balance between goal-directed behavior and habit learning in obsessive-compulsive disorder. Am J Psychiatry. 2011;168(7):718–26. Epub 20110515. doi: 10.1176/appi.ajp.2011.10071062 ; PubMed Central PMCID: PMC3533260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Voon V, Derbyshire K, Rück C, Irvine MA, Worbe Y, Enander J, et al. Disorders of compulsivity: a common bias towards learning habits. Mol Psychiatry. 2015;20(3):345–52. Epub 20140520. doi: 10.1038/mp.2014.44 ; PubMed Central PMCID: PMC4351889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Voon V, Baek K, Enander J, Worbe Y, Morris LS, Harrison NA, et al. Motivation and value influences in the relative balance of goal-directed and habitual behaviours in obsessive-compulsive disorder. Transl Psychiatry. 2015;5:e670. Epub 20151103. doi: 10.1038/tp.2015.165 ; PubMed Central PMCID: PMC5068758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ. Model-based influences on humans’ choices and striatal prediction errors. Neuron. 2011;69(6):1204–15. doi: 10.1016/j.neuron.2011.02.027 ; PubMed Central PMCID: PMC3077926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gillan CM, Kosinski M, Whelan R, Phelps EA, Daw ND. Characterizing a psychiatric symptom dimension related to deficits in goal-directed control. Elife. 2016;5: Epub 2016/03/01. doi: 10.7554/eLife.11305 ; PubMed Central PMCID: PMC4786435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sakai Y, Sakai Y, Abe Y, Narumoto J, Tanaka SC. Memory trace imbalance in reinforcement and punishment systems can reinforce implicit choices leading to obsessive-compulsive behavior. Cell Rep. 2022;40(9):111275. doi: 10.1016/j.celrep.2022.111275 . [DOI] [PubMed] [Google Scholar]
  • 11.Montague PR, Dayan P, Sejnowski TJ. A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci. 1996;16(5):1936–47. doi: 10.1523/JNEUROSCI.16-05-01936.1996 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275(5306):1593–9. doi: 10.1126/science.275.5306.1593 . [DOI] [PubMed] [Google Scholar]
  • 13.Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS Comput Biol. 2017;13(9):e1005768. Epub 2017/09/25. doi: 10.1371/journal.pcbi.1005768 ; PubMed Central PMCID: PMC5628940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dayan P. Improving Generalization for Temporal Difference Learning: The Successor Representation. Neural Computation. 1993;5(4):613–24. [Google Scholar]
  • 15.Momennejad I, Russek EM, Cheong JH, Botvinick MM, Daw ND, Gershman SJ. The successor representation in human reinforcement learning. Nat Hum Behav. 2017;1(9):680–92. Epub 2017/08/28. doi: 10.1038/s41562-017-0180-8 ; PubMed Central PMCID: PMC6941356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Stachenfeld KL, Botvinick MM, Gershman SJ. The hippocampus as a predictive map. Nat Neurosci. 2017;20(11):1643–53. Epub 2017/10/02. doi: 10.1038/nn.4650 . [DOI] [PubMed] [Google Scholar]
  • 17.Garvert MM, Dolan RJ, Behrens TE. A map of abstract relational knowledge in the human hippocampal-entorhinal cortex. Elife. 2017;6:e17086. Epub 2017/04/27. doi: 10.7554/eLife.17086 ; PubMed Central PMCID: PMC5407855. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND. Neural evidence for the successor representation in choice evaluation. bioRxiv. 2021; 10.1101/2021.08.29.458114. 10.1101/2021.08.29.458114. [DOI] [Google Scholar]
  • 19.Lei W, Jiao Y, Del Mar N, Reiner A. Evidence for differential cortical input to direct pathway versus indirect pathway striatal projection neurons in rats. J Neurosci. 2004;24(38):8289–99. doi: 10.1523/JNEUROSCI.1990-04.2004 [pii] . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wall NR, De La Parra M, Callaway EM, Kreitzer AC. Differential innervation of direct- and indirect-pathway striatal projection neurons. Neuron. 2013;79(2):347–60. doi: 10.1016/j.neuron.2013.05.014 ; PubMed Central PMCID: PMC3729794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Lu J, Cheng Y, Xie X, Woodson K, Bonifacio J, Disney E, et al. Whole-Brain Mapping of Direct Inputs to Dopamine D1 and D2 Receptor-Expressing Medium Spiny Neurons in the Posterior Dorsomedial Striatum. eNeuro. 2021;8(1):ENEURO.0348-20.2020. Epub 20210122. doi: 10.1523/ENEURO.0348-20.2020 ; PubMed Central PMCID: PMC7877463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Frank MJ, Seeberger LC, O’reilly RC. By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science. 2004;306(5703):1940–3. doi: 10.1126/science.1102941 [pii] . [DOI] [PubMed] [Google Scholar]
  • 23.Hikida T, Kimura K, Wada N, Funabiki K, Nakanishi S. Distinct roles of synaptic transmission in direct and indirect striatal pathways to reward and aversive behavior. Neuron. 2010;66(6):896–907. doi: 10.1016/j.neuron.2010.05.011 [pii] . [DOI] [PubMed] [Google Scholar]
  • 24.Kravitz AV, Tye LD, Kreitzer AC. Distinct roles for direct and indirect pathway striatal neurons in reinforcement. Nat Neurosci. 2012;15(6):816–8. doi: 10.1038/nn.3100 [pii] . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tai LH, Lee AM, Benavidez N, Bonci A, Wilbrecht L. Transient stimulation of distinct subpopulations of striatal neurons mimics changes in action value. Nat Neurosci. 2012;15(9):1281–9. doi: 10.1038/nn.3188 ; PubMed Central PMCID: PMC3951287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Iino Y, Sawada T, Yamaguchi K, Tajiri M, Ishii S, Kasai H, et al. Dopamine D2 receptors in discrimination learning and spine enlargement. Nature. 2020;579(7800):555–60. Epub 20200318. doi: 10.1038/s41586-020-2115-1 . [DOI] [PubMed] [Google Scholar]
  • 27.Lee SJ, Lodder B, Chen Y, Patriarchi T, Tian L, Sabatini BL. Cell-type-specific asynchronous modulation of PKA by dopamine in learning. Nature. 2021;590(7846):451–6. Epub 20201223. doi: 10.1038/s41586-020-03050-5 ; PubMed Central PMCID: PMC7889726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Brown JW, Bullock D, Grossberg S. How laminar frontal cortex and basal ganglia circuits interact to control planned and reactive saccades. Neural Netw. 2004;17(4):471–510. doi: 10.1016/j.neunet.2003.08.006 . [DOI] [PubMed] [Google Scholar]
  • 29.Silver MR, Grossberg S, Bullock D, Histed MH, Miller EK. A neural model of sequential movement planning and control of eye movements: Item-Order-Rank working memory and saccade selection by the supplementary eye fields. Neural Netw. 2012;26:29–58. Epub 20111019. doi: 10.1016/j.neunet.2011.10.004 . [DOI] [PubMed] [Google Scholar]
  • 30.Collins AG, Frank MJ. Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive. Psychol Rev. 2014;121(3):337–66. doi: 10.1037/a0037015 . [DOI] [PubMed] [Google Scholar]
  • 31.Mikhael JG, Bogacz R. Learning Reward Uncertainty in the Basal Ganglia. PLoS Comput Biol. 2016;12(9):e1005062. Epub 2016/09/02. doi: 10.1371/journal.pcbi.1005062 ; PubMed Central PMCID: PMC5010205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Morita K, Kawaguchi Y. A Dual Role Hypothesis of the Cortico-Basal-Ganglia Pathways: Opponency and Temporal Difference Through Dopamine and Adenosine. Front Neural Circuits. 2019;12:111. Epub 2019/01/07. doi: 10.3389/fncir.2018.00111 ; PubMed Central PMCID: PMC6338031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Skvortsova V, Palminteri S, Pessiglione M. Learning to minimize efforts versus maximizing rewards: computational principles and neural correlates. J Neurosci. 2014;34(47):15621–30. doi: 10.1523/JNEUROSCI.1350-14.2014 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Groman SM, Keistler C, Keip AJ, Hammarlund E, DiLeone RJ, Pittenger C, et al. Orbitofrontal Circuits Control Multiple Reinforcement-Learning Processes. Neuron. 2019;103(4):734–46.e3. Epub 20190625. doi: 10.1016/j.neuron.2019.05.042 ; PubMed Central PMCID: PMC6893860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Morita K, Shimomura K, Kawaguchi Y. Opponent learning with different representations in the cortico-basal ganglia circuits. eNeuro (accepted). 2022; 10.1101/2021.10.29.466375 (bioRxiv). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Sutton RS, Barto AG. Reinforcement Learning: An Introduction (Second Edition). Cambridge, MA: MIT Press; 2018. [Google Scholar]
  • 37.Shimomura K, Kato A, Morita K. Rigid reduced successor representation as a potential mechanism for addiction. Eur J Neurosci. 2021;53(11):3768–90. Epub 2021/05/10. doi: 10.1111/ejn.15227 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Morita K, Kato A. Striatal dopamine ramping may indicate flexible reinforcement learning with forgetting in the cortico-basal ganglia circuits. Front Neural Circuits. 2014;8:36. doi: 10.3389/fncir.2014.00036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Kato A, Morita K. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation. PLoS Comput Biol. 2016;12(10):e1005145. Epub 2016/10/13. doi: 10.1371/journal.pcbi.1005145 ; PubMed Central PMCID: PMC5063413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Lehnert L, Tellex S, Littman ML. Advantages and Limitations of using Successor Features for Transfer in Reinforcement Learning. arXiv. 2017:arXiv:1708.00102v1. [Google Scholar]
  • 41.Sharp PB, Dolan RJ, Eldar E. Disrupted state transition learning as a computational marker of compulsivity. Psychological Medicine. 2021:1–11. doi: 10.1017/S0033291721003846 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Worbe Y, Palminteri S, Savulich G, Daw ND, Fernandez-Egea E, Robbins TW, et al. Valence-dependent influence of serotonin depletion on model-based choice strategy. Mol Psychiatry. 2016;21(5):624–9. Epub 20150414. doi: 10.1038/mp.2015.46 ; PubMed Central PMCID: PMC4519524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ohmura Y, Iwami K, Chowdhury S, Sasamori H, Sugiura C, Bouchekioua Y, et al. Disruption of model-based decision making by silencing of serotonin neurons in the dorsal raphe nucleus. Curr Biol. 2021;31(11):2446–54.e5. Epub 20210409. doi: 10.1016/j.cub.2021.03.048 . [DOI] [PubMed] [Google Scholar]
  • 44.Gruner P, Anticevic A, Lee D, Pittenger C. Arbitration between Action Strategies in Obsessive-Compulsive Disorder. Neuroscientist. 2016;22(2):188–98. Epub 20150120. doi: 10.1177/1073858414568317 ; PubMed Central PMCID: PMC5159480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kanen JW, Ersche KD, Fineberg NA, Robbins TW, Cardinal RN. Computational modelling reveals contrasting effects on reinforcement learning and cognitive flexibility in stimulant use disorder and obsessive-compulsive disorder: remediating effects of dopaminergic D2/3 receptor agents. Psychopharmacology (Berl). 2019;236(8):2337–58. Epub 20190720. doi: 10.1007/s00213-019-05325-w ; PubMed Central PMCID: PMC6820481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Marzuki AA, Tomic I, Ip SHY, Gottwald J, Kanen JW, Kaser M, et al. Association of Environmental Uncertainty With Altered Decision-making and Learning Mechanisms in Youths With Obsessive-Compulsive Disorder. JAMA Netw Open. 2021;4(11):e2136195. Epub 20211101. doi: 10.1001/jamanetworkopen.2021.36195 ; PubMed Central PMCID: PMC8630570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Suzuki S, Zhang X, Dezfouli A, Braganza L, Fulcher BD, Parkes L, et al. Individuals with problem gambling and obsessive-compulsive disorder learn through distinct reinforcement mechanisms. PLoS Biol. 2023;21(3):e3002031. Epub 20230314. doi: 10.1371/journal.pbio.3002031 ; PubMed Central PMCID: PMC10013903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Balleine BW. Neural bases of food-seeking: affect, arousal and reward in corticostriatolimbic circuits. Physiol Behav. 2005;86(5):717–30. Epub 20051027. doi: 10.1016/j.physbeh.2005.08.061 . [DOI] [PubMed] [Google Scholar]
  • 49.Daw ND, Kakade S, Dayan P. Opponent interactions between serotonin and dopamine. Neural Netw. 2002;15(4–6):603–16. doi: 10.1016/s0893-6080(02)00052-7 . [DOI] [PubMed] [Google Scholar]
  • 50.Rouhani N, Wimmer GE, Schneier FR, Fyer AJ, Shohamy D, Simpson HB. Impaired generalization of reward but not loss in obsessive-compulsive disorder. Depress Anxiety. 2019;36(2):121–9. Epub 20181128. doi: 10.1002/da.22857 ; PubMed Central PMCID: PMC6945299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Menegas W, Akiti K, Amo R, Uchida N, Watabe-Uchida M. Dopamine neurons projecting to the posterior striatum reinforce avoidance of threatening stimuli. Nat Neurosci. 2018;21(10):1421–30. Epub 20180903. doi: 10.1038/s41593-018-0222-1 ; PubMed Central PMCID: PMC6160326. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Akiti K, Tsutsui-Kimura I, Xie Y, Mathis A, Markowitz JE, Anyoha R, et al. Striatal dopamine explains novelty-induced behavioral dynamics and individual variability in threat prediction. Neuron. 2022;110(22):3789–804.e9. Epub 20220920. doi: 10.1016/j.neuron.2022.08.022 ; PubMed Central PMCID: PMC9671833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tsutsui-Kimura I, Uchida N, Watabe-Uchida M. Dynamical management of potential threats regulated by dopamine and direct- and indirect-pathway neurons in the tail of the striatum. bioRxiv. 2022; 10.1101/2022.02.05.479267. [DOI] [Google Scholar]
  • 54.Matsumoto M, Hikosaka O. Two types of dopamine neuron distinctly convey positive and negative motivational signals. Nature. 2009;459(7248):837–41. doi: 10.1038/nature08028 ; PubMed Central PMCID: PMC2739096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Ishino S, Kamada T, Sarpong GA, Kitano J, Tsukasa R, Mukohira H, et al. Dopamine error signal to actively cope with lack of expected reward. Sci Adv. 2023;9(10):eade5420. Epub 20230310. doi: 10.1126/sciadv.ade5420 ; PubMed Central PMCID: PMC10005178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Clark DA, Guyitt BD. Pure Obsessions: Conceptual Misnomer or Clinical Anomaly? In: Abramowitz JS, McKay D, Taylor S, editors. Obsessive-Compulsive Disorder Subtypes and Spectrum Conditions: Elsevier; 2007. p. 53–75. [Google Scholar]
  • 57.Gillan CM, Sahakian BJ. Which is the driver, the obsessions or the compulsions, in OCD? Neuropsychopharmacology. 2015;40(1):247–8. doi: 10.1038/npp.2014.201 ; PubMed Central PMCID: PMC4262900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Ognibene D, Fiore VG, Gu X. Addiction beyond pharmacological effects: The role of environment complexity and bounded rationality. Neural Netw. 2019;116:269–78. Epub 2019/05/08. doi: 10.1016/j.neunet.2019.04.022 ; PubMed Central PMCID: PMC6581592. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Foa EB, Huppert JD, Leiberg S, Langner R, Kichic R, Hajcak G, et al. The Obsessive-Compulsive Inventory: development and validation of a short version. Psychol Assess. 2002;14(4):485–96. . [PubMed] [Google Scholar]
  • 60.Daw ND. Trial-by-trial data analysis using computational models. In: Delgado M, Phelps EA, Robbins TW, editors. Decision Making, Affect, and Learning, Attention and Performance XXIII: Oxford: University Press; 2011. [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011206.r001

Decision Letter 0

Daniele Marinazzo, Stefano Palminteri

26 Feb 2023

Dear Dr Morita, 

Thank you very much for submitting your manuscript "Opponent Learning with Different Representations in the Cortico-Basal Ganglia Pathways Can Develop Obsession-Compulsion Cycle" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

As you will see the reviewers raised many major points. We will ask to carefully take into account while revising the manuscript. I also encourage to take this guidelines into account while preparing the rebuttal letter: https://psyarxiv.com/kyfus

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Stefano Palminteri

Academic Editor

PLOS Computational Biology

Daniele Marinazzo

Section Editor

PLOS Computational Biology

***********************

As you will see the reviewers raised many major points. We will ask to carefully take into account while revising the manuscript. We will also suggest that in the revised manuscript to not separate figures from the main text and captions from figures. It makes the paper much more challenging to read through. I also encourage to take this guidelines into account while preparing the rebuttal letter: https://psyarxiv.com/kyfus

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Here, the authors attempt to reconcile two current theories underlying obsessive-compulsion - the impairment of the model-based system with a shorter memory trace for negative (than positive) PE. They simulated a model with an agent containing an SR (model-based) and IR (model-free) system, and adjusted its learning based on negative and positive PE at different rates. They found that if the agent's SR system learned primarily from positive PEs, while the IR system learned primarily from negative PEs, the agent would result in an increased looping of the obsession-compulsion states. Further, they simulated the behavior of this (SR+IR) agent completing the two-step task, vs an agent that only has the SR system. They found that the SR+IR agent had lower model-based control than the SR agent. Overall, the authors suggest that the opponent learning of positive PEs for SR and negative PEs for IR may underlie obsession-compulsion.

I thank the authors for an interesting manuscript. I thought the work was quite intriguing, and is indeed a step towards delineating mechanisms underlying OCD, extending both the goal-directed theory and the memory trace theory. I also appreciate the many details that went into the work. I do think the manuscript could work with some tweaks to the grammar and overall rewriting to increase the readability. The results could work with better signposting on the motivation of the analysis - especially since the methods are at the end, it requires more context. Finally, there are a lot of dense concepts which are dropped with minimal explanation and referred to back and forth - where further elaboration/explanation would help to hone in on the key messages put forward. More specific comments are as follows:

Major points

1) The introduction starts off very curt, stating the two theories under investigation of the paper, but does not really explain why the theories are related to OCD. How is impaired goal-directed control related to OCD symptoms? Perhaps some reference to dual system theories might help to set the scene. How does having a short memory trace for negative PEs also contribute to OCD? How does it affect learning and therefore behavior (e.g., lower punishment learning, therefore make more wrong actions)?

2) It would be more cohesive if SR and IR were introduced in parallel to model-based/model-free - currently there is a link between SR and model-based control, but IR link to model-free only comes into the last paragraph

3) "Moreover, we also expected that fitting of behavior of the agent with combined appetitive SR- and aversive IR-based systems could potentially result in smaller weights of model-based control than the agent with SR-based system only in the two-stage decision task" - isn't this automatically expected because SR, as the point was made above, is a model-based system, and thus would have greater model-based control vs a SR-IR (includes model-free) system?

4) It would help if Figure 1A included the outcomes, to contrast that the IR has a direct representation of A1 to final outcome.

5) "whereas eligibility-trace imbalance per se would not bias the estimation of the degree of model-based control." - so is the point that it requires both the model-based and model-free system, in combination with a eligibility-trace imbalance, to underlie OCD symptoms (vs just an imbalanced trace in the model-based system)? This seems to a main finding, and is not very well motivated in the introduction, so the analysis/results come as a surprise.

6) The description of the environmental model was fairly clear, although in its figure, the loop in the obsession state to itself is confusing because the "other" label is very close to it

7) Behavior of the dual-system agent in the environmental model - authors refer to Figure 3 indicating what the figure shows generally, but not its trends i.e. "Figure 3A-top shows an example of the time evolution of the difference in the (integrated) action values of "abnormal reaction" and "other" at the relief state (Qabnormal − Qother@relief)." but instead could described as "proportion of obsession state stays low/below threshold, indicating that the agent minimized selection...etc, etc". I think it would help in actually describing the results in the text, because with the current descriptions it does not tell me anything except that I have to refer to the figures. Also the bottom Figure 3 y-axes should indicate it is the average of the middle one- it is a bit misleading to have them the same y axis label.

8) Similar comments as above to "Figure 3D show the results."

9) In Figure 6, y-axes are missing, and it would nice to add a distribution curve since the authors are comparing distributions

10) "As shown in the left panels of Figure 6A,B, the distribution of the estimated w for the agent with appetitive SR- and aversive IR-based systems (Figure 6B) was smaller than that for the agent with SR-based system only (Figure 6A)." Consider rephrasing "smaller distribution"; does this refer to the variance/mean peak? Looks like Figure 6B ranges from 0 to 1, similar to Figure 6A.

11) "In contrast, fitting of choices generated by the original RL model with balanced eligibility trace (λ = 0.4 for both positive and negative RPEs) and those generated by a modified model with imbalance trace (λ = 0.4 and 0 for positive and negative RPEs, respectively), both of which had the same degree of model-based control (w = 0.5), did not result in smaller estimated w in the latter, as shown in the left panels of Figure 6C,D" - This analysis needs to be more clearly motivated. What is the point of testing the eligibility trace difference after the SR-IR system one, and what is expected from this? Also a bit unclear on if both models were specified with the same degree of w, why would it possibly result in differing w?

12) I wonder if it would be a fairer comparison to contrast the appetitive SR and aversive IR model with other versions of the SR-IR model (e.g., differing with pos/neg PE learning) like the models in the paragraph above. This could highlight the contribution of the appetitive/aversive component of the SR-IR model, which presumably is the main aspect that underlies the OC cycle.

13) "We further analyzed the choice pattern of each type of agent. The agent with appetitive SR and aversive IR-based systems made a model-based-type choice, taking into account whether the occurred state transition was the common type or rare type, after reward was obtained but much less so after reward was not obtained (Figure 6B, right panel). In contrast, the SR-only agent, as well as the agents with balanced or imbalanced eligibility trace, made a model-based-type choice regardless of whether reward was obtained at the previous trial (Figure 6A,C,D, right panels)." This paragraph requires rephrasing and actual explanation of what the pattern of choices mean. Model-based behavior dominates in Figures 6A, C and D vs the influence of model-free in Figure 6B- it is inaccurate to call it "making model based choice", because it is not about a single choice but stay probability over all trials.

14) Overall, the study presents some convincing evidence to the appetitive SR and aversive IR role underlying the OC cycle, particularly the demonstration of how the loop gets stuck in the obsessional-compulsion states. However I think there could be more comparisons with the models applied to the two step task, the difference between and SR system only and the appetitive/aversive IR system seem to have many components that would contribute to the difference in behavior, and it would be clearer to delineate them.

15) There is some nuance in the expression of OCD that is not fully taken into account of the current model. OCD patients present varying levels of obsessions vs compulsions, e.g. pure obsessional/compulsive OCD. Whereas the current model assumes that it is a cycle of the obsessive state to the compulsion state (to resolve distress), where both states are necessary. I wonder how these other presentations could be accounted for. Another point to raise is that there is a hypothesis specifically borne from the goal-directed work of compulsions (Gillan & Sahakian, Neuropsychopharmacology, 2015), it is suggested that obsessions in OCD might arise as a result of compulsive behavior instead - what do the authors think of this in terms of their current work?

16) "Specifically, we could not find how the model-based system estimated the transition probabilities when the frequencies of past occurrence of both types of transitions exactly matched in the original study, and we assumed that the model-based system estimated them as 50% and 50% in such situations" - Not quite sure what the authors mean here - common vs rare transition is 70/30%, and so the occurrence of the transitions should match that as well? In terms of simulations, the transition matrix would start with 50% for all types of transitions and updates trial by trial based on the choices made in the task.

Minor points

17) The authors refer to 'agent' without an article (an/the) throughout the text, it reads a bit weird without one

18) "Specifically, modeling showed that obsession-compulsion cycle can be enhanced in agent having much shorter memory (eligibility) trace for negative than positive prediction errors (PEs), and fitting of behavioral choices suggested that OCD patients indeed had such an imbalance as compared to healthy controls (HCs), although even HCs tended to show shorter memory trace for negative than positive PEs." This sentence is way too long and is very confusing. Consider splitting it into more sentences.

19) "...this possibility explains why human is prone to OCD and even healthy people tended to show shorter memory trace for negative PEs." - should be 'humans are'. More broadly I think this these two conclusions can be phrased better.

20) "and fitted them by the RL model weighing model-based and model-free controls developed in the original study" - it should just be 'control' (controls can be misinterpreted as healthy controls)

Reviewer #2: The manuscript describes an interesting integrative model of OCD, that has the potential to explain both clinical phenomena (persistent obsessions) and well-known two-stage task findings, based on the idea that patients’ model-based system (or more specifically - SR) learns disproportionately more from positive RPEs, whereas their model-free system learns more from negative RPEs. I applaud the authors for the attempt to integrate previous computational models of OCD – the field is in desperate need of such endeavors. Furthermore, the model is sophisticated and appears to be biologically plausible.

That being said, I believe that several major conceptual, empirical, and style-related issues limit the potential contribution and impact of the manuscript.

First, the model is justified chiefly based on some overlapping neurobiological findings concerning different distinctions between direct and indirect pathways of the basal ganglia. Although this seems reasonable, I believe the manuscript/model lacks sufficient intuitive psychological/clinical justification. What is the psychological interpretation of the model? Is it the idea that OCD patients are more likely to start obsessing because they essentially overweight the probability that they will be able to ‘exit’ the obsessing cycle via compulsions (such that the SR/MB value of entering it is not as negative as it could have been)? Whether this is the core idea, or a different explanation would be more precise– I believe such an explanation must be repeatedly spelled out in the introduction, and the interpretation of the results. Insufficient interpretability is also evident, for example, regarding the finding that the alternative model with 8 ‘other’ actions does not lead to obsessive phenomenology (also, can this result be attributed specifically to the existence of many ‘other’ actions starting form the relief state or, alternatively, the ‘other’ actions starting from the obsession state?).

In general, such interpretability is important for two main reasons. First, although this is a computational paper and a computational journal, I believe its conceptual assumptions/conclusions must be interpretable by OCD experts lacking the extensive technical background the current conceptualization requires. Second, spelling out the conceptual assumptions/conclusions will enhance the ability to evaluate the psychological plausibility of the model.

This raises a second issue that focuses on the specificity of the results to the specific environmental model depicted in Figure 2. A fundamental assumption (or maybe a result) of the model is that the mechanism of obsessive-compulsive behavior is related to a higher probability of having an ‘abnormal reaction’, or, in other words – a higher probability of experiencing obsessions. This is interesting but also not somewhat controversial. Some classical models of OCD will argue that obsessions (or intrusive thoughts) occur universally and that the core problem in OCD is either a higher probability of staying in the obsessional state or alternatively, a higher probability of using compulsions to exit it. Is there any evidence for such ‘behaviors’ in the model? For example, does the `OCD agent` use compulsions as a main policy (vs. ‘other’) to exit the obsessional state? If we examine the two agents (with/without OCD) only in cases/steps in which they have entered the obsessional state – is the OCD agent more likely to use compulsions to exit it? Furthermore, is the ‘OCD agent’ more prone to ‘get stuck’ in the obsessions state, or is the general increase in obsession caused by intermittent obsessions-compulsion cycles? More generally - how do excessive learning from positive SR RPEs and reduced learning from positive IR RPEs affect policies starting from the obsession state?

Third, an interesting result of the simulations is that the probability of experiencing an obsession, even in an `agent with OCD` never surpasses 50%. This raises two questions. First – how plausible is this result? Oftentimes, OCD symptoms can become so dominant in a specific context that patients will tend to experience obsessions and emit a compulsion in close to 100% of the times they find themselves in this context (e.g., while handwashing, or when leaving the house, etc.). Second – what does this entail regarding the theoretical role of the temperature parameter in the model? Isn’t it the case that a nearly-deterministic model here will never move to the obsessional state?

Fourth, I believe the ability of the model to provide an alternative explanation for the common finding of reduced model-based learning in the two-step task is intriguing and important. I also applaud the researchers for discussing the limitations of this alternative interpretation vis-à-vis other findings in this literature. I have two questions here. First, can the model also explain findings concerning model-based learning in simple structure learning tasks (e.g., Sharp, Dolan & Eldar, 2021, Psychological Medicine)? Second, it would help the paper dramatically if the authors were able to fit their model to actual two-step task data (some are publicly available, I believe, e.g., https://osf.io/usdgt/)

Finally, I have to say that I found the description of the models hard to follow. The methods section could benefit from a more detailed, step-by-step exposition of the models. From a purely stylistic point-of-view – it would be much easier to read and understand the equations if they were presented on separate lines, where all components are explained.

A few minor comments:

• The introduction could benefit if the relatively new ‘trace imbalance model’ were explained in more detail. I realize that it was developed and validated in a separate paper, but IMO, the readers should be able to understand it without having to read that other paper.

• Figure 2 could benefit if the relevant parameters controlling the probability for abnormal reaction/compulsion are explicitly stated (and thus linked to the equations)

• Some intuitive explanation regarding the ‘weights’ of the SR system mentioned in the results (and figures) can be helpful.

• On page 8, the authors say, “Now, assume that the agent, developing enhanced obsession-compulsion cycle, exists at the relief state and happens to take the "other" action repeatedly, without taking the abnormal reaction. It is in fact the optimal policy, and because no reward or punishment is given as long as the agent only takes the "other" action, the value of the "other" action should approach, through RPE-based update, to 0, which is equal to the optimal value. However, this value, 0, is lower than the positive value of the abnormal reaction (even though it slowly decays)”. This is an interesting result. My main question is whether we could also predict the opposite in the case of an agent of chooses the abnormal reaction repeatedly?

• The idea that the learning rate biases in the SR/IR systems are ‘evolutionarily selected’ is interesting, but I couldn’t understand the exact proposed function of this setting. The authors cite a paper examining this in dynamic environments, but it would be helpful if they could briefly explain the logic here.

• In my opinion, the limitations section, and especially the second limitation – is too specific. I believe this section would benefit if it would focus on more conceptual-level limitations (e.g., what clinical/psychological phenomena the model cannot explain? To what extent are the simulations results dependent upon specific fixed parameters or settings of the model's environment?

Reviewer #3: The paper by Sato and colleagues proposes a computational reinforcement learning model of obsessive compulsive disorder symptoms. The model proposes a reinforcement learning agent that learns to predict action values from multiple representations: one over successor representations (SR) and one over independent representations (IR). The key feature of the model which the paper proposes can underlie obsessive compulsive symptoms is that learners for the SR and IR representations differentially learn from positive and negative reward prediction errors. Specifically, the SR learns more from positive reward prediction errors, and the IR learns more from negative reward prediction errors.

The claim of the paper is that this model can link together two separate findings relating to decision-making and OCD symptoms. The first is findings which that individuals with OCD have higher eligibility traces for positive compared to negative learning. The authors account for these findings by showing that, like a model with imbalanced eligibility traces, their model can also generate obsessive-compulsive cycles in a toy two-state MDP example.

The second is that individuals with OCD show less model-based learning on the two-step task. This is accounted for by showing that their model demonstrates less model-based behavior than an agent that decides just using the SR.

Overall, I think the attempt to relate compulsive symptoms to the types of representations that individuals employ in reinforcement learning is commendable and also could be fruitful approach. However, I have a number of critiques and suggestions for further work for the specific simulations presented in the paper.

1. Simulation of Sakai et al., 2022. The first part of the paper attempts to offer a re-interpretation of previous findings from Sakai et al., 2022, which argued that individuals with OCD have a higher eligibility trace for positive compared to negative prediction errors. The current paper argues that a model where an SR learner has a higher learning rate for positive prediction errors compared to negative prediction errors, and an IR learner has the opposite pattern, can generate equivalent behavior. This is demonstrated by simulating behavior in a toy MDP and showing the model can develop obsessive-compulsive cycles that were generated in Sakai et al., 2022.

I think to make the argument though that an SR + IR learner can account for the model of Sakai et al., 2022, it is also necessary to demonstrate that the new model can generate the behavior of OCD individuals (and variation between OCD and healthy controls) from the empirical study presented in Sakai et al, 2022 which actually provides evidence for the eligibility trace account. Can an SR an IR based learner with different learning rates for positive versus negative prediction errors explain how controls varied from individuals with OCD in the experiments that were run in Sakai et al., 2022? This should be demonstrated with simulation if the authors want to claim that their new model can capture what the previous model was suggested to capture.

2. Updates for SR system. I think something is going a bit wrong with how the SR learning is implemented, which is leading it to somewhat bizarre behavior in the simulation of obsession-compulsion cycle. This is clear from the weights and value estimates that are being learned by the SR, and plotted in Fig. 4B and D. In particular, it should not be possible for the value of any action to be above 0, since there are only negative rewards in the task.

My guess is that the cause of such divergent large weights is in how weights for the SR-based system are being updated, and that the representation is not being appropriately scaled for the update. This in turn is causing learning rates that are effectively above 1. Positive prediction errors with learning rates greater than 1 can cause the SR learner to form positive value estimates even when there are no positive rewards in the task.

To explain this more technically: the equation that is used to learn weights for the SR system is w_new <- w_old + M_transpose*delta*alpha, where M is the feature vector (a row vector here) for the state from which the action was taken, delta is the reward prediction error, and alpha is the learning rate. Typically, when implementing this update, it is necessary to scale the feature vector for the update. Specifically, M needs to be divided by its dot-product with itself (M*M_transpose). This is reported in the methods section of Russek et al., 2017, PLOS Computatainal Biology, which uses this update to simulate some aspects of MB behavior. Without scaling the feature vector, the learning rate basically loses meaning, and it can wind up effectively being above one.

To explain this, consider the usual TD update, for tabular TD:

V_new <- V_old + alpha*delta,

where alpha is the learning rate and delta is the prediction error. Here, alpha has an interpretation as the amount that V changes as a proportion of the prediction error:

(V_new - V_old) = alpha*delta.

For the linear-feature function approximation, applying the same idea, we have

(V_new - V_old) = M*w_new - M*w_old.

For the update, w_new <- w_old + M*delta*alpha, if M is not scaled appropriately, the net change to V after the update is:

(V_new - V_old) = M*w_new - M*w_old

= M(w_new - w_old) -- factoring out M

= M(w_old + M_transpose*delta*alpha - w_old) -- replacing w_new with the update rule

= M*M_transpose*delta*alpha -- simplifying and multiplying in M)

So, effectively, the change in V is no longer determined by the learning rate alpha, but rather by M*M_transpose*alpha. Thus, for alpha to have the same meaning it has in tabular TD, the feature vector thus needs to be scaled in the update by M*M_transpose.

Without this scaling, when components of M are above 1 (which is what happens in tasks where states can be visited multiple times, as in the OCD example) the effective learning rate can take on values greater than 1. This can cause the SR-system’s values to go above 1 in tasks where all rewards are negative. This also, I think, is what causes the weights to not converge, but rather keep growing over time.

It is possible that doing these sorts of non-scaled divergent updates might be a hypothesis for some aspects of compulsive behavior, but I think it is not the one that the authors intend to test in the paper. I’d think they should thus try to re-do the simulations with scaled updates.

Alternatively, if this scaling is not the reason for the SR weights to grow so divergently, I think it is important for the authors to discover why the SR behaves so erratically in this task. This is necessary to understand how the relevant obsessive-compulsive behavior is being produced.

3. Two-step task simulation. For the two-step task simulation, the authors compare their proposed model of compulsion (SR + IR with asymmetric updating) to a model which only uses the SR. They find that SR + IR with asymmetric updating produces less model-based behavior than SR alone.

This finding however is not surprising, and does not provide evidence that the proposed model (SR + IR with asymmetric updating) produces OCD behavioral patterns in the two-step task. Applied to the two-step task, the SR functions like a model-based learner and the IR functions like a model-free learner. The simulations thus effectively compare a system which combines MB (here SR) and MF (here IR) to a system that only uses MF (IR). The simulations thus don’t test the need for asymmetric updating in producing OCD-like behavior (less MB). To make the argument that the proposed model generates observed OCD behavior, the authors need to compare SR and IR with asymmetric updating to SR and IR with neutral updating, and see whether this produces less MB behavior, in a manner consistent with findings on behavior from OCD individuals. This is needed to validate the claim that asymmetric updating to SR vs IR systems generates compulsive symptoms.

4. Possible additional empirical analysis. This is a suggestion that the authors do not need to complete to make a publishable paper, but I think would make the paper much stronger, so they might consider. The two-step task simulations have clear predictions of what compulsive behavior in the two-step task should look like (more MB behavior following rewards compared to non-reward). This prediction could be tested in existing public datasets of the two-step task. Specifically, data from Gillan et al., 2016, which analyzed how two-step behavior varies with self-reported compulsivity symptoms is publicly available at https://osf.io/usdgt/. It should be possible to re-analyze this data and determine whether OCD symptoms are associated with more asymmetric updating of an IR + SR learner (or alternatively whether there is more MB behavior following reward compared to non-reward events in individuals higher in self-reported OCD symptoms). Such an empirical result would make the argument in the paper much stronger.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No: I wasn't able to find the supplementary material (which should include code), so I answered 'no' to the question regarding code/data availability below. But perhaps this is just a technical issue in the system.

Reviewer #3: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011206.r003

Decision Letter 1

Daniele Marinazzo, Stefano Palminteri

8 May 2023

Dear Dr Morita, 

Thank you very much for submitting your manuscript "Opponent Learning with Different Representations in the Cortico-Basal Ganglia Pathways Can Develop Obsession-Compulsion Cycle" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

While all the reviewers appreciated your efforts and clarifications, two reviewers (1 and 3) still present significant and reasonable doubts concerning the potential impact and solidity of your arguments. So, unfortunately, at this stage we cannot accept and the paper and we have to ask to further revise the paper while taking into account the remaining points of the reviewers (especially 1 and 3). We apologize for the quite long delay and the additional load of work that it will require, but in line with PLoS Computational Biology standard and in order to make sure that the paper delivers its promises. We specifically stress the fact that

1) to which extent the features of the models contribute to the model behavior and specific pattern (OCD cycles)

2) the stability of the results as a function of the choice of the parameters

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Stefano Palminteri

Academic Editor

PLOS Computational Biology

Daniele Marinazzo

Section Editor

PLOS Computational Biology

***********************

While all the reviewers appreciated your efforts and clarifications, two reviewers (1 and 3) still present significant and reasonable doubts concerning the potential impact and solidity of your arguments. So, unfortunately, at this stage we cannot accept and the paper and we have to ask to further revise the paper while taking into account the remaining points of the reviewers (especially 1 and 3). We apologize for the quite long delay and the additional load of work that it will require, but in line with PLoS Computational Biology standard and in order to make sure that the paper delivers its promises. We specifically stress the fact that

1) to which extent the features of the models contribute to the model behavior and specific pattern (OCD cycles)

2) the stability of the results as a function of the choice of the parameters

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I thank the authors for their comprehensive revision - the introduction set the context and background with clear motivations and hypotheses, and the results were also signposted and explained well. I have no further questions on the points that were raised previously, only minor question/suggestions:

1. The addition of the inverse temperature analysis is quite interesting. I think there should be a little further elaboration of the role/effect of the temperature parameter in this context - does this mean individual differences in exploration/exploitation could predict whether the agent (and presumably, a human) would go into the obsessional state/develop OCD?

2. In the analysis for the choice patterns of the agents (Figure 6), I wonder if it would be helpful to report some statistics either with the P(Stay) difference or the reward*trans effect from the LMM (i.e., Stay ~ Reward * Transition + (Reward * Transition + 1 | Subject)), as the stay probability graphs differences are quite small.

Reviewer #2: I appreciate the authors' considerable efforts in revising the manuscript, particularly the additional analyses examining the questions the other reviewers and I raised. Indeed, the additional analyses and revisions clarify and improve the manuscript.

However, these important clarifications and analyses also emphasize some considerable limitations of the proposed model, specifically with regards to it conceptual and empirical foundations.

From a conceptual point of view, the key 'psychological' mechanism through which the model explains increased obsessions seems to rely on a somewhat peculiar logic. As the reviewers now clarify, the model explains excessive 'entering into an obsessional state' as driven by the overgeneralization of the pleasantness of relief from obsession'. Such motivation for obsessions seems strange, and can be equated, in a different context, to a model that chooses to 'put its hand in the fire because it 'overgeneralizes the pleasantness of eventually taking it out of the fire'.

Furthermore, as the authors now clarify, this behavior seems to depend on several seemingly arbitrary settings of the simulations. Specifically, the model's choice to enter the obsessive state depends on the lack of enough alternative actions, and sufficient stochasticity (i.e., temperature). Along the same lines, I would assume that if the value of the other options would increase (i.e. if the model had additional, *rewarding enough* actions to choose from), this would also eliminate the obsessive behavior of the model. So, in other words, this model in the illustrative 'fire' context would 'randomly decides to put its hand in the fire because it doesn't have enough rewarding (interesting?) alternatives, and because the suffering entailed by this behavior is overweighed by the relief of eventually taking the hand out of the fire'.

Of course, the fact that this logic appears peculiar to me is not conclusive evidence against it. However, I believe the authors also do not provide sufficient empirical evidence for the model. Yes, the model can explain the results of the delayed feedback task, but so can the original 'eligibility trace' model. Yes, the model predicts reduced model-based behavior, but so does a 'neutral' SR-IR model (as correctly raised by Reviewer 3, and agreed to by the authors), and as it seems from Figure 9, an aversive SR + appetitive IR model. This latter model also appears to explain better the data from Voon (ref 7 in the manuscript), and after some speculations, the Gillan data (ref 9 in the manuscript). But, while this reversed model can, it seems, explain empirical data (as a side note, I think fitting the model to the Gillan data will be a much stronger proof here), whether it will predict obsessive behavior in the environment the authors use is questionable, and to the very least – should be examined.

Reviewer #3: I commend the authors for having done substantial work in their revision. In regard to my specific critiques, I think the simulation of Sakoi et al., is mostly compelling. I also think the authors have addressed my concern about why the SR weights diverge. However, my concerns over the two-step task simulations have increased.

The key claim of the paper is that OCD symptoms could be generated by a model combining SR and IR with asymmetric learning rates for positive and negative prediction errors, where SR learns more from appetitive prediction errors and IR learns more from aversive prediction errors. I commend the authors for looking at the Gilan et al., 2016 data to see to what extent this model is supported in two-step task data. I have some uncertainty about the test that was used to look for evidence that SR and IR have different learning rates for positive and negative prediction errors. I think a more straightforward approach would be to fit the model to the task data, to treat the learning rate for either system as free parameters, and then analyze how those learning rates change as a function of self-reported OCD symptoms. This could potentially also support the suggestion that healthy participants are described by SR alone (through model comparison). I found this surprising, since SR is typically thought to stand in for the model-based system, but not the model-free system, and healthy participants in the task are described by a mixture of model-based and model-free systems. I think the ability of the SR in simulations here to generate mixtures over MB and MF weightings might be due to that it learns the transitions - so this would be similar to the MB system learning the transitions from experience. This is a reasonable hypothesis for what generates behavior that might look like model-free learning (low w), and this this could potentially be supported by actual model-fitting.

More generally though, if the test used currently is correctly identifying learning rates for either system, it falsifies the model's predictions, instead showing that MB (or SR) behavior in OCD individuals is more influenced by negative than positive prediction errors relative to MF (or IR) behavior. The paper suggests an explanation for this. First, it argues that the Gilan et al. experiment, due to low pay, should be considered to be in the punishment rather than reward domain. Second, it argues that the punishment domain might encourage opposite learning rate asymmetries - where SR would learn more from negative errors and IR would learn more from positive (aversive SR + appetitive IR).

I did not find this argument to be convincing. In particular, the simulations of obsessive compulsive cycles (Fig. 2) take place in a punishment domain, yet despite this, the model used is appetitive SR + aversive IR. Additionally, the simulations demonstrate that aversive SR + appetitive IR in this domain (which they claim punishment domain might encourage) does not generate obsessive compulsive cycles. So, if I understood, the proposed model to explain the two-step task behavior is in conflict with the model used to explain obsessive compulsive cycles.

So, altogether, I’m somewhat uncertain about the extent to which the proposed model is supported as a model for OCD. The model of OCD decisions and variation from controls consists of two parts - 1) OCD participants use more IR component than healthy controls (both use an SR component), and 2) These components have different learning rates for positive and negative PEs, where SR is appetitive and IR is aversive. For the two-step task data, 1) is supported, but 2) is not. However, it’s worth noting that 1) is not really a new prediction for this task. Because it is known that SR and IR can stand in for MB and MF, the new account is not really different from the standard account of this data, which is that there is a shift from MB to MF control in OCD.

In contrast, to generate the obsessive compulsive cycles, it is really 2) that is needed, but not 1). That is, the key feature needed to explain variation between healthy controls and OCD patients in developing OCD cycles is imbalanced learning rates, not differences in amount of IR in addition to SR. So, in this regard, if the point of the paper is to offer a new model of generation of obsessive compulsive cycles, I’m not sure that the two-step task data is really offering support to the key feature of that model that is needed to support these cycles.

Overall, I do think the basic observation that some phenomena (the simulations of obsessive compulsive cycles and the data of Sakai et al.) which previously were argued to support mechanism of imbalance eligibility traces, could also be explained by a combination SR + IR learner with imbalanced learning rates is interesting still. But I’m not really satisfied with how the falsification of, what I view as the key part of this model, in the two-step task data, is explained.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011206.r005

Decision Letter 2

Daniele Marinazzo, Stefano Palminteri

23 May 2023

Dear Dr Morita, 

We are pleased to inform you that your manuscript 'Opponent Learning with Different Representations in the Cortico-Basal Ganglia Pathways Can Develop Obsession-Compulsion Cycle' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Stefano Palminteri

Academic Editor

PLOS Computational Biology

Daniele Marinazzo

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011206.r006

Acceptance letter

Daniele Marinazzo, Stefano Palminteri

5 Jun 2023

PCOMPBIOL-D-23-00023R2

Opponent Learning with Different Representations in the Cortico-Basal Ganglia Pathways Can Develop Obsession-Compulsion Cycle

Dear Dr Morita,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Timea Kemeri-Szekernyes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Data. Codes to generate/reproduce the data presented in the figures.

    Please read "readme.txt" for details.

    (ZIP)

    Attachment

    Submitted filename: ocd2_responses_to_review_comments_v50.pdf

    Attachment

    Submitted filename: ocd2rev2_responses_to_review_comments_v9.pdf

    Data Availability Statement

    All relevant codes are within the Supporting Information files. Codes to generate/reproduce the data presented in the figures are available at:https://github.com/kenjimoritagithub/sr101


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES