Skip to main content
eLife logoLink to eLife
. 2022 Jul 18;11:e67517. doi: 10.7554/eLife.67517

Stress diminishes outcome but enhances response representations during instrumental learning

Jacqueline Katharina Meier 1, Bernhard P Staresina 2, Lars Schwabe 1,
Editors: Mimi Liljeholm3, Floris P de Lange4
PMCID: PMC9355560  PMID: 35848803

Abstract

Stress may shift behavioural control from a goal-directed system that encodes action-outcome relationships to a habitual system that learns stimulus-response associations. Although this shift to habits is highly relevant for stress-related psychopathologies, limitations of existing behavioural paradigms hinder research from answering the fundamental question of whether the stress-induced bias to habits is due to reduced outcome processing or enhanced response processing at the time of stimulus presentation, or both. Here, we used EEG-based multivariate pattern analysis to decode neural outcome representations crucial for goal-directed control, as well as response representations during instrumental learning. We show that stress reduced outcome representations but enhanced response representations. Both were directly associated with a behavioural index of habitual responding. Furthermore, changes in outcome and response representations were uncorrelated, suggesting that these may reflect distinct processes. Our findings indicate that habitual behaviour under stress may be the result of both enhanced stimulus-response processing and diminished outcome processing.

Research organism: Human

Introduction

Adaptive behaviour in complex environments requires an intricate balance of deliberate action and efficient responding. For instance, in the supermarket faced with numerous products, it may certainly be helpful to weigh the pros and cons of a specific product before making a choice. Yet, thinking for hours about which toothpaste to buy may interfere with the goal of being home before dinner. The balance of thorough deliberation and efficiency is supported by at least two systems of behavioural control that operate in parallel: (i) a goal-directed system that learns when to perform the action required to achieve a desired outcome (in the form of stimulus [S] – response [R] – outcome [O] associations) and (ii) a habitual system that acquires S-R associations without any links to the response related outcome (Adams, 1982; Adams and Dickinson, 1981; Balleine and Dickinson, 1998). These two systems are known to rely on distinct neural circuits. While the goal-directed system primarily relies on the medial prefrontal and orbitofrontal cortex as well as the dorsomedial striatum (anterior caudate in humans; Balleine and O’Doherty, 2010), the habitual system depends on the dorsolateral striatum (posterior lateral putamen in humans; Balleine and Dickinson, 1998; Balleine and O’Doherty, 2010; Corbit and Balleine, 2003; Ostlund and Balleine, 2005; Tricomi et al., 2009; Valentin et al., 2007; Yin et al., 2006). Moreover, it is commonly assumed that the goal-directed system guides early learning, whereas the habitual system takes over once a behaviour has been frequently repeated. Therefore, while buying toothpaste for the first time should be a goal-directed action, this choice should become more habitual if we have bought the specific toothpaste several times before. Adaptive behaviour requires the capacity to flexibly switch back from habitual to goal-directed control in response to environmental changes (e.g. when the previously bought toothpaste is out of stock). Lacking this flexibility in behavioural control may be detrimental to mental health. In particular, overreliance on habitual responding has been linked to several mental disorders, including drug addiction, obsessive-compulsive disorder, schizophrenia, eating disorders, and depression (Everitt and Robbins, 2005; Voon et al., 2015; Griffiths et al., 2014; Robbins et al., 2012; Voon et al., 2015).

Accumulating evidence indicates that stressful events may tip the balance from goal-directed to habitual control. Specifically, stress (and major stress mediators) has been shown to induce a shift from goal-directed action to habitual responding (Braun and Hauber, 2013; Dias-Ferreira et al., 2009; Gourley et al., 2012; Schwabe et al., 2012; Schwabe and Wolf, 2009; Schwabe and Wolf, 2010; Schwabe and Wolf, 2013; Smeets et al., 2019; Soares et al., 2012). Beyond its crucial relevance for our understanding of behavioural control in general, this stress-induced shift towards the habitual system may be a driving force in mental disorders that are characterized by dysfunctional stress responses on the one hand and aberrant habitual control on the other (Adams et al., 2018; Goeders, 2004; Schwabe et al., 2011a). Although the stress-induced bias to habitual responding may have critical clinical implications, the exact mechanisms through which stress modulates the balance of goal-directed and habitual control are not fully understood. In particular, previous research has been unable to address the fundamental question of whether the stress-induced bias towards habitual behaviour is due to a downregulation of the goal-directed system or the enhancement of the habit system, or both.

Canonical assays for the assessment of goal-directed and habitual control do not allow a distinction between these alternatives. These paradigms are based on the key distinctive feature of goal-directed and habitual control, that is, only the goal-directed system is sensitive to changes in the motivational value of the outcome in absence of new experience with it (Adams, 1982; Dickinson and Balleine, 1994). Accordingly, classical paradigms have tested the behavioural sensitivity to either a devaluation of an outcome or to the degradation of the action-outcome contingency (Adams, 1982; Corbit and Balleine, 2003; Tanaka et al., 2008; Valentin et al., 2007; Yin et al., 2004). Although these elegant paradigms provide valuable insight into the mechanisms involved in behavioural control, they are unable to determine whether increased responding to devalued or degraded actions is due to reduced outcome or enhanced S-R processing, or both.

Here, we aimed to overcome these shortcomings of classical paradigms that are used to determine these modes of behavioural control and to examine whether stress leads to an upregulation of response processing at the time of stimulus presentation or a downregulation of outcome-related processing, or both. To these ends, we leveraged EEG in combination with multivariate pattern analysis (MVPA)-based decoding of neural representations. In the present study, we first exposed participants to a stress or control manipulation and then asked them to complete an instrumental learning task during which they could learn S-R(-O) associations. Crucially, we used image categories as response options (R) and outcome categories (O) that have a distinct neural signature and recorded EEG throughout the task. EEG-based classifiers (support vector machines, SVMs) were trained to distinguish between the R and O stimulus categories on a separate delayed-matching-to-sample (DMS) task. We then applied these classifiers to the instrumental learning task to decode the outcome representations relevant to the goal-directed system when participants saw the stimulus S and when they made the response R. Furthermore, we decoded response R representations when participants saw the stimulus S, pointing to existing S-R representations (which may, however, be relevant for both habitual S-R learning and goal-directed S-R-O learning). Using a similar decoding approach on functional magnetic resonance imaging (fMRI) MRI data, a previous study showed that brain regions implicated in goal-directed control contained information about outcomes and responses, whereas regions associated with habitual responding contained only information about responses (but not outcomes) at the time of stimulus presentation (McNamee et al., 2015).

Because previous rodent studies suggested that stress or stress hormone effects on the balance of goal-directed and habitual forms of learning are training-dependent (Dias-Ferreira et al., 2009; Packard, 1999; Siller-Pérez et al., 2017), we also assessed training-dependent dynamics in the stress effect on outcome and response processing. Although we did not aim to test overtraining-induced changes in the balance of outcome and response processing, for which the number of trials may have been too limited in the present study, we included transient outcome devaluations at the beginning, middle, and end of the instrumental learning task to assess whether stress effects on instrumental behaviour are training-dependent. These outcome devaluations were included in order to assess if and how the predicted changes in neural representations are linked to behavioural manifestations of stress-induced changes in behavioural control.

Results

The goal of this study was to elucidate the mechanisms underlying the impact of stress on the control of instrumental behaviour. Specifically, we aimed to leverage an EEG-based decoding approach to determine stress-induced changes in outcome and response representations that constitute key features that distinguish goal-directed action and habitual responding (Adams, 1982; Adams and Dickinson, 1981; Balleine and Dickinson, 1998; Balleine and O’Doherty, 2010). To this end, participants first underwent the Trier Social Stress Test (TSST; Kirschbaum et al., 1993), a mock job interview that represents a gold standard in experimental stress research (Allen et al., 2014), or a non-stressful control manipulation. Afterwards, participants completed a reinforcement learning task (Luque et al., 2017) that allowed us to probe the goal-directed vs. habitual control of behaviour. In this task, participants could learn S-R-O associations (Figure 1). Specifically, they acted as ‘space traders’ who traded two cards, represented by two distinct fractals (green vs. pink), with two alien tribes that were represented by distinct symbols (red vs. blue) in return for cards from two possible categories (objects vs. scenes). On each trial, participants first saw one of the two fractals (S). They were then shown representatives of the two alien tribes next to each other and asked to decide which alien to offer the fractal to (R). Finally, participants received feedback about whether the alien accepted the offer and traded one of the desired cards and how many respective points were earned (O). Importantly, one alien tribe accepted only one type of fractal and traded only one card category. Furthermore, one card category was worth more than the other (high-valued outcome, Ohigh, and low-valued outcome, Olow). Participants had to learn these associations using trial-by-trial feedback. Moreover, there was a response cost associated with each trade that was accepted by the alien.

Figure 1. Overview of the paradigm used to decode outcome and response representations.

Figure 1.

(A) Illustration of the goal-directed and habit system. While the goal-directed system encodes associations between stimulus (S), response (R), and outcome (O), the habit system acquires S-R associations independent of the outcome engendered by the response. In accordance with this, the goal-directed system relies on outcome representations, whereas the habitual system does not. In contrast, response representations during stimulus presentation may be relevant for both habitual S-R and goal-directed S-R-O processing. (B) S-R-O mappings in the reinforcement learning task and outcomes in trials in which either none of the possible outcomes were devalued (NoDev), the outcome with lower value was devalued (Dev Olow), or the outcome with the higher value was devalued (Dev Ohigh). (C) Schematic representation of the reinforcement learning task in which participants were trained on S-R-O sequences in a trial-by-trial manner. Using an EEG-based support vector machine (SVM), neural representations of the outcome stimuli (object vs. scene) were decoded during stimulus presentation and during response choice. Moreover, neural representations of the response options (blue vs. red alien) were decoded during stimulus presentation. During devaluation blocks, participants saw on the last screen of each trial ‘???’ instead of the outcome value. (D) The SVM was trained in an unrelated delayed-matching-to-sample task (maintenance phase) that required participants to keep stimuli in mind that belonged to categories used as outcomes or response options during the reinforcement learning task. (E) Timeline of the experiment.

This task can be solved by ‘goal-directed’ action-outcome (S-R-O) learning and by ‘habitual’ S-R learning. In order to reveal the mode of control at the behavioural level, we presented task blocks in which one of the outcomes was devalued (i.e. not worth any points but associated with a response cost, thus resulting in a negative outcome). If behaviour is goal-directed, it should be sensitive to the outcome-devaluation, and participants should avoid the devalued action. Conversely, if behaviour is habitual, it should be less sensitive to the outcome devaluation, and participants should be more prone to perform the frequently repeated but now devalued response. To assess whether the balance of goal-directed and habitual behaviour and its modulation by stress is training-dependent, we presented devaluation blocks early during training, after moderate training, and again after extended training at the end of the task.

Critically, we recorded EEG during task performance and used stimulus categories as R and O stimuli that are known to have distinct neural signatures (Bae and Luck, 2018; Cairney et al., 2018; Taghizadeh-Sarabi et al., 2015; Treder, 2020). We trained EEG-based multivariate classifiers in an unrelated DMS task to discriminate between stimulus categories that were used as S (blue vs. red symbols that differed also in shape and line orientation) and O (objects vs. scenes) (Figure 1). The DMS task was completed both before and after the reinforcement learning task, and the classifier was trained on the pooled trials of the two DMS sessions (thus ruling out any time-dependent biases of the classifiers). The trained classifiers were then applied to the reinforcement learning task to determine neural representations of R and O.

Response representations were decoded during the S presentation, whereas O representations were decoded during both the S presentation and participants’ choice (R). Outcome representations at the time of S presentation and R are indicative of goal-directed control. In contrast, response representations at the time of stimulus-representations may be relevant for both goal-directed S-R-O learning and habitual S-R learning.

In order to adequately assess whether increased responding to devalued actions is due to reduced goal-directed or enhanced habitual control, or both, we employed a variety of measures and statistical approaches. At the behavioural level, we applied a devaluation paradigm to examine whether stressed participants respond differently to transient outcome devaluations compared to control participants and to assess whether the predicted neural changes are linked to behavioural outcomes. Behavioural analyses focussed mainly on participants’ choice in valued and devalued trials. Based on previous animal studies (Dias-Ferreira et al., 2009), we additionally investigated whether the effects of stress on the mode of control would be training-dependent and therefore implemented outcome devaluation at the beginning, middle, and end of the instrumental task. At the physiological level, we analysed pulse, systolic and diastolic blood pressure as well as salivary cortisol concentrations at different time points during the experiment to assess for the effectiveness of the stress manipulation. At the neural level, we mainly focused on the decoding of outcome and response representations, which provide insights into the mechanisms through which stress affects sensitivity to outcome devaluation - the primary objective of the present study.

To further address the important question of whether response and outcome representations reflect signatures of distinct control systems, we additionally analysed the correlations between both neural representations. For this purpose, we used Bayesian correlational analyses. Depending on the magnitude of the Bayes factor (reflecting the likelihood ratio of the data under the alternative hypothesis and the data under the null hypothesis), the Bayesian approach can provide evidence in favour of the alternative hypothesis or evidence in favour of the null (Hoijtink, 2012; Kass and Raftery, 1995; Nuzzo, 2017). Thus, we utilized Bayesian analyses to provide clear evidence for or against the null hypothesis. To further assess the association between the behavioural data (i.e. classification accuracy) and the strength of the neural representation (i.e. classification accuracies), we computed Spearman correlations. In order to also analyse previously proposed ‘attentional habits’ (Luque et al., 2017), we analysed event-related potentials (ERPs).

Successful stress manipulation

Significant subjective and physiological changes in response to the TSST confirmed that stress was successfully induced. Compared to participants in the control group, participants exposed to the TSST experienced the treatment as significantly more stressful, difficult, and unpleasant than those in the control condition (all t56 >5.82, all p<0.001, all d>1.530, and all 95% CI=1.456–2.112; Table 1). At the physiological level, exposure to the TSST elicited significant increases in pulse and systolic and diastolic blood pressure (time point of measurement × group interaction, all F[4, 224]>13.55, all p<0.001, all ƞp²>0.195, and all 95% CI=0.100–0.425; Figure 2). As shown in Figure 2A–C, although both groups had comparable blood pressure and pulse before and after the TSST (all t56 <2.28, all pcorr>0.081, all d<0.600, and all 95% CI=0.069–0.702), participants in the stress group had significantly higher blood pressure and pulse during the experimental manipulation than those in the control group (all t56 >4.21, all pcorr<0.001, all d>1.107, and all 95% CI=0.549–2.098). Finally, salivary cortisol concentrations increased in response to the TSST but not after the control manipulation (time point of measurement × group interaction: F[3, 168]=6.69, p<0.001, ƞp²=0.107, and 95% CI=0.026–0.188). As shown in Figure 2D, participants in the stress and control groups had comparable cortisol concentrations at baseline (t56=1.16, pcorr=1, d=0.304, and 95% CI=−0.216–0.821). However, about 20 min after the treatment, when the reinforcement learning task started, cortisol levels were significantly higher in the stress group than in the control group (t56=2.74, pcorr = 0.032, d=0.720, and 95% CI=0.185–1.249). As expected, cortisol levels returned to the level of the control group by the end of the task (60 min: t56=0.50, pcorr = 1, d=0.130, and 95% CI=−0.386–0.645; 105 min: t56=0.42, pcorr=1, d=0.111, and 95% CI=−0.405–0.625).

Table 1. Subjective responses to the Trier Social Stress Test (TSST) or control manipulation.

Control Stress
M SEM M SEM
Subjective assessments
Stressfulness 16.79 3.45 62.33* 4.44
Unpleasantness 15.36 3.51 57.00* 5.90
Difficulty 14.29 3.35 51.33* 5.29

Subjective assessments were rated on a scale from 0 (‘not at all’) to 100 (‘very much’). *p<0.001, Bonferroni-corrected, significant group difference.

Figure 2. Physiological responses to the Trier Social Stress Test (TSST), proportion of responses for devalued outcomes, and outcome and response representations throughout the reinforcement learning task.

The exposure to the TSST, but not to the control manipulation, resulted in a significant increase in pulse (A), diastolic blood pressure (B), systolic blood pressure (C), and salivary cortisol (D). The grey bars denote the timing and duration of the treatment (TSST vs. control condition) and the respective reinforcement learning task. (E) Proportion of responses for devalued outcomes across the reinforcement learning task during Dev Ohigh blocks. As training proceeded, stressed participants increasingly selected actions that led to a devalued outcome. In addition, stressed participants responded significantly more often to the devalued action than non-stressed controls in the third devaluation block at the end of the task. Individual data points are shown in Figure 3. The data for Dev Olow and NoDev blocks is presented in Figure 3—figure supplement 1 and Figure 3—figure supplement 2, respectively. (F and G) Outcome representation during stimulus presentation and response choice. As training proceeded, the outcome representations decreased in the stress group, while there were no changes in the control group (blocks 1–4 vs. blocks 19–24). At the end of the learning task, outcome representations were significantly lower in stressed participants than in controls. (H) Response representations during stimulus presentation. Stressed participants showed significantly stronger response representations after extended training compared to the control group. Data represents means and error bars represent the SE of the mean. *** p<0.001, ** p<0.01, and * p<0.05, Bonferroni-corrected (group differences, corrected for all time points and blocks, respectively). ### p<0.001, ## p<0.01, Bonferroni-corrected (block differences, corrected for the number of blocks).

Figure 2.

Figure 2—figure supplement 1. Event-related potentials during Dev Ohigh blocks.

Figure 2—figure supplement 1.

(A) Mean centroparietal activity for devalued and valued stimuli during Dev Ohigh for control and stressed participants (baseline-corrected). The late component tended to be sensitive to the outcome devaluation in the control group but not in the stress group. The light-coloured background bar refers to the time range showing marginally significant group differences. (B) Mean occipital activity for devalued and valued stimuli during Dev Ohigh for control and stressed participants (baseline-corrected). The stimulus-locked P1 component was insensitive to outcome devaluation in both groups. Data represents means ± SEM.
Figure 2—figure supplement 2. Event-related potentials during Dev Olow blocks.

Figure 2—figure supplement 2.

(A) Mean centroparietal activity for devalued and valued stimuli during Dev Olow for control and stressed participants (baseline-corrected). (B) Mean occipital activity for devalued and valued stimuli during Dev Olow for control and stressed participants (baseline-corrected). Data represents means ± SEM.
Figure 2—figure supplement 3. Event-related potentials during NoDev blocks.

Figure 2—figure supplement 3.

(A) Mean centroparietal activity for Shigh and Slow during NoDev for control and stressed participants (baseline-corrected). (B) Mean occipital activity for Shigh and Slow during NoDev for control and stressed participants (baseline-corrected). Data represents means ± SEM.
Figure 2—figure supplement 4. Brain areas contributing the most to outcome and response decoding.

Figure 2—figure supplement 4.

Searchlight analysis of differences in object vs. scene (A) and blue vs. red symbol (B) processing during the delayed-matching-to-sample (DMS) task (chance performance was 52.85 and 53.90%, respectively), calculated using Wald intervals (Agresti and Caffo, 2000).

Stress renders behaviour less sensitive to outcome devaluation

Participants’ choice accuracy increased significantly throughout the task (F[2, 114] = 10.08, p<0.001, ƞp²=0.150, and 95% CI=0.042–0.261) and reached an average performance of 99% correct responses in blocks without devaluation (NoDev), indicating that participants learned the task very well. Both groups reached a performance plateau relatively quickly and at about the same time (Figure 3—figure supplement 2). Performance in NoDev blocks did further not differ between the control and stress groups, and the time course of learning was comparable in the two groups (time × group interaction: F[2, 112] = 2.44, p=0.092, ƞp²=0.042, and 95% CI=0–0.123; main effect group: F[1, 56] = 0.30, p=0.585, ƞp²=0.005, and 95% CI=0–0.096), suggesting that stress did not affect instrumental learning as such. Furthermore, during NoDev blocks, participants had a higher response accuracy for Shigh than Slow trials (t57=3.29, p=0.002, d=0.432, and 95% CI=0.161–0.699), suggesting that learning was modulated by the value of the outcome. In the control group, instrumental behaviour did not differ across the different devaluation blocks (F[2, 54] = 1.466, p=0.240, ƞp²=0.052, and 95% CI=0.013–0.049), indicating that the repeated devaluation phases as such did not result in increased sensitivity to the devaluation procedure.

During the reinforcement learning blocks, in which one of the outcomes was devalued (Dev), participants chose the action that was associated with a valued outcome significantly more often than the action that was associated with a devalued outcome (outcome devaluation × stimulus value interaction: F[2, 110] = 163.31, p<0.001, ƞp²=0.785, and 95% CI=0.664–0.789; valued vs. devalued during Dev Ohigh: t57=12.17, pcorr<0.001, d=1.589, and 95% CI=1.205–1.985; valued vs. devalued during Dev Olow: t56=13.49, pcorr<0.001, d=1.786, and 95% CI=1.363–2.203), providing further evidence of successful instrumental learning. The more pronounced valued vs. devalued difference for Olow (i.e. outcome devaluation × stimulus value interaction) persisted even when we analysed only the control group (outcome devaluation × stimulus value interaction: F[2, 52] = 70.601, p<0.001, ƞp²=0.731, and 95% CI=0.391–0.720; valued vs. devalued during Dev Ohigh: t27=8.482, pcorr<0.001, d=1.603, and 95% CI=1.032–2.16; valued vs. devalued during Dev Olow: t26=8.654, pcorr <0.001, d=1.665, and 95% CI=1.071–2.246). Importantly, however, there was also a significant outcome devaluation × stimulus value × block × group interaction (F[4, 220]=4.86, p<0.001, ƞp²=0.081, and 95% CI=0.016–0.143). Follow-up ANOVAs revealed that stressed participants increasingly selected actions that led to a devalued outcome during Dev Ohigh blocks at the end of the task (stimulus value × time × group interaction: F[2, 112] = 8.89, p<0.001, ƞp²=0.137, and 95% CI=0.033–0.247; time × group interaction for devalued stimuli: F[2, 112] = 9.09, p<0.001, ƞp²=0.140, and 95% CI=0.035–0.250; time × group interaction for valued stimuli: F[2, 112] = 2.19, p=0.116, ƞp²=0.038, and 95% CI=0–0.116; block 2 vs. block 3 for devalued stimuli in the stress group: t29=4.28, pcorr<0.001, d=0.781, and 95% CI=0.366–1.186; block 2 vs. block 3 for devalued stimuli in the control group: t27=1.10, pcorr=1, d=0.207, and 95% CI=−0.169–0.580), whereas there was no such effect during Dev Olow blocks (F[2, 110] = 2.39, p=0.097, ƞp²=0.042, and 95% CI=0–0.124), or during NoDev blocks (F[2, 112] = 0.41, p=0.667, ƞp²=0.007, and 95% CI=0–0.052).

As shown in Figure 2E and Figure 3, stressed participants responded significantly more often to the devalued action than the non-stressed controls did in the third devaluation block at the end of the task (t56=2.61, pcorr=0.036, d=0.685, 95% CI=0.152–1.213, stress vs. control during the first and second Dev Ohigh blocks: all t56 <0.57, all pcorr = 1, all d<0.149, all 95% CI=–0,472–0.664). Furthermore, while there was no evidence of an interaction of devaluation block number (1 vs. 2) and experimental treatment (F[1, 56] = 1.575, p=0.215, ƞp²=0.027, and 95% CI=−0.013–0.052) when analysing the changes from the first to the second devaluation block, we obtained a significant interaction between block (2 vs. 3) and treatment when analysing the changes from block 2 to block 3 (F[1, 56] = 13.589, p<0.001, ƞp²=0.195, and 95% CI=0.105–0.319). Moreover, follow-up tests revealed that in block 2, groups did not differ in responses for the devalued outcome (t56=0.165, p=0.870, Cohen’s d=0.043,–0.472 to 0.558).

Figure 3. Proportion of responses for devalued outcomes across the reinforcement learning task during Dev Ohigh blocks.

As training proceeded, stressed participants increasingly selected those actions that led to a devalued outcome (block 2 vs. block 3). In addition, stressed participants responded significantly more often to the devalued action than non-stressed controls in the third devaluation block at the end of the task (stress vs. control). Dots represent mean performance of individual participants. The data for Dev Olow and NoDev blocks is presented in Figure 3—figure supplement 1 and Figure 3—figure supplement 2. * p<0.001, Bonferroni-corrected (stress vs. control). # p<0.001, Bonferroni-corrected (vs. the respective other block).

Figure 3.

Figure 3—figure supplement 1. Proportion of responses for devalued outcome across the reinforcement learning task during Dev Olow blocks.

Figure 3—figure supplement 1.

Figure 3—figure supplement 2. Proportion of correct responses during NoDev blocks after low, moderate, and high training intensity.

Figure 3—figure supplement 2.

(A) Violin plots illustrate the proportion of correct responses in Ohigh trials during NoDev blocks after low (block 1), moderate (block 12), and high (block 24) training intensity (upper part). The line graph illustrates the proportion of correct responses in Ohigh trials during NoDev from block 1 to block 24. (B) Violin plots illustrate the proportion of correct responses in Olow trials during NoDev blocks after low (block1), moderate (block 12), and high (block 24) training intensity. The line graph illustrates the proportion of correct responses in Olow trials during NoDev from block 1 to block 24. In both violin, dots represent mean performance of individual participants. In both line graphs, error bars represent the SEM.

In summary, this data suggests that stress rendered behaviour less sensitive to the outcome devaluation in a training-dependent manner, which was particularly the case for the frequently repeated response to Ohigh.

Given that it is well-known that stress may disrupt memory retrieval (Gagnon and Wagner, 2016; de Quervain et al., 1998; Roozendaal, 2002), it might be argued that the increased responding to the devalued action in stressed participants was due to their difficulty in remembering which outcome was valued and which was devalued. We could, however, rule out this alternative. Each block of our task involved, in addition to reinforcement learning trials, so called consumption trials in which participants could freely choose between the two outcome categories, without any response costs. These trials served to assess whether participants were aware of the current value of the outcomes. Here, participants clearly preferred the high-valued cards over low-valued cards during NoDev blocks (F[1, 56] = 5382.91, p<0.001, ƞp²=0.990, and 95% CI=0.984–0.992) as well as the valued card over its devalued counterpart during Dev blocks (Dev Ohigh and Dev Olow: both F[1, 56] > 214.72, both p<0.001, both ƞp²<0.793, and both 95% CI=0.687–0.848; outcome devaluation × stimulus value interaction: F[2, 112] = 876.42, p<0.001, ƞp²=0.940, and 95% CI=0.919–0.952), irrespective of stress (outcome devaluation × stimulus value × group interaction: F[2, 112] = 0.05, p=0.956, ƞp²=0.001, and 95% CI=0–0.120). This finding demonstrates that participants in both groups were aware of the value of the card stimuli in a specific block but that responses of stressed participants were less guided by this knowledge about the value of the outcome engendered by the response.

Mean reaction times were significantly faster for high-valued stimuli than for low-valued stimuli during NoDev blocks (NoDev: t57=2.83, pcorr = 0.019, d=0.372, and 95% CI=0.104–0.637), whereas participants responded faster to low-valued stimuli than to high-valued stimuli during Dev blocks (Dev Olow: t56=5.73, pcorr<0.001, d=0.759, and 95% CI=0.462–1.052; Dev Ohigh: t57=4.41, pcorr<0.001, and d=0.579, 95% CI=0.298–0.855; outcome devaluation × stimulus value interaction: F[2, 110] = 28.14, p<0.001, ƞp²=0.338, and 95% CI=0.194–0.451). Stress did not influence participants’ reaction times (outcome devaluation × stimulus value × time × group: F[4, 220] = 0.85, p=0.494, ƞp²=0.015, and 95% CI=0–0.043; main effect group: F[1, 55] = 1.00, p=0.322, ƞp²=0.018, 95% CI=0–0.134).

Although our analyses of the neural data focussed mainly on the decoding of outcome and response representations, there is recent evidence suggesting that habitual and goal-directed processes might also be reflected in ERPs (Luque et al., 2017). The extent to which a reward related ERP is sensitive to an outcome devaluation is assumed to indicate the degree of habitual or goal-directed processing. Thus, we additionally analysed stress effects on related ERPs depending on outcome devaluation. Our data shows that the occipital stimulus-locked P1 component was insensitive to outcome devaluation (outcome devaluation × stimulus value interaction: F[2, 102]=0.63, p=0.536, ƞp²=0.012, and 95% CI=0–0.069; Figure 2—figure supplements 13), which might suggest the formation of an ‘attentional habit’ (Luque et al., 2017). However, the P1 was not modulated by reward value (F[1, 51] = 0.25, p=0.619, ƞp²=0.005, and 95% CI=0.002–0.009), which makes the interpretation of the insensitivity to the outcome devaluation difficult. The P1 component was also not modulated by stress (stimulus value × group: F[1, 51] = 0.13, p=0.723, ƞp²=0.002, and 95% CI=0–0.086; outcome devaluation × stimulus value × group: F[2, 102] = 0.11, p=0.900, ƞp²=0.002, and 95% CI=0–0.028). Moreover, we identified a late component that showed a non-significant trend towards sensitivity to the outcome devaluation during Dev Ohigh blocks in control participants (devalued vs. valued: t24=1.91, p=0.068, d=0.382, and 95% CI=−0.028–0.785) but not in stressed participants (devalued vs. valued: t27=1.57, p=0.127, d=0.297, and 95% CI=−0.084–0.673; outcome devaluation × stimulus value × group interaction: F[2, 102] = 5.20, pcorr=0.042, ƞp²=0.093, and 95% CI=0.008–0.199; stimulus value × group interaction: F[1, 51] = 6.05, p=0.017, ƞp²=0.106, and 95% CI=0.003–0.273; no such effect in NoDev and Dev Olow blocks: stimulus value × group interaction: both F[1, 51] < 1.44, both p>0.236, both ƞp²<0.027, and both 95% CI=0–0.159). This pattern of results suggests that stress interferes with a late ERP component that has been linked to goal-directed processing as it was particularly sensitive to the value of an outcome. However, similar to the behavioural response to a devalued action, the stress-induced decrease in the ‘outcome-sensitive’ ERP component leaves the question open as to whether this stress effect is due to changes in outcome or response processing.

Stress reduces outcome representations at the end of training

Thus far, our behavioural data showed that stress rendered behaviour less sensitive to a change in outcome value, which can be interpreted as decreased goal-directed or increased habitual behaviour. In addition, stress reduced electrophysiological late latency potentials that appeared to be sensitive to the value of an outcome (Luque et al., 2017).

In a next step, we leveraged an EEG-based decoding approach to address the primary objective of this study, that is, to probe the effect of stress on neural outcome and response representations. We trained an MVPA classifier (SVM) based on an independent dataset (DMS task, for details see Materials and methods) to discriminate between categories that were used during the reinforcement learning task as an outcome (object card vs. scene card). This classifier was then used to assess trial-by-trial changes in outcome representations throughout the reinforcement learning task. Goal-directed control should be reflected in high accuracy of outcome category classification during the presentation of the stimulus (S) as well as during the response choice (R).

We first analysed outcome representations during the presentation of the fractal stimulus S before participants had to make a choice. This analysis revealed a significant block × group interaction (F[3, 117] = 2.77, p=0.045, ƞp²=0.066, and 95% CI=0–0.149). As training proceeded, participants in the stress group showed a reduced outcome representation at the time of S presentation (first six vs. last six blocks: t22=3.59, pcorr=0.004, d=0.748, and 95% CI=0.277–1.206), whereas the outcome representation remained rather constant in participants of the control group (first six vs. last six blocks: t17=1.08, pcorr = 0.590, d=0.255, and 95% CI=−0.219–0.721; Figure 2F). In the last six blocks of the task, the classification accuracy for the outcome was significantly lower in the stress group compared to the control group (t39=3.13, pcorr=0.012, d=0.986, and 95% CI=0.326–1.635; lower training intensity [first 18 blocks of the task]: all t39<1.13, all pcorr=1, all d<0.355, and all 95% CI=−0.914–0.975; note that the overall pattern remains when trials are blocked differently, see Supplementary file 1A). Strikingly, the reduced outcome representation was significantly correlated with the reduced behavioural sensitivity to the outcome devaluation during Dev Ohigh blocks (Spearman’s ρ=–0.482, 95% CI=−0.688−0.205, and p=0.001; Figure 4).

Figure 4. Correlations of outcome and response representation during stimulus presentation with responses for devalued outcomes during Dev Ohigh blocks.

Figure 4.

(A) Decrease of outcome representation during stimulus presentation was significantly correlated with the reduced behavioural sensitivity to the outcome devaluation during Dev Ohigh blocks. (B) Increase in response representation was significantly correlated with an increase in response for devalued outcomes during Dev Ohigh blocks. Higher difference scores indicate higher decreases in outcome and response representation over time. Regression lines are added for visualization purpose, and the light-coloured background areas indicate its 95% CI.

Analysing the outcome representations at the time of choice between the two aliens revealed a very similar pattern; at the end of the task, participants in the stress group showed a decreased outcome representation at the time point of the choice (t22=2.94, pcorr=0.016, d=0.613, and 95% CI=0.161–1.054), whereas there was no such effect in the control group (t17=0.59, pcorr=1, d=0.138, and 95% CI=−0.328–0.600). In addition, stressed participants had reduced outcome representations relative to controls, reflected in a significantly reduced classification accuracy, during the response choice at the end of the reinforcement learning task (block × group interaction: F[3, 117] = 2.99, p=0.034, ƞp²=0.071, and 95% CI=0–0.156; stress vs. control, high training intensity: t39=2.75, pcorr=0.036, d=0.865, and 95% CI=0.214–1.506; lower training intensities: all t39<2.30, all pcorr>0.108, all d<0.725, and all 95% CI=−0.500–1.358; Figure 2G). Together, these results show that at the end of training, acute stress reduced the representation of action outcomes that are considered to be a hallmark of goal-directed control.

Stress boosts response representations at the end of training

While it is assumed that the outcome representation that is crucial for goal-directed S-R-O learning is reduced with increasing habitual behaviour control, response (R) representations at the time of stimulus (S) presentation may be involved in both goal-directed S-R-O and habitual S-R processing. Therefore, we trained another classifier to discriminate between categories that were used as response options during the reinforcement learning task (red vs. blue alien). This classifier was used to examine changes in response representation during the stimulus presentation (S) throughout the reinforcement learning task. For these response representations, a block × group ANOVA revealed a significant interaction effect (F[3, 147] = 5.82, p<0.001, ƞp²=0.106, and 95% CI=0.021–0.192). As shown in Figure 2H, participants in the stress group showed a stronger response representation that was reflected in higher classification accuracy for the response categories with increasing training intensity (first half vs. last half: t25=2.51, pcorr=0.038, d=0.491, and 95% CI=0.079–0.894), whereas there was even a decrease in the control group (first half vs. last half: t24=3.50, pcorr=0.004, d=0.701, and 95% CI=0.256–1.134). In the last six blocks of the reinforcement learning task, stressed participants had significantly higher response representations than participants in the control group (t49=2.75, pcorr=0.032, d=0.770, and 95% CI=0.197–1.336; lower training intensities: all t49<1.92, all pcorr>0.244, all d<0.537, and all 95% CI=0.025–1.094). Interestingly, this increase in response representations was significantly correlated with an increase in responses for devalued outcomes during Dev Ohigh blocks (Spearman’s ρ=0.320, 95% CI=0.049–0.547, and p=0.022; Figure 4). Thus, our MVPA results indicate that stress leads to an increased response representation at the time of stimulus presentation.

Importantly, when we grouped the classification data not in four blocks consisting of 144 trials in total (averaged over six successive blocks containing 24 reinforcement learning trials each) but in 2, 3, 6, or 12 blocks, the pattern of results for the neural outcome and response representations was largely comparable (Supplementary file 1A).

Outcome and response representations are uncorrelated

To test whether the observed opposing changes in outcome and response representations after stress reflected independent or linked neural representations, we analysed Bayesian correlation between the classification accuracies in order to explicitly test the evidence in favour of the null and alternative hypothesis, respectively. These analyses revealed moderate evidence for the null hypothesis that outcome representations, both at stimulus presentation and response selection, were uncorrelated with response representation at choice time (both Pearson’s |r|<0.165, both 95% CI=0.154–0.442, and both BF01 >3.092; Figure 5), suggesting that outcome representations and response representations may be independent of each other.

Figure 5. Bayesian correlations of outcome representation during stimulus presentation and response selection with response representation during stimulus presentation.

Figure 5.

(A) Outcome representation during stimulus presentation was not correlated with response representation during response selection. As visualized in the pie chart, the corresponding Bayes factor suggests that the observed data are 3.092 times more likely under the null hypothesis (H0) than under the alternative hypothesis (H1). (B) Outcome representation during response selection was not correlated with response representation during stimulus presentation. As visualized in the pie chart, the corresponding Bayes factor suggests that the observed data are 4.642 times more likely under the H0 than under the H1. Higher difference scores indicate larger decreases in outcome and response representation, respectively, over time. Regression lines are added for visualization purpose, and the light-coloured background areas indicate its 95% CI.

Control variables and performance in the DMS task

At the beginning of the experiment, stress and control groups did not differ in subjective mood (all t56<1.20, all p>0.235, all d<0.316, and all 95% CI=−0.326–0.833), subjective chronic stress (all t54<1.07, all p>0.290, all d<0.285, and all 95% CI=−0.677–0.811), depressive mood (t56=1.07, p=0.289, d=0.281, and 95% CI=−0.238–0.797), state, or trait anxiety (both t56<0.44, all p>0.663, both d<0.115, and both 95% CI=0.401–0.630; Supplementary file 1C). Behavioural performance in the DMS task, used to train the classifier, was, as expected, very high (average performance: 97.5% correct, SD = 0.054) and comparable between groups (t56=0.23, p=0.818, d=0.061, and 95% CI=−0.455–0.576). The average classification accuracy of the classifier was 72% (SD = 0.068) for the response categories (blue rectangular vs. red oval symbol) and 66% (SD = 0.046) for the outcome categories (object vs. scene image) and did not differ between the stress and control groups (both t51<0.89, both p>0.376, both d<0.246, and both 95% CI=−0.669–0.786).

Furthermore, we recorded eye-tracking data to control for potential group differences in saccades or eye blinks. These control analyses showed that there were no significant group differences in saccades or eye blinks across the task or trial type (outcome devaluation × stimulus value × time × group: F[4, 196] = 0.78, p=0.54, ƞp²=0.02, and 95% CI=0.002–0.008; outcome devaluation × stimulus value × group: F[2, 98] = 1.03, p=0.36, ƞp²=0.02, and 95% CI=0.005–0.020; see Supplementary file 1D).

Discussion

Previous research showed that stress favours habitual responding over goal-directed action (Braun and Hauber, 2013; Dias-Ferreira et al., 2009; Gourley et al., 2012; Schwabe et al., 2009; Schwabe et al., 2010; Schwabe et al., 2011b; Schwabe et al., 2012; Seehagen et al., 2015; Smeets et al., 2019; Smeets and Quaedflieg, 2019; Soares et al., 2012; Wirz et al., 2018). Although this stress-induced bias towards habitual behaviour has important implications for stress-related mental disorders (Adams et al., 2018; Goeders, 2004; Schwabe et al., 2011a), a fundamental question has remained elusive thus far: is the shift towards habitual behaviour under stress due to diminished goal-directed control or enhanced habitual control, or both? Canonical behavioural assays of the mode of behavioural control cannot distinguish between these alternatives. Here, we used EEG-based decoding of outcome and response representations - the key components of goal-directed and habitual processing - to provide evidence that acute stress results in a decrease of outcome-related processing that is critical for goal-directed control, and paralleled by an increase in response processing.

Our behavioural and ERP data corroborates previous reports of a stress-induced shift from goal-directed to habitual control (Braun and Hauber, 2013; Schwabe et al., 2011b; Smeets and Quaedflieg, 2019). Specifically, stressed participants showed increased responding to a devalued action after extended training, suggesting a reduced behavioural sensitivity to the outcome devaluation which indicates less goal-directed behaviour and more habitual responding (Adams and Dickinson, 1981). Recent evidence showed that goal-directed and habitual processing may also be reflected in ERPs that are either sensitive or insensitive to changes in the value of an action outcome (Luque et al., 2017). We observed here that stress appears to reduce the sensitivity of late potentials to the outcome devaluation, whereas stress left the occipital P1 component that was insensitive to the outcome devaluation unaffected and may thus be considered habitual (Luque et al., 2017). However, similar to the behavioural insensitivity to an outcome devaluation after stress, these ERPs cannot separate reduced goal-directed from increased habitual responding. To disentangle outcome-related processing that is critical for goal-directed learning and S-R processing, which may be relevant both for goal-directed and habitual learning, we used an MVPA-based decoding approach. Critically, our Bayesian analyses indicated that changes in outcome and response representations were uncorrelated, which may be taken as evidence that they do not reflect changes in a single system, but rather (at least partly) dissociable signatures of goal-directed and habitual processing.

We show that stress led to a reduction in outcome representations both at the time of stimulus presentation and at the time of action selection, as well as a parallel increase of response representations during stimulus presentation indicative of enhanced S-R associations (Figure 2F–H). Both the stress-induced reduction in outcome representations and the increase in response representations were directly correlated with the behavioural (in)sensitivity to the outcome devaluation (Figure 4). However, while outcome representations were negatively correlated with participants’ responding to the devalued action, there was a positive correlation for response representations which might lend further support to the view that these representations reflect distinct processes. Taken together, our results indicate that acute stress leads to enhanced S-R processing and impaired outcome-related processing. The latter is in line with evidence showing reduced orbitofrontal activity in the face of elevated stress hormones (Schwabe et al., 2012), which is accompanied by increased habitual processing. Based on previous pharmacological studies (Gourley et al., 2012; Schwabe et al., 2010; Schwabe et al., 2011b; Schwabe et al., 2012), we assume that these opposing effects of stress on outcome-related processing and S-R processing are based on the concerted action of glucocorticoids and noradrenaline on the neural substrates of goal-directed and habitual control.

Importantly, at the time when stress effects on devaluation sensitivity and outcome representation were observed, stress mediators were no longer elevated. However, it is important to note that stress effects do not necessarily terminate when acute levels of stress mediators returned to baseline. Specifically, several major stress mediators are known to have ‘after-effects’ that outlast acute elevations of these measures (Joëls and Baram, 2009). For example, glucocorticoids are assumed to act as a ‘two-stage rocket’, with rapid, non-genomic actions and delayed, genomic actions (Joëls et al., 2012). The latter genomic actions typically set in after acute glucocorticoid elevations vanished (de Kloet et al., 2008). Thus, the fact that autonomic measures and cortisol levels returned to baseline during the learning task does not imply that the stress system activation had been over at this point. Moreover, acutely elevated stress mediators may have affected early learning processes in a way that became apparent only as training progressed.

Importantly, both our behavioural and our neural decoding data showed that stress affected the balance of outcome-related and S-R processes in a training-dependent manner. Previous human studies could not distinguish between such early and late effects of stress because the mode of behavioural control was only assessed at the end of training (Schwabe and Wolf, 2009; Smeets et al., 2019). The present results compliment rodent data which shows that (chronic) stress effects on the control of instrumental learning are training-dependent (Dias-Ferreira et al., 2009). In general, it is commonly assumed that early training falls under goal-directed control, while extensive training results in habitual control (Adams, 1982; Dickinson et al., 1995; Tricomi et al., 2009). However, whether or not overtraining may induce habitual behaviour in humans is currently debated (de Wit et al., 2018), and our data cannot speak to this issue as training may have been too limited to result in overtraining-related habits (which might require thousands of responses; Tricomi et al., 2009). Thus, training-dependent effects do not necessarily imply overtraining effects. However, findings of ‘cognitive’ and ‘habitual’ forms of navigational learning in rats demonstrated that stress hormones may accelerate a shift from ‘cognitive’ to ‘habitual’ learning that would otherwise only occur after extended training (Packard, 1999; Siller-Pérez et al., 2017). Thus, it is tempting to hypothesize that a similar mechanism might be at work during instrumental learning in stressed humans. This conclusion, however, remains speculative as we did not observe a training-dependent shift towards habitual control in the control group, and this group even showed reduced response and increased outcome representations over time, which rather suggests increased goal-directed processing across the task. Future studies are required to test this hypothesis by using the present neural decoding approach in combination with an extensive training protocol and groups of participants that are exposed to stress at distinct stages of the learning process.

In animal studies, devaluation sensitivity is usually assessed by means of a reinforcement test, in which the devalued outcome is delivered contingent on the relevant response. This procedure allows to control, for example, for a general lack of attention. In the present human study, the feedback provided on each trial varied between blocks. During NoDev blocks, participants saw the value of the card that was gained before as well as the corresponding response costs. In Dev blocks, however, this information was masked for all trials. This procedure is comparable to previous studies in humans in which the devalued outcome was not presented during critical test trials (e.g. Schwabe and Wolf, 2009; Valentin et al., 2007). Importantly, however, we also included consumption trials in all of the blocks which enabled us to rule out unspecific factors, such as a general lack of attention, altered contingency knowledge, or response perseveration. In these consumption trials, participants clearly preferred the high-valued cards over low-valued cards during NoDev blocks as well as the valued card over its devalued counterpart during Dev blocks, irrespective of stress. This finding demonstrates that participants in both groups were aware of the value of the card stimuli in a specific block but that responses of stressed participants were less guided by this knowledge about the value of the outcome engendered by the response. In addition, the specific response pattern in the consumption trials during devalued blocks also rules out general attentional or motivational deficits.

It is important to note that participants received an error feedback in devalued trials when they chose the response option that was not associated with the now devalued outcome. Given that acute stress may increase sensitivity to social cues (Domes and Zimmer, 2019), one might argue that stressed participants continued to respond towards devalued outcomes in order to avoid being presented with the error-feedback screen. However, we consider this alternative to be unlikely. First, with respect to the neural outcome and response representations, these were analysed during NoDev blocks in which groups did not differ in their behavioural performance accuracy and consequently not in the frequency of error feedback. Furthermore, participants’ performance in devalued blocks was directly associated with the observed changes in neural outcomes and response representations during the NoDev blocks, which again, could not be biased by differences in error feedback processing.

In addition, stressed participants showed an increase in insensitivity to outcome in Dev Ohigh but not in Dev Olow trials. Moreover, we found that the devaluation effect for Ohigh stimuli was stronger compared to the effect for Olow stimuli. This difference remained even when we analysed only the control group, excluding the possibility that the difference between Olow and Ohigh was merely due to the fact that stress increased specifically the insensitivity to the devaluation of Ohigh. However, why may the devaluation effect be lower for Ohigh than for Olow and why may stress have affected primarily the devaluation of Ohigh? These results suggest a stronger habit formation for stimuli that were paired with high rewards. A potential explanation for this pattern takes the links between rewards, stimulus saliency, and strength of S-R associations into account; the initial association with high valued outcomes may have increased the salience of the respective stimuli, which in turn may have promoted the formation of S-R associations. These stronger S-R associations may have resulted in more habitual responses for the devalued outcomes.

Interestingly, participants either primarily selected their action based on the outcome value (i.e. by avoiding the devalued outcome and thus were goal-directed) or responded to the devalued outcome in the vast majority of the trials and thus behaved habitually. Thus, participants showed an ‘either-or’ pattern, and there seemed to be interindividual differences in the tendency to perform in a goal-directed vs. habitual manner. Interestingly, also among stressed participants, there were substantial individual differences in the propensity to shift from goal-directed towards habitual behaviour, indicating that there is no overall effect of stress on the control of behaviour but that individuals differ in their sensitivity in these stress effects. This raises the important question of what makes individuals more or less vulnerable to the effects of stress on instrumental control (and cognition in general). Previous work suggested that genetic variants related to major stress response systems (i.e. noradrenergic and glucocorticoid activity) may play an important role in which multiple learning systems are engaged after stress (Wirz et al., 2017; Wirz et al., 2018). Understanding which further individual factors contribute to the interindividual propensity to shift from goal-directed to habitual behaviour needs to be addressed by future research. Furthermore, the fact that our neural data were not bimodal suggests that changes in neural representations may not translate directly into behavioural changes. The observed significant correlations between neural representation and performance in devalued trials show that there is a link between the behavioural and neural representation level. However, the correlations were obviously far below 1. Compared to the behavioural level which included only discrete yes-no responses, the neural data may have been much more sensitive and able to capture more fine-grained changes. The different sensitivity of behavioural and neural data is interesting in itself and points to another important question for future research: how do neural representation changes and behavioural changes relate to each other? Is there a particular threshold at which a change in representation triggers behavioural changes?

While we assume that the opposing effects of stress on neural outcome and response representations were due to the action of major stress response systems, there might have been other factors that have contributed to the present pattern of results, such as motivational factors or fatigue at the end of the task. Although we cannot completely rule out these alternatives, we consider them rather unlikely in light of our data. First, if these factors were a result of the mere amount of training, they should have also occurred in the control group which was not the case. Even if a specific interaction with the stress manipulation is assumed, it is important to note that reaction times remained fast in stressed participants until the end of training, and the response accuracy in valued trials or consumption trials remained very high. Furthermore, the observed specificity of the stress effects which occurred selectively in devalued trials cannot - in our view - be explained by unspecific influences, such as lack of motivation or fatigue.

In the present study, stress was induced before learning and outcome devaluation. Thus, stress could have affected the acquisition or the expression of instrumental behaviour, or both. While several previous studies demonstrated that acute stress (or the administration of stress hormones) before learning may shift instrumental behaviour from goal-directed to habitual control (Braun and Hauber, 2013; Dias-Ferreira et al., 2009; Gourley et al., 2012; Hartogsveld et al., 2020; Schwabe et al., 2010; Schwabe et al., 2011b; Schwabe et al., 2012; Schwabe and Wolf, 2009; Soares et al., 2012), there is evidence suggesting that stress before a test of behavioural expression may have a similar impact, that is, stress may induce habitual responding even when stress left acquisition unaffected (Schwabe et al., 2011b; Schwabe and Wolf, 2010). The latter finding, however, does not rule out additional effects of stress on acquisition, and indeed the impact of stress appeared to be more pronounced when the stress exposure took place before learning (Schwabe and Wolf, 2010). The present study did not aim to distinguish between stress effects on acquisition, or expression of goal-directed vs. habitual behaviour, but focussed on the impact of stress of the control of instrumental behaviour. Thus, our findings do not allow us to distinguish between stress effects on acquisition vs. expression of instrumental behaviour.

Based on the associative-cybernetic model (Dickinson and Balleine, 1993), it could be predicted that the obtained pattern of increased outcome and decreased response representations even leads to reduced responding for devalued outcomes across training in controls. This may be because individuals need to encode a representation of the response being performed in order to attain the outcome for a response to be goal-directed. We did not observe such a decrease, which may be attributed to the overall relatively low rate of responses for devalued outcomes in control participants.

Goal-directed action and habits are commonly considered to be two sides of the same coin. If a behaviour is more goal-directed, it is automatically less habitual (and vice versa). Obviously, behaviour cannot be fully goal-directed and habitual at the same time according to canonical operational definitions (Adams, 1982; Dickinson and Balleine, 1994). However, behaviour may not necessarily always be either fully goal-directed or habitual, and there may be different degrees to which behaviour is under goal-directed or habitual control. Given that the two modes of behavioural control are subserved by distinct neural circuits (Balleine and O’Doherty, 2010), it should be possible to separate goal-directed and habitual contributions to learning at a specific point in time. Classical behavioural paradigms involving discrete responses, however, cannot disentangle goal-directed and habitual components in a specific pattern of responding (e.g. insensitivity to outcome devaluation). Furthermore, tests of related cognitive functions, such as inhibitory control, provide only indirect evidence if any on the balance of goal-directed and habitual processes. Recently, a free-operant model was proposed that allows a behavioural dissociation of goal-directed and habitual contributions to behaviour (Perez and Dickinson, 2020). Here, we used here an MVPA-based decoding approach that focussed on neural representations that are a hallmark feature of goal-directed and habitual control. Using this novel approach, we show that acute stress reduces outcome representations and, at the same time, increases response representations in healthy humans. These were both directly linked to the stress-induced increase in habitual responding, suggesting that stress might exert opposite effects on goal-directed and habitual processing that manifest in the dominance of habitual behaviour under stress.

Materials and methods

Participants and design

Sixty-two healthy volunteers participated in this experiment. This sample size was based on earlier studies on stress and mnemonic control in our lab (Schwabe and Wolf, 2012) and a priori power calculation using G*POWER 3 suggesting that this sample size would be sufficient to reveal a medium-sized effect in a mixed-design ANOVA with a power of 0.80. Exclusion criteria were checked in a standardized interview before participation and identified any current or chronic mental or physical disorders, medication intake, or drug abuse. Furthermore, smokers and women taking hormonal contraceptives were excluded from participation because previous studies revealed that smoking and hormonal contraceptive intake may alter the cortisol response to stress. In addition, participants were asked to refrain from food intake, caffeine, and physical activity for 2 hr before testing. Four participants had to be excluded from analysis due to medication intake shortly before participation - leaving a final sample of 58 participants (32 men, 26 women; age: M=29.53 years, SEM = 2.57 years). Participants received a monetary compensation of 30 € for participation, plus a performance-dependent compensation (2–5 €). All participants gave written informed consent before entering the study, which was approved by the local ethics committee. In a between-subjects design, participants were randomly assigned to the stress (15 men and 15 women) or control group (17 men and 11 women).

Stress and control manipulation

Participants in the stress condition underwent the TSST (Kirschbaum et al., 1993), a standardized stress protocol known to reliably elicit both subjective and physiological stress responses (Allen et al., 2014; Kirschbaum et al., 1993). Briefly, the TSST consisted of a mock job interview during which participants were asked to give a 5 min free speech about why they are the ideal candidate for a job tailored to their interests and a 5 min mental arithmetic task (counting backwards in steps of 17 from 2043 as fast and accurate as possible; upon a mistake they had to stop and start again from 2023). Both the free speech and the mental arithmetic task were performed in front of a cold and non-reinforcing panel of two experimenters (1 man and 1 woman) who were dressed in white coats and introduced as experts in ‘behavioural analysis’. Furthermore, participants were videotaped throughout the TSST and could see themselves on a large screen placed next to the panel. In the control condition, participants gave a 5 min speech about a topic of their choice (e.g. their last holiday) and performed a simple mental arithmetic task (counting in steps of two) for 5 min while being alone in the experimental room - no video recordings were taken. During the control condition, the experimenter waited in front of the door outside the room where he/she was able to hear whether the participants had complied with the instructions.

To assess the effectiveness of the stress manipulation, subjective and physiological measurements were taken at several time points across the experiment. More specifically, participants rated the stressfulness, difficulty, and unpleasantness of the previous experience immediately after the TSST or control manipulation on a scale from 0 (‘not at all’) to 100 (‘very much’). In addition, blood pressure and pulse were measured using an OMRON M400 device (OMRON, Inc, MI) before, during, and immediately after the TSST/control manipulation as well as 20, 60, and 105 min after the TSST/control manipulation. To quantify cortisol concentrations and elevations during the experiment, saliva samples were collected from participants using Salivette collection devices (Sarstedt, Germany) before, during, as well as 20, 60, and 105 min after the TSST/control manipulation. Saliva samples were stored at –20°C until the end of testing. At the end of data collection, we determined the free fraction of cortisol from the saliva samples using a commercially available luminescence assay (IBL, Germany).

Reinforcement learning task

In order to investigate goal-directed and habitual contributions to behaviour, we used a modification of a recently introduced reinforcement learning task (Luque et al., 2017). In this reinforcement learning task, participants played the role of space traders on a mission to trade fractal stimuli (either a pink or a green fractal) for playing cards (either scene or object playing cards) with aliens from two tribes (‘red alien tribe’ and a ‘blue alien tribe’). Each tribe traded only one type of playing card (i.e. either scene or object playing cards). One type of card (the high-value outcome, Ohigh, worth 100 points) was more valuable than the other one (the low-value outcome, Olow, worth 20 points). In addition, one tribe of aliens exchanged a playing card only for a specific fractal (e.g. the red alien tribe only traded scene playing cards and only for a pink fractal but not for a green fractal). If a fractal was given to the alien tribe, the exchange was rejected by the alien (e.g. if the participant wanted to trade the green fractal with the red alien). In that case, participants did not receive a playing card and hence did not gain any points. Importantly, participants had to pay 5 points for every successful exchange (response costs). If a playing card was given to the incorrect alien and the exchange was denied, there were no response costs. Participants were encouraged to earn as many points as possible. At the end of the experiment, participants received a monetary reward dependent on the number of points earned throughout the reinforcement task.

Because the critical difference between goal-directed action and habitual responding is dependent on whether behaviour is sensitive to an outcome devaluation or not (Adams and Dickinson, 1981), specific outcomes were devalued in some of the task blocks. Participants were told that both types of cards needed to be kept in different boxes (scene cards in a ‘scene box’ and objects cards in an ‘object box’) but that sometimes one box was full and needed to be emptied. In that case, participants were not able to store cards of the said category anymore. Thus, in these blocks, the cards were worthless regardless of their usual value (i.e. the outcome was devaluated).

The status of the boxes (available/not available) varied only between blocks. Participants were informed about the status of the box via a message on the screen at the beginning of each block. In blocks without outcome devaluation (NoDev), both types of cards could be kept in their respective boxes and had their usual value. In blocks with devaluation of the high-valued outcome (Dev Ohigh), Ohigh cards could not be stored and thus their new value was zero, whereas Olow cards could be stored and thus had still the usual value. In blocks with devaluation of the low-valued outcome (Dev Olow), Olow cards could not be saved and had no value anymore while Ohigh cards still had high value (Figure 1). Importantly, due to the response costs (5 points for each card exchange), giving a card to an alien that traded the devalued card category was detrimental and should be avoided if behaviour is under goal-directed control.

Reinforcement learning trials were composed of S-R-O sequences. At the beginning of each trial, a fixation cross was presented on the centre of the computer screen for a random duration between 800 and 1200 ms. Then, one of the two highly distinct fractals (high-value stimulus, Shigh, or low-value stimulus, Slow) was presented. After 2000 ms, pictures of two aliens were shown at either side of the screen. The two aliens represented two different response options (R1 and R2). Participants were instructed to choose via a button press as to which alien they wanted to give the fractal. After their response, the screen was cleared, and the outcome feedback appeared on the screen for 1000 ms. If participants chose the incorrect alien, the message ‘The exchange was denied. You don't receive a playing card!’ was presented. This error feedback was presented in both NoDev and Dev blocks. If they did not respond within 2 s, the message ‘Time out, please respond faster!’ was displayed. The specific stimuli used as Shigh and Slow, R1 and R2, and Ohigh and Olow were counterbalanced across participants. The position of the aliens (left/right) representing the response options was randomized across trials. The feedback provided on each trial varied between blocks. During NoDev blocks, participants saw the value of the card that was gained before, as well as the corresponding response costs (‘+100/–5’ for Ohigh and ‘+20/–5’ for Olow). In Dev blocks (Dev Ohigh or Dev Olow blocks), this information was masked for all trials. Before Dev blocks, participants were informed that the outcome feedback was unavailable during these blocks because of a solar interference (in devaluation blocks, participants saw on the last screen of each trial ‘???’ instead of value), but that this solar interference would not influence the value of the cards that they could gain.

In addition to reinforcement learning trials, we included consumption trials in each block to test whether participants understood the task structure. At the beginning of the experiment, participants were instructed that sometimes the aliens become distracted and participants could take one of the two types of cards without trading it for fractals, and therefore without paying response costs. In these consumption trials, the message ‘The aliens seem distracted …’ was shown for 1000 ms, followed by a countdown (from 3 to 1) in the centre of the screen (over 3 s). Then, one scene and one object card appeared. One on the left and the other on the right (positions were selected randomly for each consumption trial). Participants choose one card via button press, and their choice revealed whether they were aware of the value of the stimuli in the respective block (NoDev, Dev Ohigh, and Dev Olow). Outcome screens were identical to those of the reinforcement learning trials.

Participants completed 27 NoDev blocks, 3 Dev Ohigh blocks, and 3 Dev Olow blocks, with 27 trials per block: 12 learning trials with Shigh, 12 learning trials with Slow, and 3 consumption trials. Reinforcement learning trials within each block were presented randomly. Consumption trials were displayed at trial numbers 7, 14, and 21 within each block. The Ohigh was always devaluated during the 2nd, 16th, and 29th block (Dev Ohigh), whereas Olow was always devalued in the 3rd, 17th, and 30th block (Dev Olow). Thus, these Dev blocks were presented at the beginning, middle, and end of the training.

At the end of each block, participants saw how many cards they had gained in the previous block, the total point value of the cards earned in that block, and the total number of points they had gained until then. Then, the next block started with the information about which box(es) was/were available (i.e. whether any outcome was devalued) for the upcoming block.

Performance during the devalued trials revealed the degree to which instrumental behaviour was goal-directed or habitual. Goal-directed action is indicated by the formation of S-R-O associations. Thus, if participants used a goal-directed system, they should adapt their responses to the actual outcome value following outcome devaluation. During devalued trials, participants did not earn any points if they chose the response associated with the devalued outcome. Importantly, however, they had to pay response costs leading to a subtraction of points. Thus, goal-directed participants should choose the response option associated with a trade rejection. Under such circumstances, participants did not have to pay any response costs (Figure 1). In contrast, habitual behaviour is reflected in simpler S-R associations rendering instrumental behaviour insensitive to changes in outcome value. Hence, choosing the action associated with the devalued outcome (where no points could be earned and response costs had to be paid) indicated less goal-directed and more habitual behaviour.

DMS task

In order to analyse the neural representations of response options and action outcomes, we trained an EEG-based classifier (see below) on a DMS task. This task was presented before and after participants had completed the reinforcement learning task in order to avoid any time-dependent biases in the trained classifier. In each DMS task, participants completed 128 trials. Participants were presented four different target types: object cards, scene cards, blue, and red symbols. In addition to colour, symbols also differed in shape, line orientation, and line position (blue rectangles with left-oriented lines in the upper area vs. red ovals with right-oriented lines in the lower area). Pictures used as targets were selected randomly from a pool of 256 pictures (90×object cards, 90×scene cards, 48×blue symbols, and 48×red symbols) with the restriction that successive trials did not belong to the same category more than three consecutive times. The remaining pictures were used as targets during the second DMS task. On each trial, the target was shown for 2 s on the centre of a computer screen. Participants were asked to hold it in mind during a subsequent delay phase of 2 s during which they saw a blank screen. Then, a probe stimulus was presented (the target and a distractor belonging to the same category), and the participants had to indicate via button press which picture they saw before. The position of the target during the response choice (right vs. left) was randomized. Different stimuli were used in the second DMS task compared to the first one.

Eye-tracking

We used a desktop mounted eye-tracker (EyeLink 1000; SR-Research Ltd., Mississauga, Ontario, Canada) to record monocular eye movements from the left eye at a sampling rate of 500 Hz. We used custom scripts implemented in MATLAB (The Mathworks, Natick, MA) to extract the mean saccades and blink information depending on the stimulus value, outcome devaluation, and time (0–2000 ms around the stimulus onset). Data for two participants was missing due to failed use of the eye-tracker.

Control variables

In order to control for potential group differences in depressive mood, chronic stress, and anxiety, participants completed the Beck Depression Inventory (Beck et al., 1996), the Trier Inventory for the Assessment of Chronic Stress (Schulz and Schlotz, 1999), and State-Trait Anxiety Inventory (Spielberger, 1983) at the end of the experiment. In addition, participants completed a German mood questionnaire (MDBF; Steyer et al., 1994) that measures subjective feeling on three dimensions (elevated vs. depressed mood, wakefulness vs. sleepiness, and calmness vs. restlessness) at the beginning of the experiment.

Procedure

All testing took place in the afternoon and early evening between 1 pm and 8 pm. Participants were instructed to refrain from excessive exercise and food or caffeine intake for the 2 hr before testing. After participants’ arrival at the laboratory, EEG was prepared, blood pressure measurements were taken, and a first saliva sample was collected. Participants also completed the mood scale (Steyer et al., 1994). Then, participants performed the first DMS task. After completing this task, participants received written instructions about the reinforcement learning task. In order to further familiarize participants with the structure of this task, participants completed a 5 min short tutorial afterwards. Next, participants underwent either the TSST or the control manipulation. Immediately thereafter, subjective assessments of this manipulation and another saliva sample were collected, and blood pressure was measured once again. Next, participants were briefly reminded of the instructions for the reinforcement learning task they had received before. Twenty minutes after the TSST/control manipulation, when cortisol was expected to have reached peak levels (Kirschbaum et al., 1993), participants collected another saliva sample before the reinforcement learning task commenced. After the 15th block of the reinforcement task and after finishing the task, further saliva samples were collected, and blood pressure was measured again (~60 min and ~105 min after stress onset). Finally, participants performed the second DMS task.

Statistical analysis

Subjective and physiological stress responses were analysed by mixed-design ANOVAs with the within-subject factor time point of measurement and the between-subjects factor group (stress vs. control). Participants’ responses in the reinforcement learning task were subjected to mixed-design ANOVAs with the within-subject factor; stimulus type (Shigh and Slow), outcome devaluation (NoDev: 1st, 12th, and 28th block; Dev Ohigh: 2nd, 16th, and 29th block; Dev Olow: 3rd, 17th, and 30th block), time point (1st, 2nd, and 3rd), and the between-subject factor group. Significant interaction effects were followed by appropriate post hoc tests. All reported p values are two-tailed and were Bonferroni corrected (pcorr) when indicated. Statistical analyses were calculated using SPSS 25 (IBM SPSS Statistics) and JASP version 0.13.0.0 software (https://jasp-stats.org/).

For one participant, we obtained only invalid trials during the last Dev Olow block. Thus, this participant could not be included in analyses of Dev Olow. Furthermore, two participants did not complete the Trier Inventory for the Assessment of Chronic Stress (Schulz and Schlotz, 1999). Data of 5 participants had to be excluded from the EEG analysis because of technical failure during the EEG, leaving a sample of 53 participants (control: n=25; stress: n=28) for EEG analyses.

EEG recordings

During the DMS and reinforcement learning task, participants were seated approximately 80 cm from the computer screen in an electrically shielded and sound proof cabin. EEG was recorded using a 128-channel BioSemi ActiveTwo system (BioSemi, Amsterdam, The Netherlands) organized according to the 10–5 system digitized at 2024 Hz. Additional electrodes were placed at the left and right mastoids, approximately 1 cm above and below the orbital ridge of each eye and at the outer canthi of the eyes. The EEG data was online referenced to the BioSemi CMS-DRL (common mode sense-driven right leg) reference. Electrode impedances were kept below 30 kΩ. EEG was amplified with a low cut-off frequency of 0.53 Hz (=0.3 s time constant).

EEG analysis

Preprocessing

Preprocessing was performed offline using FieldTrip (Oostenveld et al., 2011) and EEGLAB (Delorme and Makeig, 2004) as well as custom scripts implemented and processed in MATLAB (The Mathworks, Natick, MA). The PREP pipeline procedure (Bigdely-Shamlo et al., 2015) was utilized to transform the channel EEG data using a robust average reference. In addition, bad channels were interpolated using the spherical option of EEGLAB eeg_interp function. Then, data was filtered with a high pass filter of 0.1 Hz and a low pass filter of 100 Hz and downsampled to 250 Hz. For MVPA, epochs from the DMS task (2000 ms relative to the delay onset) and from the reinforcement learning task (2000 ms relative to the onset of stimulus and response option presentation) were extracted. For ERP analysis, EEG data was segmented into epochs from –200 to 2000 ms around the stimulus onset and baseline-corrected by subtracting the average 200 ms prestimulus interval. In addition, blinks and eye movements were corrected by independent component analysis (infomax ICA, Noh et al., 2014). Using the automated procedure ADJUST (Mognon et al., 2011), ocular artefact-related components in EEG recordings were identified and subsequently removed. The ADJUST algorithm combines stereotypical artefact-specific spatial and temporal features to detect and differentiate artefact ICA components (Chaumon et al., 2015). For example, ADJUST computes the kurtosis of the event-related time course for frontal channels, since, for example, eye blinks are accompanied by abrupt amplitude jumps in frontal regions areas (stereotypical artefact-specific temporal feature). Additionally, ADJUST determines the spatial topography of the IC weights to compare the magnitude of the amplitudes between frontal and posterior areas (stereotypical artefact-specific spatial feature). Using the ADJUST procedure, on average 1.65 (SEM = 0.13) components per participant were detected and removed. Previous data shows that the automatic detection of artefact components using ADJUST leads to a comparable classification of artefact components that are afforded by manual classification by experiments (Mognon et al., 2011).

ERP analysis

Based on previous studies (Hickey et al., 2010; MacLean and Giesbrecht, 2015), we expected to find an effect of stimulus value (i.e. the difference between activity elicited by Shigh and Slow) in the occipital P1 component peaking within the time window from 75 to 200 ms relative to stimulus onset. Because all stimuli were presented centrally, P1 activity was analysed in the midline occipital electrode Oz. The P1 peak was defined as the largest positive peak between 75 and 200 ms after the stimulus onset at Oz (averaging across all conditions). In line with previous research, a time window of 70 ms around that peak was then selected for analysis (Luque et al., 2017). Because the P1 maximum amplitude was at 115 ms from stimulus onset across participants, the P1 magnitude for each condition was defined as the mean EEG signal across the 80–150 ms time window.

Based on previous research on the neural underpinnings of goal-directed action (Luque et al., 2017), we further assessed the effects of stress on brain activity over centroparietal regions at a later time window. To this end, ERP data from 400 to 700 ms was subdivided into six consecutive, non-overlapping time bins - with a duration of 50 ms each. The later ERP data were analysed using mixed-design ANOVAs with the within-subject factors; outcome devaluation (NoDev, Dev Ohigh, and Dev Olow) and stimulus value (Shigh and Slow) and the between-subject factor group. Significant outcome devaluation × stimulus value × group interaction effects were appropriately corrected for multiple comparisons.

MVPA training

The multivariate decoding analyses were implemented using the MVPA-Light toolbox (Treder, 2020). The classifier was trained within-subject, using a linear SVM on the preprocessed data of the DMS task (delay phase). All EEG channels were used as features. To improve the signal to noise ratio for MVPA, trials were averaged to pseudo trials (Isik et al., 2014). Each pseudo trial was an average of two trials. We performed two separate analyses corresponding to the following classes: object vs. scene (SVMobject/scene) and blue symbol vs. red symbol (SVMblue/red). To identify the optimal time window for decoding per participant, we implemented a sliding window averaging 100 ms with a step size of 10 ms. The SVMobject/scene and SVMblue/red with the highest performance were used to decode the neural outcome and response representation. Generalization of the classifier was evaluated using a leave-one-out procedure. If the classifier’s performance remains significantly above chance, it indicates that the EEG patterns contain class-specific information and that the class can be reliably decoded from the EEG data (Murphy et al., 2011). The chance level in a simple two-class paradigm is not exactly 50% but 50% with a CI at a certain alpha level. Therefore, we calculated this interval utilizing the Wald interval with adjustments for a small sample size (Agresti and Caffo, 2000; Müller-Putz et al., 2008). The threshold for chance performance was 63.59% for the classification of blue vs. red symbols and 60.11% for object vs. scene images. Participants with classification accuracy below chance in the DMS task were not included in subsequent analyses. During the classification of blue vs. red symbols, the highest performance of two participants did not exceed the threshold for chance performance. During the classification of blue vs. red scenes, classification accuracy of 13 participants was not significant. Hence, the sample for analysis of the outcome representation (based on SVMobject/scene) and response representation (based on SVMred/blue) reported in the main text was 41 (control: n=18; stress: n=23) and 51 (control: n=25; stress: n=26) participants. Importantly, we computed all analyses again including all participants regardless of significant classification scores. This additional analysis left our findings largely unchanged (Supplementary file 1B). Furthermore, the classification accuracy of the first and the second DMS task did not differ (blue/red symbol classification: F[1, 51] = 0.66, p=0.798, ƞp²=0.013, and 95% CI=0–0.127; DMS session × group interaction: F[1, 51] = 0.03, p=0.863, ƞp²=0.001, and 95% CI=0–0.024; object/scene classification: F[1, 51] = 2.87, p=0.096, ƞp²=0.053, and 95% CI=0–0.203; DMS session × group: F[1, 51] = 0.03, p=0.860, ƞp²=0.006, and 95% CI=0.003–0.011). Thus, we pooled the data from both DMS task sessions and trained an overall classifier to ensure that the classifier was not affected by any time-related biases, and that there is a sufficient number of trials to train a reliable classifier.

MVPA decoding

The SVMobject/scene and SVMblue/red trained on the independent DMS dataset were used to assess the respective outcome (object card vs. scene card) and response representation (blue alien vs. red alien) in the trial-by-trial reinforcement learning task during the NoDev blocks. Both classifiers were applied to the respective test data (with an epoch size of 2000 ms relative to stimulus onset) using an overlapping sliding window, with a time average of 100 ms and a step size of 10 ms. The maximum classification accuracy indicated the strength of outcome and response representation.

Goal-directed behaviour is characterized by an action-outcome (S-R-O) association, which should result in an anticipatory outcome representation during stimulus presentation and response choice. Hence, we applied the SVMobject/scene to the stimulus presentation and response choice phase of the reinforcement learning task in order to decode the outcome representation. With increasing habitual control of behaviour, outcome representations should be reduced. Response representations, may in turn be relevant for both habitual S-R learning and goal-directed S-R-O learning. Therefore, we applied the SVMblue/red to the stimulus presentation phase to decode the response representation. In retrospect, a search-light RSA(representational similarity analysis) asking which features contribute the most to decoding neural outcome and response representations revealed that at the time of average maximal decoding accuracy (about 200 ms after stimulus onset for both classifications), a parieto-occipital cluster was contributing the most to decoding outcome representations. In contrast, a centroparietal cluster was contributing the most to decoding response representations (Figure 2—figure supplement 4).

Maximal accuracy values were then averaged over six blocks to get a reliable indicator of classification accuracy. Statistical analyses of the EEG decoding data were performed at the group level averaging across individual decoding accuracies. Decoding accuracies were analysed using mixed-design ANOVAs with the within-subject factor block (1–6, 7–12, 13–18, and 19–24 blocks) and the between-subject factor group. In addition, we computed Spearman correlations between ∆ classification scores (averaged classification accuracy during the first six blocks minus the classification accuracy during the last six blocks) and ∆ responses for devalued outcomes for Dev Ohigh blocks (responses for devalued outcomes during the 1st devaluation block, i.e. the 2nd overall block, minus responses for devalued outcomes during the last devaluation block, i.e. the 29th overall block). Bayesian correlation analyses were conducted using JASP version 0.13 software (https://jasp-stats.org/) and the default Cauchy prior 0.707.

Searchlight approach

To provide more insight into which electrophysiological information contributed most to differentiating between the respective categories (red vs. blue and objects vs. scenes, respectively), we also performed searchlight analyses. These allowed us to determine those topographic features that discriminated most between the two sets of categories. We first pooled the segmented ERP data (0–2 s relative to stimulus onset) from all participants. Then, we averaged the time range of ±100 ms relative to the respective maximum decodability. For each searchlight analysis, a cluster was constituted by a centre electrode and all neighbouring electrodes within a radius of 4 cm. The average cluster size was 10.61 electrodes (SEM = 0.31). We used an SVM as classifier and calculated the Wald interval with adjustments for a small sample size to evaluate the accuracy values (Agresti and Caffo, 2000; Müller-Putz et al., 2008).

Acknowledgements

We thank Keyvan Khatiri, Charlotte Germer, Marian Wiskow, and Yichen Zhong for their assistance during data collection and Polina Perzich for critical proofreading of the manuscript. We further gratefully acknowledge the support of Carlo Hiller with programming of the software. This study was funded by the German Research Foundation (DFG, SCHW1357/23-1).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Lars Schwabe, Email: lars.schwabe@uni-hamburg.de.

Mimi Liljeholm, University of California, Irvine, United States.

Floris P de Lange, Radboud University, Netherlands.

Funding Information

This paper was supported by the following grant:

  • Deutsche Forschungsgemeinschaft SCHW1357/23-1 to Lars Schwabe.

Additional information

Competing interests

No competing interests declared.

Author contributions

Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft.

Conceptualization, Funding acquisition, Methodology, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Methodology, Writing - original draft, Project administration, Writing - review and editing.

Ethics

All participants provided informed consent before participation in the experiment. The experiment was performed in line with the Declaration of Helsinki and approved by the ethics committee of the Faculty of Psychology and Human Movement Sciences at the Universität Hamburg (2018_197_Schwabe).

Additional files

Supplementary file 1. Supplementary material.

(a) Classification results (block × group interaction) block for alternatively grouped data. (b) Classification results on outcome representations for N=53. (c) Control variables: subjective mood, depressive mood, anxiety, and chronic stress. (d) Eye-tracking data: mean number of blinks and saccades across time, for Shigh and Slow trials (N=51).

elife-67517-supp1.docx (27KB, docx)
Transparent reporting form

Data availability

Data reported in this manuscript are available from the website: https://github.com/08122019/From-goal-directed-action-to-habit, copy archived at swh:1:rev:d52c6db7505b203310ec623db4fc07fede10eb9d.

References

  1. Adams CD, Dickinson A. Instrumental responding following reinforcer devaluation. The Quarterly Journal of Experimental Psychology Section B. 1981;33:109–121. doi: 10.1080/14640748108400816. [DOI] [Google Scholar]
  2. Adams CD. Variations in the sensitivity of instrumental responding to reinforcer devaluation. The Quarterly Journal of Experimental Psychology Section B. 1982;34:77–98. doi: 10.1080/14640748208400878. [DOI] [Google Scholar]
  3. Adams TG, Kelmendi B, Brake CA, Gruner P, Badour CL, Pittenger C. The role of stress in the pathogenesis and maintenance of obsessive-compulsive disorder. Chronic Stress. 2018;2:2470547018758043. doi: 10.1177/2470547018758043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Agresti A, Caffo B. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. The American Statistician. 2000;54:280. doi: 10.2307/2685779. [DOI] [Google Scholar]
  5. Allen AP, Kennedy PJ, Cryan JF, Dinan TG, Clarke G. Biological and psychological markers of stress in humans: focus on the Trier Social Stress Test. Neuroscience and Biobehavioral Reviews. 2014;38:94–124. doi: 10.1016/j.neubiorev.2013.11.005. [DOI] [PubMed] [Google Scholar]
  6. Bae GY, Luck SJ. Dissociable decoding of spatial attention and working memory from eeg oscillations and sustained potentials. The Journal of Neuroscience. 2018;38:409–422. doi: 10.1523/JNEUROSCI.2860-17.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Balleine BW, Dickinson A. Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology. 1998;37:407–419. doi: 10.1016/s0028-3908(98)00033-1. [DOI] [PubMed] [Google Scholar]
  8. Balleine BW, O’Doherty JP. Human and rodent homologies in action control: corticostriatal determinants of goal-directed and habitual action. Neuropsychopharmacology. 2010;35:48–69. doi: 10.1038/npp.2009.131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Beck AT, Steer RA, Brown GK. Manual for the Beck Depression Inventory-II. San Antonio: Psychological Corporation; 1996. [Google Scholar]
  10. Bigdely-Shamlo N, Mullen T, Kothe C, Su K-M, Robbins KA. The PREP pipeline: standardized preprocessing for large-scale EEG analysis. Frontiers in Neuroinformatics. 2015;9:16. doi: 10.3389/fninf.2015.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Braun S, Hauber W. Acute stressor effects on goal-directed action in rats. Learning & Memory. 2013;20:700–709. doi: 10.1101/lm.032987.113. [DOI] [PubMed] [Google Scholar]
  12. Cairney SA, Guttesen AÁV, El Marj N, Staresina BP. Memory consolidation is linked to spindle-mediated information processing during sleep. Current Biology. 2018;28:948–954. doi: 10.1016/j.cub.2018.01.087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Chaumon M, Bishop DVM, Busch NA. A practical guide to the selection of independent components of the electroencephalogram for artifact correction. Journal of Neuroscience Methods. 2015;250:47–63. doi: 10.1016/j.jneumeth.2015.02.025. [DOI] [PubMed] [Google Scholar]
  14. Corbit LH, Balleine BW. The role of prelimbic cortex in instrumental conditioning. Behavioural Brain Research. 2003;146:145–157. doi: 10.1016/j.bbr.2003.09.023. [DOI] [PubMed] [Google Scholar]
  15. de Kloet ER, Karst H, Joëls M. Corticosteroid hormones in the central stress response: quick-and-slow. Frontiers in Neuroendocrinology. 2008;29:268–272. doi: 10.1016/j.yfrne.2007.10.002. [DOI] [PubMed] [Google Scholar]
  16. de Quervain DJ, Roozendaal B, McGaugh JL. Stress and glucocorticoids impair retrieval of long-term spatial memory. Nature. 1998;394:787–790. doi: 10.1038/29542. [DOI] [PubMed] [Google Scholar]
  17. de Wit S, Kindt M, Knot SL, Verhoeven AAC, Robbins TW, Gasull-Camos J, Evans M, Mirza H, Gillan CM. Shifting the balance between goals and habits: Five failures in experimental habit induction. Journal of Experimental Psychology. General. 2018;147:1043–1065. doi: 10.1037/xge0000402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Delorme A, Makeig S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. Journal of Neuroscience Methods. 2004;134:9–21. doi: 10.1016/j.jneumeth.2003.10.009. [DOI] [PubMed] [Google Scholar]
  19. Dias-Ferreira E, Sousa JC, Melo I, Morgado P, Mesquita AR, Cerqueira JJ, Costa RM, Sousa N. Chronic stress causes frontostriatal reorganization and affects decision-making. Science. 2009;325:621–625. doi: 10.1126/science.1171203. [DOI] [PubMed] [Google Scholar]
  20. Dickinson A, Balleine B. In: Spatial Representation: Problems in Philosophy and Psychology. Eilan N, McCarthy RA, Brewer B, editors. Blackwell Publishing; 1993. Actions and responses: The dual psychology of behaviour. [Google Scholar]
  21. Dickinson A, Balleine B. Motivational control of goal-directed action. Animal Learning & Behavior. 1994;22:1–18. doi: 10.3758/BF03199951. [DOI] [Google Scholar]
  22. Dickinson A, Balleine B, Watt A, Gonzalez F, Boakes RA. Motivational control after extended instrumental training. Animal Learning & Behavior. 1995;23:197–206. doi: 10.3758/BF03199935. [DOI] [Google Scholar]
  23. Domes G, Zimmer P. Acute stress enhances the sensitivity for facial emotions: a signal detection approach. Stress. 2019;22:455–460. doi: 10.1080/10253890.2019.1593366. [DOI] [PubMed] [Google Scholar]
  24. Everitt BJ, Robbins TW. Neural systems of reinforcement for drug addiction: from actions to habits to compulsion. Nature Neuroscience. 2005;8:1481–1489. doi: 10.1038/nn1579. [DOI] [PubMed] [Google Scholar]
  25. Gagnon SA, Wagner AD. Acute stress and episodic memory retrieval: neurobiological mechanisms and behavioral consequences. Annals of the New York Academy of Sciences. 2016;1369:55–75. doi: 10.1111/nyas.12996. [DOI] [PubMed] [Google Scholar]
  26. Goeders NE. Stress, motivation, and drug addiction. Current Directions in Psychological Science. 2004;13:33–35. doi: 10.1111/j.0963-7214.2004.01301009.x. [DOI] [Google Scholar]
  27. Gourley SL, Swanson AM, Jacobs AM, Howell JL, Mo M, Dileone RJ, Koleske AJ, Taylor JR. Action control is mediated by prefrontal BDNF and glucocorticoid receptor binding. PNAS. 2012;109:20714–20719. doi: 10.1073/pnas.1208342109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Griffiths KR, Morris RW, Balleine BW. Translational studies of goal-directed action as a framework for classifying deficits across psychiatric disorders. Frontiers in Systems Neuroscience. 2014;8:101. doi: 10.3389/fnsys.2014.00101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hartogsveld B, van Ruitenbeek P, Quaedflieg CWEM, Smeets T. Balancing between goal-directed and habitual responding following acute stress. Experimental Psychology. 2020;67:99–111. doi: 10.1027/1618-3169/a000485. [DOI] [PubMed] [Google Scholar]
  30. Hickey C, Chelazzi L, Theeuwes J. Reward changes salience in human vision via the anterior cingulate. The Journal of Neuroscience. 2010;30:11096–11103. doi: 10.1523/JNEUROSCI.1026-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hoijtink H. Informative hypotheses: Theory and practice for behavioral and social scientists. CRC; Boca Raton: 2012. [Google Scholar]
  32. Isik L, Meyers EM, Leibo JZ, Poggio T. The dynamics of invariant object recognition in the human visual system. Journal of Neurophysiology. 2014;111:91–102. doi: 10.1152/jn.00394.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Joëls M, Baram TZ. The neuro-symphony of stress. Nature Reviews. Neuroscience. 2009;10:459–466. doi: 10.1038/nrn2632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Joëls M, Sarabdjitsingh RA, Karst H. Unraveling the time domains of corticosteroid hormone influences on brain activity: rapid, slow, and chronic modes. Pharmacological Reviews. 2012;64:901–938. doi: 10.1124/pr.112.005892. [DOI] [PubMed] [Google Scholar]
  35. Kass RE, Raftery AE. Bayes Factors. Journal of the American Statistical Association. 1995;90:773–795. doi: 10.1080/01621459.1995.10476572. [DOI] [Google Scholar]
  36. Kirschbaum C, Pirke KM, Hellhammer DH. The ’Trier Social Stress Test’--a tool for investigating psychobiological stress responses in a laboratory setting. Neuropsychobiology. 1993;28:76–81. doi: 10.1159/000119004. [DOI] [PubMed] [Google Scholar]
  37. Luque D, Beesley T, Morris RW, Jack BN, Griffiths O, Whitford TJ, Le Pelley ME. Goal-directed and habit-like modulations of stimulus processing during reinforcement learning. The Journal of Neuroscience. 2017;37:3009–3017. doi: 10.1523/JNEUROSCI.3205-16.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. MacLean MH, Giesbrecht B. Neural evidence reveals the rapid effects of reward history on selective attention. Brain Research. 2015;1606:86–94. doi: 10.1016/j.brainres.2015.02.016. [DOI] [PubMed] [Google Scholar]
  39. McNamee D, Liljeholm M, Zika O, O’Doherty JP. Characterizing the associative content of brain structures involved in habitual and goal-directed actions in humans: A multivariate FMRI study. The Journal of Neuroscience. 2015;35:3764–3771. doi: 10.1523/JNEUROSCI.4677-14.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Mognon A, Jovicich J, Bruzzone L, Buiatti M. ADJUST: An automatic EEG artifact detector based on the joint use of spatial and temporal features. Psychophysiology. 2011;48:229–240. doi: 10.1111/j.1469-8986.2010.01061.x. [DOI] [PubMed] [Google Scholar]
  41. Müller-Putz G, Scherer R, Brunner C, Leeb R, Pfurtscheller G. Better than random? A closer look on BCI results. International Journal of Bioelectromagnetism. 2008;10:52–55. [Google Scholar]
  42. Murphy B, Poesio M, Bovolo F, Bruzzone L, Dalponte M, Lakany H. EEG decoding of semantic category reveals distributed representations for single concepts. Brain and Language. 2011;117:12–22. doi: 10.1016/j.bandl.2010.09.013. [DOI] [PubMed] [Google Scholar]
  43. Noh E, Herzmann G, Curran T, de Sa VR. Using single-trial EEG to predict and analyze subsequent memory. NeuroImage. 2014;84:712–723. doi: 10.1016/j.neuroimage.2013.09.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Nuzzo RL. An introduction to bayesian data analysis for correlations. PM & R. 2017;9:1278–1282. doi: 10.1016/j.pmrj.2017.11.003. [DOI] [PubMed] [Google Scholar]
  45. Oostenveld R, Fries P, Maris E, Schoffelen JM. FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Computational Intelligence and Neuroscience. 2011;2011:156869. doi: 10.1155/2011/156869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Ostlund SB, Balleine BW. Lesions of medial prefrontal cortex disrupt the acquisition but not the expression of goal-directed learning. The Journal of Neuroscience. 2005;25:7763–7770. doi: 10.1523/JNEUROSCI.1921-05.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Packard MG. Glutamate infused posttraining into the hippocampus or caudate-putamen differentially strengthens place and response learning. PNAS. 1999;96:12881–12886. doi: 10.1073/pnas.96.22.12881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Perez OD, Dickinson A. A theory of actions and habits: The interaction of rate correlation and contiguity systems in free-operant behavior. Psychological Review. 2020;127:945–971. doi: 10.1037/rev0000201. [DOI] [PubMed] [Google Scholar]
  49. Robbins TW, Gillan CM, Smith DG, de Wit S, Ersche KD. Neurocognitive endophenotypes of impulsivity and compulsivity: towards dimensional psychiatry. Trends in Cognitive Sciences. 2012;16:81–91. doi: 10.1016/j.tics.2011.11.009. [DOI] [PubMed] [Google Scholar]
  50. Roozendaal B. Stress and memory: opposing effects of glucocorticoids on memory consolidation and memory retrieval. Neurobiology of Learning and Memory. 2002;78:578–595. doi: 10.1006/nlme.2002.4080. [DOI] [PubMed] [Google Scholar]
  51. Schulz P, Schlotz W. The Trier Inventory for the Assessment of Chronic Stress (TICS): scale construction, statistical testing, and validation of the scale work overload. Diagnostica. 1999;45:8–19. doi: 10.1026//0012-1924.45.1.8. [DOI] [Google Scholar]
  52. Schwabe L, Oitzl MS, Richter S, Schächinger H. Modulation of spatial and stimulus-response learning strategies by exogenous cortisol in healthy young women. Psychoneuroendocrinology. 2009;34:358–366. doi: 10.1016/j.psyneuen.2008.09.018. [DOI] [PubMed] [Google Scholar]
  53. Schwabe L, Wolf OT. Stress prompts habit behavior in humans. The Journal of Neuroscience. 2009;29:7191–7198. doi: 10.1523/JNEUROSCI.0979-09.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Schwabe L, Tegenthoff M, Höffken O, Wolf OT. Concurrent glucocorticoid and noradrenergic activity shifts instrumental behavior from goal-directed to habitual control. The Journal of Neuroscience. 2010;30:8190–8196. doi: 10.1523/JNEUROSCI.0734-10.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Schwabe L, Wolf OT. Socially evaluated cold pressor stress after instrumental learning favors habits over goal-directed action. Psychoneuroendocrinology. 2010;35:977–986. doi: 10.1016/j.psyneuen.2009.12.010. [DOI] [PubMed] [Google Scholar]
  56. Schwabe L, Dickinson A, Wolf OT. Stress, habits, and drug addiction: a psychoneuroendocrinological perspective. Experimental and Clinical Psychopharmacology. 2011a;19:53–63. doi: 10.1037/a0022212. [DOI] [PubMed] [Google Scholar]
  57. Schwabe L, Höffken O, Tegenthoff M, Wolf OT. Preventing the stress-induced shift from goal-directed to habit action with a β-adrenergic antagonist. The Journal of Neuroscience. 2011b;31:17317–17325. doi: 10.1523/JNEUROSCI.3304-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Schwabe L, Tegenthoff M, Höffken O, Wolf OT. Simultaneous glucocorticoid and noradrenergic activity disrupts the neural basis of goal-directed action in the human brain. The Journal of Neuroscience. 2012;32:10146–10155. doi: 10.1523/JNEUROSCI.1304-12.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Schwabe L, Wolf OT. Stress modulates the engagement of multiple memory systems in classification learning. The Journal of Neuroscience. 2012;32:11042–11049. doi: 10.1523/JNEUROSCI.1484-12.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Schwabe L, Wolf OT. Stress and multiple memory systems: from “thinking” to “doing.”. Trends in Cognitive Sciences. 2013;17:60–68. doi: 10.1016/j.tics.2012.12.001. [DOI] [PubMed] [Google Scholar]
  61. Seehagen S, Schneider S, Rudolph J, Ernst S, Zmyj N. Stress impairs cognitive flexibility in infants. PNAS. 2015;112:12882–12886. doi: 10.1073/pnas.1508345112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Siller-Pérez C, Serafín N, Prado-Alcalá RA, Roozendaal B, Quirarte GL. Glucocorticoid administration into the dorsolateral but not dorsomedial striatum accelerates the shift from a spatial toward procedural memory. Neurobiology of Learning and Memory. 2017;141:124–133. doi: 10.1016/j.nlm.2017.03.020. [DOI] [PubMed] [Google Scholar]
  63. Smeets T, Quaedflieg C. How stress and cognitive fatigue change the balancing between goal-directed and habitual behavior. Psychoneuroendocrinology. 2019;107:71. doi: 10.1016/j.psyneuen.2019.07.204. [DOI] [PubMed] [Google Scholar]
  64. Smeets T, van Ruitenbeek P, Hartogsveld B, Quaedflieg CWEM. Stress-induced reliance on habitual behavior is moderated by cortisol reactivity. Brain and Cognition. 2019;133:60–71. doi: 10.1016/j.bandc.2018.05.005. [DOI] [PubMed] [Google Scholar]
  65. Soares JM, Sampaio A, Ferreira LM, Santos NC, Marques F, Palha JA, Cerqueira JJ, Sousa N. Stress-induced changes in human decision-making are reversible. Translational Psychiatry. 2012;2:e131. doi: 10.1038/tp.2012.59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Spielberger CD. Manual for the State-Trait Anxiety Inventory. Palo Alto, CA: Consulting Psychologists Pres; 1983. [Google Scholar]
  67. Steyer R, Schwenkmezger P, Notz P, Eid M. Testtheoretische analysen des mehrdimensionalen befindlichkeitsfragebogen (mdbf) Diagnostica. 1994;40:320–328. [Google Scholar]
  68. Taghizadeh-Sarabi M, Daliri MR, Niksirat KS. Decoding objects of basic categories from electroencephalographic signals using wavelet transform and support vector machines. Brain Topography. 2015;28:33–46. doi: 10.1007/s10548-014-0371-9. [DOI] [PubMed] [Google Scholar]
  69. Tanaka SC, Balleine BW, O’Doherty JP. Calculating consequences: brain systems that encode the causal effects of actions. The Journal of Neuroscience. 2008;28:6750–6755. doi: 10.1523/JNEUROSCI.1808-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Treder MS. Mvpa-light: A classification and regression toolbox for multi-dimensional data. Frontiers in Neuroscience. 2020;14:289. doi: 10.3389/fnins.2020.00289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Tricomi E, Balleine BW, O’Doherty JP. A specific role for posterior dorsolateral striatum in human habit learning. The European Journal of Neuroscience. 2009;29:2225–2232. doi: 10.1111/j.1460-9568.2009.06796.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Valentin VV, Dickinson A, O’Doherty JP. Determining the neural substrates of goal-directed learning in the human brain. The Journal of Neuroscience. 2007;27:4019–4026. doi: 10.1523/JNEUROSCI.0564-07.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Voon V, Derbyshire K, Rück C, Irvine MA, Worbe Y, Enander J, Schreiber LRN, Gillan C, Fineberg NA, Sahakian BJ, Robbins TW, Harrison NA, Wood J, Daw ND, Dayan P, Grant JE, Bullmore ET. Disorders of compulsivity: a common bias towards learning habits. Molecular Psychiatry. 2015;20:345–352. doi: 10.1038/mp.2014.44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Wirz L, Wacker J, Felten A, Reuter M, Schwabe L. A deletion variant of the α2b-adrenoceptor modulates the stress-induced shift from “cognitive” to “habit” memory. The Journal of Neuroscience. 2017;37:2149–2160. doi: 10.1523/JNEUROSCI.3507-16.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Wirz L, Bogdanov M, Schwabe L. Habits under stress: mechanistic insights across different types of learning. Current Opinion in Behavioral Sciences. 2018;20:9–16. doi: 10.1016/j.cobeha.2017.08.009. [DOI] [Google Scholar]
  76. Yin HH, Knowlton BJ, Balleine BW. Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. The European Journal of Neuroscience. 2004;19:181–189. doi: 10.1111/j.1460-9568.2004.03095.x. [DOI] [PubMed] [Google Scholar]
  77. Yin HH, Knowlton BJ, Balleine BW. Inactivation of dorsolateral striatum enhances sensitivity to changes in the action-outcome contingency in instrumental conditioning. Behavioural Brain Research. 2006;166:189–196. doi: 10.1016/j.bbr.2005.07.012. [DOI] [PubMed] [Google Scholar]

Editor's evaluation

Mimi Liljeholm 1

The authors combined a cleverly designed behavioral task with EEG-based multivariate pattern analysis and acute stress induction to assess the neural representations mediating an influence of stress on the balance between goal-directed and habitual responding. They found that stress induction reduced neural outcome representations, and that this representational change correlated with the degree of habitual performance. The topic and approach should be of interest to a wide audience, ranging from clinicians to economists and neuroscientists.

Decision letter

Editor: Mimi Liljeholm1
Reviewed by: Mimi Liljeholm2

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Stress diminishes outcome but enhances response representations during instrumental learning" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Mimi Liljeholm as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Floris de Lange as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

(1) Several clarifications must be made to address the possibility that saccadic eye movements could be acting as a confound for some of the EEG decoding results, including an account of which features were most important for classification performance, and how eye-movements and blinks were dealt with in preprocessing.

(2) It appears that devaluation insensitivity, and reduced neural outcome representations, emerge at a point when stress measures no longer indicate a stress state. Please clarify how the time courses of these measures relate to the claims about an influence of stress on devaluation insensitivity.

(3) The proposed role of extended training is consistent with the fact that significant devaluation insensitivity and reduced outcome representations emerge in the last block in the stress group. However, the notion that habits develop with extended training seems inconsistent with the apparent decrease in response representation and increase in outcome representation across blocks in the control group, as well as the apparent increase in devaluation sensitivity in the stress group across blocks 1 and 2. These inconsistencies must be resolved and claims modified accordingly.

(4) An important aspect of assessing devaluation sensitivity in the animal literature is a reinforced test, in which the devalued outcome is delivered contingent on the relevant response – as a result, habitual animals rapidly reduce performance of the devalued action. Such tests rule out a general lack of attention, response perseveration etc. The absence of a reinforced test should be explicitly addressed and the implications discussed.

(5) It is unclear from Figure 1 whether the "Error" feedback screen was employed during devaluation blocks. If so, please address the potential influence of negative social feedback on avoidance of the non-devalued action, particularly in the group that had just been subjected to a social stress test.

(6) A range of statistical approaches are employed: please provide a section that details and justifies each.

Reviewer #1 (Recommendations for the authors):

(1) The devaluation, but not accuracy data, is clearly bimodal, in both the control and stress groups. I would like to see this discussed in more detail, in terms of consistency with physiological and neural measures, as well as in terms of broader theoretical implications.

(2) It looks like the response representation is actually decreasing across blocks in the control group. This seems inconsistent with the general argument that habits should develop as training progresses.

(3) "…while model-free learning can be present after only few trials, establishing habit behaviour requires extensive training." This statement is problematic for several reasons. First, it appears to be contradicted in the very next paragraph, where the authors acknowledge that the necessity and sufficiency of extensive training for habit expression in human subjects is controversial. Second, in terms of operational definitions, habits are instrumental responses that are insensitive to outcome devaluation, whether extensively trained or not. Likewise, theoretically, habits are usually attributed to the strength of the stimulus-response association, which may depend on repetition, but also on the magnitude of the reward, as well as, presumably, a range of other factors (e.g., stimulus salience).

(4) Given the schematic of the goal-directed system (top of Figure 1A), it's a bit strange that outcome and response representations would be uncorrelated.

(5) The violin plots are great for illustrating the bimodality in devaluation sensitivity, but it would also be useful to see all of these related variables plotted in the same way, preferably in a single figure. Please include line graphs showing the degree of responding for the devalued outcome, the decoding accuracy for responses and outcomes, and the physiological stress measures, across groups, in a way that allows for a visual, side-by-side, comparison.

(6) Please include graphics of devalued trials in Figure 1C, to illustrate the masking of outcomes and the nature of the error feedback on such trials.

(7) In the Results section, following the description of the task and procedures, but prior to reporting any results, please add a section detailing and justifying the range of statistical approaches. For example, some reported p-values for t-tests appear to be uncorrected (e.g., line 358), as are all p-values for correlation tests. Moreover, a Bayesian analysis is used to assess evidence against a correlation, but not evidence in favor of a correlation.

(8) The phrase "responses to devalued actions", employed throughout, is a bit confusing. I would change this to "responses for devalued outcomes".

(9) I'm guessing that Figure 3 is showing proportions, not percentages.

Reviewer #2 (Recommendations for the authors):

It was not clear to me what the main hypothesis of the study was. The authors seem to talk rather loosely about habits being developed after overtraining versus the mediating effect of stress on habits. The introduction should convey their main goal more clearly.

In my view, this task is not meant to test for overtraining effects as it includes repeated devaluation of the outcome within the same subject, which makes it very difficult to assess the extent to which overtraining -rather than simply lack of motivation at the end of the task or other factor- is explaining weaker devaluation effects late in training. This is a critique for any experiment looking at overtraining effects, bit it should be at least demonstrated that what is driving the effect found by the authors is indeed an effect of overtraining by looking at more than one extension of training. Did the authors test other extensions of training in pilot studies?

It does not seem to me that the task is relevant for the habit/goals distinction. The authors mention how the model-based/model-free distinction may not be interrogating the habit/goal distinction, but this task suffers from the same shortcomings. This is a problem in general with this type of experiments, in that a participant could behave in an S-R or R-O manner independently of the amount of training and switch back and forth between these two strategies throughout the task. The task is simply not well-suited for testing the habit/goal-directed distinction.

In addition, the task also employs multiple devaluation phases, which, if anything, should make participants more sensitive to the devaluation procedure. Was that the case?

It would be good to see the learning curves associated with each group of subjects. The authors' main finding is an effect that is only observed in the last block of training and only for one condition (when the outcome that was highly valued is devalued). Overtraining effects are only relevant after responding has plateaued. Inspecting if both groups have already asymptoted, or whether one did it before the other is an important point, I think.

The same authors have shown that stress manipulations after training render behavior insensitive to devaluation. This suggests that the effect of stress is on the performance of an action, not the learning process per se. How does this fit with their current results showing that stressing people before training changes devaluation sensitivity only in the last block of training? Why there is an extension of training effect in O representation for stressed participants if the effect is on performance?

The authors show that the representation of O for the controls increases with training (Figure 4) while the R representation decreases. Theoretically, these two observations should entail a weaker devaluation effect, as the subject needs to encode a representation of the response being performed in order to attain the outcome for a response to be goal-directed (see recent work from Bouton's lab and Dickinson and Balleine's associative-cybernetic model). Perhaps this should be discussed?

Related to this, the fact that a control group did not show evidence of a decrease in outcome representations at the end of training seems problematic, as their argument is based upon the notion that more extensive training makes responding habitual in the stress group but not in the control group. If stress impacts learning, it should there is no argument for the fact that overtraining in a control group does not change outcome representations.

Why is there a stronger effect of devaluation on the O-Low condition across the task? Why a less-valued outcome should be able to support this? (pp. 6, line 239)

The devaluation effect for the stress group in block 2 seems stronger than in block 1. If that is the case, then the overtraining argument becomes problematic, as it should supposedly lead to weaker effects of devaluation with training, that is, devaluation should be weaker in block 1 than block 2, and weaker in block 2 than block 3 (the latter contrast being the one they report as being statistically significant). Also, is the difference between stress and control in block 3 different to the difference between stress and control in block 2?

The theoretical aspect that the study aims go address is whether the devaluation effects are due to less goal-directed control or more habitual control. In their Discussion section, the authors argue that a response cannot be goal-directed and habitual at the same time. However, Perez and Dickinson (2020) have recently published a free-operant model where it is perfectly possible that a response is driven in equal proportions by each system. It is only in the case of discrete-trial procedures such as the present one and the 2-step task from the model-based/model-free tradition that the answer is difficult.

It seems puzzling that the control group shows decreased representations of the response with training. If anything, that should be maintained or augmented with training. This is another reason to question the overtraining argument of this paper.

Other points:

The analysis of response representations after extended training is based in comparisons of first half and last half of training, why that choice?

In the discussion, the authors mention that "habit behavior requires extensive training", but a recent preprint by Pool et al. (2021) shows that most participants can be habitual after limited training.

Pp 24. Line 522: "Furthermore, while model-free learnins is based on previously experienced outcomes, habit learning is defined as S-R learning without links to the outcome engendered by a response." I could not follow this idea, can you clarify?

During the presentation of the results, the authors show statistics for both the O-high and O-low conditions. However, the O-low condition did not add anything to the analysis, as it did not show any effects of devaluation. Therefore, for the sake of simplicity and the reader's joy, I'd simply leave that for the supplemental material and focus on the part of the experiment that is relevant for the point the authors are trying to make.

Table 1: The caption mentions that the scales ranged from 1-10, but the results show means higher than 10. Am I missing something here?

Pp. 6, line 134 "offer, trade" should apparently read "offer, or trade"

This approach was employed by McNamee and colleagues (2015) using fMRI. I think it should be cited.

pp. 15, line 308: "tended to be sensitive". I'd just say it was not significant.

pp. 18, line 370: P = 0.034, should be P_corr?

Pp 31, lines 683-684. Could you please clarify the trial structure and numbers? What do you mean by "27 trials each"?

Reviewer #3 (Recommendations for the authors):

1. I think that the manuscript would benefit from more of an effort to link the current results to other related work. For one, the authors briefly touch on how the results relate to the 'model-based and model-free' dichotomy described in other work, but mostly highlight how the operationalization of 'model-free behavior' differs from 'habits,' as described here. However, I think it is notable that the authors find that habitual behavior involves a reduction in prospective consideration of an outcome, similar to other work that has shown that model-based behavior increases prospective consideration of future states during decision-making (e.g. Doll et al., 2015). Do the authors believe that 'habit/goal' and 'model-free/model-based' decision-making strategies might share common mechanistic features?

2. I am somewhat concerned that saccadic eye movements could be acting as a confound for some of the EEG decoding results and would appreciate the authors clarifying a few points related to this and possibly doing some additional control analyses to rule out this possibility:

a. The authors mention that the symbols for the aliens differed by position – was this position fixed for each alien tribe, i.e. did the blue tribe always appear on the left for a particular participant? If so, this could drive anticipatory eye-movements in that direction.

b. It is not clear which electrodes are driving the change in outcome and response representation in the decoding analysis. An analysis examining which features are contributing most to successful decoding in the stimulus and response period would be particularly important to rule out the possibility that the main results of interest are not driven by anticipatory eye-movements, such as saccades in the response phase where two stimuli are placed side-by-side.

c. Relatedly, I would have appreciated more detail on the preprocessing pipeline used to clean-up the data, in particular how saccadic eye-movements and blinks were corrected, or removed, and if central fixation was enforced with eye-tracking or EOG.

3. Looking at the scatterplots of Figure 5A and 6, the values on the y-axis appear to be rather different. Where the control and stress groups are mostly separate on this dimension in Figure 5A, they are highly overlapping in Figure 6 such that there does not appear to be any group difference. Are these different data and if so what is the rationale for selecting one set of data for one analysis and a different set for the other?

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Stress diminishes outcome but enhances response representations during instrumental learning" for further consideration by eLife. Your article has now been evaluated by 2 peer reviewers, and by the Reviewing Editor, Mimi Liljeholm, under the supervision of Senior Editor Floris de Lange. The reviewers have opted to remain anonymous.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

(1) Reviewer 2 notes that enhanced response representations do not necessarily reflect an increased involvement of the habit system – this is particularly true if such representations are not negatively correlated with outcome representations, since the latter retrieves the former according to several associative accounts of goal-directed behavior. Thus, the stated assumption, that "habitual responding is reflected in enhanced response representations" is not valid. The reviewer suggests, and I agree, that claims based on this assumption should be removed or significantly toned down.

(2) Reviewer 3 notes that the new FRN analyses are not sufficiently contextualized, and may even be erroneous, and also that the previous request for more information regarding decoding features went unanswered.

(3) For my part, I am still confused about the distinction between model-free RL and habits. The authors state:

"It might be argued that an alternative way of disentangling goal-directed and habitual contributions to instrumental responding under stress would be through an analysis of model-based and model-free learning …"

Who would argue that? Why would that analysis be better? What exactly would evaluation of these models add here?

And then:

"For instance, while model-free learning can be present after only few trials, establishing habit behaviour relies on the strength of the stimulus-response association, which among other factors such as reward magnitude or stimulus saliency, has been suggested to depend on repetition Furthermore, while model-free learning is based on previously experienced outcomes, habit learning is defined as S-R learning without links to the outcome engendered by a response … In other words, whereas model-free learning is dependent on the expected reward and hence the outcome that follows a response, habit learning is operationally defined as being independent of the motivational value of the outcome

This is incorrect. The model-free Q-value *is* the S-R association: both depend on the reward prediction error and the learning rate, both reflect previous reinforcement, and in that sense an expectation of reward, both are void of sensory-specific outcome features, and thus insensitive to devaluation, and both can develop rapidly or not, depending on the learning rate. I suggest that the entire section on model-based and model-free learning is remove.

(4) Finally, the writing needs some attention. As an example, it is repeatedly stated that canonical behavioral assays can "hardly" disentangle habitual and goal-directed behavior. The term "hardly" means barely, as in "I can barely see that distant object", or alternatively, it can mean definitely not, as in "You can hardly expect me to …" I don't think either is the intended meaning here, and there are other problems throughout the manuscript. I strongly suggest proofing by a native English speaker.

Reviewer #2 (Recommendations for the authors):

The authors have made a good effort to respond to all of my concerns. Given that my initial concerns with the task were not well received, I will not insist at this point on the merits of this task to produce or investigate habitual control, so I will focus on the author's responses only and some theoretical points that I think should be re-written to improve the quality of the manuscript.

The authors insist that their aim *was not* to investigate overtraining, but the very definition of overtraining that one could infer from the habit literature is that it refers to the amount of training that produces habitual behaviour. They used devaluations across different extensions of training to investigate whether " stress leads to an increase in habitual processing, a decrease in goal-directed processing or both ", so I'm still not convinced with their assertion.

Related to this, they are strongly assuming that the decoding analysis will give insights as to the interaction and development of habitual and goal-directed (g-d), but you need a theory of habits and actions in order to test for that. In my view, what they are doing in this paper is more empirical than theoretical ("do outcome representations decrease with training?; do response representations increase with training?), and I'd suggest they delete any reference to what they believe is the interaction between the systems in this task – right now they are motivating the paper as providing insights into the question of whether habits are a consequence of increased habits or decreased g-d control. I personally see this as a more empirical than theoretical paper, and the current motivation is poorly justified from a theoretical perspective, I think. For example, they assume that R representations should be stronger the more active a habit is, but Bouton and his colleagues have demonstrated -at least behaviorally- that this not necessarily the case.

The authors state that "goal-directed but not habitual control relies on the motivational value of the outcome" (pp. 4). Without wanting to be pedantic, habits do depend on the motivational value of the outcome to develop (unless you take a Guthrian view of S-R contiguity being sufficient to produce them), so I'd rather rephrase this as "only the g-d system is sensitive to changes in the motivational value of the outcome in absence of new experience with it", or something along those lines. This should make it clear that it is the devaluation test what distinguishes between the two types of control.

McNamee and colleagues did a very similar paper using fMRI decoding, but the authors cite this paper briefly, without any reference to what the paper is about and what it found. I think that they should be more detailed about what these other authors did in that paper, discussing it in the introduction and motivating their experiment as an extension of the same idea using a different method (EEG).

Pp 10. "Degraded actions" suggests degraded action-outcome contingency. This is not the manipulation employed in the paper.

Again, without wanting to be pedantic, the authors state that: ""In other words, whereas model-free learning is dependent on the expected reward and hence the outcome that follows a response (Dayan and Berridge, 2014; O'Doherty et al., 2017), habit learning is operationally defined as being independent of the motivational value of the outcome (Adams, 1982; Adams and Dickinson, 1981; Balleine and Dickinson, 1998).", but algorithmic representations of habit learning (and pavlovian conditioning) since the time of Bush and Mosteller and before are based on the notion of prediction-error, that is, they are represented as model-free algorithms. The "expected reward" in MF algorithms is not a psychological, but statistical concept (it turns out that prediction-error learning can be taken as estimating the value of stimuli or states.) What is critical is that these algorithms are not sensitive to changes in outcome value unless the new value is re-experienced and updated, whereas model-based or g-d algorithms are.

Reviewer #3 (Recommendations for the authors):

The authors have addressed my main concerns about potential eye-movement artifacts acting as confounds in the decoding analysis, and mostly answered my questions about the preprocessing procedures for the EEG data. In the process of addressing many of the points raised during review, they had had to substantially tamp down some of their claims and provide more context, caveats and speculation regarding explanations for their observations. I don't think this is necessarily a bad thing, as the pattern of data rarely perfectly match our predictions and such deviations from expectation and theory are usually the grist of new research and investigation. Moreover, the main results and their interpretation hold up despite some points that remain a bit muddy (e.g. why the effects of stress occur later in the experiment rather than closer to the time of the stressful event). However, not all of my points were addressed in the initial round of reviews and I have some remaining questions – mostly arising from new analyses in the revision, and would also like additional clarity on some methodological points.

1. In responding to the other reviewers, the authors have added some new analyses of the feedback-related negativity to test if participants in the stress and control groups differed in how the processed error feedback. Interpretation of the null result here seems rather indirect, as it assumes that this ERP would change in response to differences in the sensitivity of participants to error feedback. I suspect that previous work has found such individual differences in groups with greater error sensitivity before, but that work is not cited here. It would be helpful to provide more background on what this ERP is thought to reflect to provide additional context to this result.

2. The FRN in Figure 2 —figure supplements 4-6 looks very different from what I expected, and what I believe, is conventional in the literature. The response on error trials is generally more positive than errors in the 200-300 ms post-feedback window that the authors focus on in this analysis. Generally, I believe this ERP is visible as a more negative deflection on error trials riding on a larger positive response to the feedback stimulus in this time window (e.g. Yeung et al. 2005, Cerebral Cortex). The baseline for error and correct trials also differ substantially in the Fz electrode – and differ quite a bit from zero. The unusual appearance of these ERPs make me somewhat concerned that there might be an error in the analysis.

3. This may have been understandably lost in the long list of comments in the last round of review, but the authors did not respond to my request that they provide more information about the features that contribute to the decoding of the outcome and response – i.e. the particular channels and time points that are contributing to the decoding accuracy measures. While they have convincingly argued that EOG artifacts are unlikely to drive their results, I think it would still be valuable to also see which features are contributing to most to decoding.

4. This is a more minor point, but it would be helpful to have more information about how ICA was used to remove motor and eye-movement activity E.g. how many components were removed, how were they identified and how did the authors verify success of this preprocessing. The current one sentence mention of ICA is not very illuminating about the specific approach used for this study.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Stress diminishes outcome but enhances response representations during instrumental learning" for further consideration by eLife. Your revised article has been evaluated by Floris de Lange (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewer 2 is still confused about the intended, claimed, and actual role of overtraining. I think the easiest way to deal with this is to not talk about "extended" training (what does that mean after all – "extended" could mean an hour, days, or months) but instead characterize effects as training-dependent based on devaluation at the beginning, middle or end of a task. You can be clear that you are assessing training-dependent dynamics, but that the extent of training here is quite limited compared to that in the overtraining literature. In other words, training-dependent effects do not necessarily imply overtraining effects – please make that distinction and you should be good.

Of even greater importance is the request by Reviewer 1, that you provide more details about the searchlight analysis.

As always, make sure to detail revisions or rebuttals for each reviewer comment listed below.

Reviewer #2 (Recommendations for the authors):

The authors have improved the manuscript. I think it's a much better version than the previous one. They have deleted all those paragraphs that made their arguments and motivation for the experiment confusing.

I'm still super confused about their argument that they are not testing for overtraining effects. I thought overtraining was by definition the amount of training that produced habits. Are they saying in their reply that the effect of stress speeds up habit formation? What is their view on this? If their aim was not to test for "training extension" effects, why are they doing two devaluation manipulations?

I don't think is enough to add this paragraph in light of my comment:

"Although we did not aim to test training-induced changes in the balance of outcome and response processing per se, we included transient outcome devaluations after initial, moderate, and extended training in the instrumental learning task."

And then, a few paragraphs later, saying the following:

"To assess whether the balance of goal-directed and habitual behaviour and its modulation by stress depends on the extent of training, we presented devaluation blocks early during training, after moderate training, and again after extended training at the end of the task."

And, having a section called "Stress boosts response representations after extended training."

As a reader, I'm extremely confused about these apparent contradictions. I'd love to see in the introduction a more precise paragraph where their expectations are clearly mentioned.

This comment has been carrying over since the first review when I made the following comment:

"It was not clear to me what the main hypothesis of the study was. The authors seem to talk rather loosely about habits being developed after overtraining versus the mediating effect of stress on habits. The introduction should convey their main goal more clearly."

Sorry if I'm being too stupid, but it's not clear to me why they are using training extension and stress to test devaluation sensitivity and outcome/response representations if their aim was not to overtrain participants.

Reviewer #3 (Recommendations for the authors):

The authors have satisfactorily responded to my previous major concerns. I only have one minor outstanding request: Please provide more information on how the searchlight analysis was carried out (e.g. how many electrodes were included in each 'searchlight', how electrodes were grouped together in this analysis).

eLife. 2022 Jul 18;11:e67517. doi: 10.7554/eLife.67517.sa2

Author response


Essential revisions:

(1) Several clarifications must be made to address the possibility that saccadic eye movements could be acting as a confound for some of the EEG decoding results, including an account of which features were most important for classification performance, and how eye-movements and blinks were dealt with in preprocessing.

We appreciate the opportunity to clarify this issue. First of all, the classifier applies those electrophysiological properties to the reinforcement learning task (RT) that had been identified in the delayed-matching-to-sample (DMS) task for the separation of object vs scene and blue vs red symbols. Artefacts that occur only during the reinforcement task will simply be ignored as they did not constitute discriminative features in the training data set and are therefore not part of the classifier (Grootswagers et al., 2017). Thus, artefacts would have affected the classification only when the same artefact occurred in both the DMS and RT, which is rather unlikely. Furthermore, the training DMS data were collected in two sessions, of which one was completed before participants underwent the stress induction and another after the completion of the RT. Importantly, the classifier performance did not differ between these DMS sessions and was not modulated by the experimental treatment. Consequently, we think that (i) group-specific artefacts and (ii) emotional and motivational confounds that could have arisen from the RT completion, such as fatigue, cannot account for our decoding results. Moreover, the positions of the symbols of the aliens on the computer screen were fully randomized. Thus, anticipatory eye movements, such as saccades during the response phase, could not drive the EEG findings in the stress group at the end of the task. Apart from the methodological principles of the EEG decoding and our specific task design, there is also evidence that classifiers have the capacity to ignore bad channels or suppress noise during training, making artefact correction, in general, less critical in decoding analyses (Grootswagers et al., 2017). Together, we think it is highly unlikely that artefacts related, for example to saccades or blinks, biased our decoding results.

For the ERP analysis, we removed blinks and eye movements via independent component analysis. We added this information now in the manuscript, please see page 45, lines 1024 to 1025:

“In addition, we removed muscular artefacts and corrected for blinks and eye movements via independent component analysis.”

Finally, we used an eyetracker during the EEG measurement to control specifically for eye movements and these data showed that there were no significant changes in saccades or blinks across the task or depending on the trial type, and no modulation thereof by the stress manipulation (outcome devaluation × stimulus value × time × group: F(4,196) = 0.78, P = 0.54, ƞp² = 0.02, 95% CI = 0.002 to 0.008; outcome devaluation × stimulus value × group: F(2,98) = 1.03, P = 0.36, ƞp² = 0.02, 95% CI = 0.005 to 0.020). We have added this information to the text. Please see page 14, lines 532 to 537:

“Furthermore, we recorded eye tacking data to control for potential group differences in saccades or eye blinks. These control analyses showed that there were no significant group differences in saccades or eye blinks across the task or depending on the trial type (outcome devaluation × stimulus value × time × group: F(4,196) = 0.78, P = 0.54, ƞp² = 0.02, 95% CI = 0.002 to 0.008; outcome devaluation × stimulus value × group: F(2,98) = 1.03, P = 0.36, ƞp² = 0.02, 95% CI = 0.005 to 0.020; see supplementary file 1D).”

Furthermore, we added a brief section on the eyetracking to the methods section. Please see page 42, lines 959 to 964:

“Eyetracking. We used a desktop mounted eye-tracker (EyeLink 1000; SR-Research Ltd., Mississauga, Ontario, Canada) to record monocular eye movements from the left eye at a sampling rate of 500Hz. We used custom scripts implemented in MATLAB (The Mathworks, Natick, MA) to extract the mean saccades and blink information depending on the stimulus value, outcome devaluation and time (zero to 2000 ms around the stimulus onset). Data for two participants were missing due to a failure of the used eye-tracker.”

(2) It appears that devaluation insensitivity, and reduced neural outcome representations, emerge at a point when stress measures no longer indicate a stress state. Please clarify how the time courses of these measures relate to the claims about an influence of stress on devaluation insensitivity.

It is important to note that stress effects are not necessarily over when acute levels of stress mediators returned to baseline. Specifically, several major stress mediators are known to have ‘after-effects’ that outlast acute elevations of these measures (Joëls and Baram, 2009). For example, glucocorticoids are assumed to act as a “two-stage rocket”, with rapid, non-genomic actions and delayed, genomic actions (Joëls et al., 2012). The latter genomic actions typically set in well after acute glucocorticoid elevations vanished (Kloet et al., 2008). Thus, the fact that autonomic measures and cortisol levels returned to baseline during the learning task, does not imply that the stress system activation had been over at this point. Moreover, acutely elevated stress mediators may have affected early learning processes in a way that became apparent only as training progressed.

We discuss now explicitly the fact that devaluation insensitivity, reduced neural outcome representations and increased neural response representations emerge at a point when stress parameters were not acutely elevated anymore and provide potential explanations for this pattern on page 27, lines 596 to 607:

“Importantly, at the time when stress effects on devaluation sensitivity and outcome representation were observed, stress mediators were no longer elevated anymore. However, it is important to note that stress effects are not necessarily over when acute levels of stress mediators returned to baseline. Specifically, several major stress mediators are known to have ‘after-effects’ that outlast acute elevations of these measures (Joëls and Baram, 2009). For example, glucocorticoids are assumed to act as a “two-stage rocket”, with rapid, non-genomic actions and delayed, genomic actions (Joëls et al., 2012). The latter genomic actions typically set in well after acute glucocorticoid elevations vanished (Kloet et al., 2008). Thus, the fact that autonomic measures and cortisol levels returned to baseline during the learning task, does not imply that the stress system activation had been over at this point. Moreover, acutely elevated stress mediators may have affected early learning processes in a way that became apparent only as training progressed.”

(3) The proposed role of extended training is consistent with the fact that significant devaluation insensitivity and reduced outcome representations emerge in the last block in the stress group. However, the notion that habits develop with extended training seems inconsistent with the apparent decrease in response representation and increase in outcome representation across blocks in the control group, as well as the apparent increase in devaluation sensitivity in the stress group across blocks 1 and 2. These inconsistencies must be resolved and claims modified accordingly.

In the stress group, we did not find reliable evidence for an increase of devaluation sensitivity from block 1 to block 2 (block × treatment interaction: F(1,56) = 1.575, P = 0.215, ƞp² = 0.027, 95% CI = -0.013 to 0.052). In addition, groups did not differ during this second block (t56 = 0.165, P = 0.870, Cohen’s d = 0.043, -0.472 to 0.558). We have added these statistics to the text, please see page 15, lines 327 to 334:

“Furthermore, while there was no evidence for an interaction of devaluation block number (1 vs. 2) and experimental treatment (F(1,56) = 1.575, P = 0.215, ƞp² = 0.027, 95% CI = -0.013 to 0.052) when analysing the changes from the first to the second devaluation block, we obtained a significant interaction between block (2 vs. 3) and treatment when analysing the changes from block 2 to block 3 (F(1,56) = 13.589, P < 0.001, ƞp² = 0.195, 95% CI = 0.105 to 0.319). Moreover, follow-up tests showed that after moderate training intensity (i.e. in block 2), groups did not differ in responses for the devalued outcome (t56 = 0.165, P = 0.870, Cohen’s d = 0.043, -0.472 to 0.558).”

However, we completely agree that the observed decrease in neural response representations and the increase in neural outcome representations in the control group speak against a training-induced bias towards habits (previously referred to as ‘overtraining effect’) in control participants. Importantly, however, we did not aim to assess an overtraining effect per se in the present study and this study was not designed to test for such a general overtraining effect (e.g., we might have needed an even longer or repeated training). Instead, we aimed to test how stress affects specific neural signatures (derived from EEG-based decoding) of goal directed and habit behaviour, respectively, and to what extent such stress effects may depend on the extent of training. The fact that there was no overtraining effect in the control group of the present study, does not question the training-dependency of the observed stress effect. Our idea that stress might accelerate a shift from goal-directed to habitual behaviour is based on previous rodent studies that did show that stress (hormones) accelerate a shift from ‘cognitive’ to ‘habitual’ responses that would otherwise only occur after extended training (Packard, 1999; Siller-Pérez et al., 2017). We agree that based on our own data, this idea remains rather speculative as we do not see evidence of a training-dependent bias towards habitual behavior/processing in control participants across the task.

We have now clarified the objectives of our study, emphasizing that this study did not aim to test the impact of training intensity on the control of instrumental behaviour per se. Please see page 5, lines 116 and 117:

“Although we did not aim to test training-induced changes in the balance of goal-directed and habitual processing per se […]”

Moreover, we state now explicitly that the idea that stress might accelerate a shift from goal directed to habitual control is based on respective animal data but that this idea remains speculative here as we did not find evidence for a training-related bias from goal-directed to habitual control in control participants (but even initial evidence against this notion). Please see page 30, lines 660 to 667:

“Findings on ‘cognitive’ and ‘habitual’ forms of navigational learning in rats, however, demonstrated that stress hormones may accelerate a shift from ‘cognitive’ to ‘habit’ learning that would otherwise only occur after extended training (Packard, 1999; Siller-Pérez et al., 2017). Thus, it is tempting to hypothesize that a similar mechanism might be at work during instrumental learning in stressed humans. This conclusion, however, remains rather speculative as we did not observe a training-dependent shift towards habits in the control group and this group even showed reduced response and increased outcome representations over time, which rather suggests increased goaldirected processing across the task.”

(4) An important aspect of assessing devaluation sensitivity in the animal literature is a reinforced test, in which the devalued outcome is delivered contingent on the relevant response – as a result, habitual animals rapidly reduce performance of the devalued action. Such tests rule out a general lack of attention, response perseveration etc. The absence of a reinforced test should be explicitly addressed and the implications discussed.

We agree that a reinforced test, in which the devalued outcome is delivered contingent on the relevant responses, is of special importance for the assessment of devaluation sensitivity and the exclusion of unspecific effects in animal studies. In the present human study, the feedback provided on each trial varied between blocks. During NoDev blocks, participants saw the value of the card that was gained before as well as the corresponding response costs. In Dev blocks, however, this information was masked for all trials. This procedure is comparable to previous studies in humans in which the devalued outcome was not presented during the critical test trials (e.g. Schwabe and Wolf, 2009; Valentin et al., 2007). Nevertheless, we agree that this kind of experience with the devalued outcome differs from procedures utilised in the animal literature. Importantly, however, we additionally included consumption trials in all of the blocks, which enabled us to rule out unspecific factors, such as a general lack of attention, altered contingency knowledge or response perseveration. In these consumption trials, participants clearly preferred the high-valued cards over low-valued cards during NoDev blocks as well as the valued card over its devalued counterpart during Dev blocks, irrespective of stress. This finding demonstrates that participants of both groups were aware of the value of the card stimuli in a specific block but that responses of stressed participants were less guided by this knowledge about the value of the outcome engendered by the response. The specific response pattern in the consumption trials during devalued blocks also rules out general attentional or motivational deficits.

We discuss this deviation of the present paradigm from the typical devaluation procedure in animal studies and its implications now on page 30, lines 671 to 688:

“In animal studies, devaluation sensitivity is usually assessed by means of a reinforcement test, in which the devalued outcome is delivered contingent on the relevant response. This procedure allows to control, for example, for a general lack of attention. In the present human study, the feedback provided on each trial varied between blocks. During NoDev blocks, participants saw the value of the card that was gained before as well as the corresponding response costs. In Dev blocks, however, this information was masked for all trials. This procedure is comparable to previous studies in humans in which the devalued outcome was not presented during the critical test trials (e.g. Schwabe and Wolf, 2009; Valentin et al., 2007). Importantly, however, we additionally included consumption trials in all of the blocks, which enabled us to rule out unspecific factors, such as a general lack of attention, altered contingency knowledge or response perseveration. In these consumption trials, participants clearly preferred the high-valued cards over low-valued cards during NoDev blocks as well as the valued card over its devalued counterpart during Dev blocks, irrespective of stress. This finding demonstrates that participants of both groups were aware of the value of the card stimuli in a specific block but that responses of stressed participants were less guided by this knowledge about the value of the outcome engendered by the response. In addition, the specific response pattern in the consumption trials during devalued blocks also rules out general attentional or motivational deficits.”

(5) It is unclear from Figure 1 whether the "Error" feedback screen was employed during devaluation blocks. If so, please address the potential influence of negative social feedback on avoidance of the non-devalued action, particularly in the group that had just been subjected to a social stress test.

The error-feedback screen was presented throughout the entire task (i.e. an error message appeared also during the devaluation blocks). We completely agree that this is an important aspect of our paradigm and we make this feature now more explicit, both in the text (please see page 40, lines 894 and 895) and in the revised Figure 1. While we cannot completely rule out that this form of feedback somehow interacted with our social stress manipulation, we consider it rather unlikely that the “error” feedback screen had a major impact on our results. With respect to our neural decoding data it is important to note that we decoded neural response and outcome representations only during NoDev blocks, in which “error” feedback was extremely rare (only 5.4 % of all NoDev trials) and its frequency did not differ between groups (t56 = 0.898, P = 0.373, d = 0.235, 95% CI = -0.282 to 0.752). With respect to our behavioural data, it is important to note that the responses for devalued outcomes were directly correlated with the neural response and outcome representations, respectively, which again could not be biased by differences in error feedback processing.

Moreover, in response to the reviewers’ comment, we analysed now participants’ error related negativity as a neural measure of (negative) feedback processing. If the social stress exerted its effect primarily via an avoidance of negative feedback, then stress should have also modulated negative feedback processing. Our analysis of the error-related potentials, however, showed no such effect. We have added this finding to our Results section. Please see also page 23, lines 508 to 518:

“No group differences in feedback processing

The error-feedback screen was presented throughout the entire task. Thus, an error message appeared also during the devaluation blocks. To test whether this form of feedback interacted with our social stress manipulation (e.g. that stressed participants continued in responding towards devalued outcomes in order to avoid the presentation of the error-feedback screen), we analysed participants’ error-related potentials as a neural measure of feedback processing. If the social stress exerted its effect primarily via an avoidance of negative feedback, then stress should have also modulated feedback processing. Our analysis of the error related potentials, however, showed no such effect (outcome devaluation × correctness × group for Cz, FCz and Fz: all F(2,54) < 1.176, all P > 0.316, all ƞp² < 0.042, all 95% CI = 0.010 to 0.039, Figure 2 —figure supplements 4-6).” Please see page 46, lines 1048 to 1054 for methodological details:

“Feedback related potentials. Artefact-free epochs were averaged participant- and conditionwise with respect of outcome devaluation block and correctness (noDev correct, noDev error, Dev Ohigh correct, Dev Ohigh error, Dev Olow correct and Dev Olow error). Thereafter, mean amplitudes were segmented from 200 to 300 ms relative to feedback onset at midline electrodes Cz, FCz and Fz (Hajcak et al., 2006; Pfabigan et al., 2011; Pfabigan et al., 2015). For statistical analysis, we applied ANOVAs with the within-subject factors outcome devaluation and correctness (correct and error) for Cz, FCz and Fz.”

Moreover, we have added a paragraph to the discussion in which we address the potential role of negative social feedback on the performance in devalued blocks. Please see page 31, lines 689 to 704:

“It is important to note that participants received an error feedback in devalued trials when they chose the response option that was not associated with the now devalued outcome. Given that acute stress may increase sensitivity to social cues (Domes and Zimmer, 2019), one might argue that stressed participants continued in responding towards devalued outcomes in order to avoid the presentation of the error-feedback screen. We consider this alternative, however, to be rather unlikely. First, with respect to the neural outcome and response representations, these were analysed during NoDev blocks, in which groups did not differ in their behavioural performance accuracy and consequently not in the frequency of error feedback. Furthermore, participants’ performance in devalued blocks was directly associated with the observed changes in neural representations during the NoDev blocks. Finally, if stress exerted its effect through an avoidance of negative feedback, then stress should have also altered neural feedback processing. To directly test this alternative, we analysed the error-related activity, a well-known neural correlate of (negative) feedback processing (Falkenstein et al., 2000; Holroyd and Coles, 2002). Our results, however, showed no group differences in the error-related potentials, which appears to speak against altered negative feedback processing as the major driving force of the obtained results.”

(6) A range of statistical approaches are employed: please provide a section that details and justifies each.

We appreciate the opportunity to describe and justify in more detail the range of statistical approaches that we employed in order to analyse the impact of acute stress on goal-directed and habitual processing during instrumental learning. As suggested, we have added a new paragraph in the Results section in which we justify the range of the different statistical approaches. Please see page 10, lines 198 to 227:

“In order to adequately assess whether increased responding to devalued or degraded actions is due to reduced goal-directed or enhanced habitual control (or both), we employed a variety of measures and statistical approaches. At the behavioural level, we applied a devaluation paradigm to examine whether stressed participants respond differently to transient outcome devaluations compared to control participants and to assess whether the predicted neural changes are linked to behavioural outcomes. Behavioural analyses focussed mainly on participants’ choice in valued and devalued trials. Based on previous animal studies (Dias-Ferreira et al., 2009), we additionally investigated whether the effects of stress on the control mode would depend on the amount of training and therefore implemented outcome devaluation after initial, moderate and extended training. At the physiological level, we analysed pulse, systolic and diastolic blood pressure as well as salivary cortisol concentrations at different time points across the experiment to assess the effectiveness of the stress manipulation. At the neural level, we mainly focused on the decoding of outcome and response representations, which provide insights into the mechanisms through which stress affects the sensitivity to outcome devaluation, the primary objective of the present study. Outcome representations are indicative for goal-directed processing and response representations rather point to habitual processing. To further address the important question whether response and outcome representations represent signatures of distinct control systems, we additionally analysed the correlations between both neural representations. For this purpose, we used Bayesian correlational analyses. Depending on the magnitude of the Bayes factor (reflecting the likelihood ratio of the data under the alternative hypothesis and the data under the null hypothesis), the Bayesian approach can provide evidence in favor of the alternative hypothesis or evidence in favor of the null (Hoijtink, 2012; Kass and Raftery, 1995; Nuzzo, 2017). Thus, we utilized Bayesian analyses to provide clear evidence for or against the null hypothesis. To further assess the association between the behavioural data (i.e. classification accuracy) and the strength of the neural representation (i.e. classification accuracies), we also computed Spearman correlations. To additionally analyse previously proposed “attentional habits” (Luque et al., 2017) on the one hand, and to address the alternative explanation of altered negative feedback processing on the other hand we analysed event-related potentials.”

Please see also page 48, lines 1105 to X1115:

“Statistical analyses of the EEG decoding data were performed at the group level averaging across individual decoding accuracies. Decoding accuracies were analysed by mixed-design ANOVAs with the within-subject factor block (1-6, 7-12, 13-18 and 19-24 block) and the between-subject factor group. In addition, we computed Spearman correlations between ∆ classification scores (averaged classification accuracy during the first six blocks minus the classification accuracy during the last six blocks) and ∆ responses for devalued outcomes for Dev Ohigh blocks responses for devalued outcomes during the first devaluation block, i.e. the 2nd overall block, minus responses for devalued outcomes during the last devaluation block, i.e. the 29th overall block. Bayesian correlation analyses were conducted using JASP version 0.13 software (https:/jasp-stats.org/) and the default Cauchy prior 0.707.”

Reviewer #1 (Recommendations for the authors):

(1) The devaluation, but not accuracy data, is clearly bimodal, in both the control and stress groups. I would like to see this discussed in more detail, in terms of consistency with physiological and neural measures, as well as in terms of broader theoretical implications.

We thank the reviewer for highlighting this important issue. Indeed, behavioural accuracy data are bimodally distributed. This suggests that participants either mainly selected their action based on the outcome value, i.e., did avoid the devalued outcome and thus were goal-directed, or responded to the devalued outcome in the vast majority of the trials and thus behaved habitually. Thus, participants seemed to show an “either-or” pattern and there seemed to be interindividual differences in the tendency to perform in a goal-directed vs. habitual manner. Interestingly, also among stressed participants, there were substantial individual differences in the propensity to shift from goal-directed towards habitual behaviour, indicating that there is no overall effect of stress on the control of behaviour but that individuals differ in their sensitivity to these stress effects. This raises the important question of what makes individuals more or less vulnerable to stress effects on instrumental control (and cognition in general). Previous work suggested that genetic variants related to major stress response systems (i.e. noradrenergic and glucocorticoid activity) may play an important role in how stress affects the engagement of multiple learning systems (Wirz et al., 2017; Wirz et al., 2018). Understanding which further individual factors contribute to the interindividual propensity to shift from goal-directed to habitual behaviour needs to be addressed by future research.

The fact that our neural data were not bimodal suggests that changes in the neural representation translate not directly into behavioural changes. The observed significant correlations between neural representation and performance in devalued trials show that there is a link. However, the correlations were obviously clearly below 1. Compared to the behavioural level which included only discrete yes-no responses, the neural data may have been much more sensitive and able to capture more fine-grained changes. The different sensitivity of behavioural and neural data is interesting in itself and points to another important question for future research: how do neural representation changes and behavioural changes relate to each other? Is there a particular threshold at which a change in representation at which behaviour changes?

As suggested by the reviewer, we make the bimodal distribution of the behavioural responses in the devaluation blocks more explicit now and discuss the theoretical implications of this bimodal distribution and its link to the neural data in a separate paragraph of the discussion now. Please see page 32, lines 708 to 742:

“Interestingly, participants either mainly selected their action based on the outcome value, i.e. did avoid the devalued outcome and thus were goal-directed value, or responded to the devalued outcome in the vast majority of the trials and thus behaved habitually. Thus, participants showed an “either-or” pattern and there seemed to be interindividual differences in the tendency to perform in a goal-directed vs. habitual manner. Interestingly, also among stressed participants, there are substantial individual differences in the propensity to shift from goal-directed towards habitual behaviour, indicating that there is no overall effect of stress on the control of behaviour but that individuals differ in their sensitivity in these stress effects. This raises the important question of what makes individuals more or less vulnerable to stress effects on instrumental control (and cognition in general). Previous work suggested that genetic variants related to major stress response systems (i.e. noradrenergic and glucocorticoid activity) may an important role in which of multiple learning systems is engaged after stress (Wirz et al., 2017; Wirz et al., 2018). Understanding which further individual factors contribute to the interindividual propensity to shift from goal-directed to habitual behaviour needs to be addressed by future research. Furthermore, the fact that our neural data were not bimodal suggests that changes in the neural representation may not translate directly into behavioural changes. The observed significant correlations between neural representation and performance in devalued trials show that there is a link between the behavioural and neural representation level. However, the correlations were obviously far below 1. Compared to the behavioural level which included only discrete yes-no responses, the neural data may have been much more sensitive and able to capture more fine-grained changes. The different sensitivity of behavioural and neural data is interesting in itself and points to another important question for future research: how do neural representation changes and behavioural changes relate to each other? Is there a particular threshold at which a change in representation triggers behavioural changes?”

(2) It looks like the response representation is actually decreasing across blocks in the control group. This seems inconsistent with the general argument that habits should develop as training progresses.

Participants in the control group showed indeed a reduced response representation in the second half of the task (t24 = 3.50, P = 0.002, d = 0.701, 95% CI = 0.256 to 1.134). We completely agree that this finding – as well as the fact that outcome representations did not decrease in controls across the task – speaks against a training-dependent shift towards habits. It is, however, important to note that we did not aim to assess an overtraining effect per se in the present study and that the study was not designed to test for such a general overtraining effect, but to test the impact of stress on the neural signature (identified vie EEG-based decoding) of goal-directed and habit processes, respectively, and to what extent this impact depended on the amount of training.

Our idea that stress might accelerate the shift from goal-directed to habit learning that would otherwise only occur after extended training is based on rodent studies showing exactly this pattern for stress hormone effects on the balance of ‘cognitive’ and ‘habitual’ learning (Packard, 1999; Siller-Pérez et al., 2017). We make now clearer that our argument is mainly based on these rodent data but remains speculative based on the present data as we did not see any training-dependent bias towards habit behaviour or processing in the control group across the task. In addition, we state now explicitly, that the pattern observed in the control group speaks against a training-dependent shift towards habits.

Please see page 30, lines 660 to 667:

“Findings on ‘cognitive’ and ‘habitual’ forms of navigational learning in rats, however, demonstrated that stress hormones may accelerate a shift from ‘cognitive’ to ‘habit’ learning that would otherwise only occur after extended training (Packard, 1999; Siller-Pérez et al., 2017). Thus, it is tempting to hypothesize that a similar mechanism might be at work during instrumental learning in stressed humans. This conclusion, however, remains rather speculative as we did not observe a training-dependent shift towards habits in the control group and this group even showed reduced response and increased outcome representations over time, which rather suggests increased goaldirected processing across the task.”

(3) "…while model-free learning can be present after only few trials, establishing habit behaviour requires extensive training." This statement is problematic for several reasons. First, it appears to be contradicted in the very next paragraph, where the authors acknowledge that the necessity and sufficiency of extensive training for habit expression in human subjects is controversial. Second, in terms of operational definitions, habits are instrumental responses that are insensitive to outcome devaluation, whether extensively trained or not. Likewise, theoretically, habits are usually attributed to the strength of the stimulus-response association, which may depend on repetition, but also on the magnitude of the reward, as well as, presumably, a range of other factors (e.g., stimulus salience).

We completely that this statement has been an oversimplification. “Extensive training” was too far-reaching and we further agree that repetition may be one factor among others that is important for the strength of the stimulus-response association. Therefore, we have rephrased this sentence accordingly (please see page 28, lines 633 to 637):

“For instance, while model-free learning can be present after only few trials, establishing habit behaviour relies on the strength of the stimulus-response association, which among other factors such as reward magnitude or stimulus saliency, has been suggested to depend on repetition (Adams, 1982; Dickinson et al., 1995; Tricomi et al., 2009; but see Wit et al., 2018).”

(4) Given the schematic of the goal-directed system (top of Figure 1A), it's a bit strange that outcome and response representations would be uncorrelated.

The goal-directed system might also involve a response representation at the time of stimulus presentation as it is supposed to learn S-R-O associations. However, outcome representations should prevail as action-outcome learning is at the heart of the goal-directed system and the stimulus association may play a rather minor role. Further the stimulus representation may serve a different purpose in the goal-directed system, where it is mainly used as an indicator of the correct action required to achieve a certain outcome, which is distinct from the prompting of a specific “reflexive” response in habitual S-R learning. The latter should be particularly distinct from the processes indicated by the outcome representation. Moreover, it is important to note that we decoded the response representation at the time point of the stimulus presentation and at this time point, the response representation is certainly more relevant for the habit system that relies on S-R associations. We aimed to indicate the differential relevance of the response representation for the goal-directed and habits systems by the different size of the respective bubbles. We have added now to the legend of Figure 1 that the larger bubbles are supposed to reflect the relevance of the respective representations for the two systems. Moreover, we mention now in the text that a stimulus representation might be involved in ‘goal-directed’ S-R-O learning and in ‘habitual’ S-R learning, however, presumably with a different function and a different relevance for the actual learning process. Please see page 8, lines 170 to 177:

“Note that the goal-directed system might also involve a response representation at the time of stimulus presentation as it is supposed to learn S-R-O associations. However, outcome representations should prevail as action-outcome learning is at the heart of the goal-directed system and the stimulus association may play a minor role. Further, the stimulus representation may serve a different purpose in the goal-directed system, where it is mainly used as an indicator of the correct action required to achieve a certain outcome, which is distinct from the prompting of a specific “reflexive” response in habitual S-R learning.”

(5) The violin plots are great for illustrating the bimodality in devaluation sensitivity, but it would also be useful to see all of these related variables plotted in the same way, preferably in a single figure. Please include line graphs showing the degree of responding for the devalued outcome, the decoding accuracy for responses and outcomes, and the physiological stress measures, across groups, in a way that allows for a visual, side-by-side, comparison.

We thank the reviewer for this constructive suggestion. We modified the former Figure 4 as suggested and illustrate the physiological stress measurements, the proportion of responses for devalued outcomes, and the decoding accuracy for the outcome representation during stimulus presentation and response choice as well as response representation during stimulus presentation in a single figure (now Figure 2); in addition to the violin plots (shown in Figure 3).

(6) Please include graphics of devalued trials in Figure 1C, to illustrate the masking of outcomes and the nature of the error feedback on such trials.

We have modified Figure 1C (and the respective legend) now to illustrate the masking of the outcomes and the nature of the error feedback in the devalued trials.

Please see page 9.

(7) In the Results section, following the description of the task and procedures, but prior to reporting any results, please add a section detailing and justifying the range of statistical approaches. For example, some reported p-values for t-tests appear to be uncorrected (e.g., line 358), as are all p-values for correlation tests. Moreover, a Bayesian analysis is used to assess evidence against a correlation, but not evidence in favor of a correlation.

First of all, we unintentionally reported one uncorrected p-value. We have corrected this mistake and present now the Bonferroni-corrected value, which does however not affect our conclusions (i.e. the significance remains). Please see page 20, lines 435 to 439:

“When we analysed the outcome representations at the time of the choice between the two aliens, a very similar pattern emerged: at the end of the task, participants in the stress group showed a decreased outcome representation at the time point of the choice (t22 = 2.94, Pcorr = 0.016, d = 0.613, 95% CI = 0.161 to 1.054), whereas there was no such effect in the control group (t17 = 0.59, Pcorr = 1, d = 0.138, 95% CI = -0.328 to 0.600).”

We conducted Bayesian correlation analyses between the classification accuracies to test whether outcome and response representations reflected independent or linked neural representations. Depending on the magnitude of the Bayes factor (reflecting the likelihood ratio of the data under the alternative hypothesis and the data under the null hypothesis) this approach can provide evidence in favor of the alternative hypothesis as well as evidence in favor of the null hypothesis (Hoijtink, 2012; Kass and Raftery, 1995; Nuzzo, 2017). In other words, this analysis does not selectively test for evidence against a correlation but may provide evidence for or against a correlation (our data provide evidence against a correlation).

In response to the reviewer’s suggestion, we further provide additional information detailing and justifying the range of statistical approaches. Please see page 10, lines 198 to 227:

“In order to adequately assess whether increased responding to devalued or degraded actions is due to reduced goal-directed or enhanced habitual control (or both), we employed a variety of measures and statistical approaches. At the behavioural level, we applied a devaluation paradigm to examine whether stressed participants respond differently to transient outcome devaluations compared to control participants and to assess whether the predicted neural changes are linked to behavioural outcomes. Behavioural analyses focussed mainly on participants’ choice in valued and devalued trials. Based on previous animal studies (Dias-Ferreira et al., 2009), we additionally investigated whether the effects of stress on the control mode would depend on the amount of training and therefore implemented outcome devaluation after initial, moderate and extended training. At the physiological level, we analysed pulse, systolic and diastolic blood pressure as well as salivary cortisol concentrations at different time points across the experiment to assess the effectiveness of the stress manipulation. At the neural level, we mainly focused on the decoding of outcome and response representations, which provide insights into the mechanisms through which stress affects the sensitivity to outcome devaluation, the primary objective of the present study. Outcome representations are indicative for goal-directed processing and response representations rather point to habitual processing. To further address the important question whether response and outcome representations represent signatures of distinct control systems, we additionally analysed the correlations between both neural representations. For this purpose, we used Bayesian correlational analyses. Depending on the magnitude of the Bayes factor (reflecting the likelihood ratio of the data under the alternative hypothesis and the data under the null hypothesis), the Bayesian approach can provide evidence in favor of the alternative hypothesis or evidence in favor of the null (Hoijtink, 2012; Kass and Raftery, 1995; Nuzzo, 2017). Thus, we utilized Bayesian analyses to provide clear evidence for or against the null hypothesis. To further assess the association between the behavioural data (i.e. classification accuracy) and the strength of the neural representation (i.e. classification accuracies), we also computed Spearman correlations. To additionally analyse previously proposed “attentional habits” (Luque et al., 2017) on the one hand, and to address the alternative explanation of altered negative feedback processing on the other hand we analysed event-related potentials.”

Please see also page 48, lines 1105 to 1115:

“Statistical analyses of the EEG decoding data were performed at the group level averaging across individual decoding accuracies. Decoding accuracies were analysed by mixed-design ANOVAs with the within-subject factor block (1-6, 7-12, 13-18 and 19-24 block) and the between-subject factor group. In addition, we computed Spearman correlations between ∆ classification scores (averaged classification accuracy during the first six blocks minus the classification accuracy during the last six blocks) and ∆ responses for devalued outcomes for Dev Ohigh blocks responses for devalued outcomes during the first devaluation block, i.e. the 2nd overall block, minus responses for devalued outcomes during the last devaluation block, i.e. the 29th overall block. Bayesian correlation analyses were conducted using JASP version 0.13 software (https:/jasp-stats.org/) and the default Cauchy prior 0.707.”

(8) The phrase "responses to devalued actions", employed throughout, is a bit confusing. I would change this to "responses for devalued outcomes".

We agree and have rephrased the referring phrase accordingly, throughout the manuscript. Please see, for instance, page 13, lines 265 and 266:

“Proportion of responses for devalued outcomes across the reinforcement learning task during Dev Ohigh blocks.”

Or page 21, line 450:

“Correlations of outcome and response representation during stimulus presentation with responses for devalued outcomes during Dev Ohigh blocks.”

(9) I'm guessing that Figure 3 is showing proportions, not percentages.

Thank you for noticing this mistake, which has now been corrected. Please see the new y-axis label of Figure 3 (page 16):

We changed the labelling of the Figure 3—figure supplement 1 and Figure 3—figure supplement 2 accordingly (please see Figure 3—figure supplements 1 and 2).

Reviewer #2 (Recommendations for the authors):

It was not clear to me what the main hypothesis of the study was. The authors seem to talk rather loosely about habits being developed after overtraining versus the mediating effect of stress on habits. The introduction should convey their main goal more clearly.

Thanks for the opportunity to clarify our main objective and hypothesis. Previous research showed that stress may induce a shift from goal-directed to habit behaviour, yet it remained unclear whether this is due to increased habit behaviour, decreased goal-directed behaviour or both. Existing tasks did not allow a distinction between these alternatives, nor a distinction between goal-directed and habitual contributions to behaviour in general. To overcome these methodological limitations and examine whether stress leads to an upregulation of habitual processing or a downregulation of goal-directed processing (or both), we used EEG-based decoding to identify neural signatures of the goal-directed and habit system, respectively, and their modulation by stress. Thus, our main goal was to investigate whether stress leads to an increase in habitual processing, a decrease in goal-directed processing or both – and to link these neural changes to participants’ actual responses for devalued outcomes. Based on previous rodent studies (Dias-Ferreira et al., 2009; Packard, 1999; Siller-Pérez et al., 2017), we further asked whether the stress effect on goal-directed and habitual processing, respectively, would depend on the extent of training. We did not aim, however, to assess overtraining effects per se (i.e. independent of stress). We have clarified the rationale of our study now. Please see page 5, lines 97 to 101:

“Here, we aimed to overcome these shortcomings of classical paradigms for the assessment of the mode of behavioural control and to examine whether stress leads to an upregulation of habitual processing or a downregulation of goal-directed processing (or both). To these ends, we leveraged electroencephalography (EEG) in combination with multivariate pattern analysis (MVPA)-based decoding of neural representations.”

Please see also page 5, lines 116 to 121:

“Because previous rodent studies suggested that stress or stress hormones modulate the balance of goal-directed and habitual forms of learning after more extended training (Dias-Ferreira et al., 2009; Packard, 1999; Siller-Pérez et al., 2017), we further asked whether the stress effect on goal-directed and habitual processing, respectively, would depend on the extent of training. Although we did not aim to test training-induced changes in the balance of goal-directed and habitual processing per se, we therefore included transient outcome devaluations after initial, moderate, and extended training in the instrumental learning task to assess if and how the predicted changes in neural representations are linked to behavioural manifestations of stress-induced changes in behavioural control.”

Furthermore, we state the main goal of this study also explicitly on page 6, lines 123 to 126:

“The goal of this study was to elucidate the mechanisms underlying the impact of stress on the control of instrumental behaviour. Specifically, we aimed to leverage an EEG-based decoding approach to determine stress-induced changes in outcome and response representations that constitute key distinguishing features of goal-directed action and habitual responding.”

In my view, this task is not meant to test for overtraining effects as it includes repeated devaluation of the outcome within the same subject, which makes it very difficult to assess the extent to which overtraining -rather than simply lack of motivation at the end of the task or other factor- is explaining weaker devaluation effects late in training. This is a critique for any experiment looking at overtraining effects, bit it should be at least demonstrated that what is driving the effect found by the authors is indeed an effect of overtraining by looking at more than one extension of training. Did the authors test other extensions of training in pilot studies?

We completely agree that this study was not designed to test overtraining effects and we did not intend to do so. We state this now more explicitly in the manuscript. Please see, for example, page 5, lines 116 and 117:

“Although we did not aim to test training-induced changes in the balance of goal-directed and habitual processing per se […]”

As we did not intend to test for overtraining effects per se, we did not test other extensions of training.

In terms of alternative explanations for the devaluation effects at the end of training (e.g. lack of motivation), we think that we can rule out many of these. First, participants’ reaction times became faster throughout the task (without a reduction in response accuracy in NoDev trials, thus arguing against an altered speed-accuracy tradeoff), which speaks against lack of motivation or general fatigue. Second, participants continued to perform the valued responses with high accuracy even after more extended training and without relevant changes across blocks. Moreover, if there were general motivation or fatigue effects, it would be difficult to explain why these should occur specifically during devaluation trials (note that even devaluation blocks of one outcome (high vs. low value) included valued trials of the respective other outcome). Third, we included consumption trials in all of the blocks and participants consistently choose the high valued response option throughout the reinforcement learning task, thus speaking again against unspecific effects related to the extent of training. Moreover, it might be difficult to explain the specific links that we see between response and outcome representations on the one hand and participants’ behaviour in the last devaluation block on the other hand with the suggested unspecific effects. We discuss this aspect now on page 33 line 743 to 754:

“While we assume that the opposite effects of stress on neural outcome and response representations were due to the action of major stress response systems, there may have been other factors that have contributed to the present pattern of results, such as motivational factors or fatigue at the end of the task. Although we cannot completely rule out these alternatives, in light of our data we consider them rather unlikely. First, if these factors were a result of the mere amount of training, they should have also occurred in the control group, which was not the case. Even if a specific interaction with the stress manipulation is assumed, it is important to note that reaction times remained fast in stressed participants until the end of training and the response accuracy in valued trials or consumption trials remained very high. Furthermore, the observed specificity of the stress effects, which occurred selectively in devalued trials, can in our view be hardly explained by unspecific influences, such as lack of motivation or fatigue.”

It does not seem to me that the task is relevant for the habit/goals distinction. The authors mention how the model-based/model-free distinction may not be interrogating the habit/goal distinction, but this task suffers from the same shortcomings. This is a problem in general with this type of experiments, in that a participant could behave in an S-R or R-O manner independently of the amount of training and switch back and forth between these two strategies throughout the task. The task is simply not well-suited for testing the habit/goal-directed distinction.

We completely agree that the present behavioural task – as well the existing previous tasks – cannot distinguish between the specific contributions of the goal-directed and habitual system, respectively. This is exactly why we used the specific EEG decoding approach, which helps us to identify specific signatures of the goal-directed and habitual system, respectively. Using EEG decoding, we can identify neural representations of the outcome during (stimulus presentation and) response selection, indicative of action-outcome learning, which is at the heart of the goal-directed system. Further, we can identify neural representations of the response during stimulus presentation, indicative of stimulus-response learning, which is at the heart of habit learning. Interestingly, our Bayesian correlation analysis provides evidence that these outcome and response representations are uncorrelated, suggesting that these may indeed reflect neural signatures of distinct processes. Furthermore, as theoretically predicted, response representations were positively correlated with participants’ responses for devalued outcomes, whereas outcome representations were negatively correlated with participants’ responses for devalued outcomes. Thus, using EEG-based decoding, we aimed to overcome shortcomings of existing behavioural paradigms (including the present behavioural task) in the distinction between specific contributions of the goal-directed and habit system, respectively. We have made this rationale for the present study explicit in the introduction of the manuscript. Please see page 5, lines 97 to 101:

“Here, we aimed to overcome these shortcomings of classical paradigms for the assessment of the mode of behavioural control and to examine whether stress leads to an upregulation of habitual processing or a downregulation of goal-directed processing (or both). To these ends, we leveraged electroencephalography (EEG) in combination with multivariate pattern analysis (MVPA)-based decoding of neural representations.”

Furthermore, we state repeatedly that the present behavioural task (same as other existing tasks) cannot distinguish between goal-directed and habitual contributions to behaviour. Please see, for example, page 5, lines 93 to 96:

“Although these elegant paradigms provided valuable insights into mechanisms involved in behavioural control, they cannot show whether increased responding to devalued or degraded actions is due to reduced goal-directed or enhanced habitual control (or both).”

Or, in the discussion on page 26, lines 562 to 565, directly related to the present task:

“However, similar to the behavioural insensitivity to an outcome devaluation after stress, these ERPs cannot separate reduced goal-directed from increased habitual responding. To disentangle goal-directed and habitual processing, we used an MVPA-based decoding approach.”

In addition, the task also employs multiple devaluation phases, which, if anything, should make participants more sensitive to the devaluation procedure. Was that the case?

We included multiple devaluation phases because previous rodent studies suggested that the stress-induced bias towards ‘habit’ behaviour occurs primarily after more extended training (Packard, 1999; Siller-Pérez et al., 2017). In response to this comment, we have tested whether the multiple devaluations per se made participants more sensitive to the devaluation procedure. We tested this idea by analysing potential changes in the responses across the repeated devaluation blocks in the control group. Importantly, we did not observe any changes (neither increases nor decreases in the proportion of responses for devalued outcomes) that would suggest an increased sensitivity to the devaluation in the responses in the repeated devaluation blocks in the control group (p = .240). This result speaks against the idea that the repeated devaluation per se would affect participants’ sensitivity to the devaluation procedure. For stressed participants, we observed even a reduced sensitivity to devaluation across training (in particular in the last block). Thus, if one would assume that the repeated devaluation would have led to an increased sensitivity to the devaluation, the observed stress effects would be even more remarkable.

We report now explicitly that behaviour did not differ across the repeated devaluation blocks in the control group, which speaks against any unspecific sensitization effects due to the repeated devaluation. Please see page 14, line 294 to line 297.

“In the control group, instrumental behaviour did not differ across the different devaluation blocks (F(2,54) = 1.466, P = 0.240, ƞp² = 0.052, 95% CI = 0.013 to 0.049), which indicates that the repeated devaluation phases per se did not result in an increased sensitivity to the devaluation procedure.”

It would be good to see the learning curves associated with each group of subjects. The authors' main finding is an effect that is only observed in the last block of training and only for one condition (when the outcome that was highly valued is devalued). Overtraining effects are only relevant after responding has plateaued. Inspecting if both groups have already asymptoted, or whether one did it before the other is an important point, I think.

We agree and present now additional line graphs in Supplementary file 2, as suggested. As can be seen in this graph, both groups reached a performance plateau relatively quickly and at about the same time. Performance in NoDev blocks did further not differ between the control and stress groups and the time course of learning was comparable in the two groups (time × group interaction: F(2,112) = 2.44, P = 0.092, ƞp² = 0.042, 95% CI = 0 to 0.123; main effect group: F(1,56) = 0.30, P = 0.585, ƞp² = 0.005, 95% CI = 0 to 0.096). We make this latter point clearer in the Results section and refer the reader to the line graphs in supplemental Figure 2. Please see page 14, lines 286 to 291:

“Both groups reached a performance plateau relatively quickly and at about the same time (Figure 3—figure supplement 1). Performance in NoDev blocks did further not differ between the control and stress groups and the time course of learning was comparable in the two groups (time × group interaction: F(2,112) = 2.44, P = 0.092, ƞp² = 0.042, 95% CI = 0 to 0.123; main effect group: F(1,56) = 0.30, P = 0.585, ƞp² = 0.005, 95% CI = 0 to 0.096), suggesting that stress did not affect instrumental learning per se.”

Please see also Figure 3—figure supplement 1:

The same authors have shown that stress manipulations after training render behavior insensitive to devaluation. This suggests that the effect of stress is on the performance of an action, not the learning process per se. How does this fit with their current results showing that stressing people before training changes devaluation sensitivity only in the last block of training? Why there is an extension of training effect in O representation for stressed participants if the effect is on performance?

This is a relevant point which might not have been sufficiently addressed in the previous version of the manuscript. There are by now several studies, from our lab as well as from others, showing that stress or stress hormone administration before learning may induce a bias from goal-directed to habit behaviour (Braun and Hauber, 2013; Dias-Ferreira et al., 2009; Gourley et al., 2012; Hartogsveld et al., 2020; Schwabe et al., 2010; Schwabe et al., 2011; Schwabe et al., 2012; Schwabe and Wolf, 2009; Soares et al., 2012). In one study to which the reviewer is referring to, we showed that stress leads to a comparable shift when stress is induced after learning and before an extinction test (Schwabe and Wolf, 2010). Although this finding shows that stress may interfere with behavioural expression if learning remained unaffected, this finding does not at all rule out that learning would not be affected. Moreover, it seemed that the shift towards habits was somewhat more pronounced when stress was induced before learning than when induced before the test (as discussed in Schwabe and Wolf, 2010). Thus, we do not see how our previous findings would be in conflict with the present findings. Nevertheless, we address this aspect now in the discussion, stating also explicitly that we were not able (and did not aim) to distinguish between stress effects on acquisition and expression of goal-directed vs. habitual processes. Please see page 34, lines 755 to 770:

“In the present study, stress was induced before learning and outcome devaluation. Thus, stress could have affected the acquisition or the expression of instrumental behaviour (or both). While several previous studies demonstrated that acute stress or the administration of stress hormones before learning may shift instrumental behaviour from goal-directed to habitual control (Braun and Hauber, 2013; Dias-Ferreira et al., 2009; Gourley et al., 2012; Hartogsveld et al., 2020; Schwabe et al., 2010; Schwabe et al., 2011; Schwabe et al., 2012; Schwabe and Wolf, 2009; Soares et al., 2012), there is evidence suggesting that stress before a test of behavioural expression may have a similar impact, i.e. stress may induce habitual responding even when stress left acquisition unaffected (Schwabe et al., 2011; Schwabe and Wolf, 2010). The latter finding, however, does not rule out additional effects of stress on acquisition and indeed the impact of stress appeared to be more pronounced when the stress exposure took place before learning (Schwabe and Wolf, 2010). The present study did not aim to distinguish between stress effects on acquisition or expression of goal-directed vs. habitual behaviour but focussed on the impact of stress of the control of instrumental behaviour per se. Our findings thus do not allow a distinction between stress effects on acquisition vs. expression of instrumental behaviour.”

The authors show that the representation of O for the controls increases with training (Figure 4) while the R representation decreases. Theoretically, these two observations should entail a weaker devaluation effect, as the subject needs to encode a representation of the response being performed in order to attain the outcome for a response to be goal-directed (see recent work from Bouton's lab and Dickinson and Balleine's associative-cybernetic model). Perhaps this should be discussed?

We completely agree that the observed decreases of response representations and increases in outcome representations that we observed in control participants should reduce a potential devaluation effect. However, we did not observe a decrease in devalued actions in controls which may be due to the fact that the proportion of responses for devalued outcomes was generally rather low in controls. We briefly discuss this point on page 34, lines 771 to 777:

“Based on the associative-cybernetic model (Dickinson and Balleine, 1993), it could be predicted that the obtained pattern with increased outcome and decreased response representations leads to even reduced responding for devalued outcomes across training in controls, as individuals need to encode a representation of the response being performed in order to attain the outcome for a response to be goal-directed. We did not observe such a decrease, which may be related to the overall relatively low rate of responses for devalued outcomes in control participants.”

Related to this, the fact that a control group did not show evidence of a decrease in outcome representations at the end of training seems problematic, as their argument is based upon the notion that more extensive training makes responding habitual in the stress group but not in the control group. If stress impacts learning, it should there is no argument for the fact that overtraining in a control group does not change outcome representations.

We completely agree that the lack of a decrease in outcome representations (paralleled by a decrease in response representations) across the training in the control group speaks against a practice-related shift towards habits, in the sense of an ‘overtraining effect’, in the present study. Again, our study was not designed to test an overtraining effect per se and we did not aim to do so. Instead, we aimed to test how stress affects specific neural signatures (derived from EEG-based decoding) of goal-directed and habit behaviour, respectively, and to what extent such stress effects may depend on the extent of training.

However, the fact that there was no overtraining effect in the control group of the present study does, in our view, not at all question the training-dependency of the observed stress effect. Our idea that stress might accelerate a shift from goal-directed to habitual behaviour is based on previous rodent studies that showed that stress or stress hormones accelerate a shift from ‘cognitive’ to ‘habitual’ responses that would otherwise only occur with extended training (Packard, 1999; Siller-Pérez et al., 2017). We agree that based on our own data, this idea remains speculative as we do not see evidence of a training-dependent bias towards habitual behavior/processing. We make this latter point clearer now. Please see page 30, lines 660 to 667:

“Findings on ‘cognitive’ and ‘habitual’ forms of navigational learning in rats, however, demonstrated that stress hormones may accelerate a shift from ‘cognitive’ to ‘habit’ learning that would otherwise only occur after extended training (Packard, 1999; Siller-Pérez et al., 2017). Thus, it is tempting to hypothesize that a similar mechanism might be at work during instrumental learning in stressed humans. This conclusion, however, remains rather speculative as we did not observe a training-dependent shift towards habits in the control group and this group even showed reduced response and increased outcome representations over time, which rather suggests increased goal-directed processing across the task.”

Why is there a stronger effect of devaluation on the O-Low condition across the task? Why a less-valued outcome should be able to support this? (pp. 6, line 239)

Overall, there was a highly significant devaluation effect for both low- and high-valued outcomes. The more pronounced valued vs. devalued difference for Olow per se (i.e. outcome devaluation × stimulus value interaction) remained even when we analysed only the control group (outcome devaluation × stimulus value interaction: F(2,52) = 70.601, P < 0.001, ƞp² = 0.731, 95% CI = 0.391 to 0.720; valued vs. devalued during Dev Ohigh: t27 = 8.482, Pcorr < 0.001, d = 1.603, 95% CI = 1.032 to 2.16; valued vs. devalued during Dev Olow: t26 = 8.654, Pcorr < 0.001, d = 1.665, 95% CI = 1.071 to 2.246), suggesting that it is not just due to the increased responding for devalued high-outcomes in the stress group. We added this information in the Results section, please see page 14, lines 304 to 310:

“The more pronounced valued vs. devalued difference for Olow per se (i.e. outcome devaluation × stimulus value interaction) remained even when we analysed only the control group (outcome devaluation × stimulus value interaction: F(2,52) = 70.601, P < 0.001, ƞp² = 0.731, 95% CI = 0.391 to 0.720; valued vs. devalued during Dev Ohigh: t27 = 8.482, Pcorr < 0.001, d = 1.603, 95% CI = 1.032 to 2.16; valued vs. devalued during Dev Olow: t26 = 8.654, Pcorr < 0.001, d = 1.665, 95% CI = 1.071 to 2.246).”

This observed pattern may be due to a higher habitualization for Ohigh stimuli. The association with a higher reward/outcome value may have led to a higher saliency of the stimuli, which may have resulted in steeper learning curves (as shown in our study) and promoted the formation of stimulus-response (S-R) associations. These stronger S-R associations may have resulted in more responses for the respective outcomes even when these were devalued. We address this potential explanation in the discussion now. Please see page 32, lines 705 to 717:

“In addition, stressed participants showed an increase of insensitivity to outcome in Dev Ohigh but not in dev Olow. Moreover, we found that the devaluation effect for Ohigh stimuli was stronger compared to the effect for Olow stimuli. This difference remained even when we analyzed only the control group, i.e. excluding the possibility that the difference between Olow and Ohigh was merely due to the fact that stress increased specifically the insensitivity to the devaluation of Ohigh. However, why may the devaluation effect be lower for Ohigh than for Olow and why may stress have affected primarily the devaluation of Ohigh? These results suggest a stronger habit formation for stimuli that were paired with high rewards. A potential explanation for this pattern takes the links between rewards, stimulus saliency and strength of stimulus-response associations into account: the initial association with high valued outcomes may have increased the salience of the respective stimuli, which in turn may have promoted the formation of stimulus-response associations. These stronger S-R associations, in turn, may have resulted in more habitual responses for the devalued outcomes.”

The devaluation effect for the stress group in block 2 seems stronger than in block 1. If that is the case, then the overtraining argument becomes problematic, as it should supposedly lead to weaker effects of devaluation with training, that is, devaluation should be weaker in block 1 than block 2, and weaker in block 2 than block 3 (the latter contrast being the one they report as being statistically significant). Also, is the difference between stress and control in block 3 different to the difference between stress and control in block 2?

We performed the suggested analyses and obtained no significant differences. There was no evidence for an interaction of block number (1 vs. 2) and experimental treatment (F(1,56) = 1.575, P = 0.215, ƞp² = 0.027, 95% CI = 0.013 to 0.052), which would have been indicative of the decrease from block 1 to 2 the reviewer was referring to. There was, however, a significant interaction between block (2 vs 3) and treatment when analysing the changes from block 2 to block 3 (F(1,56) = 13.589, P < 0.001, ƞp² = 0.195, 95% CI = 0.105 to 0.319). Follow-up tests showed that after moderate training intensity (i.e. in block 2), groups did not differ in responses for the devalued outcome (t56 = 0.165, P = 0.870, Cohen’s d = 0.043, -0.472 to 0.558). As reported in the Results section, after a higher number of repetitions (i.e. in devaluation block 3), stressed participants responded significantly more often for the devalued outcome than control participants. Thus, while there were no significant changes from block 1 to block 2, there was a significant increase in devalued responses in the stress group, but not in the control group, from block 2 to block 3. We focus now more explicitly on the (potential) changes from block 1 to 2 and block 2 to 3 on page 15, line 327 to 334:

“Furthermore, while there was no evidence for an interaction of devaluation block number (1 vs. 2) and experimental treatment (F(1,56) = 1.575, P = 0.215, ƞp² = 0.027, 95% CI = 0.013 to 0.052) when analysing the changes from the first to the second devaluation block, we obtained a significant interaction between block (2 vs. 3) and treatment when analysing the changes from block 2 to block 3 (F(1,56) = 13.589, P < 0.001, ƞp² = 0.195, 95% CI = 0.105 to 0.319). Moreover, follow-up tests showed that after moderate training intensity (i.e. in block 2), groups did not differ in responses for the devalued outcome (t56 = 0.165, P = 0.870, Cohen’s d = 0.043, -0.472 to 0.558).”

The theoretical aspect that the study aims go address is whether the devaluation effects are due to less goal-directed control or more habitual control. In their Discussion section, the authors argue that a response cannot be goal-directed and habitual at the same time. However, Perez and Dickinson (2020) have recently published a free-operant model where it is perfectly possible that a response is driven in equal proportions by each system. It is only in the case of discrete-trial procedures such as the present one and the 2-step task from the model-based/model-free tradition that the answer is difficult.

According to canonical operational definitions, instrumental behaviour can at least not be fully goal-directed and fully habitual at the same time (Adams, 1982; Dickinson and Balleine, 1994). However, we completely agree that behaviour may not always necessarily be either fully goal-directed or habitual and there may be different degrees to which behaviour is under goal-directed or habitual control (this assumption formed actually the basis of our study idea). We further agree that both the goal-directed and habit system may contribute to the probability of responding, as proposed by Perez and Dickinson (2020). Here, we used an MVPA-based decoding approach to overcome the methodological shortcomings of behavioural paradigms with discrete behavioural responses and to unravel to contributions of the goal-directed and habitual systems to behaviour. Our findings point to distinct (neural) contributions of these systems.

We refer now also to the recent work by Perez and Dickinson, 2020 when we discuss the dissociable contributions of goal-directed and habitual processes to behaviour. Please see page 35, line 786 to 788:

“Classical behavioural paradigms involving discrete responses, however, can hardly disentangle goal-directed and habitual components in a specific pattern of responding (e.g. insensitivity to outcome devaluation). […] Recently, a free-operant model was proposed that allows a behavioural dissociation of goal-directed and habitual contributions to behaviour (Perez and Dickinson, 2020).”

It seems puzzling that the control group shows decreased representations of the response with training. If anything, that should be maintained or augmented with training. This is another reason to question the overtraining argument of this paper.

We completely agree that the control group did not show signs of increased habit behaviour across the reinforcement learning task. Again, our study was not designed to test an overtraining effect per se and we did not aim to do so. Instead, we aimed to test how stress affects specific neural signatures (derived from EEG-based decoding) of goal-directed and habit behaviour, respectively, and to what extent such stress effects may depend on the extent of training. This rational of the study has now been made clearer in the introduction (please see page 5, lines 97 to 101).

“Here, we aimed to overcome these shortcomings of classical paradigms for the assessment of the mode of behavioural control and to examine whether stress leads to an upregulation of habitual processing or a downregulation of goal-directed processing (or both). To these ends, we leveraged electroencephalography (EEG) in combination with multivariate pattern analysis (MVPA)-based decoding of neural representations.”

Please see also page 5, lines 112 to 121:

“Because previous rodent studies suggested that stress or stress hormones modulate the balance of goal-directed and habitual forms of learning after more extended training (Dias-Ferreira et al., 2009; Packard, 1999; Siller-Pérez et al., 2017), we further asked whether the stress effect on goal-directed and habitual processing, respectively, would depend on the extent of training. Although we did not aim to test training-induced changes in the balance of goal-directed and habitual processing per se, we therefore included transient outcome devaluations after initial, moderate, and extended training in the instrumental learning task to assess if and how the predicted changes in neural representations are linked to behavioural manifestations of stress-induced changes in behavioural control.”

We referred to the decrease in the neural response representation across the task in controls on page 21, lines 467 to 472:

“As shown in Figure 2H, participants of the stress group showed a stronger response representation, reflected in a higher classification accuracy for the response categories, with increasing training intensity (first half vs. last half: t25 = 2.51, Pcorr = 0.038, d = 0.491, 95% CI = 0.079 to 0.894), whereas there was even a decrease in the control group (first half vs. last half: t24 = 3.50, Pcorr = 0.004, d = 0.701, 95% CI = 0.256 to 1.134).”

Please also see page 30, line 660 to 667 in the Discussion section:

“Findings on ‘cognitive’ and ‘habitual’ forms of navigational learning in rats, however, demonstrated that stress hormones may accelerate a shift from ‘cognitive’ to ‘habit’ learning that would otherwise only occur after extended training (Packard, 1999; Siller-Pérez et al., 2017). Thus, it is tempting to hypothesize that a similar mechanism might be at work during instrumental learning in stressed humans. This conclusion, however, remains rather speculative as we did not observe a training-dependent shift towards habits in the control group and this group even showed reduced response and increased outcome representations over time, which rather suggests increased goal-directed processing across the task.”

Other points:

The analysis of response representations after extended training is based in comparisons of first half and last half of training, why that choice?

We contrasted the first against the last half of the training to ensure a sufficient number of trials within the decoding approach. Importantly, however, when we grouped the accuracy neural representation accuracy data differently, the pattern of results was largely comparable.

The results of alternatively grouped data are presented in the Supplementary File 1A.

In the discussion, the authors mention that "habit behavior requires extensive training", but a recent preprint by Pool et al. (2021) shows that most participants can be habitual after limited training.

We agree that our initial wording was a bit too far-reaching and have rephrased this sentence, also based on a respective comment of reviewer #1 as follows (page 28, lines 633 to 637):

“For instance, while model-free learning can be present after only few trials, establishing habit behaviour relies on the strength of the stimulus-response association, which among other factors such as reward magnitude or stimulus saliency, has been suggested to depend on repetition (Adams, 1982; Dickinson et al., 1995; Tricomi et al., 2009; but see Wit et al., 2018).”

The recent manuscript by Pool et al. (2021) showed indeed that some participants perform habitually after ‘limited’ training. It is, however, important to note that this ‘limited’ training comprised already several hundred responses (12 blocks of 40 seconds and 3-4 responses per second). Thus, these findings are not in conflict with the idea that habit behaviour requires (among other factors) repetition, as we state now in the rephrased sentence.

Pp 24. Line 522: "Furthermore, while model-free learnins is based on previously experienced outcomes, habit learning is defined as S-R learning without links to the outcome engendered by a response." I could not follow this idea, can you clarify?

We thank the reviewer for the opportunity to clarify this aspect. Model-free learning is based on the value of rewards, i.e. it explicitly takes the value of an outcome into account. This is in contrast to habitual S-R-processes which are assumed to depend on the association between responses and antecedent stimuli regardless of the outcome that is engendered by the action. We added this information in the manuscript, please see page 29 lines 640 to 644:

“In other words, whereas model-free learning is dependent on the expected reward and hence the outcome that follows a response (Dayan and Berridge, 2014; O'Doherty et al., 2017), habit learning is operationally defined as being independent of the motivational value of the outcome (Adams, 1982; Adams and Dickinson, 1981; Balleine and Dickinson, 1998).”

During the presentation of the results, the authors show statistics for both the O-high and O-low conditions. However, the O-low condition did not add anything to the analysis, as it did not show any effects of devaluation. Therefore, for the sake of simplicity and the reader's joy, I'd simply leave that for the supplemental material and focus on the part of the experiment that is relevant for the point the authors are trying to make.

Please note that eLife does not allow supplementary text (only Supplementary files including figures). As we do think that the dependency of the effects on the value of the outcome (high vs. low) is relevant and for the sake of completeness, we would strongly prefer to leave the respective information in the (main) text – if both the reviewers and editors agree.

Table 1: The caption mentions that the scales ranged from 1-10, but the results show means higher than 10. Am I missing something here?

Thank you for noticing this mistake, which has now been corrected. The correct range is from 1-100. Please see page 12:

Pp. 6, line 134 "offer, trade" should apparently read "offer, or trade"

Thank you for noticing this mistake, is should actually read “and trade”. Please see page 6, line 138 to 140:

“Finally, participants received feedback about whether the alien accepted the offer and traded one of the desired cards, and how many points were earned (O).”

This approach was employed by McNamee and colleagues (2015) using fMRI. I think it should be cited.

We agree and cite this reference now on page 8, line 170:

“Outcome representations at the time of S presentation and R are indicative for goal-directed control. Habitual control, on the other hand, should be indicated by increased R representations at the time of stimulus (S) presentation (McNamee et al., 2015).”

pp. 15, line 308: "tended to be sensitive". I'd just say it was not significant.

We agree and refer to this result now as a non-significant trend. Please see page 18, lines 390 to 398:

“Moreover, we identified a late component that showed a non-significant trend towards sensitivity to the outcome devaluation during Dev Ohigh blocks in control participants (devalued vs. valued: t24 = 1.91, P = 0.068, d = 0.382, 95% CI = -0.028 to 0.785) but not in stressed participants (devalued vs. valued: t27 = 1.57, P = 0.127, d = 0.297, 95% CI = -0.084 to 0.673; outcome devaluation × stimulus value × group interaction: F(2,102) = 5.20, Pcorr = 0.042, ƞp² = 0.093, 95% CI = 0.008 to 0.199; stimulus value × group interaction: F(1,51) = 6.05, P = 0.017, ƞp² = 0.106, 95% CI = 0.003 to 0.273; no such effect in NoDev and Dev Olow blocks: stimulus value × group interaction: both F(1,51) < 1.44, both P > 0.236, both ƞp² < 0.027, both 95% CI = 0 to 0.159).”

pp. 18, line 370: P = 0.034, should be P_corr?

The mentioned statistic refers to an overall interaction effect, for which only one test was performed. Therefore, a correction for multiple testing is not required here.

Pp 31, lines 683-684. Could you please clarify the trial structure and numbers? What do you mean by "27 trials each"?

Participants completed 27 NoDev blocks, 3 Dev Ohigh blocks, and 3 Dev Olow blocks. Each block consisted of 27 trials (12 learning trials with Shigh, 12 learning trials with Slow, and three consumption trials). We clarified this in the text; please see page 71, lines 317 to 319:

“Participants completed 27 NoDev blocks, 3 Dev Ohigh blocks, and 3 Dev Olow blocks, with 27 trials per block: 12 learning trials with Shigh, 12 learning trials with Slow, and three consumption trials.”

Reviewer #3 (Recommendations for the authors):

1. I think that the manuscript would benefit from more of an effort to link the current results to other related work. For one, the authors briefly touch on how the results relate to the 'model-based and model-free' dichotomy described in other work, but mostly highlight how the operationalization of 'model-free behavior' differs from 'habits,' as described here. However, I think it is notable that the authors find that habitual behavior involves a reduction in prospective consideration of an outcome, similar to other work that has shown that model-based behavior increases prospective consideration of future states during decision-making (e.g. Doll et al., 2015). Do the authors believe that 'habit/goal' and 'model-free/model-based' decision-making strategies might share common mechanistic features?

Although we focussed in the manuscript mainly on differences between the two frameworks to point out that an analysis of model-free vs. model-based processes would be less helpful to address the key question of our study, we fully agree with the reviewer and we do think that habitual/goal-directed and model-free/model-based processes may share some mechanistic features. These commonalities relate to the consideration of an outcome but also to the underlying neural structures. We have added now that there are also several commonalities between the habit/goal-directed and model-free/model-based frameworks and we link our findings more to other work in this area.

Please see page 28, lines 614 to 628:

“There are indeed several commonalities between goal-directed vs. habitual processes on the one hand and model-based vs. model-free processes on the other hand and these processes may share some mechanistic features. For instance, both goal-directed and model-based processes involve a complex representation of the environment and allow for flexible, deliberate action compared to rather rigid model-free or habitual processes (Dolan and Dayan, 2013). Moreover, at a neural level both goal-directed and model-based behaviour are associated with overlapping prefrontal and medial temporal areas (Balleine and O'Doherty, 2010; Gläscher et al., 2010; Valentin et al., 2007). Further, there is evidence that also the balance of model-based and model-free processes can by modulated by stress (Cremer et al., 2021; Otto et al., 2013; Radenbach et al., 2015) and our finding that more habitual responses after stress correlated negatively with neural outcome representations appears to be generally in line with earlier findings showing that model-based behaviour increases prospective consideration of future states during decision-making (Doll et al., 2015).”

2. I am somewhat concerned that saccadic eye movements could be acting as a confound for some of the EEG decoding results and would appreciate the authors clarifying a few points related to this and possibly doing some additional control analyses to rule out this possibility:

a. The authors mention that the symbols for the aliens differed by position – was this position fixed for each alien tribe, i.e. did the blue tribe always appear on the left for a particular participant? If so, this could drive anticipatory eye-movements in that direction.

The positions of the symbols for the aliens on the computer screen were fully randomized. Thus, it is highly unlikely that anticipatory eye-movements could bias our decoding results. We have added this important information in the method section, please see page 40, lines 897 to 898:

“The position of the aliens (left/right) representing the response options was randomized across trials.”

b. It is not clear which electrodes are driving the change in outcome and response representation in the decoding analysis. An analysis examining which features are contributing most to successful decoding in the stimulus and response period would be particularly important to rule out the possibility that the main results of interest are not driven by anticipatory eye-movements, such as saccades in the response phase where two stimuli are placed side-by-side.

We appreciate the opportunity to clarify this issue. First of all, the classifier applies those electrophysiological properties for testing in the reinforcement learning task (RT) that had been identified in the delayed-matching-to-sample (DMS) task for the separation of object vs scene and blue vs red symbols. Artefacts that occur only during the reinforcement task will simply be ignored as they did not constitute discriminative features in the training data set (Grootswagers et al., 2017). Thus, artefacts could have affected the classifier only if the same artefacts were present during both the DMS task and the RT, which is rather unlikely. Moreover, while the classifier was trained on the epoched EEG data of the DMS delay phase, during which participants were constantly presented with a blanked screen, the trained classifier was applied to the EEG data during stimulus presentation and response choice in the RT. Thus, the visual input during the train and test data differed substantially, thus making, for example, similar saccades even more unlikely. Furthermore, the training DMS data were collected in two sessions, of which one was completed before participants underwent the stress induction and another after completion of the reinforcement learning task. Importantly, the classifier performance did not differ between both DMS sessions and was not modulated by the experimental treatment. Consequently, we think that (i) group-specific artefacts and (ii) emotional and motivational confounds that could have arisen from the RT completion such as fatigue cannot account for our decoding results. Moreover, the positions of the symbols for the aliens on the computer screen were fully randomized. Thus, we do not think that anticipatory eye movements, such as saccades during the response phase, could have had driven the EEG in the stress group at the end of the task. Apart from the methodological principles of the EEG decoding and our specific task design, there is also evidence that classifiers have the capacity to ignore bad channels or suppress noise during training, making artefact correction, in general, less critical in decoding analyses (Grootswagers et al., 2017). Together, we think it is highly unlikely that could have biased the decoding results.

In addition, we used eye-tracking during the EEG measurement to rule out biases due to eye movements. We have added now data showing that stress did not interact with outcome value, trial type or training intensity to alter blinks or saccades. Please see page 24, lines 532 to 537:

“Furthermore, we recorded eye-tracking data to control for potential group differences in saccades or eye blinks. These control analyses showed that there were no significant group differences in saccades or eye blinks across the task or depending on the trial type (outcome devaluation × stimulus value × time × group: F(4,196) = 0.78, P = 0.54, ƞp² = 0.02, 95% CI = 0.002 to 0.008; outcome devaluation × stimulus value × group: F(2,98) = 1.03, P = 0.36, ƞp² = 0.02, 95% CI = 0.005 to 0.020; see supplementary file 1D).”

Furthermore, we added a brief section on the eye-tracking to the methods section.

Please see page 42, lines 959 to 964:

“Eye-tracking. We used a desktop mounted eye-tracker (EyeLink 1000; SR-Research Ltd., Mississauga, Ontario, Canada) to record monocular eye movements from the left eyes at a sampling rate of 500Hz. We used custom scripts implemented in MATLAB (The Mathworks, Natick, MA) to extract the mean saccades and blink information depending on the stimulus value, outcome devaluation and time (zero to 2000 ms around the stimulus onset). Data for two participants were missing due to a failure of the used eye-tracker.”

c. Relatedly, I would have appreciated more detail on the preprocessing pipeline used to clean-up the data, in particular how saccadic eye-movements and blinks were corrected, or removed, and if central fixation was enforced with eye-tracking or EOG.

For both, the ERP and MVPA analysis, we first applied the PREP pipeline to transform the channel EEG data using a robust average reference. Furthermore, bad channels were interpolated and a high pass filter of 0.1 Hz and a low pass filter of 100 applied.

Similar to previous studies (e.g. Mai et al., 2019), we did not perform any artefact correction on the input data for the MVPA approach. It has been shown that classifiers are highly robust against movement and eye-blink artefacts (Grootswagers et al., 2017). As the classifier is able to ignore bad channels and noise in the training procedure, artefact correction is not an essential part of the decoding pre-processing pipeline (Grootswagers et al., 2017). Furthermore, participants completed two DSM sessions: the first took place after participants underwent the stress or control manipulation and the second DMS task was completed after the reinforcement task. Importantly, the classifier performance did not differ between both DMS sessions and was not modulated by the experimental treatment. Therefore, our experimental design in combination with methodological properties inherent in the classifier approach makes it rather unlikely that our decoding results could have been biased by artefacts.

For the ERP analysis, we removed muscular artefacts and corrected for blinks and eye movements via independent component analysis. We added this information now in the manuscript, please see page 45, lines 1028 to 1029:

“In addition, we removed muscular artefacts and corrected for blinks and eye movements via independent component analysis.”

3. Looking at the scatterplots of Figure 5A and 6, the values on the y-axis appear to be rather different. Where the control and stress groups are mostly separate on this dimension in Figure 5A, they are highly overlapping in Figure 6 such that there does not appear to be any group difference. Are these different data and if so what is the rationale for selecting one set of data for one analysis and a different set for the other?

For some data points in the previous Figures 5A and 6A, the colour-based group assignment was incorrect (i.e., some data points were coloured in blue instead of red). We apologize and have corrected these mistakes and present now the corrected colouring. Please see page 21.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

(1) Reviewer 2 notes that enhanced response representations do not necessarily reflect an increased involvement of the habit system – this is particularly true if such representations are not negatively correlated with outcome representations, since the latter retrieves the former according to several associative accounts of goal-directed behavior. Thus, the stated assumption, that "habitual responding is reflected in enhanced response representations" is not valid. The reviewer suggests, and I agree, that claims based on this assumption should be removed or significantly toned down.

We completely agree that, according to associative learning accounts, response representations may be relevant for both ‘goal-directed’ S-R-O learning and ‘habitual’ S-R learning. Therefore, the assumption that response representations are indicate for the habit system is not valid. On the other hand, our empirical data do, in our view, lend some support for the idea that outcome representations are indicative for the goal-directed system, whereas response representations are linked to the habit system: while outcome representations were negatively correlated with participants’ responding for devalued outcomes, response representations were positively correlated with responses for devalued outcomes. Nevertheless, we are now more careful in our wording and rephrased the relevant sentences throughout the manuscript (including the abstract) to (i) avoid the impression that we consider response representations as being directly indicative for the habit system and (ii) make clear that response representations may be relevant for both goal-directed and habit learning. Moreover, we refer now simply to outcome and response representations or processing, instead of goal-directed and habitual processing, to stick closer to what we actually measured, as suggested by the reviewer. Please see, for example, page 8, lines 175 to 177:

“Outcome representations at the time of S presentation and R are indicative of goal-directed control. In contrast, response representations at the time of stimulus-representations, may be relevant for both goal-directed S-R-O learning and habitual S-R learning.”

Or on page 21, lines 458 to 461:

“While it is assumed that the outcome representation that is crucial for goal-directed S-R-O learning is reduced with increasing habitual behaviour control, response (R) representations at the time of stimulus (S) presentation may be involved in both goal-directed S-R-O and habitual S-R processing.”

Or on page 24, lines 538 to 539:

“…to provide evidence that acute stress results in a decrease of outcome-related processing that is critical for goal-directed control, and paralleled by an increase in response processing.”

(2) Reviewer 3 notes that the new FRN analyses are not sufficiently contextualized, and may even be erroneous, and also that the previous request for more information regarding decoding features went unanswered.

We have checked the FRN analyses but decided – based on the comments of reviewer 3 – to remove the FRN from this revised manuscript because this analysis appeared to be (a) only indirectly relevant to the present study and (b) potentially unreliable given the low number of errors.

We provide now information on the features that contributed the most to decoding the responses and outcomes. A parieto-occipital cluster was contributing the most to decoding outcome representations and a centro-parietal cluster was contributing the most to decoding response representations. For more details, please see our responses to the respective comments below.

(3) For my part, I am still confused about the distinction between model-free RL and habits. The authors state:

"It might be argued that an alternative way of disentangling goal-directed and habitual contributions to instrumental responding under stress would be through an analysis of model-based and model-free learning …"

Who would argue that? Why would that analysis be better? What exactly would evaluation of these models add here?

And then:

"For instance, while model-free learning can be present after only few trials, establishing habit behaviour relies on the strength of the stimulus-response association, which among other factors such as reward magnitude or stimulus saliency, has been suggested to depend on repetition Furthermore, while model-free learning is based on previously experienced outcomes, habit learning is defined as S-R learning without links to the outcome engendered by a response … In other words, whereas model-free learning is dependent on the expected reward and hence the outcome that follows a response, habit learning is operationally defined as being independent of the motivational value of the outcome

This is incorrect. The model-free Q-value *is* the S-R association: both depend on the reward prediction error and the learning rate, both reflect previous reinforcement, and in that sense an expectation of reward, both are void of sensory-specific outcome features, and thus insensitive to devaluation, and both can develop rapidly or not, depending on the learning rate. I suggest that the entire section on model-based and model-free learning is remove.

As the section on model-based and model-free learning is only indirectly relevant for the current study and seemed to be less helpful, we decided to remove this entire section, as suggested by the Reviewing Editor.

(4) Finally, the writing needs some attention. As an example, it is repeatedly stated that canonical behavioral assays can "hardly" disentangle habitual and goal-directed behavior. The term "hardly" means barely, as in "I can barely see that distant object", or alternatively, it can mean definitely not, as in "You can hardly expect me to …" I don't think either is the intended meaning here, and there are other problems throughout the manuscript. I strongly suggest proofing by a native English speaker.

We went over the manuscript again, focussing in particular on language aspects.

Furthermore, our revised manuscript has now been proofread by a native English speaker. We hope that the problems the reviewer was referring to have now been corrected.

Reviewer #2 (Recommendations for the authors):

The authors have made a good effort to respond to all of my concerns. Given that my initial concerns with the task were not well received, I will not insist at this point on the merits of this task to produce or investigate habitual control, so I will focus on the author's responses only and some theoretical points that I think should be re-written to improve the quality of the manuscript.

The authors insist that their aim *was not* to investigate overtraining, but the very definition of overtraining that one could infer from the habit literature is that it refers to the amount of training that produces habitual behaviour. They used devaluations across different extensions of training to investigate whether " stress leads to an increase in habitual processing, a decrease in goal-directed processing or both ", so I'm still not convinced with their assertion.

We used a first devaluation block very early in training (i.e. after just one block of training) and another one at the end of the training session. However, the training as a whole took place in a single session that included about 300 responses per stimulus. Although this trial number allows an analysis of the goal-directed vs. habitual control of behaviour (see also Valentin et al., 2007 or Wit et al., 2009 for paradigms including even less trials), compared to studies that were specifically interested in overtraining effects, this trial number is rather low. For instance, Tricomi et al. (2009) used a three-day training protocol that involved in total more than 10.000 responses (12 sessions of 8 minutes each, with on average 2 responses per second). Similar protocols, including significantly more responses than in the present task, were employed in recent studies that specifically aimed to test for overtraining effects (Pool et al., 2022; Wit et al., 2018). Thus, we do not think that the number of trials realized in the present study was sufficient to test for potential overtraining effects, which, again, were not a focus of the present study.

We elaborate on this aspect on page 27, lines 596 to 600:

“However, whether or not overtraining may induce habitual behaviour in humans is currently debated (Wit et al., 2018) and our data cannot speak to this issue as training may have been too limited to result in overtraining-related habits (which might require thousands of responses; Tricomi et al., 2009).”

Related to this, they are strongly assuming that the decoding analysis will give insights as to the interaction and development of habitual and goal-directed (g-d), but you need a theory of habits and actions in order to test for that. In my view, what they are doing in this paper is more empirical than theoretical ("do outcome representations decrease with training?; do response representations increase with training?), and I'd suggest they delete any reference to what they believe is the interaction between the systems in this task – right now they are motivating the paper as providing insights into the question of whether habits are a consequence of increased habits or decreased g-d control. I personally see this as a more empirical than theoretical paper, and the current motivation is poorly justified from a theoretical perspective, I think. For example, they assume that R representations should be stronger the more active a habit is, but Bouton and his colleagues have demonstrated -at least behaviorally- that this not necessarily the case.

Although our data do show that outcome representations correlate negatively with responding for devalued actions, whereas response representations correlate positively with this behavioural readout of habitual behaviour, we completely agree that, from a theoretical point, it is not valid to take response representations as a direct indication of the habit system (as response representations are relevant for both S-R-O and S-R learning). We are now more careful in our wording and removed passages which indicated that response representations would be a direct read-out of the habit system. Instead, we emphasize now repeatedly that response representations are relevant both for goal-directed S-R-O and habitual S-R learning. As suggested by the reviewer, we stick now primarily to what we have measured and simply refer to outcome and response representations (or processing), respectively.

Please see, for example, page 8, lines 175 to 177:

“Outcome representations at the time of S presentation and R are indicative of goal-directed control. In contrast, response representations at the time of stimulus-representations, may be relevant for both goal-directed S-R-O learning and habitual S-R learning.”

Or on page 21, lines 458 to 461:

“While it is assumed that the outcome representation that is crucial for goal-directed S-R-O learning is reduced with increasing habitual behaviour control, response (R) representations at the time of stimulus (S) presentation may be involved in both goal-directed S-R-O and habitual S-R processing.”

Or on page 24, lines 538 to 539:

“…provide evidence that acute stress results in a decrease of outcome-related processing that is critical for goal-directed control, and paralleled by an increase in response processing.”

The authors state that "goal-directed but not habitual control relies on the motivational value of the outcome" (pp. 4). Without wanting to be pedantic, habits do depend on the motivational value of the outcome to develop (unless you take a Guthrian view of S-R contiguity being sufficient to produce them), so I'd rather rephrase this as "only the g-d system is sensitive to changes in the motivational value of the outcome in absence of new experience with it", or something along those lines. This should make it clear that it is the devaluation test what distinguishes between the two types of control.

We agree and have rephrase this sentence as suggested.

Please see page 4, lines 89 to 92:

“These paradigms are based on the key distinctive feature of goal-directed and habitual control, i.e. that only the goal-directed system is sensitive to changes in the motivational value of the outcome in absence of new experience with it (Adams, 1982; Dickinson and Balleine, 1994).”

McNamee and colleagues did a very similar paper using fMRI decoding, but the authors cite this paper briefly, without any reference to what the paper is about and what it found. I think that they should be more detailed about what these other authors did in that paper, discussing it in the introduction and motivating their experiment as an extension of the same idea using a different method (EEG).

We agree and refer to the study by McNamee et al. (2015) and its findings in more detail. Please see pages 5 and 6, lines 115 to 119:

“Using a similar decoding approach on fMRI data, a previous study showed that brain regions implicated in goal-directed control contained information about outcomes and responses, whereas regions associated with habitual responding contained only information about responses (but not outcomes) at the time of stimulus presentation (McNamee et al., 2015).”

Pp 10. "Degraded actions" suggests degraded action-outcome contingency. This is not the manipulation employed in the paper.

We completely agree and have corrected this mistake. The sentence reads now “In order to adequately assess whether increased responding to devalued actions is due to reduced goal-directed or enhanced habitual control (or both),…” (please see page 8, lines 178 to 179).

Again, without wanting to be pedantic, the authors state that: ""In other words, whereas model-free learning is dependent on the expected reward and hence the outcome that follows a response (Dayan and Berridge, 2014; O'Doherty et al., 2017), habit learning is operationally defined as being independent of the motivational value of the outcome (Adams, 1982; Adams and Dickinson, 1981; Balleine and Dickinson, 1998).", but algorithmic representations of habit learning (and pavlovian conditioning) since the time of Bush and Mosteller and before are based on the notion of prediction-error, that is, they are represented as model-free algorithms. The "expected reward" in MF algorithms is not a psychological, but statistical concept (it turns out that prediction-error learning can be taken as estimating the value of stimuli or states.) What is critical is that these algorithms are not sensitive to changes in outcome value unless the new value is re-experienced and updated, whereas model-based or g-d algorithms are.

We agree. However, as suggested by the Reviewing Editor, we have now decided to remove the entire section on model-based and model-free learning, as this section was only indirectly relevant to the present study and seemed to be rather confusing.

Reviewer #3 (Recommendations for the authors):

The authors have addressed my main concerns about potential eye-movement artifacts acting as confounds in the decoding analysis, and mostly answered my questions about the preprocessing procedures for the EEG data. In the process of addressing many of the points raised during review, they had had to substantially tamp down some of their claims and provide more context, caveats and speculation regarding explanations for their observations. I don't think this is necessarily a bad thing, as the pattern of data rarely perfectly match our predictions and such deviations from expectation and theory are usually the grist of new research and investigation. Moreover, the main results and their interpretation hold up despite some points that remain a bit muddy (e.g. why the effects of stress occur later in the experiment rather than closer to the time of the stressful event). However, not all of my points were addressed in the initial round of reviews and I have some remaining questions – mostly arising from new analyses in the revision, and would also like additional clarity on some methodological points.

1. In responding to the other reviewers, the authors have added some new analyses of the feedback-related negativity to test if participants in the stress and control groups differed in how the processed error feedback. Interpretation of the null result here seems rather indirect, as it assumes that this ERP would change in response to differences in the sensitivity of participants to error feedback. I suspect that previous work has found such individual differences in groups with greater error sensitivity before, but that work is not cited here. It would be helpful to provide more background on what this ERP is thought to reflect to provide additional context to this result.

We agree that the interpretation of the null result is rather indirect here. Even more importantly, however, participants made substantially fewer incorrect than correct trials, which may render the FRN analysis less reliable and may have actually resulted in the rather unusual pattern of the FRN to which the reviewer is referring in his/her next comment. We therefore decided to remove this analysis from the manuscript.

Please note that we added the FRN analysis in the previous round of revision in response to a reviewer comment asking us to address the potential influence of negative social feedback on avoidance of the non-devalued action. We address this issue explicitly and argue that it is in our view rather unlikely that the negative feedback had a major impact on our results. These arguments remain and, as mentioned by the reviewer, the FRN would have been only of indirect relevance here. Please see page 28, line 629 to page 29, line 639:

“It is important to note that participants received an error feedback in devalued trials when they chose the response option that was not associated with the now devalued outcome. Given that acute stress may increase sensitivity to social cues (Domes and Zimmer, 2019), one might argue that stressed participants continued to respond towards devalued outcomes in order to avoid being presented with the error-feedback screen. However, we consider this alternative to be unlikely. First, with respect to the neural outcome and response representations, these were analysed during NoDev blocks in which groups did not differ in their behavioural performance accuracy, and consequently not in the frequency of error feedback. Furthermore, participants’ performance in devalued blocks was directly associated with the observed changes in neural outcomes and response representations during the NoDev blocks, which again, could not be biased by differences in error feedback processing.”

2. The FRN in Figure 2 —figure supplements 4-6 looks very different from what I expected, and what I believe, is conventional in the literature. The response on error trials is generally more positive than errors in the 200-300 ms post-feedback window that the authors focus on in this analysis. Generally, I believe this ERP is visible as a more negative deflection on error trials riding on a larger positive response to the feedback stimulus in this time window (e.g. Yeung et al. 2005, Cerebral Cortex). The baseline for error and correct trials also differ substantially in the Fz electrode – and differ quite a bit from zero. The unusual appearance of these ERPs make me somewhat concerned that there might be an error in the analysis.

We agree that the pattern of results differs from conventional findings. Therefore, we have re-examined our analyses. We obtained a minor mistake in the baseline correction. After correcting this mistake, we obtained plausible baseline values for all three electrodes. However, the general pattern of the FRN remained unchanged and critically distinct from the typically observed FRN (e.g. Yeung et al., 2005). In retrospect, this finding may be owing to the generally very low number of incorrect compared to correct trials (e.g., 99% correct trials in the NoDev blocks), which may render the FRN in the present study less reliable. In light of concerns of reliability of the FRN, we decided to remove the FRN from the present manuscript, which may have been only of indirect value for our interpretation anyways (please see our response to the previous comment).

Please note that we took the minor mistake in the analysis of the FRN as an occasion to critically check all of our analysis, which showed however no further mistakes.

3. This may have been understandably lost in the long list of comments in the last round of review, but the authors did not respond to my request that they provide more information about the features that contribute to the decoding of the outcome and response – i.e. the particular channels and time points that are contributing to the decoding accuracy measures. While they have convincingly argued that EOG artifacts are unlikely to drive their results, I think it would still be valuable to also see which features are contributing to most to decoding.

We apologize for missing this aspect in our previous revision. We have now analysed which features are contributing the most to outcome and response decoding, respectively. More specifically, we have run a search-light RSA asking which features are most informative. These analyses revealed that a frontal-left cluster and a left parieto-occipital cluster were contributing the most to decoding outcome representations, whereas a centro-parietal cluster regions was contributing the most to decoding response representations. We have added this information to the manuscript on page 46, lines 1044 to 1049:

“In retrospect, a search-light RSA asking which features contribute the most to decoding neural outcome and response representations, revealed that at the time of average maximal decoding accuracy (about 200 ms after stimulus onset for both classifications), a parieto-occipital cluster was contributing the most to decoding outcome representations. In contrast, a centro-parietal cluster was contributing the most to decoding response representations (Figure 2 —figure supplement 4).”

We also added more information on the searchlight procedure in the methods section. Please see page 47, lines 1062 to 1068.

“Searchlight approach. To provide more insight into which electrophysiological information contributed most to differentiating between the respective categories (red vs. blue and objects vs. scenes, respectively), we also performed a searchlight analysis. This allowed us to determine the topographic features that discriminated most between the two categories. In order to implement the searchlight approach, a SVM was used. Again, we calculated the Wald interval with adjustments for a small sample size (Agresti and Caffo, 2000; Müller-Putz et al., 2008).”

4. This is a more minor point, but it would be helpful to have more information about how ICA was used to remove motor and eye-movement activity E.g. how many components were removed, how were they identified and how did the authors verify success of this preprocessing. The current one sentence mention of ICA is not very illuminating about the specific approach used for this study.

We agree and elaborate now on the ICA. Please see page 42, line 963 to page 43, line 977:

“In addition, blinks and eye movements were corrected by independent component analysis (infomax ICA, Noh et al., 2014). Using the automated procedure ADJUST (Mognon et al., 2011), ocular artifact-related components in EEG recordings we identified and subsequently removed. The ADJUST algorithm combines stereotypical artifact-specific spatial and temporal features to detect and differentiate artifact ICA components (Chaumon et al., 2015). For example, ADJUST computes the kurtosis of the event-related time course for frontal channels, since, for example, eye blinks are accompanied by abrupt amplitude jumps in frontal regions areas (stereotypical artifact-specific temporal feature). Additionally, ADJUST determines the spatial topography of the IC weights to compare the magnitude of the amplitudes between frontal and posterior areas (stereotypical artifact-specific spatial feature). Using the ADJUST procedure, on average 1.65 (SEM = 0.13) components per participant were detected and removed. Previous data shows that the automatic detection of artifact components using ADJUST, leads to a comparable classification of artifact components that is afforded by manual classification by experiments (Mognon et al., 2011).”

References

Adams CD. 1982. Variations in the sensitivity of instrumental responding to reinforcer devaluation. The Quarterly Journal of Experimental Psychology Section B 34:77–98.

doi: 10.1080/14640748208400878.

Agresti A, Caffo B. 2000. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. American Statistician 54:280. doi: 10.2307/2685779.

Chaumon M, Bishop DVM, Busch NA. 2015. A practical guide to the selection of independent components of the electroencephalogram for artifact correction. Journal of Neuroscience Methods 250:47–63. doi: 10.1016/j.jneumeth.2015.02.025.

Dickinson A, Balleine B. 1994. Motivational control of goal-directed action. Animal Learning & Behavior 22:1–18. doi: 10.3758/BF03199951.

McNamee D, Liljeholm M, Zika O, O'Doherty JP. 2015. Characterizing the associative content of brain structures involved in habitual and goal-directed actions in humans: A multivariate FMRI study. Journal of Neuroscience 35:3764–3771. doi: 10.1523/JNEUROSCI.4677-14.2015.

Mognon A, Jovicich J, Bruzzone L, Buiatti M. 2011. ADJUST: An automatic EEG artifact detector based on the joint use of spatial and temporal features. Psychophysiology 48:229–240. doi: 10.1111/j.1469-8986.2010.01061.x.

Müller-Putz G, Scherer R, Brunner C, Leeb R, Pfurtscheller G. 2008. Better than random? A closer look on BCI results. International Journal of Bioelectromagnetism 10:52–55.

Noh E, Herzmann G, Curran T, Sa VR de. 2014. Using single-trial EEG to predict and analyze subsequent memory. NeuroImage 84:712–723. doi: 10.1016/j.neuroimage.2013.09.028.

Pool ER, Gera R, Fransen A, Perez OD, Cremer A, Aleksic M, Tanwisuth S, Quail S, Ceceli AO, Manfredi DA, Nave G, Tricomi E, Balleine B, Schonberg T, Schwabe L, O'Doherty JP. 2022. Determining the effects of training duration on the behavioral expression of habitual control in humans: a multilaboratory investigation. Learning & Memory 29:16–28. doi: 10.1101/lm.053413.121.

Tricomi E, Balleine BW, O'Doherty JP. 2009. A specific role for posterior dorsolateral striatum in human habit learning. European Journal of Neuroscience 29:2225–2232. doi: 10.1111/j.14609568.2009.06796.x.

Valentin VV, Dickinson A, O'Doherty JP. 2007. Determining the neural substrates of goal-directed learning in the human brain. Journal of Neuroscience 27:4019–4026.

doi: 10.1523/JNEUROSCI.0564-07.2007.

Wit S de, Corlett PR, Aitken MR, Dickinson A, Fletcher PC. 2009. Differential engagement of the ventromedial prefrontal cortex by goal-directed and habitual behavior toward food pictures in humans. Journal of Neuroscience 29:11330–11338. doi: 10.1523/JNEUROSCI.1639-09.2009.

Wit S de, Kindt M, Knot SL, Verhoeven AAC, Robbins TW, Gasull-Camos J, Evans M, Mirza H, Gillan CM. 2018. Shifting the balance between goals and habits: five failures in experimental habit induction. Journal of Experimental Psychology: General 147:1043–1065. doi: 10.1037/xge0000402.

Yeung N, Holroyd CB, Cohen JD. 2005. ERP correlates of feedback and reward processing in the presence and absence of response choice. Cerebral cortex (New York, N.Y.: 1991) 15:535–544. doi: 10.1093/cercor/bhh153.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Reviewer 2 is still confused about the intended, claimed, and actual role of overtraining. I think the easiest way to deal with this is to not talk about "extended" training (what does that mean after all – "extended" could mean an hour, days, or months) but instead characterize effects as training-dependent based on devaluation at the beginning, middle or end of a task. You can be clear that you are assessing training-dependent dynamics, but that the extent of training here is quite limited compared to that in the overtraining literature. In other words, training-dependent effects do not necessarily imply overtraining effects – please make that distinction and you should be good.

We thank the Editor for the very helpful advice. We now make the distinction between training-dependent effects and overtraining effects clearer and also omit the term “extended” training, as suggested. Instead, we talk now about effects at the beginning, middle or end of the task. Moreover, we make the distinction between training-dependent effects and overtraining effects explicit, as suggested.

Of even greater importance is the request by Reviewer 1, that you provide more details about the searchlight analysis.

We now provide more details on the searchlight analysis, as requested by the Reviewer. For details, please see our response to the Reviewer below.

Reviewer #2 (Recommendations for the authors):

The authors have improved the manuscript. I think it's a much better version than the previous one. They have deleted all those paragraphs that made their arguments and motivation for the experiment confusing.

I'm still super confused about their argument that they are not testing for overtraining effects. I thought overtraining was by definition the amount of training that produced habits. Are they saying in their reply that the effect of stress speeds up habit formation? What is their view on this? If their aim was not to test for "training extension" effects, why are they doing two devaluation manipulations?

I don't think is enough to add this paragraph in light of my comment:

"Although we did not aim to test training-induced changes in the balance of outcome and response processing per se, we included transient outcome devaluations after initial, moderate, and extended training in the instrumental learning task."

And then, a few paragraphs later, saying the following:

"To assess whether the balance of goal-directed and habitual behaviour and its modulation by stress depends on the extent of training, we presented devaluation blocks early during training, after moderate training, and again after extended training at the end of the task."

And, having a section called "Stress boosts response representations after extended training."

As a reader, I'm extremely confused about these apparent contradictions. I'd love to see in the introduction a more precise paragraph where their expectations are clearly mentioned.

This comment has been carrying over since the first review when I made the following comment:

"It was not clear to me what the main hypothesis of the study was. The authors seem to talk rather loosely about habits being developed after overtraining versus the mediating effect of stress on habits. The introduction should convey their main goal more clearly."

Sorry if I'm being too stupid, but it's not clear to me why they are using training extension and stress to test devaluation sensitivity and outcome/response representations if their aim was not to overtrain participants.

We included devaluations tests at the beginning, middle and end of the task to assess whether stress effects on the control of instrumental behaviour are training-dependent, as suggested by earlier rodent studies. It is, however, important to distinguish between training-dependent stress effects and overtraining effects per se. As suggested by the editors “training-dependent effects do not necessarily imply overtraining effects”. In order to make this aspect clearer and avoid confusion, we do not speak of “moderate” or “extended” training any more. Moreover, we state that we are interested in training-dependent dynamics of stress effects and explicitly distinguish training-dependent effects of stress from overtraining effects.

Please see, for example, page 6, lines 120 to 130:

Because previous rodent studies suggested that stress or stress hormone effects on the balance of goal-directed and habitual forms of learning are training-dependent (Dias-Ferreira et al., 2019; Packard, 1999; Siller-Pérez et al., 2017), we also assessed training-dependent dynamics in the stress effect on outcome and response processing. Although we did not aim to test overtraining-induced changes in the balance of outcome and response processing, for which the number of trials may have been too limited in the present study, we included transient outcome devaluations at the beginning, middle and end of the instrumental learning task to assess whether stress effects on instrumental behaviour are training-dependent.

Please see also the changes of the subheadings on pages 19 and 21:

“Stress reduces outcome representations at the end of training”

“Stress boosts response representations at the end of training”

And on page 27, line 203:

“Thus, training-dependent effects do not necessarily imply overtraining effects.”

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Supplementary file 1. Supplementary material.

    (a) Classification results (block × group interaction) block for alternatively grouped data. (b) Classification results on outcome representations for N=53. (c) Control variables: subjective mood, depressive mood, anxiety, and chronic stress. (d) Eye-tracking data: mean number of blinks and saccades across time, for Shigh and Slow trials (N=51).

    elife-67517-supp1.docx (27KB, docx)
    Transparent reporting form

    Data Availability Statement

    Data reported in this manuscript are available from the website: https://github.com/08122019/From-goal-directed-action-to-habit, copy archived at swh:1:rev:d52c6db7505b203310ec623db4fc07fede10eb9d.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES