Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 10.
Published in final edited form as: J Cogn Neurosci. 2021 Dec 6;34(1):79–107. doi: 10.1162/jocn_a_01780

Learning at variable attentional load requires cooperation of working memory, meta-learning and attention-augmented reinforcement learning

Thilo Womelsdorf 1, Marcus R Watson 2, Paul Tiesinga 3
PMCID: PMC9830786  NIHMSID: NIHMS1858121  PMID: 34813644

Abstract

Flexible learning of changing reward contingencies can be realized with different strategies. A fast learning strategy involves using working memory of recently rewarded objects to guide choices. A slower learning strategy uses prediction errors to gradually update value expectations to improve choices. How the fast and slow strategies work together in scenarios with real-world stimulus complexity is not well known. Here, we aim to disentangle their relative contributions in rhesus monkeys while they learned the relevance of object features at variable attentional load. We found that learning behavior across six monkeys is consistently best predicted with a model combining (i) fast working memory (ii) slower reinforcement learning from differently weighted positive and negative prediction errors, as well as (iii) selective suppression of non-chosen feature values and (iv) a meta-learning mechanism that enhances exploration rates based on a memory trace of recent errors. The optimal model parameter settings suggest that these mechanisms cooperate differently at low and high attentional loads. While working memory was essential for efficient learning at lower attentional loads, enhanced weighting of negative prediction errors and meta-learning were essential for efficient learning at higher attentional loads. Together, these findings pinpoint a canonical set of learning mechanisms and suggest how they may cooperate when subjects flexibly adjust to environments with variable real-world attentional demands.

Keywords: Cognitive flexibility, exploration-exploitation tradeoff, feature-based attention, nonhuman primate, reward prediction error

Introduction

Cognitive flexibility is realized through multiple mechanisms (Dajani & Uddin, 2015), including recognizing that environmental demands change, the rapid updating of expectations and the shifting of response strategies away from irrelevant towards newly relevant information. The combination of these processes is a computational challenge as they operate on different time scales ranging from slow integration of reward histories to faster updating of expected values given immediate reward experiences (Botvinick et al., 2019). How fast and slow learning processes cooperate to bring about efficient learning is not well understood.

Fast adaptation to changing reward contingencies depends on a fast learning mechanism. Previous studies suggest that such a fast learning strategy can be based on different strategies. One strategy involves memorizing successful experiences in a working memory (WM) and guiding future choices to those objects that have highest expected reward value in working memory (Alexander & Brown, 2015; Alexander & Womelsdorf, 2021; A. G. Collins, Brown, Gold, Waltz, & Frank, 2014; A. G. Collins & Frank, 2012; McDougle & Collins, 2020; Viejo, Girard, Procyk, & Khamassi, 2018). This WM strategy is similar to recent ‘episodic’ learning models that store instances of episodes as a means to increase learning speed when similar episodes are encountered (Botvinick et al., 2019; Gershman & Daw, 2017).

A second fast learning mechanism uses an attentional strategy that enhances learning from those experiences that were selectively attended (Niv et al., 2015; Oemisch et al., 2019; Rombouts, Bohte, & Roelfsema, 2015). The advantages of this strategy is an efficient sampling of values when there are many alternatives or uncertain reward feedback (Farashahi, Rowe, Aslami, Lee, & Soltani, 2017; Kruschke, 2011; Leong, Radulescu, Daniel, DeWoskin, & Niv, 2017). Empirically, such an attentional mechanism accounts for learning values of objects and features within complex multidimensional stimulus spaces (Hassani et al., 2017; Leong et al., 2017; Niv et al., 2015; Wilson & Niv, 2011). In these multidimensional spaces, learning from sampling all possible object instances can be impractical and slows down learning to a greater extent than what is observed in humans and monkeys (Farashahi, Rowe, et al., 2017; Oemisch et al., 2019). Instead, learners appear to speed up learning by learning stronger from objects that are attended and actively chosen, while penalizing features associated with non-chosen objects (Hassani et al., 2017; Leong et al., 2017; Niv et al., 2015; Oemisch et al., 2019; Wilson & Niv, 2011).

In addition to WM and attention-based strategies, various findings indicate that learning can be critically enhanced by selectively increasing the rate of exploration during difficult or volatile learning stages (Khamassi, Quilodran, Enel, Dominey, & Procyk, 2015; Soltani & Izquierdo, 2019). Such a meta-learning strategy, for example, increases the rate of exploring options as opposed to exploiting previously learned value estimates (Tomov, Truong, Hundia, & Gershman, 2020). This and other meta-learning approaches have been successfully used to account for learning rewarded object locations in monkeys (Khamassi et al., 2015) and for speeding up learning of multi-arm bandit problems (Wang et al., 2018).

There is evidence for all three proposed strategies in learning, but only few empirical studies characterize the contribution of different learning strategies. Thus, it is unknown whether working memory, attention-augmented reinforcement learning (RL) and meta-learning approaches are all used during learning in differently complex environments and whether they differently cooperate at low and high learning difficulty.

To address this issue, we set out to test and disentangle the specific contribution of various computational mechanisms for flexibly learning the relevance of visual object features. We trained six monkeys to learn the reward value of object features in environments with varying numbers of irrelevant distracting feature dimensions. By increasing the number of distracting features we increased attentional load which resulted in successively slower learning behavior. We found that across monkeys, learning speed was best predicted by a computational RL model that combines working memory, attention augmented RL, a separate learning rate for erroneous choices and meta-learning. The optimal model parameter settings, which account for a significant fraction of the observed choices, suggest that the contribution of these individual learning mechanisms varied systematically with attentional load. WM contributed to learning speed particularly at low and medium load, meta-learning contributed at medium and high loads, meta-learning contributed maximal at high load, while selective decay of non-attended feature values was an essential learning mechanism across all attentional loads.

Methods

Experimental Design.

Six male macaque monkeys performed the experiments with an age ranging from 6–9 years and weighting 8.5–14.4 kg. All animal and experimental procedures were in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals, the Society for Neuroscience Guidelines and Policies, and approved by the Vanderbilt University Institutional Animal Care and Use Committee.

The experiment was controlled by USE (Unified Suite for Experiments) using the Unity 3D game engine for behavioral control and visual display (Watson, Voloh, Thomas, Hasan, & Womelsdorf, 2019). Four animals performed the experiment in cage-based touchscreen Kiosk Testing Station described in (Womelsdorf et al., 2021), while two animals performed the experiment in a sound attenuating experimental booth. All experiments used 3-dimensionally rendered objects, so called Quaddles (Watson, Voloh, Naghizadeh, & Womelsdorf, 2019), that were defined by their body shape, arm style, surface pattern, and color (Fig. 1A). We used up to nine possible body shapes, six possible colors, eleven possible arm types and nine possible surface patterns as feature values. The six colors were equidistant within the perceptually defined color space CIELAB. Objects extended ~3cm on the screen corresponding to ~2.5° degrees of visual angle and were presented either on a 24” BenQ monitor or an Elo 2094L 19.5 LCD touchscreen running at 60 Hz refresh rate with 1920 × 1080 pixel resolution.

Figure 1. Task paradigm and feature space.

Figure 1.

(A) The task used 3D rendered ‘Quaddle’ objects that varied in color, pattern, shape and arm style. The features grey color, straight arms and spherical body shape were never rewarded in any of the experiments and therefore constitute ‘neutral’ features. (B) For successive blocks of 30–60 trials a single feature was rewarded. (C) The attentional load conditions differed in the number of non-neutral feature dimensions that varied across trials in a block. Blocks with 1-D, 2-D, and 3-D objects contained stimuli varying features in 1, 2, and 3 features dimensions. (D) Trials were initiated by touching or fixating a central stimulus. Three objects were shown at random locations and subjects had to choose one by either touching (four monkeys) or fixating an object (two monkeys) for ≥0.7 sec. Visual feedback indicated correct (yellow) vs error (cyan, not shown) outcomes. Fluid reward followed correct outcomes. (E) Sequences of three example trials for a block with 1-D objects (upper row, shape varied), 2-D objects (middle, color and arms varied), and 3-D objects (bottom row, body shape, arms and color) objects. (F) Same as E but for an object set varying surface pattern, arms, and color.

Task paradigm.

Animals performed a feature-reward learning task that required learning through trial-and-error which feature of multidimensional objects is associated with reward. The feature that was rewarded, i.e. the feature-reward rule, stayed constant for blocks of 35–60 trials and then switched randomly to another feature (Fig. 1B). Individual trials (Fig. 1C) were initiated by either touching a central blue square (4 monkeys) or fixating the blue square for 0.5 sec. Following a 0.3 sec. delay three objects were presented at the corners a virtual square-grid spanning 15 cm on the screen (~24°). The animals had up to 5 sec. to choose one object by touching it for 0.1 sec (four monkeys) or maintaining gaze at an object for 0.7 sec (two monkeys). Following the choice of an object visual feedback was provided as a colored disk behind the selected object (yellow/grey for rewarded/not rewarded choices, respectively) concomitant with auditory feedback (low/high pitched sound for non-rewarded/rewarded choices, respectively). Choices of the object with the rewarded feature resulted in fluid reward 0.3 sec. after the onset of the visual and auditory feedback.

For each learning block a unique set of objects was selected that varied in one, two or three feature dimensions from trial-to-trial. The non-varying features were always either a spherical body shape, straight arms with blunt ending, grey color or uniform surface. These feature values were never associated with reward during the experiment and thus represent reward-neutral features. These neutral features defined a neutral object to which we added either one, two, or three non-neutral feature values rendering them 1-, 2-, and 3- dimensional (Fig. 1D). For blocks with objects that varied in one feature dimension (1D attentional load condition) three feature values from that dimension were chosen at the beginning of the block (e.g. body shapes that were oblong, pyramidal, and cubic). One of these features was associated with reward while the two remaining features were not reward-associated, and thus served as distracting features. Within individual trials objects never had the same feature values for these dimensions as illustrated for three successive example trials in Fig. 1E,F (upper row). The feature values of the unused dimensions were the features of the neutral objects in all trials of that block. For blocks with objects varying in two feature dimensions, a set of three feature values per dimension was selected to obtain nine unique objects combining these features. Only one of the features was associated with reward while the other two feature values of that dimension and the feature values of the other dimension were not linked to reward. Fig. 1E,F (middle row) illustrates three example trials of these blocks of the 2D attentional load condition. For blocks with objects varying in three feature dimensions (3D attentional load condition), three feature values per dimensions were selected so that the three presented objects had always different features of that dimension which led to twenty-seven unique objects combining these features. Again, only one feature was associated with reward in a given block while all other feature values were not linked to reward.

Blocks with objects that varied in 1, 2 and 3 feature dimensions constitute 1D, 2D, and 3D attentional load conditions because they vary the number of feature dimensions that define the search space when learning which feature is rewarded. The specific dimension, feature value, and the dimensionality of the learning problem varied pseudo-randomly from block to block. During individual experimental sessions, monkeys performed up to 30 learning blocks.

Gaze control.

For two animals gaze was monitored with a Tobii Spectrum with 600 Hz sampling rate and binocular infrared eye-tracker. For these animals the experimental session began with a 9-point eye-tracker calibration routine and later reconstruction of object fixations using a robust gaze classification algorithm described in (Voloh, Watson, Koenig, & Womelsdorf, 2020).

Statistical analysis.

All analysis was performed with custom MATLAB code (Mathworks, Natick, MA). Significance tests control for the false discovery rate (FDR) with an alpha value of 0.05 to account for multiple comparisons (Benjamini & Hochberg, 1995).

General formulation of Rescorla-Wagner reinforcement learning models.

The value of feature i in trial t, before the outcome was known, is denoted by Vi,tF. The superscript F stands for feature, in order to distinguish it from the value of an object that will be introduced in the next section. The new value, Vi,t+1F available for decisions on the next trial depends on which features were at trial t present in the chosen object, and whether this choice was rewarded Rt = 1, or not Rt = 0. The value of features that were present in objects and that were not chosen, and those that could appear in the course of the session, but were not present on the current trial, decay with a parameter value ωtRL, where RL denotes that the decay component is from the reinforcement component of the model as opposed to the decay of the working memory component introduced below. The features that were present in the chosen and rewarded object increase in value, because the reward prediction error, Rt-Vi,tF is positive, whereas when the chosen object was not rewarded, the value decays. We have summarized these update rules in the following equations:

Vi,t+1F=Vi,tF+ηtfi,tA,V(RtVi,tF)         features of objects (eq. 3)
=(1ωncRL)Vi,tF        features of non-chosen objects (eq. 4)
=(1ωnpRL)Vi,tF       non-presented features (eq. 5)

The factor fi,tA,V is explained further down. We have indicated a trial-dependence in gain η and allow the decay parameter ω to depend on whether the feature was present in the non-chosen object (nc) or whether it as part of the stimulus set of the session, but not presented (np) in the current trial. It further carries a superscript RL to indicate it is part of the reinforcement learning formulation rather than working memory (superscript WM). The setting of these parameters depends on the specific model version. In the base RL model, there is no feature-value decay ωnc,t = ωnp,t = 0 and the gain is constant and equal to η. In the next model ‘RL gain and loss’, the gain depends on whether the choice was rewarded (gain) or not-rewarded (loss) ηt = ηGain Rt + ηLoss (1 − Rt), which introduces two new parameters ηGain and ηLoss for rewarded and non-rewarded choices, respectively.

In most models the decay for non-chosen and not presented features was equal ωnc,t = ωnp,t = ωRL, introducing only a single additional parameter. In the so-called hybrid models, we add a feature-dimension gain-factor fi,tA,V, which reflects attention to a particular dimension. It is calculated using a Bayesian model (see below) and is indicated by a superscript V because it affects the value update. Hence, it acts as if information about the role of a certain dimension in the acquisition of the reward is not available. The choice probability pi,tRL for object i at trial t is determined using a softmax function:

pi,tRL=exp(βtjOifj,tA,CPVj,tF)kexp(βtjOkfj,tA,CPVj,tF) (eq. 6)

The sum in the exponent of the preceding expression is over the features j that are part of object i, which defines the set Oi. The factor βt in the exponent determines the extent to which the subject exploits, i.e. systematically chooses the object with the highest compound value, (reflected in a large β) or explores, i.e. makes choices independent of the compound value (reflected in small β). In most model versions the β did not change over trials (parameter βRL), while in the meta-learning models with adaptive exploration its value was adaptive, reflecting the history of reward outcomes and is thus trial dependent, see the following subsection. The factor fi,tA,CP is a feature-dimension gain-factor that acts only in the choice-probability, it reflects that some dimensions do not contribute to the choice probability when they are deemed irrelevant (not attended).

Adaptive Exploration.

For models with adaptive βt values we follow the model of (Khamassi, Enel, Dominey, & Procyk, 2013), which involves determining an error trace:

βt+1*=βt*+α+max(δt,0)+αmin(δt,0) (eq. 7)

where the min and max functions are used to select the negative and positive part, respectively, of an estimate of the reward prediction error,

δt=Rt-1#(jOi)jOiVj,tF (eq. 8)

This is a different form of the PE than above, because here we need to consider all features in a chosen object, rather than each feature separately. The error trace is translated into an actual β value using

βt=βm1+exp(ω1(βt*ω2)) (eq. 9)

This adaptive component replaces one parameter by five new parameters: α+, α−, βm, ω1 and ω2, of which we fixed four in most models, to the following values that came out of pilot parameter explorations: α+ = −0.6, α− = −0.4, ω1 = −6, ω2 = 0.5, and varied βm and sometimes varied α− as well.

Attentional dimension weight.

The attentional gain factor (fi,tA,V and fi,tA,CP) uses a Bayesian estimate of what the target feature f is, hence what the relevant feature dimension is and weighs the contribution of each feature value according to whether it is part of the target dimension (Hassani et al., 2017; Niv et al., 2015; Oemisch et al., 2019). From the target feature probability p(f|𝒟1:t) (see Equations 5 to 7 in (Hassani et al., 2017) for the derivation of the equation we use to update this probability from trial-to-trial) we can obtain a target dimension probability by summing over all the feature values f(d) that belong to a particular dimension d,

pd,tD=p(d|𝒟1:t)=ff(d)p(f|𝒟1:t) (eq. 10)

this is turned into a feature gain

ϕd,tA=(pd,tD)αe(pe,tD)α (eq. 11)

which weighs feature values in each object according to their dimension d(f), for an object i, this becomes Vi,t=jOiϕd(j),tAVj,tF, and which we incorporate as a feature dependent factor fi,tA=ϕd(i),tA in the relevant expressions (eq. (3) and (6), for the value and choice probability), indicated with additional superscript V and CP, respectively).

Stickiness in choice probability.

Stickiness of choosing objects refers to choosing the object whose feature values overlap with the previously chosen one and represents perseveration (Balcarras, Ardid, Kaping, Everling, & Womelsdorf, 2016). It is implemented by making the choice probability dependent on whether a feature on the previous trial is present in an object.

piS=exp(βtjOifj,tA,CPVj,tF)+Δt1,ik exp(βtjOkfj,tA,CPVj,tF)+Δt1,i (eq. 12)

Here Δt−1,i, is equal to eγ −1 when object i presented on trial t contains at least one feature that was also present in the chosen object on the previous trial (t − 1). By subtracting one, we assure that when γ = 0, there is no stickiness contribution to the choice. In our setup it is possible that more than one of the current objects contain features that were present in the chosen object.

Combined working memory / reinforcement learning models.

Working memory models are formulated in terms of the value Vi,tWM of an object i irrespective of what features are present in it (A. G. Collins & Frank, 2012). These values are initialized to a non-informative value of 1no, where no is the number of objects. When each of the objects has this value there is no preference in choosing one above the other. When an object is chosen on trial t, the value is set to Vi,t+1WM=1 when rewarded, whereas it is reset to the original value Vi,t+1WM=1no when the choice was not rewarded. All other values decay towards the original value with a decay parameter ωWM:

Vi,t+1WM=Vi,tWMωWM(Vi,tWM1no) (eq. 13)

The values are then directly used in the choice probabilities (also denoted pChoice):

pi,tWM=exp(βWMVi,tWM)jexp(βWMVj,tWM) (eq. 14)

This component mechanism thus introduces two new parameters, a decay parameter ωWM and the softmax parameter βWM, which are separately varied in the fitting procedure.

Integrating choice probabilities.

In the most comprehensive models, choices are determined by a weighted combination of the choice probabilities derived from the RL and WM components, referred to as pi,tT (T stands for total),

pi,tT=wtpi,tWM+(1-wt)pi,tRL (eq. 15)

A larger wt means more weight for the WM predictions in the total choice probability. The update of wt reflects the value of the choice probability for the choice made and the capacity limitations of the working memory:

wt+1=ptWMCwtwtptWMC+(1-wt)ptRLC (eq. 16)

where

ptRLC=pa(t),tRLrt+(1pa(t),tRL)(1rt).

This expression selects from amongst two possible values for ptRLC depending on whether rt = 1 or 0. Here a(t) is the index of the object chosen on trial t. In addition,

ptWMC=α(pa(t),tWMrt+(1pa(t),tWM)(1rt))+(1α)(1no) (eq. 17)

where α=min(1,CWMnS) and CWM is the working memory capacity, essentially the number of objects about which information can be accessed, nS is the number of objects that can be presented during the task. It is determined as the number of objects whose value Vi,tWM exceeds 1no by a margin of 0.01. When nS is much larger than CWM the information in pWM, which is unlimited in capacity but decays with time, can not be read out, instead ptWMC=1/no. Hence, when pa(t),tRL exceeds 1no it will win the competition for influence and reduce wt towards zero, and with that the influence of WM via pi,tWM.

Posterior predictive checks and model identifiability.

Our overall focus was to evaluate how fit parameters vary with task condition and across subjects. For this it is necessary that the parameters have a clear meaning; that they are reproducible (model identifiability) and that the objective function value correlates with the degree to which model choices match the subject’s choices (a kind of posterior predictive test). To assess this we performed a number of validation analyses on the model that we had found to best fit the subject’s choices (the top ranked model in Fig. 3A, model 1, Table 1), which was characterized by eight fitting parameters, which could present a challenge to fitting procedures.

Figure 3. Rank ordering of models with different combinations of mechanisms.

Figure 3.

(A) Models (rows) using different combinations of model mechanisms (columns) are rank ordered according to their Bayesian Information Criterion (BIC). The top-ranked model combined four mechanisms that are highlight in red: Decay of non-chosen features, working memory, adaptive exploration rate, and a separate learning gain for errors (losses). The 2nd, 3rd, and 4th ranked models are denoted with cyan, green and yellow bars. Thick horizonal bar indicates that the model mechanism was used in that model. The 26th ranked model was the base RL model that used only a beta softmax parameter and a learning rate. Right: Model rank average (1st column) and for each individual monkey (columns 2 to 7). See Table 1 for the same table in numerical format with additional information about the number of free parameters for each model. (B) After subtracting the BIC of the 1st ranked model, the normalized BIC’s for each monkey confirms that the top-ranked model has low BIC values for each monkey. (C) Average behavioral learning curves for the individual monkeys (left) and the simulated choice probabilities of the top-ranked model for each monkey. The simulated learning curves are similar to the monkey learning curves providing face validity for the model.

Table 1.

Overview of the parameter used in models evaluated and ranked according to BIC higher than the base RL model (which is the model ranked 26th). See Figure 3 of the main text for a graphical illustration of the model rank ordering.

Model Rank RL gain (η) RL decay (ω) RL softmax (β) RL atn (α) WM decay (ω) WM softmax (β) WM capacity (CWM) stickiness (γ) #para
1 ηGainLoss ωRL βm, α+, ω1, ω2 fix) - ωWM βWM CWM - 8
2 ηGainLoss ωRL βm+12fix) - ωWM βWM CWM γ 8
3 ηGainLoss ωRL βRL - ωWM βWM CWM γ 8
4 ηGainLoss ωRL βm+12fix) - ωWM βWM CWM - 7
5 ηGainLoss ωRL βRL - ωWM βWM CWM - 7
6 η ωRL βm+12fix) - ωWM βWM CWM - 7
7 η ωRL βRL - ωWM βWM CWM - 6
8 η ωRL βm+12fix) - ωWM βWM CWM - 6
9 η ωRL βm, α+, ω1, ω2 fix) - ωWM βWM CWM γ 7
10 ηGainLoss ωRL βm, α+, ω1, ω2 fix) - - - - - 5
11 ηGainLoss ωRL βm+12fix) - - - - γ 5
12 ηGainLoss ωRL βRL - - - - γ 5
13 η ωRL βm+12fix) - - - - - 3
14 η ωRL βRL - - - - γ 4
15 η ωRL βRL - - - - - 3
16 η ωRL βRL α ωWM βWM CWM - 7
17 ηGainLoss ωRL βRL - - 4
18 ηGainLoss ωRL βRL α ωWM βWM CWM - 8
19 η - βm+12fix) - ωWM βWM CWM - 5
20 η ωRL βRL α ωWM βWM CWM γ 8
21 η - βRL - ωWM βWM CWM - 5
22 η - βRL - ωWM βWM CWM γ 6
23 ηGainLoss - βRL - ωWM βWM CWM - 6
24 η - βRL - - - - γ 3
25 ηGainLoss - βRL - - - - - 3
26 η - βRL - - - - - 2

Our objective function was the Negative Log Likelihood (NLL), which was minimized through a call to the matlab function fminsearch followed by a call to fmincon. We did this for multiple different initial conditions and found a few distinct solutions, each converged to from multiple different initial values, and each corresponding to slightly different values of the objective function. These differences were typically smaller than the differences between different models. This shows that the algorithm can get stuck in local minima. Note that this did not occur for models with fewer parameters.

We evaluated the behavioral performance of each of the corresponding parameter values by generating choice sequences based on sampling randomly according to the model-generated choice probabilities, repeating the sampling multiple times for each unique parameter set derived from an initial condition (see Appendix; Appendix Fig. 1). The posterior predictive performance was characterized by determining the mean reward and the overlap between the model’s and the subject’s choices – quantified as the fraction of common choices. These performance measures are different from the NLL value that we minimized, since there we took the subject’s choices and the received rewards to update the (feature) values in the model and determined the choice probability for each trial according to the model, negative log of these adds to the objective function. In contrast, here we generate choices, often different from those of the subject because we use the choice probabilities from the model for the fitted parameter values, but updated trial-to-trial with model choices and the corresponding rewards according to the task rule. The resulting choice-reward sequence so generated is also different from the measured one, but it is the one used to update the model’s feature values across trials. It is therefore likely to generate different choices across blocks than the subject even when using the exact same stimulus sequence. The model can be considered good when the experimentally generated sequence cannot be distinguished from the distribution of the model-generated choices, this can occur for rather low values of overlap, when the choice probabilities are much smaller than one.

We find that the lower the NLL, the higher the overlap in choices is (Appendix Fig. 1A). This means that NLL is a useful indicator of the quality of the fit. For the exploratory model runs, the overlap in choices is low, but does exceed 0.5. When considering the same distribution of choices across options, but made randomly, which we accomplished by randomly permuting the model’s choices, the overlap is reduced by about 0.04 (8%, Appendix Fig. 1B), ending up below 0.5. Note that this is higher than expected based on purely random choices with equal probability for each option, since both the subject and the model make the correct choice more than chance and we labeled the correct choice as option 1. The upshot is that lower NLL leads to more overlap in choices and therefore is a good basis to compare models with.

A typical issue for fitting functions with many parameters is shallow optima in which similar values for the objective function are found when covarying two or more parameters. This could affect the identifiability of the model, since a given parameter setting would generate model choices that would be most optimally fit by quite different parameter values. As mentioned in the preceding text we ran the optimization procedure for multiple initial conditions, and then generated multiple choice sequences for each of them. We fitted each of these sequences again. This gave us a multi-dimensional distribution of parameters values (Appendix Fig. 1C). We found that three parameters displayed a larger range of values than the rest of them. Specifically, for some initial conditions they created outlier values that were close to the upper bound constraints we imposed on the optimization with fmincon; these outlier values were also present in the fits of the model generated sequences. The most affected was the beta parameter for working memory (βWM), as well the working memory capacity (CWM), and to a lesser extent also the (adaptive) beta parameter for RL model (βm). The underlying cause was the covariation between CWM and βWM which we observed as a clear correlation between two fitting parameters across the refitted parameter values (Appendix Fig. 1D). We address this effect by using a cross validation procedure (see below) to deal with the outlier parameter values.

Model comparison and cross-validation.

To compare models, we calculated the log likelihood (normalized by the number of choices) of each model fit to the choices of the monkeys and computed the Bayesian Information Criterion (BIC) for each model that penalizes models according to the number of free parameters. We rank ordered the BIC to identify the model most predictive of the monkey’s choices. We used a cross-validation procedure for validating that the best fit models do not overfit the data. For the cross validation we evaluated how well a model predicted in terms of the NLL the subject’s choices of (test-) learning blocks that were withheld when fitting the model parameters on the remaining (training-) datasets. We repeated the cross-validation 50 times and used the average parameter values across these 50 cross-validation runs to simulate the choices of the monkey. For each cross validation we cut the entire data set at two randomly chosen blocks, yielding three parts. The two largest parts were assigned as training and test set. We did this to keep the trials in the same order as the monkey performed them, as the memory dependent effects in the model (and presumably the monkey) extend beyond the block boundaries. This is different from the standard procedure, where blocks were randomly assigned to test and training sets, hence breaking the block ordering that is important for the model. In general, the cross-validation results were qualitatively similar to the results optimizing the entire dataset and gave near identical rank ordering of the models with identical top-ranked models. This finding rules out overfitting.

Relation of model parameter values and behavior.

To test how each of the model parameters of the best-fitting model related to the learning and performance levels across different attentional load conditions we constructed linear mixed effects (LME) models (Pinherio & Bates, 1996). The models predict the learning speed LS (corresponding to the number of trials to criterion performance) or the plateau accuracy AC (percent correct over trials after criterion was reached) based on the individual model parameter values (Par1, Par2, …Parn) and the attentional load condition (with 3 levels for the 1D, 2D, and 3D load condition). All models used as random effects the factor Monkeys (each of six animals) to control for individual variations. For the best fitting model this LME had the form

LS or Accuracy=Par1βWM+Par2CWM+Par3ωWMWM decay+Par4βm*adaptive exploration+Par5α+amplitude for adaptive exploration+Par6ωRL+Par7ηGain+Par8ηloss+AttLoad+(1|Monkey)+b+ε (eq. 18)

In a second linear mixed effects analysis we tested which learning parameter values of the best-fit model are able to predict fast versus slow learners. To test this, we ranked subjects by their learning speed (their average trials-to-criterion) for each attentional load condition. We tested whether the rank ordering (Learner rank) was accounted for by individual model parameter values. Using the parameter values of the best-fitting model we tested the LME of the form.

Learner rank=Par1βWM+Par2CWM+Par3ωWMWM decay+Par4βm*adapt. explor.+Par5α+ampl. for adapt. explor.+Par6ωRL+Par7ηGain+Par8ηloss+(1|AttLoad)+b+ε (eq. 19)

All inference statistics of the linear effects models account for multiple comparisons by adjusting the significance level according to a false discovery rate of p=0.05.

Results

Behavioral performance.

We measured how six monkeys learned the relevance of object features in a learning task while varying the number of reward-irrelevant, distracting feature dimensions of these objects from one to three. On each trial subjects chose one of three objects and either did or did not receive reward, in order to learn by trial-and-error which object feature predicted reward. The rewarded feature could be any one of 37 possible features values from 4 different feature dimensions (color, shape, pattern and arms) of multidimensional Quaddle objects (Fig. 1A) (Watson, Voloh, Naghizadeh, et al., 2019). The rewarded feature, i.e. the reward rule, stayed constant within blocks of 35–60 trials (Fig. 1B). Learning blocks varied in the number of non-rewarded, distracting features (Fig. 1C). Subjects had 5 sec to choose an object which triggered correct or error feedback in the form of a yellow or cyan halo around the chosen object, respectively (Fig. 1D). The first of three experimental conditions was labeled 1-dimensional attentional load because all the distractor features were from the same dimension as the target feature (e.g. different body shapes, see examples in top row of Fig. 1E,F). At 2-dimensional attentional load, features of a second dimension varied in addition to features from the target feature dimension (e.g. objects varied in body shapes and surface patterning). At 3-dimensional attentional load, object features varied along three dimensions (e.g. varying in body shapes, surface patterns, and arm styles) (bottom row in Fig. 1E,F).

Six monkeys performed a total number of 989 learning blocks, completing on average 55/56/54 (SE: 4.4/4.3/4.2; range 41:72) learning blocks for the 1D, 2D, and 3D attentional load conditions, respectively. The number of trials in a block needed to learn the relevant feature, i.e. to reach 75% criterion performance increased for the 1D, 2D, and 3D attentional load condition from on average 6.5, to 13.5, and 20.8 trials (SE’s: 4.2 / 8.3 / 6.9) (Kruskal-Wallis test, p = 0.0152, ranks: 4.8, 10.2, 13.6) (Fig. 2A,B). Learning speed did not differ when the rewarded feature in a block was of the same or of a different dimension as the rewarded feature in the immediately preceding block (intra- versus extradimensional block transitions; Wilcoxon Ranksum test, p = 0.699, ranksum = 36) (Fig. 2C).

Figure 2. Learning performance.

Figure 2.

(A) Average Learning curves across six monkeys for the 1D, 2D, and 3D load condition. (B) Learning curves for each monkeys (colors) for 1D, 2D, 3D (low-to-high color saturation levels). All monkeys showed fastest learning for low load and slowest learning for the high load condition. Curves are smoothed with a 5 trial forward-looking window. (C) Left: The average trials-to-criterion (75% accuracy over 10 consecutive trials) for low to high attentional load (x-axis) for blocks in which the target feature was either of the same - intra-dimensional (ID) - or different dimension - extra-dimensional (ED) - as in the preceding trial. Right: Average number of trials-to-criterion across load conditions. Grey lines denote individual monkeys. Errors are SE. (D) The red color denotes average trials-to-criterion for blocks in which the target feature was novel (not shown in previous block), or when it was previously a learned distractor. The blue color denotes the condition in which a distractor feature was either novel (not shown in previous block), or part of the target in the previous block. When distractors were previously targets, learning was slower. (E) Latent inhibition of distractors (red) and target perseveration (blue) at low, medium and high load. Errors are SE.

Flexible learning can be influenced by target- and distractor- history effects (Banaie Boroujeni, Watson, & Womelsdorf, 2020; Chelazzi, Marini, Pascucci, & Turatto, 2019; Failing & Theeuwes, 2018; Le Pelley, Pearson, Griffiths, & Beesley, 2015; Rusz, Le Pelley, Kompier, Mait, & Bijleveld, 2020), which may vary with attentional load. We tested this by first evaluating the presence of latent inhibition which refers to slower learning of a newly rewarded target feature when that feature was a (learned) distractor in the preceding block compare to when the target feature was not shown in the previous block. We did, however, not find a latent inhibition effect (paired signed rank test: p = 0.156, signed rank = 3; Fig. 2D, left). A second history effect is persevering choosing the feature that was a target in the previous block. We quantified this target perseveration by comparing learning in blocks in which a previous (learned) target feature became a distractor, to learning blocks in which distractor features were new. We found that target perseveration significantly slowed down learning (paired signed rank test: p = 0.0312, signed rank = 0; Fig. 2D, right), which was significantly more pronounced in the high (3D) than in the low (1D) attentional load condition (paired signed rank test, again: p = 0.0312, signed rank = 0; Fig. 2E). These learning history effects suggest that learned target features had a significant influence on future learning in our task, particularly at high attentional load, while learned distractors had only marginal or no effects on subsequent learning.

Multi-component modeling of flexible learning of feature values.

To discern specific mechanisms underlying flexible feature-value learning in individual monkeys we fit a series of reinforcement learning (RL) models to their behavioral choices (see Material and Methods). These models formalize individual cognitive learning mechanisms and allowed characterizing their role in accounting for behavioral learning at varying attentional load. We started with the classical Rescorla-Wagner reinforcement learner that uses two key mechanisms: (i) The updating of value expectations of features VF every trial t by weighting reward prediction errors (PEs) with a learning gain η:Vi,t+1F=Vi,tF+η(RtVi,tF) (With reward Rt = 1 for a rewarded choice and zero otherwise), and (ii) the probabilistic (‘softmax’) choice of an object O given the sum of the expected values of its constituent features Vi, pChoiceiRL=exp(βRLjOiVjF)jexp(βRLkOjVkF) (Sutton & Barto, 2018). These two mechanisms incorporate two learning parameters: the weighting of prediction error (PE) information by η (often called the learning rate), η(PE), and the degree to which subjects explore or exploit learned values represented by βRL, which is small or close to zero when exploring values and larger when exploiting values.

We augmented the Rescorla-Wagner learning model with up to seven additional mechanisms to predict the monkey choices (Table 1; see Discussion and Appendix for the results with none-Rescorla Wagner models such as attentional switching and hypothesis testing model). The first of these mechanisms enhanced the expected values of all object features that were chosen by decaying feature values of non-chosen objects. This selective decay improved the prediction of choices in reversal learning and probabilistic multidimensional feature learning tasks (Hassani et al., 2017; Niv et al., 2015; Oemisch et al., 2019; Radulescu, Daniel, & Niv, 2016; Wilson & Niv, 2011). It is implemented as decay ωRL of feature values Vi from non-chosen features and thereby enhanced the value estimate for chosen (and hence attended) features for the next trial t:

Vi,t+1F=(1 ωRL)Vi,tF (Eq.1: Decay of non-chosen feature value).

As a second mechanism we considered a working memory (WM) process that uploads the identity of rewarded objects in a short-term memory. Such a WM can improve learning of multiple stimulus-response mappings (A. G. Collins & Frank, 2012; A. G. E. Collins, Ciullo, Frank, & Badre, 2017) and multiple reward locations (Viejo et al., 2018; Viejo, Khamassi, Brovelli, & Girard, 2015). Similar to (A. G. Collins & Frank, 2012) we uploaded the value of an object in WM (ViWM)  when it was chosen and rewarded and decayed its value with a time constant 1ωWM. WM proposes a choice using its own choice probability pChoiceWM, which competes with the pChoiceRL from the reinforcement learning component of the model. The actual behavioral choice is the weighted sum of the choice probabilities of the WM and the RL component w(pChoiceWM) + (1 – w)pChoiceRL. A weight of w > 0.5 would reflect that the WM content dominates the choice which would be the case when the WM capacity can maintain object values for sufficiently many objects before they fade away (see Methods). This WM module reflects a fast ‘one–shot’ learning mechanism for choosing the recently rewarded object.

As a third mechanism we implemented a meta-learning process that adaptively increases the rate of exploration (the β parameter of the standard RL formulation) when errors accumulate. Similar to (Khamassi et al., 2013) the mechanism uses an error trace βt*, which increases when a choice was not rewarded, by an amount proportional to the negative PE for that choice with a negative gain parameter α, and decreases after correct trials proportional to the positive PE weighted by a positive gain parameter α+ (Khamassi et al., 2013):

βt+1*=βt*+α+δt+αδt (Eq. 2: Adjustment of exploration rate),

where the PE is given by δt = RtVt, with V reflecting the mean of all the feature values of the chosen object. The error trace contains a record of the recent reward performance and was transformed into a beta parameter for the softmax choice according to βtRL=βmax1+exp(ω1(βt*ω2)) (Khamassi et al., 2013). Transiently increasing the exploration rate increases the chances to find relevant object features when there are no reliable, learned values to guide the choice and there are multiple possible feature dimensions that could be valuable. We kept α+ = −0.6, α = −0.4, ω1 = −6 and ω2 = 0.5 fixed, and varied βmax and, in some cases, α as well, resulting in a fourth model mechanism that could underlie flexible feature learning under increasing attentional load.

We tested three other neurobiologically plausible candidate mechanisms that played important roles in prior learning studies. A fifth mechanism implemented choice stickiness to account for perseverative (repetitive) choices independent of value estimates (Badre, Doll, Long, & Frank, 2012; Balcarras et al., 2016). A sixth parameter realized an ‘attentional’ dimension weight during value updates which is realized by multiplying feature values given the reward likelihood for the feature dimension they belong to (Leong et al., 2017; Oemisch et al., 2019). Finally, as a seventh parameter we separately modelled the weighting of negative PE’s after error outcomes, ηLoss, and the weighting of positive PE’s for correct outcomes, ηGain, to allow separate learning speeds for avoiding objects that did not lead to reward (after negative feedback) and for facilitating choices to objects that led to rewards (after positive feedback) (Caze & van der Meer, 2013; Frank, Moustafa, Haughey, Curran, & Hutchison, 2007; Frank, Seeberger, & O’Reilly R, 2004; Kahnt et al., 2009; Lefebvre, Lebreton, Meyniel, Bourgeois-Gironde, & Palminteri, 2017; Taswell, Costa, Murray, & Averbeck, 2018; van den Bos, Cohen, Kahnt, & Crone, 2012). We constructed models that combined two, three or four of these mechanisms. This led to models with two to eight free parameters (see Methods). Each model was fitted to the behavioral data separately for each attentional load condition and for each individual monkey. We calculated the Bayesian Information Criteria (BIC) to rank order the models according to how well they predicted actual learning behavior given the number of free parameters.

Working memory, adaptive exploration and decaying distractor values supplement reinforcement learning.

We found that across monkeys and attentional load conditions the RL model that best predicted monkey’s choices during learning had four non-standard components: (i) working memory, (ii) non-chosen value decay, (iii) adaptive exploration rate, and (iv) a separate gain for negative PE’s (ηLoss) (Fig. 3). This model had the lowest Bayesian Information Criterion on average across all six monkeys and was ranked 1st for three individual monkeys (Monkeys 1,2, and 3; Fig 3A,B; Table 1 shows the complete list of free model parameters for the rank ordered models). The overall best ranked model ranked 4th, 5th and 10th for the other three monkeys (Fig 3A). The top-ranked model for these three monkeys had three of the four mechanisms in common with the overall best-fit model and differed only in one parameter to the overall best-fit model. In other words, the learning performance of monkeys was best fit with a model that incorporated the selective forgetting (feature value decay) (for all 6 monkeys), a separate WM component (with three free parameters) (for all 6 monkeys), an adaptive exploration rate with or without a free parameter (for all 6 monkeys), and a separate learning rate for error outcomes (for 5 of 6 monkeys) (Fig. 3A). The one monkey (monkey #4) whose top-ranked model did not use a learning rate for error outcomes had instead the choice stickiness mechanism as part of his best-fit model (the overall 7th model in Fig. 3A) and included the learning rate for errors in his 4th best ranked model (Fig. 3A). Together, these findings identify a ‘family’ of good learning mechanisms across monkeys. Within this family four cognitive learning mechanisms most consistently contributed to predict learning performance. One additional mechanism (choice stickiness) played an important role in one of the six animals but was not needed to account for the behavior of the other five animals. Simulating the monkeys’ choices with the overall best-fitting model showed that it reproduced well the variable learning curves obtained from the monkeys (Fig. 3C).

To discern how the individual model mechanisms of the most predictive model contributed to the learning at low, medium and high attentional load, we simulated the choice probabilities for this full model as well as for partial models that implemented only individual mechanisms of that full model separately for each load condition (Fig. 4A,B). The simulations used the parameter values of the overall best-fit model. This analysis confirmed that the best-fit full model was most closely predicting choices of the animals in all load conditions, showing a difference between the model choice probabilities and the monkeys choice accuracy of only ~7% across all three attentional load conditions (Fig. 4C). The reduced (partial) model that performed similarly well across all attentional loads used the decay of non-chosen features (ωRL) (ranked 17th among all models, Fig. 4C, 3A). All other partial models were performing differently at low and high attentional loads. The partial model having only the working memory component (with ωWM) predicted choices well for the 1D and 2D load conditions but showed a sharply reduced predictability for choices in the 3D load condition (Fig. 4C). The partial model with the adaptive exploration rate (β*) worsened choice probability for the low load condition relative to the standard RL, but improved predictions for the 2D load condition (Fig. 4C). Similarly, the partial model with the separate weighting of negative prediction errors (ηLoss, ranked 27th, see Fig. 3A) showed overall better choice probabilities than the standard RL model (ranked 28th), but still failed predicting 12–18% of the monkeys’ choices when used as the only non-standard RL mechanisms (Fig. 4C). These results highlight that the selective forgetting of nonchosen values which is formalized as the decay of non-chosen features (ωRL) was the only parameter that was similarly important across all attentional load conditions. All other cognitive learning mechanisms had functional roles that varied with attentional load.

Figure 4. Choice probabilities of monkeys and models at three different loads.

Figure 4.

(A) Average choice accuracy of monkeys (grey) and choice probabilities of six models. The top-ranked model (red) combines WM with RL and selective suppression of nonchosen values, a separate learning gain for neg. RPE’s, and adaptive exploration rates. The base RL model (green) only contained a softmax beta parameter and a single learning rate. The other models each add a single mechanism to this base model to isolate its contribution to account for the choice patterns of the monkeys. Columns show from left to right the results for low, medium, and high load conditions and for their average. (B) The ratio of monkey accuracy and model choice probability shows that in all load conditions, the top-ranked model predicts monkey choices consistently better than models with a single added mechanism. (C) Average difference of model predictions (choice probability) and monkeys’ choices (proportion correct) at low to high loads for different models. Error bars are SE.

To understand why WM was only beneficial at low and medium attentional load, but detrimental at high attentional load, we visualized the choice probabilities that the WM module of the full model generated for different objects. We contrasted these WM choice probabilities with the choice probabilities for different stimuli of the RL module and of the combined WM+RL model (Fig. 5A). Following a block switch the WM module uploaded an object as soon as it was rewarded and maintained that rewarded object in memory over only a few trials. When the rewarded object was encountered again prior to decaying to zero it guided the choice of that object beyond what the RL module would have suggested (evident in trial six in Fig. 5AC). This WM contribution is beneficial when the same object instance re-occurred within few trials, which happened more frequently with low and medium attentional load, but only rarely during high load. At this high load condition, it was the RL model component that faithfully tracking the choice probability of the monkey, while the WM representation of recently rewarded objects is non-informative because (1) it can only make a small contribution as the number of stimuli in the block is much larger than the capacity and because (2) it does not remember rewarded objects long enough to be around when the objects are presented another time.

Figure 5. Contribution of working memory, reinforcement learning and adaptive exploration to learning behavior.

Figure 5.

(A) Choice probabilities of the RL component of the top-ranked model for an example block, calculated for the objects with the new target feature (blue), the previous block’s target feature (red) and other target features (yellow). Purple plus signs show which object was chosen. (B) Same format and same example block as in A but for choice probabilities calculated for objects within the working memory module of the model. Choice probabilities of the WM and the RL component are integrated to reach a final behavioral choice. (C) Same as A and B but after combining the WM and RL components in the full model. Choices closely follow the RL component, but when the WM representation is recently updated its high choice probabilities influences the combined, final choice probability, as evident in trials 6 and 7 in this example block. (D) The trace of nonrewarded (error) trials for three example blocks with low, medium and high load peaks immediately after the block switch and then declines for all conditions. Error traces remain non-zero for the medium and high conditions. (E) The same example blocks as in D. The adaptive exploration rate (y-axis) is higher (lower beta values) when the error trace is high during early trials in a block.

While the WM contribution declined with load, the ability to flexibly adjust exploration rates became more important with high load as is evidenced by improved choice probabilities at high load (Fig. 4C). This flexible meta-learning parameter used the trace of recent errors to increase exploration (reflected in lower beta parameter values). Such increases in exploration facilitate disengaging from previously relevant targets after the first errors following the block switch, even when there are no other competitive features in terms of value, because the mechanism enhances exploring objects with previously non-chosen features. Our results suggest that such an adjustment of exploration can reduce the decline in performance at high attentional load (Fig. 4C), i.e. when subjects have to balance exploring the increased number of features with acting based on already gained target information (Fig. 5D,E).

The relative contribution of model mechanisms for learning (exploration) and plateau performance (exploitation).

The relative contributions of individual model mechanisms for different attentional loads can be inferred from their load-specific parameter values that best predicted monkey’s learning when fitted to the learning performance at each load separately (Fig. 6). WM was maintained longer for learning at 2D and 3D than for 1D load (lower ωWM values), and showed fastest decay (higher ωWM values) at 2D and 3D signifying that WM representations stopped contributing to learning at these loads (Fig. 6C). When attentional load increased the models showed a gradual decrease of the weighting of positive PE’s (ηGain from ~0.2 to 0.1) and of the weighting of negative PE’s (ηLoss, from ~0.9 to 0.4) (Fig. 6E, and 6G). A potential explanation for the decrease in ηGain is that with more distracting features more trials are needed to determine what feature is the target, which can be achieved with slower updating. The decay of non-chosen feature values (ωRL) was weaker with increased load across monkeys indicating a longer retention of values of non-chosen objects (Fig. 6F), which reflects protecting the target value when it is not part of the currently chosen (but unrewarded) object. An event that occurs more often at high loads. Adaptive exploration rates (βm) increased on average from low to medium and high load (more negative values) signifying increased exploration after errors at these higher attentional loads.

Figure 6. Model parameter values at different attentional loads.

Figure 6.

The average parameter values (black) of the top-ranked model (y-axis) plotted against the number of distracting feature dimensions for the WM parameters (A-C) and the RL parameters (D-H). Individual monkeys are in colors. Error bars indicate SE.

The parameter variations at increasing load could relate either to the learning speed or the plateau performance differences at different loads. To quantify their relative contributions, we used linear mixed effects modeling to quantify how a model-independent estimate of learning speed (number of trials to reach criterion performance), and plateau accuracy (proportion of correct trials after learning criterion was reached) was predicted by the model parameters of the best-fit model. We found learning speed was significantly predicted by three parameters (Fig. 7A). Learning occurred significantly earlier (i) with larger prediction error weighting for rewarded trials (ηGain, t-stat = −3.39, p = 0.0096, FDR controlled at alpha = 0.05), with higher prediction error weight for unrewarded trials (ηLoss, t-stat = −4.66, p = 0.0016, FDR controlled at alpha 0.05), and (iii) with larger adaptive change of exploration as captured in the meta-learning parameter βm (t-stat = −5.78, p = 0.00041, FDR controlled at alpha 0.05) (for scatterplot overviews, see Fig. 7C). The remaining parameters were not significantly predicting learning when the false discovery rate was controlled at alpha 0.05 (all not sign.: βWM:t = −2.11, CWM : t= 0.33, ωWN:t = −2.7, ωRL:t = 0.41, α+:t=1.84).

Figure 7. Model parameter values underlying learning speed and plateau performance levels.

Figure 7.

(A) The t values of the linear effects analysis for each parameter value of the best fitting model for predicting the average trials to criterion (learning speed). Stars denote FDR corrected significance at p < 0.05. Negative values denote that higher parameter values associate with faster learning. (B) Same format as A for the linear mixed effects model predicting plateau performance accuracy with model parameter values. (C,D) Scatterplots of the RL parameter values of the top-ranked model plotted against the average learning speed (C) and the average plateau performance (D). The grey line is the linear regression whose r and p values are given above each plot. Each dot is the average result from one monkey in either the 1D (red), 2D (blue) or 3D (green) condition.

In contrast to the learning speed, the plateau performance level was not significantly predicted by a single parameter when controlling for the false discovery rate at alpha 0.05 (βWM: t–stat = 0.97, CWM: t–stat= −2.99, ωWM:t = −0.11, βm, t-stat = −1.06, ωRL: t=2.32, ηGain, t-stat = −0.33, ηLoss t-stat = −1.34, α+: t=0.24). With a more lenient false discovery rate of alpha = 0.2, the plateau performance was significantly predicted by parameter values of working memory capacity (CWM) and selective decay of non-chosen values (ωRL) (Fig. 7B,D). These results indicate a trend for better performance with less reliance on working memory, but with a stronger selective forgetting (decay) of values for features that were not part of chosen objects (modulated via ωRL).

Model parameter values distinguish fast and slow learners.

We next tested which model parameters distinguished good and bad learners across attentional load conditions by sorting subjects according to their learning speed, i.e. their average number of trials to reach criterion) and predicting the rank order of fast to slow learners based on the parameter values of the best-fitting model (Fig. 8A). We found that the variations of five model parameters had a significant main effect of the linear mixed effects model (Fig. 8B). Faster learners, requiring fewer trials to learn, retained a higher working memory capacity (CWM: t= −3.12, p=0.0124, Fig. 8C), a lower average exploration rate (βm: t= −5.33, p=0.0005, Fig. 8D), a larger learning rate for positive outcomes (ηGain, t-stat = −3.06, p = 0.0137, Fig. 8E), a larger learning rate for negative outcomes (ηLoss, t-stat = −4.08, p = 0.0028, Fig. 8F), and a larger variation (i.e. a larger amplitude of changes) of exploration rates (α+, t-stat = −3.6, p = 0.0137, Fig. 8G). These findings illustrate, that good and bad learners are distinguished not by differences of a single learning mechanism but by applying a learning strategy that utilizes working memory, flexibly adapts exploration rates, and show enhanced learning rates for both correct and error outcomes (Fig. 8).

Figure 8. Model mechanisms distinguishing slow and fast learners.

Figure 8.

(A) The average learning speed (the trials to reach criterion, y-axis) plotted against the individual monkeys ordered from the fastest to the slowest learner. (B) T values of the linear effects analysis that tested how the rank-ordering of monkeys learning speeds (ranks #1 to #6) are accounted for by model parameter values. Stars denote FDR corrected significance at p < 0.05. (C-G) The parameter values (y-axis) of the best fit model plotted against the rank ordering of learners (x-axis). The parameters shown had a significant main effect to account for the rank ordering of learners as shown in (B).

Discussion

We found that learning feature values under increasing attentional load is accounted for by a reinforcement learning framework that incorporates four non-standard RL mechanisms: (i) a value-decrementing mechanism that selectively reduces the feature values associated with the non-chosen object, (ii) a separate working memory module that retains representations of rewarded objects over a few trials, (iii) separate gains for enhancing values after positive prediction errors and for suppressing values after negative prediction errors, and (iv) a meta-learning component that adjusts exploration levels according to an ongoing error trace. When these four mechanisms were combined the learning behavior of monkeys was better accounted for than when using fewer or different sets of mechanisms. Critically, the same set of mechanisms were similarly important for all six animals (Fig. 3), suggesting they constitute a canonical set of mechanisms underlying flexible learning and adjustment of behavior. Although subjects varied in how these mechanisms were weighted (Fig. 6) those with faster learning and hence higher cognitive flexibility were distinguished by stronger weighting of positive and negative prediction errors, higher working memory capacity, as well as an overall lower exploration rate but with enhanced meta-adjustment rates of the exploration rate during periods of high error rates. Taken together these results document a formally defined set of mechanisms that support flexible learning of feature relevance under variable attentional load. It is important to note that the optimal model parameter settings do not perfectly account for all the observed choices, for instance, the observed learning curves in Figure 3C and 4A lie above those generated by the models. It is therefore possible that an additional model mechanism exists that closes this predictive gap, which could potentially interfere with, for instance, the interaction between working memory and RL based learning during varying attentional load, hence, potentially changing the suggested role of the mechanisms. Further research is therefore necessary to identify additional model mechanisms to exclude this possibility. In addition, these suggested mechanisms serve as a starting point for electrophysiological experiments in which specific brain areas are targeted by perturbative approaches in order to causally establish the role of model mechanisms in behavior. In the following, we further discuss our results from this viewpoint, including the evaluation of other model mechanisms we considered.

Selective value enhancement is a key mechanism to cope with high attentional load.

One key finding was that only one non-standard RL mechanism, the decay of values of non-chosen features (ωRL), contributed similarly to learning across all attentional load conditions (Fig. 4C). This finding highlights the importance of this mechanism and supports previous studies that used a similar decay of non-chosen features to account for learning in multidimensional environments with deterministic or probabilistic reward schedules (Hassani et al., 2017; Niv et al., 2015; Oemisch et al., 2019; Radulescu et al., 2016; Wilson & Niv, 2011). The working principle of this mechanism is a push-pull effect on the expected values of encountered features and thus resembles a selective attention phenomenon (when emphasizing the ‘pushing’ of values) of chosen and attended objects), or a ‘selective forgetting’ phenomenon (when emphasizing the ‘pulling’ down of values of nonchosen object features). When a feature is chosen (or attended), its value is updated and contributes to the next choice, while the value of a feature that is not chosen (not attended) is selectively suppressed and contributes less to the next choice. A process with a similar effect has been described in the associability literature whereby the exposure to stimuli without directed attention to it causes a reduction in effective salience of that stimulus. Such reduced effective salience reduces its associability and can cause the latent inhibition of non-chosen stimulus features for learning (Donegan, 1981; Esber & Haselgrove, 2011; Hall & Pearce, 1979) or the slowing of responses to those stimuli (also called negative priming) (Lavie & Fox, 2000). The effect is consistent with a plasticity process that selectively tags synapses of those neuronal connections that represent chosen objects in order to enable their plasticity while preventing (or disabling) plasticity of non-tagged synapses processing non-chosen objects (Roelfsema & Holtmaat, 2018; Rombouts et al., 2015). In computational models such a synaptic tag is activated by feedback connections from motor circuits that carry information about what subjects looked at or manually chose (Rombouts et al., 2015). Accordingly, only chosen objects are updated, resembling how ωRL implements increasing values for chosen objects when rewarded and the passive decay of values of nonchosen objects. Consistent with this interpretation ωRL was significantly positively correlated with overall learning speed (Fig. 7C) although it did not predict learning speed in an LME that controlled for the specific monkey identity as random effect (Fig. 7A). At low attentional load, high ωRL values reflect the fast forgetting of non-chosen stimuli, while at high attentional load ωRL adjusted to lower values which is slowing down the forgetting of values associated with nonchosen objects (Fig. 5F). The lowering of the ωRL decay at high load implies that values of all stimulus features are retained in the form of an implicit choice-history trace. Consistent with this finding, various studies have reported that prefrontal cortex areas contain neurons representing values of unchosen objects and unattended features of objects (Boorman, Behrens, Woolrich, & Rushworth, 2009; Westendorff, Kaping, Everling, & Womelsdorf, 2016). Our results demonstrate that at high attentional load, the ability of subjects to retain the value history of those nonchosen stimulus features is a critical factor for fast learning and good performance levels (Fig. 7A).

Working memory supports learning together with reinforcement learning.

Our study provides empirical evidence that learning the relevance of visual features leverages a fast working memory mechanism in parallel with a slower reinforcement learning of values. This finding empirically documents the existence of parallel (WM and RL) choice systems, each contributing to the monkey’s choice in individual trials to optimize outcomes. The existence of such parallel choice and learning systems for learning fast and slow has a long history in the decision-making literature (Balleine, 2019; Poldrack & Packard, 2003; van der Meer, Kurth-Nelson, & Redish, 2012). For example, WM has been considered to be the key component for a rule-based learning system that uses a memory of recent rewards to decide to stay with or switch response strategies (Worthy, Otto, & Maddox, 2012). A separate learning system is associative and implicitly integrates experiences over longer time periods (Poldrack & Packard, 2003), which in our model corresponds to the reinforcement learning module.

The WM mechanisms we adopted for the feature learning task is similar to WM mechanisms that contributed in previous studies to the learning of strategies of a matching pennies game (Seo, Cai, Donahue, & Lee, 2014), the learning of hierarchical task structures (Alexander & Brown, 2015; A. G. Collins & Frank, 2012, 2013), or the flexible learning of reward locations (Viejo et al., 2018; Viejo et al., 2015). Our study adds to these prior studies by documenting that the benefit of WM is restricted to tasks with low and medium attentional load (Fig. 4). The failure of working memory to contribute to learning at higher load likely reflects an inherent limit in working memory capacity. Beyond an interpretation that WM capacity limits are reached at higher load, WM is functionally predominantly used to facilitate processing of actively processed items as opposed to inhibiting the processing of items stored in working memory (Noonan, Crittenden, Jensen, & Stokes, 2018). In other words, a useful working memory is rarely filled with distracting, non-relevant information that a subject avoids. In our task, high distractor load would thus overwhelm the working memory store with information about non-rewarded objects whose active use would not lead to reward. Consequently, the model – and the subject whose choices the model predicts – downregulated the importance of WM at high attentional load, relying instead on slower reinforcement learning mechanism to cope with the task.

Separate learning rates promote avoiding choosing objects resulting in worse-than-expected outcomes.

We found that separating learning from positive and negative prediction errors improved model predictions of learning across attentional loads (Fig. 4) by allowing a ~3-fold larger learning rate for negative than positive outcomes (Fig. 6E vs. 6G). Thus, monkeys were biased to learn faster to avoid objects with worse-than-expected feature values than to stay with choosing objects with better-than-expected feature values. A related finding is the observation of larger learning rates for losses than gains for monkeys performing a simpler object-reward association task (Taswell et al., 2018). In our task, such a stronger weighting of erroneous outcomes seems particularly adaptive because the trial outcomes were deterministic, rather than probabilistic, and thus a lack of reward provided certain information that the chosen features were part of the distracting feature set. Experiencing an omission of reward can therefore immediately inform subjects that feature values of the chosen object should be suppressed as much as possible to avoid choosing it again. This interpretation is consistent with recent computational insights that the main effect of having separate learning rates for positive and negative outcomes is to maximize the contrast between available values for optimized future choices (Caze & van der Meer, 2013). According to this rationale, larger learning rates for negative outcomes in our task promote the switching away from choosing objects with similar features again in the future. We should note that in studies with uncertain (probabilistic) reward associations that cause low reward rates, the overweighting of negative outcomes would be non-adaptive as it would promote frequent switching of choices which is suboptimal in these probabilistic environments (Caze & van der Meer, 2013). These considerations can also explain why multiple prior studies with probabilistic reward schedules report of an overweighting of positive over negative prediction errors which in their tasks promoted staying with and prevent switching from recent choices (Frank et al., 2007; Frank et al., 2004; Kahnt et al., 2009; Lefebvre et al., 2017; van den Bos et al., 2012).

The separation of two learning rates also demonstrates that our task involves two distinct learning systems for updating values after experiencing nonrewarded and rewarded choice outcomes. Neurobiologically, this finding is consistent with studies of lesioned macaques reporting that learning from aversive outcomes is more rapid than from positive outcomes and that this rapid learning is realized by fast learning rates in the amygdala as opposed to slower learning rates for better-than-expected outcomes that is closely associated with the ventral striatum (Averbeck, 2017; Namburi et al., 2015; Taswell et al., 2018). Our finding of considerably higher (faster) learning rates for negative than positive prediction errors is consistent with this view of a fast versus a slow RL updating system in the amygdala and the ventral striatum, respectively. The importance of these learning systems for cognitive flexibility is evident by acknowledging that learning rates from both, positive and negative outcomes, distinguished good and bad learners (Fig. 8), which supports reports that better and worse learning human subjects differ prominently in their strength of prediction error updating signals (Klein et al., 2007; Krugel, Biele, Mohr, Li, & Heekeren, 2009; Schonberg, Daw, Joel, & O’Doherty, 2007).

Adaptive exploration contributes to learning at high attentional load.

We found that adaptive increases of exploration during the learning period contributed to improved learning at high load (Fig. 3). Adapting the rate of exploration over exploitation reflects a meta-learning strategy that changes the learning process itself by adaptively enhancing searching for new choice options irrespective of already acquired expected values (Doya, 2002). Our finding critically extends insights that adaptive learning rates are critically important to cope with uncertain environments (Farashahi, Donahue, et al., 2017; Soltani & Izquierdo, 2019) to target uncertainty imposed by increased distractor load. In earlier studies, reward uncertainty was estimated to adjust learning rates in tasks with varying volatility (Farashahi, Donahue, et al., 2017), changing outcome probabilities when predicting sequences of numbers (Nassar, Wilson, Heasly, & Gold, 2010), sharp transitions of exploratory search for reward rules and exploitation of those rules (Khamassi et al., 2015), probabilistic reward schedules during reversal learning (Krugel et al., 2009), or the compensation for error in multi-joint motor learning (Schweighofer & Arbib, 1998). A commonality of these prior meta-learning studies is a relatively high level of uncertainty about the source of reward or error outcomes. In our task, the uncertainty about the target feature systematically increased with the number of distracting features. As a consequence of enhanced uncertainty, subjects utilized a learning mechanism that increased randomly exploring new choice options when non-rewarded choices accumulated, and to reduce exploring alternative choices when choices began to lead to reward outcomes. Such balancing of exploration and exploitation can be achieved by using a memory of recent reward history to adjust undirected vigilance (Dehaene, Kerszberg, & Changeux, 1998; Khamassi et al., 2013), or other forms of exploratory strategies (Tomov et al., 2020).

Limitations and scope of our findings.

Our study found that four non-standard learning mechanisms contributed to explaining learning at low and high attentional load across six monkeys. This is a major novel finding that motivates identifying the neural basis of each of these mechanisms and how they cooperate during learning. However, these results should not be considered a conclusive list of learning mechanisms and various limitations of our approach should be considered. Firstly, we cannot rule out that other mechanisms are used beyond those considered (see next subsection). Secondly, our conclusions are based on ranking different models according to how well (in terms of likelihood) they predict choices. We quantified this for each of six monkeys separately with the Bayesian Information Criterion (BIC) which is well established and penalizes the model predictability by the number of parameters used for the prediction (Wagenmakers & Farrell, 2004). We could have chosen other means to rank order models by, e.g. combining the BIC with measures of explained variance of choices (pseudo R2) into a compound goodness-of-fit score (Balcarras et al., 2016), or by performing a model recovery analysis using all possible models (Wilson & Collins, 2019). Such more extensive model comparisons will be justified when a study considers multiple alternative, mutually exclusive, and possibly more contentious, newly devised learning mechanisms that do not apparently contribute to enhancing likelihood in terms of BIC. Here, we documented, as reported in the methods section, that the models used were identifiable to a sufficient degree so that the different parameters could be compared and that the models yielded significant differences in the objective function so that the performance of the models could be adequately compared (Appendix).

Consideration of alternative mechanisms.

In our study only two main mechanisms were tested that were not consistently contributing to enhancing the BIC (choice stickiness and Bayesian dimension weighting). These two processes might play a role in other tasks that we did not consider. Our study should therefore not be considered to exclude learning mechanisms, but rather to provide strong positive empirical evidence for including those four mechanisms that we found to consistently contribute to successful cognitive flexibility in our task.

Beyond the mechanisms that we tested explicitly in our study, we explored alternative models that have been described in the literature and which we found not to be competitive in our dataset. These models were almost exclusively formulated in terms of values, either of the object or feature. One of the omitted models was object-based RL learning (not to be confused with object-based working memory, which represented fast, one-shot learning). For our data this model did not yield competitive accounts for the choices. It is important to mention this model since it relates to the investigation in (Farashahi, Rowe, et al., 2017), which compares feature with object-based probabilistic RL learning.

A second set of models we did not explicitly consider in this report but tested on the dataset are models that assume learning involves subjects to test hypotheses of reward rules causing fast and discrete switches of attention and behavioral choices when a hypothesis is refuted. For example, some studies report that the identification of the correct target occurs suddenly during the block and from then on results in choices that are rewarded (Papachristos & Gallistel, 2006). From block to block this sudden onset occurs at different trials relative to the target switch so that averaging accuracy across blocks can give the wrong impression of a smooth learning curve. The rapid switching is consistent with the subject holding a single hypothesis at a time about what the target feature is and switching when the choices based on this hypothesis are not rewarded. Such a model is referred to as a serial hypothesis-testing model as suggested by Niv and colleagues (Radulescu, 2020; Radulescu, Niv, & Ballard, 2019). We implemented a version of this model and explored the learning behavior in the generative version of the model and fitted it to a subset of behavioral data using the likelihood formulation of the model (see Appendix, Appendix Fig. 2). On each trial there is a single inferred target feature, a switch to a new (randomly chosen) feature is made when the number of rewards in the prior τ trials is below a certain threshold (this is a parameter in the model). At the phenomenological level the model reproduces the decrease in learning speed with higher attentional load (Appendix Fig. 2DF and Appendix Fig. 3BD, for attentional load 1,2 and 3, respectively). For our fits we varied the memory duration τ and found that a memory of 3 trials in the past was optimal, nevertheless the resulting performance (NLL) did not match our best model (model 1, Table 1). We expect that the model will be interesting for future studies in which probabilistic rewards are considered for which memory should help distinguishing between unrewarded correct choices and unrewarded incorrect choices for switching to a new hypothesis.

A more complicated version of the hypothesis switching approach is based on change-point detection (Adams & MacKay, 2007), in the sense that more probability distributions need to be updated from trial to trial (hence be represented somewhere in the brain if the model to has to have a mechanistic interpretation). A Bayesian inference model with change-point detection was described in (Wilson & Niv, 2011) section 2.3.2. In this model the key quantity is the probability pf that feature f is the target. This distribution converges quickly towards a situation in which all the probability weight is on one feature, in essence representing that the subject has a single hypothesis for the target feature. A change-point indicates that the current hypothesis does not predict rewarded choices any more, in which case pf has to be reset, typically to a uniform distribution. The model builds, based on the stimulus-choice-reward sequence, a probability distribution of the possible change-points and maintains the corresponding feature distribution pf conditioned on each possible change point in the past. We had earlier simulated a Bayesian inference model without change-point detection, which led to non-competitive values for NLL, because it learned too fast compared to the subject (note that we had a deterministic reward rule). We had included this model e.g. in (Oemisch et al., 2019). Similarly, the model with change-point detection also had a higher NLL than our best model and the optimal fit parameters corresponded to an unreasonably high switch rate h. Hence, we did not enter these types of models in our comparison.

An alternative to value-based learning is Q-learning in which for each state s of the environment and inferred feature f, an action value Q(c,f,s) for choice (action) c is learned from the past choices and rewards. An example of such a model, relevant to our data, was proposed in Kour and Morris (Kour & Morris, 2019). An additional advantage was that this model was fit by an Expectation-Maximization procedure (i.e. Baum-Welch method, see (Murphy, 2012)) since that procedure provides a way to reconstruct the hypothesis on each trial. We however found that the sheer number of parameters represented by the Q-values (which grow with the product of the number of states times number of features times number of objects, which our case is 36 × 3 × 3, 216 × 6 ×3, and 1296 × 9 × 3 for 1,2 and 3 non-neutral features respectively) made fitting difficult for the number of blocks that we had experimentally available.

In other published models the specific target of updates for unchosen features is a free parameter (Barraclough, Conroy, & Lee, 2004; Ito & Doya, 2009). These models are based on Q-learning hence suffer in our case from the aforementioned issue of a large number of parameters necessary for the action values. This was not the case in the cited references since there was only one state in combination with two possible choices (L and R) that each have a different reward probability associated with them. A direct comparison based on our data was therefore not feasible. Nevertheless, (Ito & Doya, 2009) discuss four different update rules, which determine how the action values of chosen and non-chosen options are updated when there was a reward or when the reward was omitted. These update choices play similar roles as our learning rate parameters ηLoss and ηGain play for the update of the chosen option when not-rewarded/rewarded, respectively and ωncRL for the decay of the non-chosen option). In conclusion, our evaluated models are similar in spirit to the ones in (Ito & Doya, 2009), however, rather than based on action values for choices, we update feature values with similar different alternative update models.

Other published models incorporated longer-timescale perseveration to account for choice behavior (Akaishi, Umeda, Nagase, & Sakai, 2014; Miller, Shenhav, & Ludvig, 2019). For example, the task in (Akaishi et al., 2014) involved two choices, but models for this choice were formulated solely in terms of the probability of making the first choice, which is obtained by transforming a decision variable by a sigmoid. Different choices for functional dependence of the decision variable on past choices and current and past stimulus features were made. For instance, it could depend on the prior choice, the current and previous contrast, sometimes in a nonlinear combination, where the previous contrast gain-modulated the effect of the current contrast. We have incorporated the choice stickiness of the previous choice in a similar fashion (following (Balcarras et al., 2016)), but found that for the current task stickiness on average did not improve the prediction of choices and was not part of the top-ranked model in five of six monkeys (Fig. 3A). Akaishi and colleagues also evaluated replacing variables by predictions for the stimulus contrast and choice (Akaishi et al., 2014). These predictions were updated using the prediction error similar to the feature values in our reinforcement learning models, hence this mimics Q-learning. Apart from the fact we did not have a contrast variable in the current behavioral data, we did not include such prediction variables in our models, primarily because choices depend on the stimulus configuration, hence we would have to consider a different set of action variable for each stimulus configuration, again yielding the aforementioned state-space size problems.

Taken together, our brief discussion of previous work covers models that we either implemented in a pilot stage but found not to account well for the choices in our task, or that were a different type of model that was based on action value formulation that was feasible in the original works, but not for our behavioral data. The latter incorporated effects of perseveration and different reward dependent gains and choice-dependent decay that we had already incorporated in our feature value updates and evaluated as part of our set of models. Taken together, we believe that our comparison covered a sufficiently diverse set of models to draw conclusions about how behavioral strategies change with attentional load. When the task is extended to include probabilistic rewards as well as multi-stage setups then action-value based models may need to be used instead of the feature value based models we focused on here.

Conclusion.

In summary, our study documents that a standard reinforcement learning modeling approach does not capture the cognitive processes needed to solve feature-based learning. By formalizing the subcomponent processes needed to augment standard (Rescorla Wagner) RL modeling we provide strong empirical evidence for the recently proposed ‘EF-RL’ framework that describes how executive functions (EF) augment RL mechanism during cognitive tasks (Rmus, McDougle, & Collins, 2020). The framework asserts that RL mechanisms are central for learning a policy to address task challenges, but that attention-, action-, and higher-order expectations are integral for shaping these policies (Rmus et al., 2020). In our study these ‘EF’ functions included (i) working memory, (ii) adaptive exploration, (iii) a separate learning gain for erroneous performance, and (iv) an attentional mechanism for forgetting nonchosen values. As outlined in Fig. 9 these three mechanisms leverage distinct learning signals, updating values based directly on outcomes (WM), on prediction errors (RL-based decay of nonchosen values), or on a continuous error history trace (meta-learning based adaptive exploration). As a consequence, these three learning mechanisms operate in parallel and influence choices to variable degrees across different load conditions, for instance, learning fast versus slow (WM versus meta-learning versus RL) and adapting optimally to low versus high attentional load (WM versus meta-learning). Our study documents that these mechanisms operate in parallel when monkeys learn the relevance of features of multidimensional objects, providing a starting point to identify how diverse neural systems integrate these mechanisms during cognitive flexible behavior (Womelsdorf & Everling, 2015).

Figure 9. Characteristics of the working memory, reinforcement learning and meta-learning components.

Figure 9.

The model components differ in the teaching signals that trigger adjustment (upper row), in the learning speed, i.e. in how fast they affect behavior (middle row), and in how important they are to contribute to learning at increasing attentional load (bottom row).

Acknowledgements:

The authors would like to thank Shelby Volden and Seth König for help with animal training and data collection, and Seth König for preparing the data for analysis and for introducing the concept of a ‘neutral’ Quaddle stimulus. Research reported in this publication was supported by the National Institute of Mental Health of the National Institutes of Health under Award Number R01MH123687. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Appendix: Identifiability of model 1

We evaluated to what extent the fitting of model 1 (Table 1) was affected by local minima by starting the fitting procedure from different initial conditions (n=10). We found that a number of distinct parameter sets were reached multiple times. The objective function, NLL, was different for these solutions, but the differences were small and typically less than the difference between different models considered here. For each of the initial conditions, we generated 40 behavioral sequences, determined the NLL, reward outcome and match between subject and model choices (Figure 1AB). Here the match was quantified as the fraction of common choices. The overall conclusion is that the better the NLL (that is a lower value) the more overlap there was between the choices of the model and those of the subject, this was in addition indexed as a higher average reward. We also refitted these sequences and analyzed the variability in parameter values so obtained, represented as boxplots in Figure 1C. We found that the variability in the β parameters for both the working memory and reinforcement learning component was high, which was reflected in the negative correlation between them (Figure 1E). In addition, there was a strong anticorrelation between βWM and CWM (Figure 1D). This suggests that outlier values of the β variables should be removed. This can be achieved by putting priors on the parameters and incorporate those in the objective function, or using a cross-validation strategy to extract the robust parameter sets.

Figure 1. Model 1 identifiability.

Figure 1.

The experimental behavioral data was fitted 10 times starting from different initial conditions for the parameters. For each of the fit parameters the generative model was run 40 times, which yielded 400 values for the NLL, the average reward and the match between subject choices. The models were refitted which yielded 400 parameter sets. (A) Reward outcome versus NLL: a lower (better) NLL also gives a higher average reward. (B) Match between model and subject choices versus NLL. Each model choice sequence was represented by a line, the bottom point is the match for randomized model choices, whereas the top point is that for the actual model choices. The lower the NLL the better the prediction of choices by the model is. (C) Boxplot summarizing the 400 values of each parameter, the red line indicates the median, the box contains the interquartile range (25 to 75th percentile), the points beyond the whiskers are outliers. For βWM and βm, there are a significant number of outliers. The variables (D) βWM and CWM and (E) βWM and βm are negatively correlated.

Serial hypothesis testing models

Generative model

The state of the subject is described by a latent variable z, in our case, this is the current hypothesis about what the target feature f is. There are multiple dimensions d, within each dimension there are multiple feature values fd, the number of which vary with dimension. For instance for the dimension color, there are feature values red and green, whereas for dimension shape there is a different number of features. Each dimension has a neutral feature value that can appear in multiple objects, when the corresponding dimension cannot provide the target feature that predicts reward. The key variable is the target feature ft during trial t, determined by the environment, and the set of features that are active, fa, i.e. they are all the non-neutral features that are presented during a block of trials and that can be the target feature. We use functions to switch between dimension d (color), feature value within dimension fd (red) and overall feature value f (index for color-red), specifically D(f)=df, Fd(f)=fd, f=F(d,fd), these are easily implemented as a lookup in the appropriate matrix. The primary reason is that unless you use attention to a particular dimension, the overall feature value can be used as the main variable in updates, and then it also represents the values z can take.

The state is an allowed combination of no objects, the objects can share the same feature value for a dimension when it is the neutral value, but they have to have different ones for non-neutral features. The state of the environment is summarized by the matrix S(i,j,s), here i=1,…,no indicates the object, j=1,…,nf the feature, s=1,…, ns the state index. When in trial t the state of the environment is st, it means that the objects are given in terms of their features by S(i,j,st), i.e. S(i,j,st)=1 when object i in state st contains feature j. Taken together this means that the environment can be completely described by state st (ns=36, 216 or 1296, respectively, for 1,2 or 3 active dimensions with 3 non-neutral features used in each dimension) and the current target feature ft (taking values between 1 and nf=3, 6 and 9, respectively, for the aforementioned attentional loads). In our setup each st is chosen randomly from amongst the ns available states, where ft switches randomly from trial to trial from feature j to i, according to the switching matrix

pijswf=(1h)δij+hnf1(1δij),

where δij is the Kronecker delta, equal to one when i=j and zero otherwise. Hence there is a probability h for a switch to a random other feature. The potential value (possible reward) of each state is the same, hence, the subject cannot change future values of a state by a particular choice. This makes this system different from the typical one studied using q-learning.

The additional variables on each trial are subject-inferred target zt, choice ct and the resulting reward rt. The choice ct is for the object that contains the inferred target, hence, for which S(ct,zt,st)=1, with probability pc (exploitation); and with 1-pc a random choice is made (exploration). This parameter in some sense plays the role of the β in the softmax choice probability of our feature value models (Table 1). The probability for obtaining a reward on trial t, rt=1, is pr when the choice contains the target, that is, S(ct,ft,st)=1, and pnr, otherwise.

In the experimental study we used a deterministic reward scheme, pr=1, pnr=0. Taken together, using x=S(ct,ft,st) as shorthand, then the probability for reward on trial t is p=pr x + pnr(1-x). Note that ct and rt can only influence the future inferred target zt+1.

The probability for the new inferred target being zt+1=i when zt=j is determined by the following switching matrix

pijswz=(1htz)δij+htznf1(1δij),

with the switch probability

htz=111+eκ(r¯tθ).

Note that the sigmoid in this expression in fact represents a stay probability, but we need to enter the switch probability in the switch matrix, hence the subtraction from one in the formula. The r-t is the weighted average of rewards across the past τ trials,  r-t=i=0τ-1wirt-i, here we explored two choices: wi = 2τi−1 or wi = 1. Hence, either the more recent reward counts more or all past rewards are counted equally.

The algorithm to create the behavioral model data therefore has the following steps:

  • Initialize. Set z1, f1, s1 (drawn uniformly from amongst all feasible values).

  • Make choice ct. Choose whether to exploit (probability pc) or explore (1-pc), for the former choose the ct that satisfies S(ct,zt,st)=1, for the latter choose ct according to the uniform distribution.

  • Determine reward rt. Set rt=1 with probability p=pr x + pnr(1-x), x=S(ct,ft,st).

  • Update inferred target zt. Draw zt+1=i according to pi,ztswz using
    htz=1-1e-κ(r-t-θ).
  • Update target feature ft. Draw ft+1 =i according to pi,ftswf.

  • Update state (objects) st. Draw st+1 uniformly from amongst the allowed states.

The parameters are exploit probability pc, reward probability for correct choice pr and for incorrect choice pnr, target switching rate h, memory duration τ, sharpness of switch function κ and the threshold θ.

Likelihood model for behavioral observations

The generative model produces a sequence (st,ct,rt), the actual target ft is also available but should not be used (since this is contained in the reward that the subject receives), whereas latent variable zt is hidden. During the fitting procedure we fixed pr and pnr, although in principle the subject may not know them, and we assumed a deterministic choice pc=1. In addition, the allowed states according to which the sequence was generated is also given, this means a fixed attentional load is assumed and provided. Hence, the free parameters are κ and θ which can be optimized for a relevant range of memory durations τ. We start the procedure by setting the inferred target to a specific value zt. Then based on the reward history a switch probability htz and the resulting switch vector pi,zt (given by pi,ztswz, i runs along allowed feature values) is formed. The vector S(ct+1,i,st+1) indicates which features were present in the chosen object on the next trial, hence the product of the switching probability to i and whether the choice contained i, measures how well the new inferred target predicts the choice. We choose as log likelihood LL=log(iS(ct+1,i,st+1)pi,zt) and add these across trials. We also need to update the inferred target, we choose the zt+1=i where i maximizes S(ct+1,i,st+1)pi,zt. This procedure recovers the correct parameters when fed the model generated choice-state-reward sequence.

In order to fit the experimental data, which contains blocks with varying attentional load, we add a running average across presented features in order to construct the correct set of active feature values fa for the switching functions.

Serial hypothesis testing based update with feature-specific switching

The preceding model switched when there were too many unrewarded trials in its past, but it did not use that information to switch to a specific feature. In a different version of the model the inferred target τ trials back was used as the initial condition for a Bayesian update that integrated the choice-state-reward sequence up to and including the current trial. Specifically, set pif=δi,zt-τ+0.01, for i=1,…,nf, as starting point of the iteration. We add a small nonzero probability for all other features, since otherwise, with the multiplicative updates used here, there can never appear any nonzero probabilities for other features than the starting one. The update for each trial is, using temporary variables x, y, and u for ease of presentation, and v as temporary trial index: xi=δi,cv and yij = S(i,j,sv), for i=1,..,no, and j=1,…,nf, whereas uj = ∑i(prxiyij + pnr (1 − xi)(1 − yij). The update then becomes pjf=(rvuj+(1rv)(1uj)pjf , which is applied starting from the pf for tτ (i.e. with peak at the inferred target), for v = tτ + 1 up to t. The resulting pf is then normalized into a probability distribution. The zt+1 is then drawn randomly according to this distribution pf. The only change compared to the generative algorithm presented before is in this update of the internal variable.

The likelihood model is changed along similar lines. In that case the choice and reward are given hence we need to consider S(i,ct+1,st+1)pif, which measures the likelihood that the choice is made according to the updated pf, which integrates the past τ trials. The new inferred target zt+1 is given by the feature i that maximizes this likelihood. The contribution to the objective function for trial t+1, is given by the log likelihood LL=log(iS(ct+1,i,st+1)pif).

Examples

In Figure 2 we show the predictions of the generative model with random switching between features. A higher attentional load results in slower learning and a lower asymptotic performance (typically the experiment stops at 30, so asymptotic performance means performance towards the end of the block. In Figure 3 we show the predictions for feature based switching. The key feature is that integrating over one previous trial is not enough to reach perfect performance for even the lowest attentional load, whereas for load 3 none of the delays τ up to 9 is sufficient.

Figure 2. Behavioral data for a serial hypotheses testing model with pr=0.99, pnr=0, pc=1, h=0.01 with recent rewards more heavily weighted in the switching function.

Figure 2.

(A) There are ns=36 different states, which are randomly sampled across trials. (B) The target feature ft is generated from a random switching process with hazard rate h=0.01, the target feature zt is inferred by the model based on previous observations of choices and reward. (C) The choice as a function of trial index, rewarded choices are in red, the unrewarded in blue. These examples are generated for τ = 3, κ = 5.71, θ = 1.75 and attentional load 1. (D-F) Choice accuracy curves for attentional load equal to (D) 1, (E) 2, (F) 3, for five different values of τ = 1,⋯,5 as indicated in the legend. The corresponding (κ, τ) values are (40, 0.25), (13.33, 0.75), (5.71, 1.75), (2.67, 3.75), (1.3, 7.75). Higher attentional load leads to slower learning and lower asymptotic choice accuracy. For these setting there is little advantage of a longer memory (τ), for attentional load 1, longer τ prolongs the learning period.

Figure 3. Behavioral data for a serial hypotheses testing model with feature based switching, pr=0.99, pnr=0, pc=1, h=0.01.

Figure 3.

(A) The target feature ft is generated from a random switching process with hazard rate h=0.01, the target feature zt is inferred by the model based on previous observations of choices and reward. This data is for τ = 3 and attentional load 1. (B-D) Choice accuracy curves when the attentional load is (B) 1, (C) 2, (D) 3, for five different values of τ = 1,3, ⋯, 9 as indicated in the legend. Higher attentional load leads to slower learning and lower asymptotic choice accuracy. Longer τ prolongs the learning period.

Footnotes

Conflict of Interest Statement: The authors declare no competing financial interests.

Data and code accessibility

Data and computational modeling code for reproducing the results of the best fitting model (Fig. 4) is available on a github link that is activated upon publication of this manuscript.

References

  1. Adams RP, & MacKay DJ (2007). Bayesian online changepoint detection. arXiv:0710.3742. [Google Scholar]
  2. Akaishi R, Umeda K, Nagase A, & Sakai K (2014). Autonomous mechanism of internal choice estimate underlies decision inertia. Neuron, 81(1), 195–206. [DOI] [PubMed] [Google Scholar]
  3. Alexander WH, & Brown JW (2015). Hierarchical Error Representation: A Computational Model of Anterior Cingulate and Dorsolateral Prefrontal Cortex. Neural Comput, 27(11), 2354–2410. [DOI] [PubMed] [Google Scholar]
  4. Alexander WH, & Womelsdorf T (2021). Interactions of Medial and Lateral Prefrontal Cortex in Hierarchical Predictive Coding. Front Comput Neurosci, 15, 605271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Averbeck BB (2017). Amygdala and ventral striatum population codes implement multiple learning rates for reinforcement learning. IEEE Symposium Series on Computational Intelligence (SSCI), 1–5. [Google Scholar]
  6. Badre D, Doll BB, Long NM, & Frank MJ (2012). Rostrolateral prefrontal cortex and individual differences in uncertainty-driven exploration. Neuron, 73(3), 595–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Balcarras M, Ardid S, Kaping D, Everling S, & Womelsdorf T (2016). Attentional Selection Can Be Predicted by Reinforcement Learning of Task-relevant Stimulus Features Weighted by Value-independent Stickiness. J Cogn Neurosci, 28(2), 333–349. [DOI] [PubMed] [Google Scholar]
  8. Balleine BW (2019). The Meaning of Behavior: Discriminating Reflex and Volition in the Brain. Neuron, 104(1), 47–62. [DOI] [PubMed] [Google Scholar]
  9. Banaie Boroujeni K, Watson MR, & Womelsdorf T (2020). Gains and Losses Differentially Regulate Learning at Low and High Attentional Load. bioRxiv, 1–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Barraclough DJ, Conroy ML, & Lee D (2004). Prefrontal cortex and decision making in a mixed-strategy game. Nat Neurosci, 7(4), 404–410. [DOI] [PubMed] [Google Scholar]
  11. Benjamini Y, & Hochberg Y (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B: Methodological, 57, 289–300. [Google Scholar]
  12. Boorman ED, Behrens TE, Woolrich MW, & Rushworth MF (2009). How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron, 62(5), 733–743. [DOI] [PubMed] [Google Scholar]
  13. Botvinick M, Ritter S, Wang JX, Kurth-Nelson Z, Blundell C, & Hassabis D (2019). Reinforcement Learning, Fast and Slow. Trends Cogn Sci, 23(5), 408–422. [DOI] [PubMed] [Google Scholar]
  14. Caze RD, & van der Meer MA (2013). Adaptive properties of differential learning rates for positive and negative outcomes. Biol Cybern, 107(6), 711–719. [DOI] [PubMed] [Google Scholar]
  15. Chelazzi L, Marini F, Pascucci D, & Turatto M (2019). Getting rid of visual distractors: the why, when, how, and where. Curr Opin Psychol, 29, 135–147. [DOI] [PubMed] [Google Scholar]
  16. Collins AG, Brown JK, Gold JM, Waltz JA, & Frank MJ (2014). Working memory contributions to reinforcement learning impairments in schizophrenia. J Neurosci, 34(41), 13747–13756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Collins AG, & Frank MJ (2012). How much of reinforcement learning is working memory, not reinforcement learning? A behavioral, computational, and neurogenetic analysis. Eur J Neurosci, 35(7), 1024–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Collins AG, & Frank MJ (2013). Cognitive control over learning: creating, clustering, and generalizing task-set structure. Psychol Rev, 120(1), 190–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Collins AGE, Ciullo B, Frank MJ, & Badre D (2017). Working Memory Load Strengthens Reward Prediction Errors. J Neurosci, 37(16), 4332–4342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Dajani DR, & Uddin LQ (2015). Demystifying cognitive flexibility: Implications for clinical and developmental neuroscience. Trends Neurosci, 38(9), 571–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Dehaene S, Kerszberg M, & Changeux JP (1998). A neuronal model of a global workspace in effortful cognitive tasks. Proc Natl Acad Sci U S A, 95(24), 14529–14534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Donegan NH (1981). Priming-produced facilitation or diminution of responding to a Pavlovian unconditioned stimulus. J Exp Psychol Anim Behav Process, 7(4), 295–312. [PubMed] [Google Scholar]
  23. Doya K (2002). Metalearning and neuromodulation. Neural Netw, 15(4–6), 495–506. [DOI] [PubMed] [Google Scholar]
  24. Esber GR, & Haselgrove M (2011). Reconciling the influence of predictiveness and uncertainty on stimulus salience: a model of attention in associative learning. Proc Biol Sci, 278(1718), 2553–2561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Failing M, & Theeuwes J (2018). Selection history: How reward modulates selectivity of visual attention. Psychon Bull Rev, 25(2), 514–538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Farashahi S, Donahue CH, Khorsand P, Seo H, Lee D, & Soltani A (2017). Metaplasticity as a Neural Substrate for Adaptive Learning and Choice under Uncertainty. Neuron, 94(2), 401–414 e406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Farashahi S, Rowe K, Aslami Z, Lee D, & Soltani A (2017). Feature-based learning improves adaptability without compromising precision. Nat Commun, 8(1), 1768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Frank MJ, Moustafa AA, Haughey HM, Curran T, & Hutchison KE (2007). Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proc Natl Acad Sci U S A, 104(41), 16311–16316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Frank MJ, Seeberger LC, & O’Reilly RC (2004). By carrot or by stick: cognitive reinforcement learning in parkinsonism. Science, 306(5703), 1940–1943. [DOI] [PubMed] [Google Scholar]
  30. Gershman SJ, & Daw ND (2017). Reinforcement Learning and Episodic Memory in Humans and Animals: An Integrative Framework. Annu Rev Psychol, 68, 101–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hall G, & Pearce JM (1979). Latent inhibition of a CS during CS-US pairings. J Exp Psychol Anim Behav Process, 5(1), 31–42. [PubMed] [Google Scholar]
  32. Hassani SA, Oemisch M, Balcarras M, Westendorff S, Ardid S, van der Meer MA, et al. (2017). A computational psychiatry approach identifies how alpha-2A noradrenergic agonist Guanfacine affects feature-based reinforcement learning in the macaque. Sci Rep, 7, 40606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Ito M, & Doya K (2009). Validation of decision-making models and analysis of decision variables in the rat basal ganglia. J Neurosci, 29(31), 9861–9874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kahnt T, Park SQ, Cohen MX, Beck A, Heinz A, & Wrase J (2009). Dorsal striatal-midbrain connectivity in humans predicts how reinforce- ments are used to guide decisions. Journal of Cognitive Neuroscience, 21, 1332–1345. [DOI] [PubMed] [Google Scholar]
  35. Khamassi M, Enel P, Dominey PF, & Procyk E (2013). Medial prefrontal cortex and the adaptive regulation of reinforcement learning parameters. Prog Brain Res, 202, 441–464. [DOI] [PubMed] [Google Scholar]
  36. Khamassi M, Quilodran R, Enel P, Dominey PF, & Procyk E (2015). Behavioral Regulation and the Modulation of Information Coding in the Lateral Prefrontal and Cingulate Cortex. Cereb Cortex, 25(9), 3197–3218. [DOI] [PubMed] [Google Scholar]
  37. Klein TA, Neumann J, Reuter M, Hennig J, von Cramon DY, & Ullsperger M (2007). Genetically determined differences in learning from errors. Science, 318(5856), 1642–1645. [DOI] [PubMed] [Google Scholar]
  38. Kour G, & Morris G (2019). Estimating Attentional Set-Shifting Dynamics in Varying Contextual Bandits. bioRxiv, 1–15. [Google Scholar]
  39. Krugel LK, Biele G, Mohr PN, Li SC, & Heekeren HR (2009). Genetic variation in dopaminergic neuromodulation influences the ability to rapidly and flexibly adapt decisions. Proc Natl Acad Sci U S A, 106(42), 17951–17956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Kruschke JK (2011). Models of attentional learning. In Pothos EM & Wills AJ (Eds.), Formal Approaches in Categorization (pp. 120–152): Cambridge University Press. [Google Scholar]
  41. Lavie N, & Fox E (2000). The role of perceptual load in negative priming. J Exp Psychol Hum Percept Perform, 26(3), 1038–1052. [DOI] [PubMed] [Google Scholar]
  42. Le Pelley ME, Pearson D, Griffiths O, & Beesley T (2015). When goals conflict with values: counterproductive attentional and oculomotor capture by reward-related stimuli. J Exp Psychol Gen, 144(1), 158–171. [DOI] [PubMed] [Google Scholar]
  43. Lefebvre G, Lebreton M, Meyniel F, Bourgeois-Gironde S, & Palminteri S (2017). Behavioural and neural characterization of optimistic reinforcement learning. Nature Human Behaviour, 1(4), 1–9. [Google Scholar]
  44. Leong YC, Radulescu A, Daniel R, DeWoskin V, & Niv Y (2017). Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments. Neuron, 93(2), 451–463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. McDougle SD, & Collins AGE (2020). Modeling the influence of working memory, reinforcement, and action uncertainty on reaction time and choice during instrumental learning. Psychon Bull Rev. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Miller KJ, Shenhav A, & Ludvig EA (2019). Habits without values. Psychol Rev, 126(2), 292–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Murphy KP (2012). Machine learning: a probabilistic perspective: MIT Press. [Google Scholar]
  48. Namburi P, Beyeler A, Yorozu S, Calhoon GG, Halbert SA, Wichmann R, et al. (2015). A circuit mechanism for differentiating positive and negative associations. Nature, 520(7549), 675–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Nassar MR, Wilson RC, Heasly B, & Gold JI (2010). An approximately Bayesian delta-rule model explains the dynamics of belief updating in a changing environment. J Neurosci, 30(37), 12366–12378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Niv Y, Daniel R, Geana A, Gershman SJ, Leong YC, Radulescu A, et al. (2015). Reinforcement learning in multidimensional environments relies on attention mechanisms. J Neurosci, 35(21), 8145–8157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Noonan MP, Crittenden BM, Jensen O, & Stokes MG (2018). Selective inhibition of distracting input. Behav Brain Res, 355, 36–47. [DOI] [PubMed] [Google Scholar]
  52. Oemisch M, Westendorff S, Azimi M, Hassani SA, Ardid S, Tiesinga P, et al. (2019). Feature-specific prediction errors and surprise across macaque fronto-striatal circuits. Nat Commun, 10(1), 176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Papachristos EB, & Gallistel CR (2006). Autoshaped head poking in the mouse: a quantitative analysis of the learning curve. J Exp Anal Behav, 85(3), 293–308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Pinherio JC, & Bates DM (1996). Unconstrained Parametrizations for Variance-Covariance Matrices. Statistics and Computing, 6, 289–296. [Google Scholar]
  55. Poldrack RA, & Packard MG (2003). Competition among multiple memory systems: converging evidence from animal and human brain studies. Neuropsychologia, 41(3), 245–251. [DOI] [PubMed] [Google Scholar]
  56. Radulescu A (2020). Computational Mechanisms of Selective Attention during Reinforcement Learning. Princeton University. [Google Scholar]
  57. Radulescu A, Daniel R, & Niv Y (2016). The effects of aging on the interaction between reinforcement learning and attention. Psychol Aging, 31(7), 747–757. [DOI] [PubMed] [Google Scholar]
  58. Radulescu A, Niv Y, & Ballard I (2019). Holistic Reinforcement Learning: The Role of Structure and Attention. Trends Cogn Sci, 23(4), 278–292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Rmus M, McDougle SD, & Collins AGE (2020). The Role of Executive Function in Shaping Reinforcement Learning. PsyArXiv, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Roelfsema PR, & Holtmaat A (2018). Control of synaptic plasticity in deep cortical networks. Nat Rev Neurosci, 19(3), 166–180. [DOI] [PubMed] [Google Scholar]
  61. Rombouts JO, Bohte SM, & Roelfsema PR (2015). How attention can create synaptic tags for the learning of working memories in sequential tasks. PLoS Comput Biol, 11(3), e1004060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Rusz D, Le Pelley M, Kompier MAJ, Mait L, & Bijleveld E (2020). Reward-driven distraction: A meta-analysis. PsyArXiv, 10.31234/osf.io/82csm. [DOI] [PubMed] [Google Scholar]
  63. Schonberg T, Daw ND, Joel D, & O’Doherty JP (2007). Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. J Neurosci, 27(47), 12860–12867. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Schweighofer N, & Arbib MA (1998). A model of cerebellar metaplasticity. Learn Mem, 4(5), 421–428. [DOI] [PubMed] [Google Scholar]
  65. Seo H, Cai X, Donahue CH, & Lee D (2014). Neural correlates of strategic reasoning during competitive games. Science, 346(6207), 340–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Soltani A, & Izquierdo A (2019). Adaptive learning under expected and unexpected uncertainty. Nat Rev Neurosci, 20(10), 635–644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Sutton RS, & Barto AG (2018). Reinforcement learning: An introduction. (2nd ed. Vol. Cambridge, MA: ): MIT Press. [Google Scholar]
  68. Taswell CA, Costa VD, Murray EA, & Averbeck BB (2018). Ventral striatum’s role in learning from gains and losses. Proc Natl Acad Sci U S A, 115(52), E12398–E12406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Tomov MS, Truong VQ, Hundia RA, & Gershman SJ (2020). Dissociable neural correlates of uncertainty underlie different exploration strategies. Nat Commun, 11(1), 2371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. van den Bos W, Cohen MX, Kahnt T, & Crone EA (2012). Striatum-medial prefrontal cortex connectivity predicts developmental changes in reinforcement learning. Cereb Cortex, 22(6), 1247–1255. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. van der Meer M, Kurth-Nelson Z, & Redish AD (2012). Information processing in decision-making systems. Neuroscientist, 18(4), 342–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Viejo G, Girard B, Procyk E, & Khamassi M (2018). Adaptive coordination of working-memory and reinforcement learning in non-human primates performing a trial-and-error problem solving task. Behav Brain Res, 355, 76–89. [DOI] [PubMed] [Google Scholar]
  73. Viejo G, Khamassi M, Brovelli A, & Girard B (2015). Modeling choice and reaction time during arbitrary visuomotor learning through the coordination of adaptive working memory and reinforcement learning. Front Behav Neurosci, 9, 225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Voloh B, Watson MR, Koenig S, & Womelsdorf T (2020). MAD saccade: statistically robust saccade threshold estimation via the median absolute deviation. Journal of Eye Movement Research, 12(8). [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Wagenmakers EJ, & Farrell S (2004). AIC model selection using Akaike weights. Psychon Bull Rev, 11(1), 192–196. [DOI] [PubMed] [Google Scholar]
  76. Wang JX, Kurth-Nelson Z, Kumaran D, Tirumala D, Soyer H, Leibo JZ, et al. (2018). Prefrontal cortex as a meta-reinforcement learning system. Nat Neurosci, 21(6), 860–868. [DOI] [PubMed] [Google Scholar]
  77. Watson MR, Voloh B, Naghizadeh M, & Womelsdorf T (2019). Quaddles: A multidimensional 3-D object set with parametrically controlled and customizable features. Behav Res Methods, 51(6), 2522–2532. [DOI] [PubMed] [Google Scholar]
  78. Watson MR, Voloh B, Thomas C, Hasan A, & Womelsdorf T (2019). USE: An integrative suite for temporally-precise psychophysical experiments in virtual environments for human, nonhuman, and artificially intelligent agents. J Neurosci Methods, 326, 108374. [DOI] [PubMed] [Google Scholar]
  79. Westendorff S, Kaping D, Everling S, & Womelsdorf T (2016). Prefrontal and anterior cingulate cortex neurons encode attentional targets even when they do not apparently bias behavior. J Neurophysiol, 116(2), 796–811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Wilson RC, & Collins AG (2019). Ten simple rules for the computational modeling of behavioral data. Elife, 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Wilson RC, & Niv Y (2011). Inferring relevance in a changing world. Front Hum Neurosci, 5, 189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Womelsdorf T, & Everling S (2015). Long-Range Attention Networks: Circuit Motifs Underlying Endogenously Controlled Stimulus Selection. Trends Neurosci, 38(11), 682–700. [DOI] [PubMed] [Google Scholar]
  83. Womelsdorf T, Thomas C, Neumann A, Watson MR, Boroujeni Banaie K, Hassani SA, et al. (2021). A Kiosk Station for the Assessment of Multiple Cognitive Domains and Cognitve Enrichment of Monkeys. Frontiers in Behavioral Neuroscience, 15, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Worthy DA, Otto AR, & Maddox WT (2012). Working-memory load and temporal myopia in dynamic decision making. J Exp Psychol Learn Mem Cogn, 38(6), 1640–1658. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data and computational modeling code for reproducing the results of the best fitting model (Fig. 4) is available on a github link that is activated upon publication of this manuscript.

RESOURCES