Abstract
Learning and generalization in spatial domains is often thought to rely on a “cognitive map”, representing relationships between spatial locations. Recent research suggests that this same neural machinery is also recruited for reasoning about more abstract, conceptual forms of knowledge. Yet, to what extent do spatial and conceptual reasoning share common computational principles, and what are the implications for behavior? Using a within-subject design we studied how participants used spatial or conceptual distances to generalize and search for correlated rewards in successive multi-armed bandit tasks. Participant behavior indicated sensitivity to both spatial and conceptual distance, and was best captured using a Bayesian model of generalization that formalized distance-dependent generalization and uncertainty-guided exploration as a Gaussian Process regression with a radial basis function kernel. The same Gaussian Process model best captured human search decisions and judgments in both domains, and could simulate realistic learning curves, where we found equivalent levels of generalization in spatial and conceptual tasks. At the same time, we also find characteristic differences between domains. Relative to the spatial domain, participants showed reduced levels of uncertainty-directed exploration and increased levels of random exploration in the conceptual domain. Participants also displayed a one-directional transfer effect, where experience in the spatial task boosted performance in the conceptual task, but not vice versa. While confidence judgments indicated that participants were sensitive to the uncertainty of their knowledge in both tasks, they did not or could not leverage their estimates of uncertainty to guide exploration in the conceptual task. These results support the notion that value-guided learning and generalization recruit cognitive-map dependent computational mechanisms in spatial and conceptual domains. Yet both behavioral and model-based analyses suggest domain specific differences in how these representations map onto actions.
Author summary
There is a resurgence of interest in “cognitive maps’’ based on recent evidence that the hippocampal-entorhinal system encodes both spatial and non-spatial relational information, with far-reaching implications for human behavior. Yet little is known about the commonalities and differences in the computational principles underlying human learning and decision making in spatial and non-spatial domains. We use a within-subject design to examine how humans search for either spatially or conceptually correlated rewards. Using a Bayesian learning model, we find evidence for the same computational mechanisms of generalization across domains. While participants were sensitive to expected rewards and uncertainty in both tasks, how they leveraged this knowledge to guide exploration was different: participants displayed less uncertainty-directed and more random exploration in the conceptual domain. Moreover, experience with the spatial task improved conceptual performance, but not vice versa. These results provide important insights about the degree of overlap between spatial and conceptual cognition.
Introduction
Thinking spatially is intuitive. We remember things in terms of places [1–3], describe the world using spatial metaphors [4, 5], and commonly use concepts like “space” or “distance” in mathematical descriptions of abstract phenomena. In line with these observations, previous theories have argued that reasoning about abstract conceptual information follows the same computational principles as spatial reasoning [6–8]. This has recently gained new support from neuroscientific evidence suggesting that common neural substrates are the basis for knowledge representation across domains [9–13].
One important implication of these accounts is that reinforcement learning [14] in non-spatial domains may rely on a map-like organization of information, supported by the computation of distances or similarities between experiences. These representations of distance facilitate generalization, allowing for predictions about novel stimuli based on their similarity to previous experiences. Here, we ask to what extent does the search for rewards depend on the same distance-dependent generalization across two different domains—one defined by spatial location and another by abstract features of a Gabor patch—despite potential differences in how the stimuli and their similarities may be processed?
We formalize a computational model that incorporates distance-dependent generalization and test it in a within-subject experiment, where either spatial features or abstract conceptual features are predictive of rewards. This allows us to study the extent to which the same organizational structure of cognitive representations is used in both domains, based on examining the downstream behavioral implications for learning, decision making, and exploration.
Whereas early psychological theories described reinforcement learning as merely developing an association between stimuli, responses, and rewards [15–17], more recent studies have recognized that the structure of representations plays an important role in making value-based decisions [11, 18] and is particularly important for knowing how to generalize from limited data to novel situations [19, 20]. This idea dates back to Tolman, who famously argued that both rats and humans extract a “cognitive map” of the environment [21]. This cognitive map encodes relationships between experiences or options, such as the distances between locations in space [22], and—crucially—facilitates flexible planning and generalization. While cognitive maps were first identified as representations of physical spaces, Tolman hypothesized that similar principles may underlie the organization of knowledge in broader and more complex cognitive domains [21].
As was the case with Tolman, neuroscientific evidence for a cognitive map was initially found in the spatial domain, in particular, with the discovery of spatially selective place cells in the hippocampus [23, 24] and entorhinal grid cells that fire along a spatial hexagonal lattice [25]. Together with a variety of other specialized cell types that encode spatial orientation [26, 27], boundaries [28, 29], and distances to objects [30], this hippocampal-entorhinal machinery is often considered to provide a cognitive map facilitating navigation and self-location. Yet more recent evidence has shown that the same neural mechanisms are also active when reasoning about more abstract, conceptual relationships [31–36], characterized by arbitrary feature dimensions [37] or temporal relationships [38, 39]. For example, using a technique developed to detect spatial hexagonal grid-like codes in fMRI signals [40], Constantinescu et al. found that human participants displayed a pattern of activity in the entorhinal cortex consistent with mental travel through a 2D coordinate system defined by the length of a bird’s legs and neck [9]. Similarly, the same entorhinal-hippocampal system has also been found to reflect the graph structure underlying sequences of stimuli [10] or the structure of social networks [41], and even to replay non-spatial representations in the sequential order that characterized a previous decision-making task [42]. At the same time, much evidence indicates that cognitive map-related representations are not limited to medial temporal areas, but also include ventral and orbital medial prefrontal areas [9, 11, 40, 43–45]. Relatedly, a study by Kahnt and Tobler [46] using uni-dimensional variations of Gabor stimuli showed that the generalization of rewards was modulated by dopaminergic activity in the hippocampus, indicating a role of non-spatial distance representations in reinforcement learning.
Based on these findings, we asked whether learning and searching for rewards in spatial and conceptual domains is governed by similar computational principles. Using a within-subject design comparing spatial and non-spatial reward learning, we tested whether participants used perceptual similarities in the same way as spatial distances to generalize from previous experiences and inform the exploration of novel options. To ensure commensurate stimuli discriminability between domains, participants completed a training phase where they were required to reach the same level of proficiency in correctly matching a series of target stimuli (see Methods; Fig 1c). In both domains, rewards were correlated (see S2 Fig), such that nearby or similar options tended to yield similar rewards. To model how participants generalize and explore using either perceptual similarities or spatial distances, we used Gaussian Process (GP) regression [47, 48] as a Bayesian model of generalization based on the principle of function learning. The Bayesian predictions of the GP model generalize about novel options using a common notion of similarity across domains, and provide estimates of expected reward and uncertainty. We tested out-of-sample predictions of the GP model against a Bayesian learner that incorporates uncertainty-guided exploration but without generalization, and investigated differences in parameters governing value-based decision making and uncertainty-directed exploration [49–51].
Participant performance was correlated across tasks and was best captured by the GP model in both domains. We were also able to reliably predict participant judgments about unobserved options based on parameters estimated from the bandit task. Whereas the model parameters indicated similar levels of generalization in both domains, we found lower levels of directed exploration in the conceptual domain, where participants instead showed increased levels of random exploration. Moreover, we also observed an asymmetric task order effect, where performing the spatial task first boosted performance on the conceptual task but not vice versa. These findings provide a clearer picture of both the commonalities and differences in how people reason about and represent both spatial and abstract phenomena in complex reinforcement learning tasks.
Results
129 participants searched for rewards in two successive multi-armed bandit tasks (Fig 1). The spatial task was represented as an 8 × 8 grid, where participants used the arrow keys to move a highlighted square to one of the 64 locations, with each location representing one option (i.e., arm of the bandit). The conceptual task was represented using Gabor patches, where a single patch was displayed on the screen and the arrow keys changed the tilt and stripe frequency (each having 8 discrete values; see S1 Fig), providing a non-spatial domain where similarities are relatively well defined. Each of the 64 options in both tasks produced normally distributed rewards, where the means of each option were correlated, such that similar locations or Gabor patches with similar stripes and tilts yielded similar rewards (S2 Fig), thus providing traction for similarity-guided generalization and search. The strength of reward correlations were manipulated between subjects, with one half assigned to smooth environments (with higher reward correlations) and the other assigned to rough environments (with lower reward correlations). Importantly, both classes of environments had the same expectation of rewards across options.
The spatial and conceptual tasks were performed in counter-balanced order, with each task consisting of an initial training phase (see Methods) and then 10 rounds of bandits. Each round had a different reward distribution (drawn without replacement from the assigned class of environments), and participants were given 20 choices to acquire as many points as possible (later converted to monetary rewards). The search horizon was much smaller than the total number of options and therefore induced an explore-exploit dilemma and motivated the need for generalization and efficient exploration. The last round of each task was a “bonus round”, where after 15 choices, participants were shown 10 unobserved options (selected at random) and asked to make judgments about the expected reward and their level of confidence (i.e., uncertainty about the expected rewards). These judgments were used to validate the internal belief representations of our models. All data and code, including interactive notebooks containing all analyses in the paper, is publicly available at https://github.com/charleywu/cognitivemaps.
Computational models of learning, generalization, and search
Multi-armed bandit problems [52, 53] are a prominent framework for studying learning, where various reinforcement learning (RL) models [14] are used to model the learning of reward valuations and to predict behavior. A common element of most RL models is some form of prediction-error learning [54, 55], where model predictions are updated based on the difference between the predicted and experienced outcome. One classic example of learning from prediction errors is the Rescorla-Wagner [55] model, in which the expected reward V(⋅) of each bandit is described as a linear combination of weights wt and a one-hot stimuli vector xt representing the current state st:
(1) |
(2) |
Learning occurs by updating the weights w as a function of the prediction error δt = rt − V(xt), where rt is the observed reward, V(xt) is the reward expectation, and 0 < η ≤ 1 is the learning rate parameter. In our task, we used a Bayesian Mean Tracker (BMT) as a Bayesian variant of the Rescorla-Wagner model [55, 56]. Rather than making point estimates of reward, the BMT makes independent and normally distributed predictions for each state si,t, which are characterized by a mean m and variance v and updated on each trial t via the delta rule (see Methods for details).
Generalization using Gaussian process regression
Yet, an essential aspect of human cognition is the ability to generalize from limited experiences to novel options. Rather than learning independent reward representations for each state, we adopt a function learning approach to generalization [19, 57], where continuous functions represent candidate hypotheses about the world, mapping the space of possible options to some outcome value. For example, a function can map how pressure on the gas pedal is related to the acceleration of a car, or how different amounts of water and fertilizer influence the growth rate of a plant. Crucially, the learned mapping provides estimates even for outcomes that have not been observed, by interpolating or extrapolating from previous experiences.
While the literature on how humans explicitly learn functions extends back to the 1960s [58], more recent approaches have proposed Gaussian Process (GP) regression [47] as a candidate model of human function learning [59–61]. GPs unite previous proposals of rule-based [62, i.e., learning the weights of a particular parametric function] and exemplar-based theories [63, i.e., neural networks predicting similar inputs will produce similar outputs], while also predicting the perceived difficulty of learning different functions [64] and explaining biases in how people extrapolate from limited data [59].
Formally, a GP defines a multivariate-normal distribution P(f) over possible value functions f(s) that map inputs s to output y = f(s).
(3) |
The GP is fully defined by the mean function m(s), which is frequently set to 0 for convenience without loss of generality [47], and kernel function k(s, s′) encoding prior assumptions (or inductive biases) about the underlying function. Here we use the radial basis function (RBF) kernel:
(4) |
encoding similarity as a smoothly decaying function of the squared Euclidean distance between stimuli s and s′, measured either in spatial or conceptual distance. The length-scale parameter λ encodes the rate of decay, where larger values correspond to broader generalization over larger distances.
Given a set of observations about previously observed states and associated rewards, the GP makes normally distributed posterior predictions for any novel stimuli s⋆, defined in terms of a posterior mean and variance:
(5) |
(6) |
The posterior mean corresponds to the expected value of s⋆ while the posterior variance captures the underlying uncertainty in the prediction. Note that the posterior mean can also be rewritten as a similarity-weighted sum:
(7) |
where each si is a previously observed input in st and the weights are collected in the vector . Intuitively, this means that GP regression is equivalent to a linearly weighted sum, but uses basis functions k(⋅, ⋅) that project the inputs into a feature space, instead of the discrete state vectors. To generate new predictions, every observed reward yi in yt is weighted by the similarity of the associated state si to the candidate state s⋆ based on the kernel similarity. This similarity-weighted sum (Eq 7) is equivalent to a RBF network [65], which has featured prominently in machine learning approaches to value function approximation [14] and as a theory of the neural architecture of human generalization [66] in vision and motor control.
Uncertainty-directed exploration
In order to transform the Bayesian reward predictions of the BMT and GP models into predictions about participant choices, we use upper confidence bound (UCB) sampling together with a softmax choice rule as a combined model of both directed and random exploration [19, 50, 51].
UCB sampling uses a simple weighted sum of expected reward and uncertainty:
(8) |
to compute a value q for each option s, where the exploration bonus β determines how to trade off exploring highly uncertain options against exploiting high expected rewards. This simple heuristic—although myopic—produces highly efficient learning by preferentially guiding exploration towards uncertain yet promising options, making it one of the only algorithms with known performance bounds in Bayesian optimization [67]. Recent studies have provided converging evidence for directed exploration in human behavior across a number of domains [19, 50, 68–70].
The UCB values are then put into a softmax choice rule:
(9) |
where the temperature parameter τ controls the amount of random exploration. Higher temperature sampling leads to more random choice predictions, with τ → ∞ converging on uniform sampling. Lower temperature values make more precise predictions, where τ → 0 converges on an argmax choice rule. Taken together, the exploration bonus β and temperature τ parameters estimated on participant data allow us to assess the relative contributions of directed and undirected exploration, respectively.
Behavioral results
After training participants were highly proficient in discriminating the stimuli, achieving at least 90% accuracy in both domains (see S3 Fig). Participants were also successful in both bandit tasks, achieving much higher rewards than chance in both conceptual (one-sample t-test: t(128) = 24.6, p < .001, d = 2.2, BF > 100) and spatial tasks (t(128) = 34.6, p < .001, d = 3.0, BF > 100; Fig 2a; See Methods for further details about statistics). In addition, participants could also leverage environmental structure in both domains. Using a two-way mixed ANOVA, we found that both environment (smooth vs. rough: F(1, 127) = 9.4, p = .003, η2 = .05, BF = 13) and task (spatial vs. conceptual: F(1, 127) = 35.8, p < .001, η2 = .06, BF > 100) influenced performance. The stronger reward correlations present in smooth environments facilitated higher performance (two sample t-test: t(127) = 3.1, p = .003, d = 0.5, BF = 12), even though both environments had the same expected reward.
While performance was strongly correlated between the spatial and conceptual tasks (Pearson’s r = .53, p < .001, BF > 100; Fig 2b), participants performed systematically better in the spatial version (paired t-test: t(128) = 6.0, p < .001, d = 0.5, BF > 100). This difference in task performance can largely be explained by a one-directional transfer effect (Fig 2c). Participants performed better on the conceptual task after having experienced the spatial task (t(127) = 2.8, p = .006, d = 0.5, BF = 6.4). This was not the case for the spatial task, where performance did not differ whether performed first or second (t(127) = − 1.7, p = .096, d = 0.3, BF = .67). Thus, experience with spatial search boosted performance on conceptual search, but not vice versa.
Participants learned effectively within each round and obtained higher rewards with each successive choice (Pearson correlation between reward and trial: r = .88, p < .001, BF > 100; Fig 2d). We also found evidence for learning across rounds in the spatial task (Pearson correlation between reward and round: r = .91, p < .001, BF = 15), but not in the conceptual task (r = .58, p = .104, BF = 1.5).
Patterns of search also differed across domains. Comparing the average Manhattan distance between consecutive choices in a two-way mixed ANOVA showed an influence of task (within: F(1, 127) = 13.8, p < .001, η2 = .02, BF = 67) but not environment (between: F(1, 127) = 0.12, p = .73, η2 = .001, BF = 0.25, Fig 2e). This reflected that participants searched in smaller step sizes in the spatial task (t(128) = − 3.7, p < .001, d = 0.3, BF = 59), corresponding to a more local search strategy, but did not adapt their search distance to the environment. Note that each trial began with a randomly sampled initial stimuli, such that participants did not begin near the previous selection (see Methods). The bias towards local search (one-sample t-test comparing search distance against chance: t(128) = − 16.3, p < .001, d = 1.4, BF > 100) is therefore not a side effect of the task characteristics, but both purposeful and effortful (see S4 Fig for additional analysis of search trajectories).
Participants also adapted their search patterns based on reward values (Fig 2f), where lower rewards predicted a larger search distance on the next trial (correlation between previous reward and search distance: r = − .66, p < .001, BF > 100). We analyzed this relationship using a Bayesian mixed-effects regression, where we found previous reward value to be a reliable predictor of search distance (bprevReward = − 0.06, 95% HPD: [−0.06, −0.06]; see S1 Table), while treating participants as random effects. This provides initial evidence for generalization-like behavior, where participants actively avoided areas with poor rewards and stayed near areas with rich rewards.
In summary, we found correlated performance across tasks, but also differences in both performance and patterns of search. Participants were boosted by a one-directional transfer effect, where experience with the spatial task improved performance on the conceptual task, but not the other way around. In addition, participants made larger jumps between choices in the conceptual task and searched more locally in the spatial task. However, participants adapted these patterns in both domains in response to reward values, where lower rewards predicted a larger jump to the next choice.
Modeling results
To better understand how participants navigated the spatial and conceptual tasks, we used computational models to predict participant choices and judgments. Both GP and BMT models implement directed and undirected exploration using the UCB exploration bonus β and softmax temperature τ as free parameters. The models differed in terms of learning, where the GP generalized about novel options using the length-scale parameter λ to modulate the extent of generalization over spatial or conceptual distances, while the BMT learns the rewards of each option independently (see Methods).
Both models were estimated using leave-one-round-out cross validation, where we compared goodness of fit using out-of-sample prediction accuracy, described using a pseudo-R2 (Fig 3a). The differences between models were reliable and meaningful, with the GP model making better predictions than the BMT in both the conceptual (t(128) = 3.9, p < .001, d = 0.06, BF > 100) and spatial tasks (t(128) = 4.3, p < .001, d = 0.1, BF > 100). In total, the GP model best predicted 85 participants in the conceptual task and 93 participants in the spatial task (out of 129 in total). Comparing this same out-of-sample prediction accuracy using a Bayesian model selection framework [71, 72] confirmed that the GP had the highest posterior probability (corrected for chance) of being the best model in both tasks (protected exceedance probability; conceptual: pxp(GP) = .997; spatial: pxp(GP) = 1.000; Fig 3b). The superiority of the GP model suggests that generalization about novel options via the use of structural information played a guiding role in how participants searched for rewards (see S6 Fig for additional analyses).
Learning curves
To confirm that the GP model indeed captured learning behavior better in both tasks, we simulated learning curves from each model using participant parameter estimates (Fig 3c; see Methods). The GP model achieved human-like performance in all tasks and environments (comparing aggregate GP and human learning curves: conceptual MSE = 17.7; spatial MSE = 16.6), whereas BMT learning curves were substantially less similar (conceptual MSE = 150.6; spatial MSE = 330.7). In addition, the GP captured the same qualitative difference between domains and environments as our human participants (Fig 3d), with better performance in conceptual vs. spatial, and smooth vs. rough. These patterns were not present in the BMT or random simulations.
Parameter estimates
To understand how generalization and exploration differed between domains, Fig 3e compares the estimated model parameters from the conceptual and spatial tasks. The GP model had three free parameters: the extent of generalization (λ) of the RBF kernel, the exploration bonus (β) of UCB sampling, and the temperature (τ) of the softmax choice rule (see S9 Fig for BMT parameters). Note that the exploration bonus captures exploration directed towards uncertainty, whereas temperature captures random, undirected exploration, which have been shown to be distinct and recoverable parameters [19, 70].
We do not find reliable differences in λ estimates across tasks (Wilcoxon signed-rank test: Z = − 1.2, p = .115, r = − .11, BF = .13). In all cases, we observed lower levels of generalization relative to the true generative model of the underlying reward distributions (λrough = 2, λsmooth = 4; min-BF = 1456), replicating previous findings [19] that found undergeneralization to be largely beneficial in similar settings. Generalization was anecdotally correlated across tasks (Kendall rank correlation: rτ = .13, p = .028, BF = 1.3), providing weak evidence that participants tended to generalize similarly across domains.
Whereas generalization was similar between tasks, there were intriguing differences in exploration. We found substantially lower exploration bonuses (β) in the conceptual task (Z = − 5.0, p < .001, r = − .44, BF > 100), indicating a large reduction of directed exploration, relative to the spatial task. At the same time, there was an increase in temperature (τ) in the conceptual task (Z = 6.9, p < .001, r = − .61, BF > 100), corresponding to an increase in random, undirected exploration. These domain-specific differences in β and τ were not influenced by task order or environment (two-way ANOVA: all p > .05, BF < 1). Despite these differences, we find some evidence of correlations across tasks for directed exploration (rτ = .18, p = .002, BF = 13) and substantial evidence for correlations between random exploration across domains (rτ = .43, p < .001, BF > 100).
Thus, participants displayed similar and somewhat correlated levels of generalization in both tasks, but with markedly different patterns of exploration. Whereas participants engaged in typical levels of directed exploration in the spatial domain (replicating previous studies; [19, 70]), they displayed reduced levels of directed exploration in the conceptual task, substituting instead an increase in undirected exploration. Again, this is not due to a lack of effort, because participants made longer search trajectories in the conceptual domain (see S4a Fig). Rather, this indicates a meaningful difference in how people represent or reason about spatial and conceptual domains in order to decide which are the most promising options to explore.
Bonus round
In order to further validate our behavioral and modeling results, we analyzed participants’ judgments of expected rewards and perceived confidence for 10 unobserved options they were shown during the final “bonus” round of each task (see Methods and Fig 1c). Participants made equally accurate judgments in both tasks (comparing mean absolute error: t(128) = − 0.2, p = .827, d = 0.02, BF = .10; Fig 4a), which were far better than chance (conceptual: t(128) = − 9.2, p < .001, d = 0.8, BF > 100; spatial: t(128) = − 8.4, p < .001, d = 0.7, BF > 100) and correlated between tasks (r = .27, p = .002, BF = 20). Judgment errors were also correlated with performance in the bandit task (r = − .45, p < .001, BF > 100), such that participants who earned higher rewards also made more accurate judgments.
Participants were equally confident in both domains (t(128) = − 0.8, p = .452, d = 0.04, BF = .13; Fig 4b), with correlated confidence across tasks (r = .79, p < .001, BF > 100), suggesting some participants were consistently more confident than others. Ironically, more confident participants also had larger judgment errors (r = .31, p < .001, BF = 91) and performed worse in the bandit task (r = − .28, p = .001, BF = 28).
Using parameter estimates from the search task (excluding the entire bonus round), we computed model predictions for each of the bonus round judgments as an out-of-task prediction analysis. Whereas the BMT invariably made the same predictions for all unobserved options since it does not generalize (Fig 4c), the GP predictions were correlated with participant judgments in both conceptual (mean individual correlation: ; single sample t-test of z-transformed correlation coefficients against μ = 0: t(128) = 11.0, p < .001, d = 1.0, BF > 100) and spatial tasks (; t(128) = 11.0, p < .001, d = 1.0, BF > 100). This correspondence between human judgments and model predictions was also confirmed using a Bayesian mixed effects model, where we again treated participants as random effects (bparticipantJudgment = .82, 95% HPD: [0.75, 0.89]; see S2 Table for details).
Not only was the GP able to predict judgments about expected reward, but it also captured confidence ratings. Fig 4d shows how the highest confidence ratings corresponded to the lowest uncertainty estimates made by the GP model. This effect was also found in the raw data, where we again used a Bayesian mixed effects model to regress participant confidence judgments onto the GP uncertainty predictions (bparticipantJudgment = − 0.02, 95% HPD: [-0.02, -0.01]; see S2 Table).
Thus, participant search behavior was consistent with our GP model and we were also able to make accurate out-of-task predictions about both expected reward and confidence judgments using parameters estimated from the search task. These predictions validate the internal learning model of the GP, since reward predictions depend only on the generalization parameter λ. All together, our results suggest domain differences were not due to differences in how participants computed or represented expected reward and uncertainty, since they were equally good judging at their uncertainty in the bonus rounds for both domains. Rather, these diverging patterns of search arose from differences in exploration, where participants substantially reduced their level of exploration directed towards uncertain options in the conceptual domain.
Discussion
Previous theories of cognitive maps [21, 32–34] have argued that reasoning in abstract domains follows similar computational principles as in spatial domains, for instance, sharing a common approach to computing similarities between experiences. These accounts imply that the shared notion of similarity should influence how people generalize from past outcomes, and also how they balance between sampling new and informative options as opposed to exploiting known options with high expected rewards.
Here, we investigated to what extent learning and searching for rewards are governed by similar computational principles in spatial and conceptual domains. Using a within-subject design, we studied participant behavior in both spatially and conceptually correlated reward environments. Comparing different computational models of learning and exploration, we found that a Gaussian Process (GP) model that incorporated distance-based generalization, and hence a cognitive map of similarities, best predicted participants behavior in both domains. In both domains, our parameter estimates indicated equivalent levels of generalization. Using these parameters, our model was able to simulate human-like learning curves and make accurate out-of-task predictions about participant reward estimations and confidence ratings in a final bonus round. This model-based evidence for similar distance-based decision making in both domains was also in line with our behavioral results. Performance was correlated across domains and benefited from higher outcome correlations between similar bandit options (i.e., smooth vs. rough). Subsequent choices tended to be more local than expected by chance, and similar options where more likely to be chosen after a high reward than a low reward outcome.
In addition to revealing similarities, our modelling and behavioral analyses provided a diagnostic lens into differences between spatial and conceptual domains. Whereas we found similar levels of generalization in both tasks, patterns of exploration were substantially different. Although participants showed clear signs of directed exploration (i.e., seeking out more uncertain options) in the spatial domain, this was notably reduced in the conceptual task. However, as if in compensation, participants increased their random exploration in the conceptual task. This implies a reliable shift in sampling strategies but not in generalization. Thus, even though the computational principles underpinning reasoning in both domains are indeed similar, how these computations are mapped onto actions can vary substantially. Moreover, participants obtained more rewards and sampled more locally in the spatial domain. We also find a one-directional transfer effect, where experience with the spatial task boosted performance on the conceptual task, but not vice versa. These findings shed new light onto the computational mechanisms of generalization and decision making, suggesting a universality of generalization and a situation-specific adaptation of decision making policies.
Related work
Our findings also contribute to a number of other cognitive models and theories. According to the successor representation (SR; [18]) framework, hippocampal cognitive maps reflect predictions of expected future state occupancy [73–75]. This provides a similarity metric based on transition dynamics, where an analytic method for computing the SR in closed form is to assume random transitions through the state space. This assumption of a random policy produces a nearly identical similarity metric as the RBF kernel [76], with exact equivalencies in certain cases [77].
However, the SR can also be learned online using the Temporal-Difference learning algorithm, leading to asymmetric representations of distance that are skewed by the distance of travel [73, 78]. Recent work building on Kohonen maps has also suggested that the distribution of the experienced stimuli in feature space will have implications for the activation profiles of grid cells and the resulting cognitive map [79].
In our current study, we have focused on the simplifying case of a cognitive map learned through a random policy. This context was induced by having stimuli uniformly distributed over the search space and using a training phase involving extensive and random trajectories over the search space (i.e., matching random targets from random starting points). While this assumption is not always met in real life domains, it provides a useful starting point and allows us to reciprocally compare behavior in spatial and conceptual domains.
Previous work has also investigated transfer across domains [80], where inferences about the transition structure in one task can be generalized to other tasks. Whereas we used identical transition structures in both tasks, we nevertheless found asymmetric transfer between domains. A key question underlying the nature of transfer is the remapping of representations [81, 82], which can be framed as a hidden state-space inference problem. Different levels of prior experience with the spatial and conceptual stimuli could give rise to different preferences for reuse of task structure as opposed to learning a novel structure. This may be a potential source of the asymmetric transfer we measured in task performance.
Additionally, clustering methods (e.g., [79]) can also provide local approximations of GP inference by making predictions about novel options based on the mean of a local cluster. For instance, a related reward-learning task on graph structures [76] found that a k-nearest neighbors model provided a surprisingly effective heuristic for capturing aspects of human judgments and decisions. However, a crucial limitation of any clustering models is it would be incapable of learning and extrapolating upon any directional trends, which is a crucial feature of human function learning [59, 60]. Alternatively, clustering could also play a role in approximate GP inference [83], by breaking up the inference problem into smaller chunks or by considering only a subset of inputs. Future work should explore the question of how human inference scales with the complexity of the data.
Lastly, the question of “how the cognitive map is learned” is distinct from the question of “how the cognitive map is used”. Here, we have focused on the latter, and used the RBF kernel to provide a map based on the assumption of random transitions, similar to a random-policy implementation of the SR. While both the SR and GP provide a theory of how people utilize a cognitive map for performing predictive inferences, only the GP provides a theory about representations of uncertainty via Bayesian predictions of reward. These representations of uncertainty are a key feature that sets the GP apart from the SR. Psychologically, GP uncertainty estimates systematically capture participant confidence judgments and provide the basis for uncertainty-directed exploration. This distinction may also be central to the different patterns of search we observed in spatial and non-spatial domains, where a reduction in uncertainty-directed exploration may also reflect computational differences in the structure of inference. However, the exact nature of these representations remains an open question for future neuroimaging research.
Future directions
Several questions about the link between cognitive maps across domains remain unanswered by our current study and are open for future investigations. Why did we find differences in exploration across domains, even though the tasks were designed to be as equivalent as possible, including requiring commensurate stimuli discriminability during the pre-task training phase? Currently, our model can capture but not fully explain these differences in search behavior, since it treats both domains as equivalent generalization and exploration problems.
One possible explanation is a different representation of spatial and non-spatial information, or different computations acting on those representations. Recent experimental work has demonstrated that representations of spatial and non-spatial domains may be processed within the same neural systems [9, 12], suggesting representational similarities. But in our study it remains possible that different patterns of exploration could instead result from a different visual presentation of information in the spatial and the non-spatial task. It is, for example, conceivable that exploration in a (spatially or non-spatially) structured environment depends on the transparency of the structure in the stimulus material, or the alignment of the input modality. In our case the spatial structure was embedded in the stimulus itself, whereas the conceptual structure was not. Additionally, the arrow key inputs may have been more intuitive for manipulating the spatial stimuli. While generalization could be observed in both situations, directed exploration might require more explicitly accessible information about structural relationships or be facilitated by more intuitively mappable inputs. Previous work used a task where both spatial and conceptual features were simultaneously presented [84, i.e., conceptual stimuli were shuffled and arranged on a grid], yet only spatial or only conceptual features predicted rewards. However, differences in the saliency of spatial and conceptual features meant participants were highly influenced by spatial features, even when they were irrelevant. This present study was designed to overcome these issues by presenting only task-specific features, yet future work should address the computational features that allow humans to leverage structured knowledge of the environment to guide exploration. There is also a wide range of alternative non-spatial stimuli that we have not tested (for instance auditory [12] or linguistic stimuli [85, 86]), which could be considered more “conceptual” than our Gabor stimuli or may be more familiar to participants. Thus, it is an open empirical question to determine the limits to which spatial and different kinds of conceptual stimuli can be described using the same computational framework.
Our model also does not account for attentional mechanisms [87] or working memory constraints [88, 89], which may play a crucial role in influencing how people integrate information differently across domains [90]. To ask whether feature integration is different between domains, we implemented a variant of our GP model using a Shepard kernel [65], which used an additional free parameter estimating the level of integration between the two feature dimensions (S10 Fig). This model did not reveal strong differences in feature integration, yet replicated our main findings with respect to changes in exploration. Additional analyses showed asymmetries in attention to different feature dimensions, which was an effect modulated by task order (S4d–S4f Fig). Task order also modulated performance differences between domains, which only appeared when the conceptual task was performed before the spatial task (Fig 2c). Experience with the spatial task version may have facilitated a more homogenous mapping of the conceptual stimuli into a 2D similarity space, which in turn facilitated better performance. This asymmetric transfer may support the argument that spatial representations have been “exapted” to other more abstract domains [6–8]. For example, experience of different resource distributions in a spatial search task was found to influence behavior in a word generation task, where participants exposed to sparser rewards in space generated sparser semantic clusters of words [91]. Thus, while both spatial and conceptual knowledge are capable of being organized into a common map-like representation, there may be domain differences in terms of the ease of learning such a map and asymmetries in the transfer of knowledge. Future research should investigate this phenomenon with alternative models that make stronger assumptions about representational differences across domains.
We also found no differences in predictions and uncertainty estimates about unseen options in the bonus round. This means that participants generalized and managed to track the uncertainties of unobserved options similarly in both domains, yet did not or could not leverage their representations of uncertainty for performing directed exploration as effectively in the conceptual task. Alternatively, differences in random exploration could also arise from limited computational precision during the learning of action values [92]. Thus, the change in random exploration we observed may be due to different computational demands across domains. Similar shifts increases to random exploration have also been observed under direct cognitive load manipulations, such as by adding working memory load [93] or by limiting the available decision time [94].
Finally, our current experiment only looked at similarities between spatial and conceptual domains if the underlying structure was the same in both tasks. Future studies could expand this approach across different domains such as logical rule-learning, numerical comparisons, or semantic similarities. Additionally, structure learned in one domain could be transferable to structures encountered in either the same domain with slightly changed structures or even to totally different domains with different structures. A truly all-encompassing model of generalization should capture transfer across domains and structural changes. Even though several recent studies have advanced our understanding of how people transfer knowledge across graph structures [80], state similarities in multi-task reinforcement learning [95], and target hypotheses supporting generalization [90], whether or not all of these recruit the same computational principles and neural machinery remains to be seen.
Conclusion
We used a rich experimental paradigm to study how people generalize and explore both spatially and conceptually correlated reward environments. While people employed similar principles of generalization in both domains, we found a substantial shift in exploration, from more uncertainty-directed exploration in the spatial task to more random exploration in the conceptual domain. These results enrich our understanding of the principles connecting generalization and search across different domains and pave the way for future cognitive and neuroscientific investigations.
Methods
Participants and design
140 participants were recruited through Amazon Mechanical Turk (requiring a 95% approval rate and 100 previously approved HITs) for a two part experiment, where only those who had completed part one were invited back for part two. In total 129 participants completed both parts and were included in the analyses (55 female; mean age = 35, SD = 9.5). Participants were paid $4.00 for each part of the experiment, with those completing both parts being paid an additional performance-contingent bonus of up to $10.00. Participants earned $15.6 ± 1.0 and spent 54 ± 19 minutes completing both parts. There was an average gap of 18 ± 8.5 hours between the two parts of the experiment.
We varied the task order between subjects, with participants completing the spatial and conceptual task in counterbalanced order in separate sessions. We also varied between subjects the extent of reward correlations in the search space by randomly assigning participants to one of two different classes of environments (smooth vs. rough), with smooth environments corresponding to stronger correlations, and the same environment class used for both tasks (see below).
Ethics statement
The study was approved by the ethics committee of the Max Planck Institute for Human Development (A 2019/27) and all participants gave written informed consent.
Materials and procedure
Each session consisted of a training phase, the main search task, and a bonus round. At the beginning of each session participants were required to complete a training task to familiarize themselves with the stimuli (spatial or conceptual), the inputs (arrow keys and spacebar), and the search space (8 × 8 feature space). Participants were shown a series of randomly selected targets and were instructed to use the arrow keys to modify a single selected stimuli (i.e., adjusting the stripe frequency and angle of a Gabor patch or moving the location of a spatial selector, Fig 1c) in order to match a target stimuli displayed below. The target stayed visible during the trial and did not have to be held in memory. The space bar was used to make a selection and feedback was provided for 800ms (correct or incorrect). Participants were required to complete at least 32 training trials and were allowed to proceed to the main task once they had achieved at least 90% accuracy on a run of 10 trials (i.e., 9 out of 10). See S3 Fig for analysis of the training data.
After completing the training, participants were shown instructions for the main search task and had to complete three comprehension questions (S11 and S12 Figs) to ensure full understanding of the task. Specifically, the questions were designed to ensure participants understood that the spatial or conceptual features predicted reward. Each search task comprised 10 rounds of 20 trials each, with a different reward function sampled without replacement from the set of assigned environments. The reward function specified how rewards mapped onto either the spatial or conceptual features, where participants were told that options with either similar spatial features (Spatial task) [19, 96] or similar conceptual features (Conceptual task) [20, 57] would yield similar rewards. Participants were instructed to accumulate as many points as possible, which were later converted into monetary payoffs.
The tenth round of each sessions was a “bonus round”, with additional instructions shown at the beginning of the round. The round began as usual, but after 15 choices, participants were asked to make judgments about the expected rewards (input range: [1, 100]) and their level of confidence (Likert scale from least to most confident: [0, 10]) for 10 unrevealed targets. These targets were uniformly sampled from the set of unselected options during the current round. After the 10 judgments, participants were asked to make a forced choice between the 10 options. The reward for the selected option was displayed and the round continued as normal. All behavioral and computational modeling analyses exclude the last round, except for the analysis of the bonus round judgments.
Spatial and conceptual search tasks
Participants used the arrow keys to either move a highlighted selector in the spatial task or change the features (tilt and stripe frequency) of the Gabor stimuli in the conceptual task (S1 Fig). On each round, participants were given 20 trials to acquire as many cumulative rewards as possible. A selection was made by pressing the space bar, and then participants were given feedback about the reward for 800 ms, with the chosen option and reward value added to the history at the bottom of the screen. At the beginning of each trial, the starting position of the spatial selector or the displayed conceptual stimulus was randomly sampled from a uniform distribution. Each reward observation included normally distributed noise, , where the rewards for each round were scaled to a uniformly sampled maximum value in the range of 80 to 95, so that the value of the global optima in each round could not be easily guessed.
Participants were given feedback about their performance at the end of each round in terms of the ratio of their average reward to the global maximum, expressed as a percentage (e.g., “You have earned 80% of the maximum reward you could have earned on this round”). The performance bonus (up to $10.00) was calculated based on the cumulative performance of each round and across both tasks.
Bonus round judgments
In both tasks the last round was a “bonus round”, which solicited judgments about the expected reward and their level of confidence for 10 unrevealed options. Participants were informed that the goal of the task remained the same (maximize cumulative rewards), but that after 15 selections, they would be asked to provide judgments about 10 randomly selected options, which had not yet been explored. Judgments about expected rewards were elicited using a slider from 1 to 100 (in increments of 1), while judgments about confidence were elicited using a slider from 0 to 10 (in increments of 1), with the endpoints labeled ‘Least confident’ and ‘Most confident’. After providing the 10 judgments, participants were asked to select one of the options they just rated, and subsequently completed the round like all others.
Environments
All environments were sampled from a GP prior parameterized with a radial basis function (RBF) kernel (Eq 4), where the length-scale parameter (λ) determines the rate at which the correlations of rewards decay over (spatial or conceptual) distance. Higher λ-values correspond to stronger correlations. We generated 40 samples of each type of environments, using λrough = 2 and λsmooth = 4, which were sampled without replacement and used as the underlying reward function in each task (S2 Fig). Environment type was manipulated between subjects, with the same environment type used in both conceptual and spatial tasks.
Models
Bayesian mean tracker
The Bayesian Mean Tracker (BMT) is a simple but widely-applied associative learning model [69, 97, 98], which is a special case of the Kalman Filter with time-invariant reward distributions. The BMT can also be interpreted as a Bayesian variant of the Rescorla-Wagner model [56], making predictions about the rewards of each option j in the form of a normally distributed posterior:
(10) |
The posterior mean mj,t and variance vj,t are updated iteratively using a delta-rule update based on the observed reward yt when option j is selected at trial t:
(11) |
(12) |
where δj,t = 1 if option j was chosen on trial t, and 0 otherwise. Rather than having a fixed learning rate, the BMT scales updates based on the Kalman Gain Gj,t, which is defined as:
(13) |
where is the error variance, which is estimated as a free parameter. Intuitively, the estimated mean of the chosen option mj,t is updated based on the prediction error yt−mj,t−1 and scaled by the Kalman Gain Gj,t (Eq 11). At the same time, the estimated variance vj,t is reduced by a factor of 1 − Gj,t, which is in the range [0, 1] (Eq 12). The error variance can be interpreted as an inverse sensitivity, where smaller values result in more substantial updates to the mean mj,t, and larger reductions of uncertainty vj,t.
Model cross-validation
As with the behavioral analyses, we omit the 10th “bonus round” in our model cross-validation. For each of the other nine rounds, we use cross validation to iteratively hold out a single round as a test set, and compute the maximum likelihood estimate using differential evolution [99] on the remaining eight rounds. Model comparisons use the summed out-of-sample prediction error on the test set, defined in terms of log loss (i.e., negative log likelihood).
Predictive accuracy
As an intuitive statistic for goodness of fit, we report predictive accuracy as a pseudo-R2:
(14) |
comparing the out-of-sample log loss of a given model Mk against a random model Mrand. R2 = 0 indicates chance performance, while R2 = 1 is a theoretically perfect model.
Protected exceedance probability
The protected exceedance probability (pxp) is defined in terms of a Bayesian model selection framework for group studies [71, 72]. Intuitively, it can be described as a random-effect analysis, where models are treated as random effects and are allowed to differ between subjects. Inspired by a Polya’s urn model, we can imagine a population containing K different types of models (i.e., people best described by each model), much like an urn containing different colored marbles. If we assume that there is a fixed but unknown distribution of models in the population, what is the probability of each model being more frequent in the population than all other models in consideration?
This is modelled hierarchically, using variational Bayes to estimate the parameters of a Dirichlet distribution describing the posterior probabilities of each model P(mk|y) given the data y. The exceedance probability is thus defined as the posterior probability that the frequency of a model is larger than all other models under consideration:
(15) |
[72] extends this approach by correcting for chance, based on the Bayesian Omnibus Risk (BOR), which is the posterior probability that all model frequencies are equal:
(16) |
This produces the protected exceedance probability (pxp) reported throughout this article, and is implemented using https://github.com/sjgershm/mfit/blob/master/bms.m.
Simulated learning curves
We simulated each model by sampling (with replacement) from the set of cross-validated participant parameter estimates, and performing search on a simulated bandit task. We performed 10,000 simulations for each combination of model, environment, and domain (spatial vs. conceptual).
Bonus round predictions
Bonus round predictions used each participant’s estimated parameters to predict their judgments about expected reward and confidence. Because rewards in each round were randomly scaled to a different global maximum, we also rescaled the model predictions in order to align model predictions with the observed rewards and participant judgments.
Statistical tests
Comparisons
We report both frequentist and Bayesian statistics. Frequentist tests are reported as Student’s t-tests (specified as either paired or independent) for parametric comparisons, while the Mann-Whitney-U test or Wilcoxon signed-rank test are used for non-parametric comparisons (for independent samples or paired samples, respectively). Each of these tests are accompanied by a Bayes factors (BF) to quantify the relative evidence the data provide in favor of the alternative hypothesis (HA) over the null (H0), which we interpret following [100].
Parametric comparison are tested using the default two-sided Bayesian t-test for either independent or dependent samples, where both use a Jeffreys-Zellner-Siow prior with its scale set to , as suggested by [101]. All statistical tests are non-directional as defined by a symmetric prior (unless otherwise indicated).
Non-parametric comparisons are tested using either the frequentist Mann-Whitney-U test for independent samples, or the Wilcoxon signed-rank test for paired samples. In both cases, the Bayesian test is based on performing posterior inference over the test statistics (Kendall’s rτ for the Mann-Whitney-U test and standardized effect size for the Wilcoxon signed-rank test) and assigning a prior using parametric yoking [102]. This leads to a posterior distribution for Kendall’s rτ or the standardized effect size r, which yields an interpretable Bayes factor via the Savage-Dickey density ratio test. The null hypothesis posits that parameters do not differ between the two groups, while the alternative hypothesis posits an effect and assigns an effect size using a Cauchy distribution with the scale parameter set to .
Correlations
For testing linear correlations with Pearson’s r, the Bayesian test is based on Jeffrey’s [103] test for linear correlation and assumes a shifted, scaled beta prior distribution for r, where the scale parameter is set to [104].
For testing rank correlations with Kendall’s tau, the Bayesian test is based on parametric yoking to define a prior over the test statistic [105], and performing Bayesian inference to arrive at a posterior distribution for rτ. The Savage-Dickey density ratio test is used to produce an interpretable Bayes Factor.
ANOVA
We use a two-way mixed-design analysis of variance (ANOVA) to compare the means of both a fixed effects factor (smooth vs. rough environments) as a between-subjects variable and a random effects factor (conceptual vs. spatial) as a within-subjects variable. To compute the Bayes Factor, we assume independent g-priors [106] for each effect size , where each g-value is drawn from an inverse chi-square prior with a single degree of freedom , and assuming a Jeffreys prior on the aggregate mean and scale factor. Following [107], we compute the Bayes factor by integrating the likelihoods with respect to the prior on parameters, where Monte Carlo sampling was used to approximate the g-priors. The Bayes factor reported in the text can be interpreted as the log-odds of the model relative to an intercept-only null model.
Mixed effects regression
Mixed effects regressions are performed in a Bayesian framework with brms [108] using MCMC methods (No-U-Turn sampling [109] with the proposal acceptance probability set to.99). In all models, we use a maximal random effects structure [110], and treat participants as a random intercept. Following [111] we use the following generic weakly informative priors:
(17) |
(18) |
(19) |
All models were estimated over four chains of 4000 iterations, with a burn-in period of 1000 samples.
Supporting information
Acknowledgments
We thank Daniel Reznik, Nicholas Franklin, Samuel Gershman, Christian Doeller, and Fiery Cushman for helpful discussions.
Data Availability
All data and analysis code is available from https://github.com/charleywu/cognitivemaps.
Funding Statement
ES is supported by the Max Planck Society and the Jacobs Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1. James W. The Principles of Psychology. Dover, New York; 1890. [Google Scholar]
- 2. Yates FA. Art of Memory. Routledge; 2013. [Google Scholar]
- 3. Dresler M, Shirer WR, Konrad BN, Müller NC, Wagner IC, Fernández G, et al. Mnemonic training reshapes brain networks to support superior memory. Neuron. 2017;93:1227–1235. 10.1016/j.neuron.2017.02.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Landau B, Jackendoff R. Whence and whither in spatial language and spatial cognition? Behavioral and Brain Sciences. 1993;16:255–265. 10.1017/S0140525X00029927 [DOI] [Google Scholar]
- 5. Lakoff G, Johnson M. Metaphors We Live By. University of Chicago press; 2008. [Google Scholar]
- 6. Todd PM, Hills TT, Robbins TW. Cognitive search: Evolution, algorithms, and the brain. MIT press; 2012. [Google Scholar]
- 7. Hills TT, Todd PM, Goldstone RL. Search in external and internal spaces: Evidence for generalized cognitive search processes. Psychological Science. 2008;19:802–808. 10.1111/j.1467-9280.2008.02160.x [DOI] [PubMed] [Google Scholar]
- 8. Hills TT. Animal foraging and the evolution of goal-directed cognition. Cognitive Science. 2006;30:3–41. 10.1207/s15516709cog0000_50 [DOI] [PubMed] [Google Scholar]
- 9. Constantinescu AO, O’Reilly JX, Behrens TE. Organizing conceptual knowledge in humans with a gridlike code. Science. 2016;352:1464–1468. 10.1126/science.aaf0941 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Garvert MM, Dolan RJ, Behrens TE. A map of abstract relational knowledge in the human hippocampal–entorhinal cortex. eLife. 2017;6:e17086 10.7554/eLife.17086 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Schuck NW, Cai MB, Wilson RC, Niv Y. Human orbitofrontal cortex represents a cognitive map of state space. Neuron. 2016;91:1402–1412. 10.1016/j.neuron.2016.08.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Aronov D, Nevers R, Tank DW. Mapping of a non-spatial dimension by the hippocampal–entorhinal circuit. Nature. 2017;543(7647):719 10.1038/nature21692 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Solomon EA, Lega BC, Sperling MR, Kahana MJ. Hippocampal theta codes for distances in semantic and temporal spaces. Proceedings of the National Academy of Sciences. 2019;116(48):24343–24352. 10.1073/pnas.1906729116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Sutton RS, Barto AG. Reinforcement learning: An introduction. Cambridge: MIT Press; 1998. [Google Scholar]
- 15. Thorndike EL. Animal intelligence: An experimental study of the associative processes in animals. The Psychological Review: Monograph Supplements. 1898;2(4):i. [Google Scholar]
- 16. Pavlov IP. Conditional reflexes: an investigation of the physiological activity of the cerebral cortex. Oxford University Press; 1927. [Google Scholar]
- 17. Skinner BF. The behavior of organisms: An experimental analysis. Appleton-Century, New York; 1938. [Google Scholar]
- 18. Dayan P. Improving generalization for temporal difference learning: The successor representation. Neural Computation. 1993;5:613–624. 10.1162/neco.1993.5.4.613 [DOI] [Google Scholar]
- 19. Wu CM, Schulz E, Speekenbrink M, Nelson JD, Meder B. Generalization guides human exploration in vast decision spaces. Nature Human Behaviour. 2018;2:915––924. 10.1038/s41562-018-0467-4 [DOI] [PubMed] [Google Scholar]
- 20. Stojić H, Schulz E, Analytis P P, Speekenbrink M. It’s new, but is it good? How generalization and uncertainty guide the exploration of novel options. Journal of Experimental Psychology: General. 2020;. [DOI] [PubMed] [Google Scholar]
- 21. Tolman EC. Cognitive maps in rats and men. Psychological Review. 1948;55:189–208. 10.1037/h0061626 [DOI] [PubMed] [Google Scholar]
- 22. Thorndyke PW. Distance estimation from cognitive maps. Cognitive psychology. 1981;13(4):526–550. 10.1016/0010-0285(81)90019-0 [DOI] [Google Scholar]
- 23. O’Keefe J, Dostrovsky J. The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain research. 1971;. [DOI] [PubMed] [Google Scholar]
- 24. O’Keefe J. A review of the hippocampal place cells. Progress in neurobiology. 1979;13(4):419–439. 10.1016/0301-0082(79)90005-4 [DOI] [PubMed] [Google Scholar]
- 25. Hafting T, Fyhn M, Molden S, Moser MB, Moser EI. Microstructure of a spatial map in the entorhinal cortex. Nature. 2005;436(7052):801 10.1038/nature03721 [DOI] [PubMed] [Google Scholar]
- 26. Taube JS, Muller RU, Ranck JB. Head-direction cells recorded from the postsubiculum in freely moving rats. I. Description and quantitative analysis. Journal of Neuroscience. 1990;10(2):420–435. 10.1523/JNEUROSCI.10-02-00420.1990 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Taube JS. Head direction cells and the neurophysiological basis for a sense of direction. Progress in neurobiology. 1998;55(3):225–256. 10.1016/S0301-0082(98)00004-5 [DOI] [PubMed] [Google Scholar]
- 28. Lever C, Burton S, Jeewajee A, O’Keefe J, Burgess N. Boundary vector cells in the subiculum of the hippocampal formation. Journal of Neuroscience. 2009;29(31):9771–9777. 10.1523/JNEUROSCI.1319-09.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Solstad T, Boccara CN, Kropff E, Moser MB, Moser EI. Representation of geometric borders in the entorhinal cortex. Science. 2008;322(5909):1865–1868. 10.1126/science.1166466 [DOI] [PubMed] [Google Scholar]
- 30. Høydal ØA, Skytøen ER, Andersson SO, Moser MB, Moser EI. Object-vector coding in the medial entorhinal cortex. Nature. 2019;568(7752):400 10.1038/s41586-019-1077-7 [DOI] [PubMed] [Google Scholar]
- 31. Epstein RA, Patai EZ, Julian JB, Spiers HJ. The cognitive map in humans: spatial navigation and beyond. Nature neuroscience. 2017;20(11):1504 10.1038/nn.4656 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Behrens TE, Muller TH, Whittington JC, Mark S, Baram AB, Stachenfeld KL, et al. What Is a Cognitive Map? Organizing Knowledge for Flexible Behavior. Neuron. 2018;100(2):490–509. 10.1016/j.neuron.2018.10.002 [DOI] [PubMed] [Google Scholar]
- 33. Kaplan R, Schuck NW, Doeller CF. The role of mental maps in decision-making. Trends in Neurosciences. 2017;40:256–259. 10.1016/j.tins.2017.03.002 [DOI] [PubMed] [Google Scholar]
- 34. Bellmund JL, Gärdenfors P, Moser EI, Doeller CF. Navigating cognition: Spatial codes for human thinking. Science. 2018;362(6415):eaat6766 10.1126/science.aat6766 [DOI] [PubMed] [Google Scholar]
- 35. Eichenbaum H. Hippocampus: remembering the choices. Neuron. 2013;77(6):999–1001. 10.1016/j.neuron.2013.02.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Spiers HJ. The Hippocampal Cognitive Map: One Space or Many? Trends in Cognitive Sciences. 2020; 10.1016/j.tics.2019.12.013. [DOI] [PubMed] [Google Scholar]
- 37. Schiller D, Eichenbaum H, Buffalo EA, Davachi L, Foster DJ, Leutgeb S, et al. Memory and space: towards an understanding of the cognitive map. Journal of Neuroscience. 2015;35(41):13904–13911. 10.1523/JNEUROSCI.2618-15.2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Kraus BJ, Robinson RJ II, White JA, Eichenbaum H, Hasselmo ME. Hippocampal “time cells”: time versus path integration. Neuron. 2013;78(6):1090–1101. 10.1016/j.neuron.2013.04.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. MacDonald CJ, Carrow S, Place R, Eichenbaum H. Distinct hippocampal time cell sequences represent odor memories in immobilized rats. Journal of Neuroscience. 2013;33(36):14607–14616. 10.1523/JNEUROSCI.1537-13.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Doeller CF, Barry C, Burgess N. Evidence for grid cells in a human memory network. Nature. 2010;463(7281):657 10.1038/nature08704 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Tavares RM, Mendelsohn A, Grossman Y, Williams CH, Shapiro M, Trope Y, et al. A map for social navigation in the human brain. Neuron. 2015;87(1):231–243. 10.1016/j.neuron.2015.06.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Schuck NW, Niv Y. Sequential replay of nonspatial task states in the human hippocampus. Science. 2019;364(6447):eaaw5181 10.1126/science.aaw5181 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Jacobs J, Weidemann CT, Miller JF, Solway A, Burke JF, Wei XX, et al. Direct recordings of grid-like neuronal activity in human spatial navigation. Nature neuroscience. 2013;16(9):1188 10.1038/nn.3466 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Schuck NW, Wilson R, Niv Y. A state representation for reinforcement learning and decision-making in the orbitofrontal cortex In: Goal-Directed Decision Making. Elsevier; 2018. p. 259–278. [Google Scholar]
- 45. Niv Y. Learning task-state representations. Nature neuroscience. 2019;22(10):1544–1553. 10.1038/s41593-019-0470-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Kahnt T, Tobler PN. Dopamine regulates stimulus generalization in the human hippocampus. Elife. 2016;5:e12678 10.7554/eLife.12678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Rasmussen C, Williams C. Gaussian Processes for Machine Learning Adaptive Computation and Machine Learning. MIT Press; 2006. [Google Scholar]
- 48. Schulz E, Speekenbrink M, Krause A. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. bioRxiv. 2017;. [Google Scholar]
- 49. Auer P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research. 2002;3(Nov):397–422. [Google Scholar]
- 50. Wilson RC, Geana A, White JM, Ludvig EA, Cohen JD. Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of Experimental Psychology: General. 2014;143:155–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Schulz E, Gershman SJ. The algorithmic architecture of exploration in the human brain. Current Opinion in Neurobiology. 2019;55:7–14. 10.1016/j.conb.2018.11.003 [DOI] [PubMed] [Google Scholar]
- 52. Steyvers M, Lee MD, Wagenmakers EJ. A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology. 2009;53:168–179. 10.1016/j.jmp.2008.11.002 [DOI] [Google Scholar]
- 53.Acuna D, Schrater P. Bayesian modeling of human sequential decision-making on the multi-armed bandit problem. In: Proceedings of the 30th annual conference of the cognitive science society. vol. 100. Washington, DC: Cognitive Science Society; 2008. p. 200–300.
- 54. Bush RR, Mosteller F. A mathematical model for simple learning. Psychological Review. 1951;58:313 10.1037/h0054388 [DOI] [PubMed] [Google Scholar]
- 55. Rescorla RA, Wagner AR. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current research and theory. 1972;2:64–99. [Google Scholar]
- 56. Gershman SJ. A unifying probabilistic view of associative learning. PLoS Computational Biology. 2015;11(11):e1004567 10.1371/journal.pcbi.1004567 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Schulz E, Konstantinidis E, Speekenbrink M. Putting bandits into context: How function learning supports decision making. Journal of experimental psychology: learning, memory, and cognition. 2018;44(6):927. [DOI] [PubMed] [Google Scholar]
- 58. Carroll JD. Functional learning: The learning of continuous functional mappings relating stimulus and response continua. ETS Research Bulletin Series. 1963;1963:i–144. [Google Scholar]
- 59. Lucas CG, Griffiths TL, Williams JJ, Kalish ML. A rational model of function learning. Psychonomic Bulletin & Review. 2015;22(5):1193–1215. 10.3758/s13423-015-0808-5 [DOI] [PubMed] [Google Scholar]
- 60. Griffiths TL, Lucas C, Williams J, Kalish ML. Modeling human function learning with Gaussian processes. In: Advances in Neural Information Processing Systems; 2009. p. 553–560. [Google Scholar]
- 61. Schulz E, Tenenbaum JB, Duvenaud D, Speekenbrink M, Gershman SJ. Compositional inductive biases in function learning. Cognitive Psychology. 2017;99:44–79. 10.1016/j.cogpsych.2017.11.002 [DOI] [PubMed] [Google Scholar]
- 62. Koh K, Meyer DE. Function learning: Induction of continuous stimulus-response relations. Journal of Experimental Psychology: Learning, Memory, and Cognition. 1991;17:811. [DOI] [PubMed] [Google Scholar]
- 63. Busemeyer JR, Byun E, DeLosh EL, McDaniel MA. Learning functional relations based on experience with input-output pairs by humans and artificial neural networks In: Lamberts K, Shanks D, editors. Concepts and Categories. Cambridge: MIT Press; 1997. p. 405–437. [Google Scholar]
- 64.Schulz E, Tenenbaum JB, Reshef DN, Speekenbrink M, Gershman S. Assessing the Perceived Predictability of Functions. In: Proceedings of the 37th Annual Meeting of the Cognitive Science Society. Cognitive Science Society; 2015. p. 2116–2121.
- 65. Jäkel F, Schölkopf B, Wichmann FA. Similarity, kernels, and the triangle inequality. Journal of Mathematical Psychology. 2008;52:297–303. 10.1016/j.jmp.2008.03.001 [DOI] [Google Scholar]
- 66. Poggio T, Bizzi E. Generalization in vision and motor control. Nature. 2004;431:768–774. 10.1038/nature03014 [DOI] [PubMed] [Google Scholar]
- 67.Srinivas N, Krause A, Kakade SM, Seeger M. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of the 27th International Conference on Machine Learning (ICML 2010). 2010; p. 1015–1022.
- 68. Gershman SJ. Deconstructing the human algorithms for exploration. Cognition. 2018;173:34–42. 10.1016/j.cognition.2017.12.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Speekenbrink M, Konstantinidis E. Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science. 2015;7:351–367. 10.1111/tops.12145 [DOI] [PubMed] [Google Scholar]
- 70. Schulz E, Wu CM, Ruggeri A, Meder B. Searching for rewards like a child means less generalization and more directed exploration. Psychological Science. 2019;. 10.1177/0956797619863663 [DOI] [PubMed] [Google Scholar]
- 71. Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ. Bayesian model selection for group studies. Neuroimage. 2009;46:1004–1017. 10.1016/j.neuroimage.2009.03.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Rigoux L, Stephan KE, Friston KJ, Daunizeau J. Bayesian model selection for group studies—revisited. Neuroimage. 2014;84:971–985. 10.1016/j.neuroimage.2013.08.065 [DOI] [PubMed] [Google Scholar]
- 73. Stachenfeld KL, Botvinick MM, Gershman SJ. The hippocampus as a predictive map. Nature Neuroscience. 2017;20:1643 EP –. 10.1038/nn.4650 [DOI] [PubMed] [Google Scholar]
- 74. Russek EM, Momennejad I, Botvinick MM, Gershman SJ, Daw ND. Predictive representations can link model-based reinforcement learning to model-free mechanisms. PLoS computational biology. 2017;13(9):e1005768 10.1371/journal.pcbi.1005768 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Bellmund JL, De Cothi W, Ruiter TA, Nau M, Barry C, Doeller CF. Deforming the metric of cognitive maps distorts memory. Nature Human Behaviour. 2020;4(2):177–188. 10.1038/s41562-019-0767-3 [DOI] [PubMed] [Google Scholar]
- 76. Wu CM, Schulz E, Gershman SJ. Inference and search on graph-structured spaces. bioRxiv. 2020;. [Google Scholar]
- 77.Machado MC, Rosenbaum C, Guo X, Liu M, Tesauro G, Campbell M. Eigenoption Discovery through the Deep Successor Representation. In: Proceedings of the International Conference on Learning Representations (ICLR); 2018.
- 78. Mehta MR, Quirk MC, Wilson MA. Experience-dependent asymmetric shape of hippocampal receptive fields. Neuron. 2000;25(3):707–715. 10.1016/S0896-6273(00)81072-7 [DOI] [PubMed] [Google Scholar]
- 79. Mok RM, Love BC. A non-spatial account of place and grid cells based on clustering models of concept learning. Nature communications. 2019;10(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80. Mark S, Moran R, Parr T, Kennerley S, Behrens T. Transferring structural knowledge across cognitive maps in humans and models. bioRxiv. 2019;. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Sanders H, Wilson MA, Gershman SJ. Hippocampal Remapping as Hidden State Inference. BioRxiv. 2019;. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82. Whittington JC, Muller TH, Mark S, Chen G, Barry C, Burgess N, et al. The Tolman-Eichenbaum Machine: Unifying space and relational memory through generalisation in the hippocampal formation. bioRxiv. 2019; p. 770495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83. Liu H, Ong YS, Shen X, Cai J. When Gaussian process meets big data: A review of scalable GPs. IEEE Transactions on Neural Networks and Learning Systems. 2020;. [DOI] [PubMed] [Google Scholar]
- 84.Wu CM, Schulz E, Garvert MM, Meder B, Schuck NW. Connecting conceptual and spatial search via a model of generalization. In: Rogers TT, Rau M, Zhu X, Kalish CW, editors. Proceedings of the 40th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society; 2018. p. 1183–1188.
- 85. Abbott JT, Austerweil JL, Griffiths TL. Random walks on semantic networks can resemble optimal foraging. Psychological Review. 2015;122(3):558–569. 10.1037/a0038693 [DOI] [PubMed] [Google Scholar]
- 86. Hills TT, Jones MN, Todd PM. Optimal foraging in semantic memory. Psychological review. 2012;119(2):431 10.1037/a0027373 [DOI] [PubMed] [Google Scholar]
- 87. Radulescu A, Niv Y, Ballard I. Holistic Reinforcement Learning: The Role of Structure and Attention. Trends in Cognitive Sciences. 2019;23:278–292. 10.1016/j.tics.2019.01.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88. Collins AG, Frank MJ. Within-and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory. Proceedings of the National Academy of Sciences. 2018;115:2502–2507. 10.1073/pnas.1720963115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89. Ohl S, Rolfs M. Saccadic selection of stabilized items in visuospatial working memory. Consciousness and Cognition. 2018;64:32–44. 10.1016/j.concog.2018.06.016 [DOI] [PubMed] [Google Scholar]
- 90. Austerweil JL, Sanborn S, Griffiths TL. Learning How to Generalize. Cognitive science. 2019;43(8). 10.1111/cogs.12777 [DOI] [PubMed] [Google Scholar]
- 91. Hills TT, Todd PM, Goldstone RL. The central executive as a search process: Priming exploration and exploitation across domains. Journal of Experimental Psychology: General. 2010;139(4):590 10.1037/a0020666 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92. Findling C, Skvortsova V, Dromnelle R, Palminteri S, Wyart V. Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nature neuroscience. 2019;22(12):2066–2077. 10.1038/s41593-019-0518-9 [DOI] [PubMed] [Google Scholar]
- 93. Cogliati Dezza I, Cleeremans A, Alexander W. Should we control? The interplay between cognitive control and information integration in the resolution of the exploration-exploitation dilemma. Journal of Experimental Psychology: General. 2019;. [DOI] [PubMed] [Google Scholar]
- 94.Wu CM, Schulz E, Gerbaulet K, Pleskac TJ, Speekenbrink M. Under pressure: The influence of time limits on human exploration. In: Goel AK, Seifert CM, Freksa C, editors. Proceedings of the 41st Annual Conference of the Cognitive Science Society. Montreal, QB: Cognitive Science Society; 2019. p. 1219––1225.
- 95. Tomov M, Schulz E, Gershman SJ. Multi-Task Reinforcement Learning in Humans. bioRxiv. 2019; p. 815332. [DOI] [PubMed] [Google Scholar]
- 96.Wu CM, Schulz E, Speekenbrink M, Nelson JD, Meder B. Mapping the unknown: The spatially correlated multi-armed bandit. In: Proceedings of the 39th Annual Meeting of the Cognitive Science Society; 2017. p. 1357–1362.
- 97. Courville AC, Daw ND. The rat as particle filter. In: Advances in neural information processing systems; 2008. p. 369–376. [Google Scholar]
- 98. Navarro DJ, Tran P, Baz N. Aversion to option loss in a restless bandit task. Computational Brain & Behavior. 2018;1(2):151–164. 10.1007/s42113-018-0010-8 [DOI] [Google Scholar]
- 99. Mullen K, Ardia D, Gil DL, Windover D, Cline J. DEoptim: An R package for global optimization by differential evolution. Journal of Statistical Software. 2011;40(6):1–26. [Google Scholar]
- 100. Jeffreys H. The theory of probability. OUP Oxford; 1998. [Google Scholar]
- 101. Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review. 2009;16:225–237. [DOI] [PubMed] [Google Scholar]
- 102. van Doorn J, Ly A, Marsman M, Wagenmakers EJ. Bayesian Latent-Normal Inference for the Rank Sum Test, the Signed Rank Test, and Spearman’s ρ. arXiv preprint arXiv:171206941. 2017;. [Google Scholar]
- 103. Jeffreys H. The Theory of Probability. Oxford, UK: Oxford University Press; 1961. [Google Scholar]
- 104. Ly A, Verhagen J, Wagenmakers EJ. Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. Journal of Mathematical Psychology. 2016;72:19–32. 10.1016/j.jmp.2015.06.004 [DOI] [Google Scholar]
- 105. van Doorn J, Ly A, Marsman M, Wagenmakers EJ. Bayesian inference for Kendall’s rank correlation coefficient. The American Statistician. 2018;72:303–308. 10.1080/00031305.2016.1264998 [DOI] [Google Scholar]
- 106.Zellner A, Siow A. Posterior odds ratios for selected regression hypotheses. In: Bernardo JM, Lindley DV, Smith AFM, editors. Bayesian Statistics: Proceedings of the First International Meeting held in Valencia (Spain). University of Valencia; 1980. p. 585–603.
- 107. Rouder JN, Morey RD, Speckman PL, Province JM. Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology. 2012;56:356–374. 10.1016/j.jmp.2012.08.001 [DOI] [Google Scholar]
- 108. Bürkner PC. brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software. 2017;80(1):1–28. [Google Scholar]
- 109. Hoffman MD, Gelman A. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014;15:1593–1623. [Google Scholar]
- 110. Barr DJ, Levy R, Scheepers C, Tily HJ. Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of memory and language. 2013;68(3):255–278. 10.1016/j.jml.2012.11.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111. Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge university press; 2006. [Google Scholar]
- 112.Austerweil J, Griffiths T. Learning hypothesis spaces and dimensions through concept learning. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 32; 2010.