Training diversity promotes absolute-value-guided choice

Levi Solomyak; Paul B Sharp; Eran Eldar

doi:10.1371/journal.pcbi.1010664

. 2022 Nov 2;18(11):e1010664. doi: 10.1371/journal.pcbi.1010664

Training diversity promotes absolute-value-guided choice

Levi Solomyak ^1,^*, Paul B Sharp ¹, Eran Eldar ¹

Editor: Lusha Zhu²

PMCID: PMC9678339 PMID: 36322560

Abstract

Many decision-making studies have demonstrated that humans learn either expected values or relative preferences among choice options, yet little is known about what environmental conditions promote one strategy over the other. Here, we test the novel hypothesis that humans adapt the degree to which they form absolute values to the diversity of the learning environment. Since absolute values generalize better to new sets of options, we predicted that the more options a person learns about the more likely they would be to form absolute values. To test this, we designed a multi-day learning experiment comprising twenty learning sessions in which subjects chose among pairs of images each associated with a different probability of reward. We assessed the degree to which subjects formed absolute values and relative preferences by asking them to choose between images they learned about in separate sessions. We found that concurrently learning about more images within a session enhanced absolute-value, and suppressed relative-preference, learning. Conversely, cumulatively pitting each image against a larger number of other images across multiple sessions did not impact the form of learning. These results show that the way humans encode preferences is adapted to the diversity of experiences offered by the immediate learning context.

Author summary

Learning relative preferences between a pair of options is effective in guiding choice between them, but might lead to error in choosing between options that have not been paired against each other even if we know each option well. This problem of generalizing relative preferences to novel decision contexts increases as the number of options gets larger, since the more options there are the more likely we are to encounter choices between new sets of options. To solve this problem, people may learn the expected reward associated with each individual option—that is, its ‘absolute value’, by means of which any pair of options can be compared. Thus, we hypothesized that the more options a person learns about, the more likely they would be to form absolute values as opposed to relative preferences. We constructed a novel multi-day reward learning experiment to specifically test this hypothesis. We found that concurrently learning about more images indeed enhances absolute-value learning and suppresses relative-preference learning. The findings clarify what learning conditions promote the formation of generalizable preferences that can help reach optimal decisions across different contexts, an ability that is vital in the real world where experience is limited and fragmented across multiple continuously shifting contexts.

Introduction

A large body of decision-making research suggests that humans learn from experience the expected value of different available options (hitherto referred to as these options’ absolute values; [1–5]). That is, as people try out different options and observe their reward outcomes, it is thought that they track the average reward associated with each option, and this allows them to choose options with higher expected value. The main benefit of this so-called value learning is that it makes it easy to choose from any set of options, including options that have not been previously considered in relation to one another, simply by comparing their absolute values. In this sense, absolute values act as a “common currency” that serves to generalize preferences across contexts that offer different combinations of options [6–8]. Evidence that value learning is implemented by the brain emerged from early foundational work on primates indicating that brainstem dopaminergic neurons instantiate prediction errors—differences between actual and expected reward—that are well suited for algorithmic implementations of value learning [9]. Since then, many human brain imaging studies have shown that activation in the orbitofrontal cortex (OFC) and other regions is correlated with expected value during reward learning and other types of economic decision-making tasks [5,10,11].

New evidence, however, has cast into question whether humans indeed learn absolute expected values or may be instead learning relative preferences among limited sets of options. Two recent studies showed that people’s choices reflect relative preferences because when they are rewarded for choosing one out of two options, they do not only form a preference in favor of the option they chose, but also a preference against the option they did not choose [12,13]. Neural data reveal a similar picture. Neural firings in areas considered to encode value such as the OFC and the striatum have been found to encode normalized values that, in fact, have no absolute meaning and can only be interpreted as relative preferences compared to other options sampled in the same context [14,15]. Such relative preference encoding is evident even when each individual option is encountered separately [16,17]. These studies among others [18–25], have led researchers to propose alternative models of learning, according to which humans learn preferences between options without encoding the absolute value of each option [7,8,26–29].

Here, we test a novel hypothesis that humans flexibly adapt the degree to which they form absolute expected values and relative preferences based on the opportunities and incentives afforded by the environment. It is well established, across a wide range of machine learning applications, that learning environments that provide a more diverse set of learning exemplars aid generalization of learned information to new input patterns and unfamiliar contexts [30,31]. In the case of learning to maximize reward, the set of exemplars corresponds with the set of possible options, and learning about a broader set of options could make it more clearly evident that the value of an option does not depend on the other options it is pitted against—that is, that each option has an absolute value. Additionally, the broader the set of options, the greater is the space of possible choice sets (i.e., a choice set is a set of simultaneously available options from which one chooses), some of which have yet to be encountered. The prospect of having to choose among novel sets of previously encountered options makes it worthwhile to form preferences that can be used to choose among such sets, which is precisely what absolute values are best suited for. By contrast, relative preferences produce suboptimal choices among unfamiliar choice sets, since they only encode how valuable options were relative to the other options they were previously pitted against [12,27].

These considerations suggest at least two types of training diversity may support and incentivize value learning. The first type of diversity relates to how many options a person learns about concurrently within a given learning session (henceforth, concurrent diversity), whereas the second type of diversity is the number of alternative options a given option is cumulatively pitted against, across all learning sessions (henceforth, cumulative diversity). These two types of diversity are dissociable since an option can be learned concurrently with fewer other options, yet across multiple learning sessions it may be cumulatively pitted against more other options.

To test the impact of concurrent and cumulative diversity on the formation of absolute values, we designed a novel multi-day reward learning experiment comprising twenty learning sessions. In each session, subjects’ goal was to maximize their reward by choosing among pairs of images, each of which was associated with a fixed probability of reward. The probabilities were not told to subjects and could only be learned through trial and error, by observing the reward outcomes of chosen images. To manipulate concurrent diversity, we varied how many images subjects learned about concurrently in each learning session. To manipulate cumulative diversity, we varied the total number of other images each image was pitted against over two separate learning sessions. Critically, the multi-session design allowed us to assess the formation of absolute values by asking subjects to choose between images that were never directly paired together during learning. To enhance the distinction between absolute values and relative preferences, we had images with the same reward probability learned against other images with either mostly lower or mostly higher reward probabilities. An absolute value learner would have no preference among these images, whereas a relative preference learner would prefer the option that ranked higher in its original learning context.

Results

27 subjects (ages 20 to 30; Mean = 24 ±.5) completed two learning sessions a day of a reward learning task over a period of ten days (Fig 1). In low-concurrent-diversity sessions, subjects learned about three images at a time, whereas in high-concurrent-diversity sessions, subjects learned about six images at a time. Every image appeared in two learning sessions, but low-cumulative-diversity images were pitted against the same images in both sessions whereas high-cumulative-diversity images were pitted against different images.

Subjects formed preferences in favor of more rewarding images

To validate our task, we first determined whether subjects successfully learned to choose images associated with higher reward probabilities. Subjects indeed tended to choose the more rewarding image out of each pair, doing so with 86% (SEM ±1%) accuracy during learning (Fig 2A; chance performance = 50%). As can be expected, subjects’ performance was lower in the conditions that required learning about more images (i.e., high concurrent diversity; Fig 2B, left panel) or about more pairs of images (i.e., high cumulative diversity; Fig 2B, right panel). However, subjects performed considerably above chance in all conditions, and showed gradual improvement with each new image as they tried out choosing it and observed its outcomes (Fig 2C). These results confirm that the task was effective in getting subjects to form preferences among images based on how often each image was rewarded.

Fig 2 — **A) Choice accuracy as a function of trial type and condition.** A choice was considered accurate if the subject selected the image with the higher reward probability. Subjects performed significantly above chance (50%) in all trial types and conditions (CI = [.834,.879]). The plot shows total (vertical lines) and interquartile (boxes) ranges and medians (horizontal lines). Also shown are mean accuracies predicted by a computational model that was fitted to subjects’ choices (circles; see details under *Computational Formalization* below). **B) Effect of training diversity on accuracy during learning**. Accuracy was higher in sessions with low (91% SEM ±1%) compared to high (87% SEM ±1%) concurrent diversity (p_corrected = .048, bootstrap test), and trended higher in sessions with low (90% SEM ±1%) compared to high (88% SEM ±1%) cumulative diversity (p_corrected = .078, bootstrap test). The plot shows individual subject accuracy (circles), group distributions of accuracy levels (violin), group means (thick lines) and standard errors (gray shading). **C) Learning curves for each experimental condition.** Accuracy in trials involving a given image as a function of how many trials the image previously appeared in. A drop-off in accuracy can be observed for high-cumulative-diversity images (dark) at the beginning of the second session, because these images were then pitted against new images. The plot shows group means (circles), standard errors (vertical lines), local polynomial regression lines ([33]; curves) and confidence intervals (shading). Given that performance was always above chance (50%), y-axes in panels B and C focus on this range.

Concurrent diversity increased generalization

A hallmark of value learning is the ability to generalize learned preferences to novel settings. To test generalization, we had subjects choose between images that had not been previously pitted against each other (‘novel pair’ testing trials). We compared subjects’ accuracy on these trials to accuracy in choosing between images that subjects had encountered during learning (‘learned pair’ testing trials). Novel and learned pairs involved the same images and presented subjects with similarly difficult choices, in the sense that the time elapsed since learning was roughly the same (novel pairs = 1.88 days ±.08, learned pair = 1.75 ±.08 following learning), as was the average difference in reward rate between the two images that made up a pair (reward rate difference: Δ_novel−pair = 48.4% ±2%, Δ_{learned−pair} = 49.1%±3%). However, only novel pairs presented subjects with choices between images learned in different games with different sets of other images. This presents no challenge to an absolute value learner, but a relative preference learner might end up choosing an image with a lower expected value simply because it was learned against worse images (and thus acquired a higher relative value). For this reason, a pure absolute value learner can be expected to perform equally well in choosing between novel and learned pairs, whereas a relative preference learner should perform worse in choosing between novel pairs.

We found that subjects successfully chose the image with the higher reward probability in 83% of novel-pair trials (SEM ±2%; chance performance = 50%). This level of accuracy, however, was significantly lower than the accuracy subjects demonstrated on learned-pair trials (Mean = 87% SEM ±3%; bootstrap p = .03). Subjects thus generalized their preferences well, but did not do so perfectly.

We therefore asked whether success in generalization was affected by the diversity of learning experiences. To quantify generalization, we computed the drop-off in accuracy from learned-pair to novel-pair testing trials. Since we found no interaction between the effects of concurrent and cumulative diversity (p = .86 bootstrap test), we separately examined each while marginalizing over the other. Strikingly, we found that accuracy did not significantly differ between novel-pair and learned-pair trials for images learned in conditions of high concurrent diversity (Fig 3; Mean = -1% ±2%; p = .68, bootstrap test). Conversely, for low-concurrent-diversity images, subjects performed substantially worse in choosing between novel pairs (Mean = -7% ±2%; p_corrected = .004 bootstrap test). This difference between low and high concurrent diversity was neither due to a difference in learned-pair trials nor in novel-pair trials (S1 Table), but specifically reflected the drop-off in accuracy between them (p_corrected = .015 bootstrap test).

Fig 3 — n = 27 subjects. Drop-off in accuracy in novel-pair compared to learned-pair testing trials, as a function of training diversity. Concurrent diversity had a significant effect on this measure of generalization (p_corrected = .004, bootstrap test) whereas cumulative diversity did not significantly affect it (p_corrected = .29, bootstrap test). The plot shows individual subject accuracy (circles), group distributions of accuracy levels (violin), group means (thick lines) and standard errors (gray shading).

In contrast, cumulative diversity did not impact the performance drop-off from learned-pair to novel-pair trials (-3% ±2% vs -4.8% ±2% p_corrected = .29 bootstrap test). This was despite the fact that pairs of images were encountered during learning half as many times in conditions of high cumulative diversity. For this reason, we expected that learned-pair performance would be compromised by cumulative diversity, and this was indeed the case (as evident by comparing accuracy on learned-pair trials to the level of accuracy achieved at the last 5 trials of learning; Mean_low = -2.3% SEM ±1%, Mean_high = -4.2% ±1%, p = .03 bootstrap test). However, accuracy on novel-pair trials was similarly compromised by cumulative diversity (Mean_low = -5.6% ±2%, Mean_high = -7.9% ±2%, p = .026 bootstrap test), and thus no benefit to generalization was observed.

These results show that increasing the number of options about which a person concurrently learns improves their ability to generalize their learned preferences to novel choice sets.

Concurrent diversity reduces influence of other options’ outcomes

The observed improved generalization suggested that concurrent diversity enhanced absolute value learning. To further investigate this possibility, we examined another key consequence of absolute value learning, namely, that the preferences it forms depend only on the available images’ prior outcomes. By contrast, relative preferences also account for the outcomes of other images against which the presently available images were pitted during learning. These latter outcomes determined how each of the available images ranked compared to other images during learning. Due to accounting for these outcomes, a relative preferences learner should show no preference between images with similar rankings during learning even if their absolute values differ, but favor similarly rewarded images that ranked higher in their original learning context (i.e. a rank-bias).

In examining choices between similarly ranked images with different absolute values, we found that concurrent diversity improved subjects’ accuracy (S1 Fig; Mean_high = .83±.02; Mean_low = .78±.02; p_corrected = .035 bootstrap test), as consistent with a shift towards value learning. By contrast, cumulative diversity impaired performance on such trials (Mean_low = .82±.02 Mean_high = .77±.02 p_corrected = .02 bootstrap test), as consistent with its general detrimental effect on overall performance.

Visualizing choices between similarly rewarded images (i.e. less than 10% difference in reward rate) showed that subjects preferred images that ranked higher during learning (i.e., that were pitted against images with lower reward probabilities) under low concurrent diversity (Mean_low = 63% SEM ±6%, with 50% representing no preference between images) but not under high concurrent diversity (Mean_high = 49% SEM ±5%; Fig 4A). This difference between conditions trended towards significance (p = .06 permutation test) and was not evident as a function of cumulative diversity (Mean low = 57% ±7%, Mean high = 48% ±8%).

Fig 4 — **A) Rank bias as a function of training diversity.** Y-axis shows the percent of trials involving similarly rewarded images in which subjects chose the image that ranked higher during learning. A subject’s image rankings were based on how many times the subject chose each image over the other images it was pitted against (best, second-best, or worst, see Methods for further details). The plot depicts individual subjects’ choices (circles), group distributions (violin), group means (thick lines) and standard errors (gray shading). **B) Effect of reward histories on choice.** The plot shows the log odds effect on choice of three types of reward. Own: differences in reward history of currently available images. Current alternative: differences in reward history when rejecting one available image in favor of the other available image. Other: differences in reward history when rejecting one of the available images in favor of any other image.

The above measure of rank bias, however, is limited in both sensitivity and validity. First, it ignores the random differences that inevitably exist in the actual outcomes of similarly ranked images. Second, it does not utilize the information that exists in subjects’ choices between differently ranked images. Third, it is confounded by the fact that higher ranking images were chosen more during learning, since they were pitted against less rewarding images. This latter confound is important because it means that more outcomes were observed for higher-ranking images, which may have allowed subjects to develop greater confidence regarding their value.

To address these challenges, we used a Bayesian logistic mixed model that predicted subjects’ choices based on differences between the currently available images’ own reward history (β_Own; i.e., proportion rewarded), the number of times each image was chosen (β_{Times Chosen} i.e., sampling bias), and the reward history from trials in which the currently available images were rejected in favor of other images (Fig 4B). The latter was separated into two separate regressors, for prior outcomes of rejecting a presently available image in favor of the other presently available image (β_{Current alternative}) and of rejecting it in favor of any other image (β_Other). Importantly, only the latter regressor unequivocally captures the effect of other images’ outcomes that are irrelevant for maximizing absolute expected value. To determine whether the effects of different types of outcomes were modulated by training diversity, we included regressors for concurrent and cumulative diversity and interactions between both types of diversity and each type of reward history.

The results confirmed that in addition to the strong impact of an image’s own reward history (β_Own = 1.97 [1.63, 2.31]), preference for an image was inversely influenced by the outcomes of both the current alternative (β_{Current alternative} = -.43 [-.64, -.21]) and of the other images it had previously been pitted against (β_Other = -.53 [-.75, -.32]). Thus, the more subjects were rewarded when not choosing an image, the less likely they were to prefer it on subsequent trials. Most importantly, the influence of other images’ reward history was reduced by high concurrent diversity (β_{Other×Concurrent} = .39 CI = [.30, .49]). No additional interactions were found (S2 Table), except for an interaction of cumulative diversity with the impact of an image’s own outcomes, as consistent with cumulative diversity’s general detrimental effect on performance. Thus, concurrent, but not cumulative, diversity reduced the influence of other options’ rewards that are irrelevant for inferring absolute value.

Finally, to inquire whether the effect of cumulative diversity was indeed specific to trials that exclusively probed absolute value, we examined subjects’ accuracy in novel-pair testing trials wherein one of the images had both higher absolute and higher relative value than the other image. Choosing correctly on such trials does not require absolute values. As expected, we found no significant effect of concurrent diversity in these trials (p_corrected = .22). Here, too, we found a trend for subjects to perform worse on images learned in high, as compared to low, cumulative diversity (p_corrected = .052 bootstrap test). Taken together, the results indicate that concurrent diversity specifically improved accuracy on trials that required absolute values to choose correctly.

Computational formalization of value and preference learning

Our results evidenced signs of both absolute value and relative preference learning. On one hand, subjects successfully learned reward maximizing choices and generalized well to novel choice sets, as consistent with absolute value learning. On the other hand, subjects performed still better at choosing among familiar choice sets, and they preferred images that were relatively more valuable in their original learning context, as consistent with relative preference learning. Critically, learning about more images concurrently diminished or even eliminated the signs of relative preference learning.

We next tested whether this set of results can be coherently explained as reflecting the operation of two learning processes—absolute value and relative preference learning—the balance between which changes as a function of concurrent diversity. To do this, we fitted subjects’ choices during both learning and testing with a computational model that combines value and preference learning (as proposed by [26]). We then examined the best-fitting values of the model’s parameters to determine the degree to which value and preference learning were each employed in each experimental condition.

To formalize absolute value learning, the model represents subject beliefs about the absolute values of images as beta distributions, defined by two parameters a_i and b_i. This beta distribution represents the reward probability that is believed to be associated with each image given the outcomes obtained for choosing it. Thus, a_i and b_i accumulate the number of times that choice of image i was rewarded and not rewarded:

a_{i} \leftarrow {γ a}_{i} + O

(1)

b_{i} \leftarrow {γ b}_{i} + (1 - O)

(2)

where O = 1 if the choice was rewarded and O = 0 if it was not. Here, γ serves as a leak parameter allowing for the possibility that more recent outcomes have a greater impact on subjects’ beliefs (γ = 1 entails that outcomes are equally integrated, whereas γ<1 entails overweighting of recent outcomes). The decision variable provided by this form of learning is the absolute value (V) of image i, which is estimated as the expected value of the image’s beta distribution:

V (i) = \frac{a_{i}}{a_{i} + b_{i}}

(3)

To isolate the key computation distinguishing relative preference learning from value learning, we use the same learning rules defined above to generate a relative preference

W(i,j), for image i over image i, that accounts for all outcomes observed for choosing between the two images. When applying Eqs 1 and 2 for preference learning, O = 1 if image i was chosen and rewarded or if image j was chosen and not rewarded, and O = 0 if image j was chosen and rewarded or if image i was chosen and not rewarded. To enable preference learning to exhibit preferences among previously unencountered pairs of images, a general relative preference W(i) was computed for each image i by accumulating in the same fashion the outcomes of choosing image i or the other image, across all learning trials involving image i. Accounting for the outcomes of the other images distinguishes the relative preference W(i) from the absolute value V(i).

When facing a choice between image i and image j, the probability that the model will choose either image is computed based on a weighted sum of the images’ absolute value and relative preference:

P (c h o i c e = i) = \frac{e^{β_{v a l u e} V (i) + β_{p r e f e r e n c e} W^{'} (i)}}{e^{β_{v a l u e} V (i) + β_{p r e f e r e n c e} W^{'} (i)} + e^{β_{v a l u e} V (j) + β_{p r e f e r e n c e} W^{'} (j)}}

(4)

where W′(i) is itself a weighted sum of W(i) and W(i,j), with a free parameter controlling their relative weights. Importantly, β_value and β_preference are distinct inverse temperature parameters, the respective magnitudes of which determine the degree to which absolute values and relative preferences influence choice. Thus, the values of these parameters that best fit subjects’ choices can be used to quantify the degree to which value and preference learning manifested in each experimental condition [34,35]. To this end, we allowed the two inverse temperature parameters to vary as a function of concurrent and cumulative diversity, for either learning or testing trials.

Subjects combine value and preference learning

To determine whether a combination of value and preference learning was needed to explain subjects’ choices, we compared the full model to two sub-models, one that only learns absolute values (β_preference = 0) and one that only learns relative preferences (β_value = 0), as well as to a number of additional alternative learning models (see Methods). We found that the full model accounted for subjects’ choices across both learning and testing trials significantly better than the alternative models (Fig 5 and S5 Table). Moreover, only the full model was able to recreate in simulation all the behavioral findings, including generalization performance, effect on choice of outcomes for other images, and rank bias (see S2 Fig and Falsification of alternative models in Methods).

Fig 5 — **A) Model comparison.** Comparison of the combined Preference+Value model to models that learn only absolute values or relative preferences. The models are compared by means of the Bayesian Information Criterion (BIC; [36]). Lower BIC values indicate a more parsimonious model fit. **B) Modeled Utilization of Value and Preference Learning.** Individual subject model parameter fits showing the effects of concurrent and cumulative diversity on the degree to which preference and value learning manifested in subjects’ choices. Each dot represents a subject. Dashed lines mark where utilization of the form of learning is equal for low and high diversity.

Concurrent diversity enhances value learning

Examining the values of the parameters that best-fitted subjects’ choices across all trials showed that value learning generally predominated over relative preference learning (β_value = 4 ±.28 vs β_preference = 1.88 ±.22), as befitting a task that involves frequent choices between options from different learning contexts. Importantly, however, preference learning manifested to a greater extent in conditions of low concurrent diversity (low concurrent: β_preference = 2.6 ±.29; high concurrent: β_preference = 1.44 ±.21; p<.001 permutation test), whereas value learning manifested to a greater extent in conditions of high concurrent diversity (low concurrent: β_value = 3.84±.31; high concurrent: β_value = 4.33±.2 p<.001 permutation test).

By contrast to concurrent diversity, cumulative diversity inhibited value learning (low cumulative: = 4.7 ±.5; high cumulative: β_value = 3.5 ±.4; p < .001 permutation test) and had no significant impact on preference learning (low cumulative: β_preference = 1.89 ±.13; high cumulative: β_preference = 1.92 ±.1; p = .32 permutation test). These results indicate that concurrently learning about a broader set of options enhances the use of absolute values for making choices.

Testing alternative interpretations

A necessary consequence of varying concurrent diversity is an extension of the duration of learning, since twice as many trials are required to learn about twice as many images. This raises the possibility that it was simply the duration of learning, and not the diversity of learning exemplars, that shifted learning from relative preferences to absolute values. To test this interpretation, we implemented a variant of our model where a shift towards absolute-value learning progresses gradually during learning irrespective of concurrent diversity. We compared this model to a matched implementation of a concurrent diversity effect, that is, where a gradually developing shift progresses towards value learning under high concurrent diversity but towards preference learning under low concurrent diversity. Model comparison favored the latter model over all other models (ΔBIC +741). This result indicates that shifts between preference and value learning indeed developed gradually. Most importantly, though, this result confirms again that the direction of the shift depended on concurrent diversity.

A second necessary consequence of concurrently learning about more images is that consecutive presentations of an image will be separated by more intervening trials. Larger separation might itself affect the predominant form of learning, either because values and preferences decay during the intervening trials at different rates, or because a larger separation between outcomes affects the degree to which subjects overweight recent outcomes in forming values and preferences. To test the first possibility, we modified our model so as to allow values and preferences to decay during the intervening trials (via the leak parameters γ_value, γ_preference). This model fitted the data worse (ΔBIC +220), ruling out decay during intervening trials. To test the second possibility, we modified our model so that the overweighting of recent outcomes (also controlled by the leak parameters) could vary as a function of diversity conditions. This model too fitted the data substantially worse (ΔBIC + 350).

Finally, we examined an alternative hypothesis that participants make use of the transitivity of relative preference, thereby inferring a global rank of items without learning the expected value of each item. However, even among pairs of images with no transitive relation between them, subjects were significantly above chance in selecting the higher value image (Mean = .82±.1). Moreover, the effects of concurrent diversity on generalization were significant within this subset of trials as well (p_corrected = .003 bootstrap test).

Thus, neither the duration of learning, nor the presumed effects of interleaving trials, nor transitive inferences offer a successful alternative explanation for the enhancement of absolute value learning by high concurrent diversity of learning exemplars.

Discussion

We found that increasing the number of options a person concurrently learns about shapes reward learning in several ways. It first reduces performance during learning, but then leads to more successful generalization, removes a bias in favor of options that ranked higher during learning, and generally decreases the degree to which preference for an option is influenced by presently irrelevant options’ outcomes. Computational modeling shows that all of these effects are coherently explained by a shift away from relative preference and towards absolute value learning. These findings offer a meaningful extension of previous demonstrations of absolute value [1,2,5] and relative preference [12,13,27] learning in humans, by identifying key conditions under which the former is diminished in favor of the latter, namely, conditions of high concurrent training diversity.

The enhancement of absolute values and inhibition of relative preferences that we observed can best be understood in light of past suggestions that encoding context-specific information aids performance as long as the agent remains within the learning context, but is ill suited for generalizing policies to other learning contexts [37,38]. Relative preference learning is inherently specific to the learning context and impairs generalization to novel choice sets. Our findings show that such context-specific learning is promoted by a learning experience that limits the possibility of encountering novel choice sets, specifically, by reducing the number of options. In this sense, the shift between preference and value learning in our experiment can be thought of as a rational adaptation. This perspective is supported by a recent finding that value learning is enhanced by expectations of having to choose between options from different learning contexts [39]. Here, though, we demonstrate that absolute value learning can be enhanced even absent a direct manipulation of the need to choose between options from different learning contexts. Increasing the number of options is sufficient for this purpose. Conversely, with a low number of options, relative-preference learning remains clearly evident despite subjects being aware of the need to choose across contexts.

Our findings agree with prior work showing that emphasizing comparisons between a limited number of specific images, for instance by repeatedly presenting subjects with a choice between the same two options and providing reward information about the foregone option, promotes learning of relative values [27]. However, the process by which the formation of relative values in the latter experiments has so far been explained–namely, normalization to the range of outcomes experienced during learning–cannot explain relative preference learning in our experiment. This is because the range of outcomes in our experiments was the same in all learning sessions. By contrast, the model we proposed here for relative preference learning may coherently account for both our findings and the findings that had previously been attributed to normalization.

Though both concurrent and cumulative diversity increased task difficulty, as evident by poorer performance during learning, cumulative diversity did not have the effect of improving generalization. This result has two key implications. First, it contradicts previous suggestions that it is task difficulty per-se that promotes absolute value learning [27]. Second, it suggests that the formation of absolute values is not promoted by the global diversity of learning exemplars encountered during the entire course of learning, but rather, by the local diversity that characterizes the immediate learning context.

We successfully ruled out several alternative explanations for the finding that concurrent diversity promotes absolute value learning, including some possible effects of increased duration of learning and greater separation between consecutive choices of an image, both of which are direct consequences of concurrently learning about more images. Another important consequence of such learning, which we have not addressed here, is increased working memory load [40]. Future experiments could disentangle the effects of number of images and working memory load by introducing unrelated tasks during learning, so as to increase working memory load without changing the number of images about which subjects concurrently learn.

Another open question remains as to a full functional description of the relationship between concurrent diversity and absolute value learning. Our model, which was tested on three (low diversity) or six (high diversity) concurrently learned images does not allow us to extrapolate to learning with other set sizes. Clarifying the full functional relationship between diversity and value learning can be aided by extending the current experimental approach to testing additional levels of diversity, as well as by further developing a mechanistic understanding of how diversity promotes value learning.

Several studies have investigated the neural basis of value [5,9,10,11] and preference [12,14,15] learning in isolation, and the potential instantiation of relative preferences via sampling from memory during choice [41–45]. However, it is yet unknown how the brain arbitrates between preference and value learning. One relevant line of work comprises studies on how concurrent diversity influences the brain regions recruited for learning [40]. Though this work has not examined absolute values and relative preferences, it has shown that increasing the number of items people concurrently learn about strengthens activation in a striatal-frontoparietal network implicated in value learning. Future studies could investigate the involvement of this network and other regions in arbitrating between value and preference learning as environmental conditions change.

Conclusion

Our findings contribute to the ongoing debate concerning the extent to which people learn absolute values versus relative preferences. We show that absolute-value learning depends on a characteristic of the immediate learning context, namely, the diversity of learning experiences it offers. We find that increased diversity, despite impairing performance in the short term, has the effect of enhancing learning of absolute values which generalize well to novel contexts. Such generalization is essential for making decisions in real life where our experiences are inevitably fragmented across many different contexts.

Contact for resource sharing

Further information and requests for resources or raw data should be directed to and will be fulfilled by the Lead Contact, Levi Solomyak (levi.solomyak@mail.huji.ac.il).

Methods

Ethics statement

The experimental protocol was approved by the Hebrew University local research ethics committee, and written informed consent was obtained from all subjects.

Subjects

27 human subjects (14 male,13 female), aged 20 to 30 (Mean = 24 SEM ±.5), completed the experiment which consisted of 3556 trials [46]. Given the size of the dataset obtained for each subject (an order of magnitude greater than in typical learning experiments) and the effect size found in similar prior literature [27], we expected that a meaningful finding would manifest as at least a large effect (Cohen’s D = .8; [47]). We thus selected a sample size that would provide at least 80% power of detecting such an effect (i.e., n> = 26). The experiment was discontinued midway for 3 additional subjects due to failure to complete learning sessions or evidence of random choosing. Subjects were recruited from a subject pool at Hebrew University of Jerusalem as well as from the Jerusalem area. Before being accepted to the study, each subject was queried regarding each of the study’s inclusion or exclusion criteria. Inclusion criteria included fluent Hebrew or English and possession of an Android smartphone that could connect to wearable sensors via Bluetooth Low Energy. Exclusion criteria included age (younger than 18 or older than 40), impaired color discrimination, use of psychoactive substances (e.g., psychiatric medications), and current neurological or psychiatric illness. Subjects were paid 40 Israeli Shekels (ILS) per day for participation and 0.25 ILS for each coin they collected in the experimental task, which together added up to an average sum of 964 ±42 ILS over the entire duration of the study.

Subjects who missed two sessions of the experiment or who displayed patterns of making random choices were automatically excluded from the study. Random choosing was indicated by chance-level performance or reaction times below 1000 ms, which our previous experience [32] suggested is consistent with inattentive performance.

Experimental design

To test for value and preference learning, we had subjects perform a trial-and-error learning task over a period of 10 days. On each trial, subjects chose from one of two available images, and then collected a coin reward with a probability associated with the chosen image. Each game included 48 such learning trials involving a set of 3 images with reward probabilities of either {0, .33,.66} or {.33, .66 and 1}. These probabilities were never revealed to the subjects. Subjects were only instructed that each image was associated with a fixed probability of reward. Subjects played four games a day, two in a morning session and another two in an evening session. Over a total of 20 sessions, subjects learned about 60 unique images, each appearing in 64 learning trials over two sessions.

To assess whether concurrent or cumulative diversity promotes absolute value learning, we tested subjects on four experimental conditions involving either low or high levels of each type of diversity. Task conditions were randomly ordered across days in order to avoid confounds related to fatigue or gradual improvement in learning strategy. Images learned in low cumulative diversity conditions were learned over the span of two consecutive days. To satisfy the constraints of high cumulative diversity concerning which images are pitted against which, high cumulative diversity conditions spanned three days (see below). All four conditions yielded the same expected payout, since the average reward probability associated with images within each condition was .5.

To enhance the distinction between absolute values and relative preferences, we had images with the same absolute value (i.e., equal reward probability) learned against other images with mostly lower reward probabilities (i.e., in games where the probabilities were {0, .33,.66}; low reward context) or mostly higher reward probabilities (i.e., in games where the probabilities were {.33, .66 and 1}; high reward context). Within each day, an equal number of images were learned in the high and low reward contexts.

Concurrent diversity

Low: In these conditions, the two games within each learning session were independent of one another, each involving a distinct set of three images, with each trial randomly pairing two of the three images. Thus, the number of images subjects had to concurrently track in these conditions was limited to three.

High: Each session involved a set of six images, all of which were encountered in both games. Consequently, in these conditions subjects had to concurrently track six images. To equalize low and high concurrent diversity conditions in terms of the number of images each image was pitted against within a game (two), as well as in terms of the total number of pairs of images between which subjects chose within each session, the six images formed only six different pairs. To enhance the impact of high concurrent diversity, unlike in the low concurrent diversity condition, images that were pitted against each other never had a common image that they were both pitted against (Table 1).

Table 1. Example image pairings in conditions of low and high concurrent diversity and reward context.

A) Concurrent diversity: Low; Reward context: Low
Stimulus	Expected value	Pitted against	Optimal choice frequency
1	0	2 and 3	Never
2	.33	1 and 3	Half (over 1)
3	.66	2 and 3	Always

Open in a new tab

Cumulative diversity

Low: Every image was pitted against the same two other images in two consecutive learning sessions. Thus, in total, subjects chose between each pair of images 32 times. In these conditions, subjects learned about twelve images in the span of two days.

High: Every image was pitted against two different pairs of images in two different learning sessions (i.e., against a total of four other images). Thus, over the same number of trials involving the same number of images, subjects encountered twice as many image pairs compared to the low cumulative diversity condition. Correspondingly, subjects chose between each pair of images 16 times, half as much as under low cumulative diversity. To ensure that the opportunity to learn relative preference was not hindered by a change in reward context midway through learning, reward context was always the same (i.e., either high or low) in both learning sessions of a given image. Implementing these criteria made it impossible to have subjects learn about twelve images in the span of two days, and thus, we had subjects learn about 18 images across the span of 3 days (Table 2). Furthermore, pairing each image with different images in two sessions meant that, for some stimuli, its two sessions could not be consecutive. Thus, 58.33% of high cumulative diversity were learned over the course of two days.

Table 2. Example arrangement of images across days in conditions of high cumulative diversity.

Each number corresponds to an image subjects learned about. The arrangement ensured that every image would be pitted against four distinct images across two learning sessions Brackets group images that were pitted against each other. A) Low concurrent. Each game consisted of three images each pitted against the two other images. B) High concurrent. Each game consisted of six images, with each image pitted against two other images.

A) Low concurrent
Day 1	Day 2	Day 3
Morning session
Game 1: {1,2,3}	Game 1: {2,5,6}	Game 1: {14,15,18}
Game 2: {7,8,9}	Game 2: {8,11,12}	Game 2: {13,16,17}
Evening session
Game 1: {1,4,5}	Game 1: {3,4,6}	Game 1: {13,14,17}
Game 2: {7,10,11}	Game 2: {9,10,12}	Game 2: {15,16,18}

Open in a new tab

This extension of the learning period conferred a small benefit to accuracy in testing trials, as shown by a logistic regression on the number of days learning spanned (log odds accuracy improvement = .12 CI = [.05 .19]). Reassuringly, this incidental effect ran counter to the overall effect of high cumulative diversity, which was to impair testing performance. Thus, it did not change the interpretation of the main results.

To assess the formation of absolute values, we had subjects choose between images about which they had learned in two previous sessions (‘testing’ trials). Through the entire course of learning, such testing trials were interleaved with the learning trials (every 3rd trial, 24 testing trials total). Reward feedback was not shown on testing trials, but subjects were informed in advance that these trials would be rewarded with the same probabilities with which they were rewarded during learning. This reward was factored into the final bonus subjects received for their performance. Testing trials always presented a choice between images learned within the same condition. However, some of these images subjects had already chosen between during learning (‘Learned Pair Trials’; 25% of testing trials), whereas other trials presented a choice between images about which subjects learned in separate games (‘Novel Pair trials’; 75% of testing trials). Half of novel-pair trials were designed to assess how well subjects performed in general. These trials thus presented a choice between two images one which of was preferable to the other both in terms of reward probability and in terms of how it ranked in reward probability compared to the other images it was learned with. The other half of novel-pair trials were designed to distinguish between value and preference learning. Thus, half of these (25% of all novel-pair trials) presented a choice between images with the same expected value but that ranked differently in their original learning context, whereas the other half presented a choice between images with the same relative rank but different values. Pairs that satisfied these criteria were selected in random.

Within every session, we tested only the latest condition for which at least two learning sessions were completed. This meant that in most days only one condition was tested. However, equalizing the total number of testing trials across conditions required that there be on average 2 days (range 1–3) in which the morning and evening sessions tested different learning conditions. The variation between subjects emerged because of Sabbath observance, which resulted in some subjects completing only the morning session on Fridays (since sundown prevented the completion of the second session) and either subsequently continuing the following evening (Saturday evening) or the following day (Sunday morning).

In the morning session of the final day of the experiment (day 11), to ensure that there were sufficient testing trials of images learned on days 9 and 10, subjects were presented with testing trials from the last learned condition. In the afternoon session, subjects were presented with testing trials that spanned across conditions. However, not all subjects performed these afternoon trials, and some performed them incompletely. Therefore, the data from the afternoon trials were not included in the main analyses.

Mobile platform

To test learning across multiple well-separated sessions, we modified an app developed by Eldar et al [32] for Android smartphones using the Android Studio programming environment (Google, Mountain View, CA). The app asks users to perform experimental tasks according to a predetermined schedule. Additional features of the app not relevant for the present work include probing of changes in subjects’ mental state, including regular mood self-report questionnaires and life events and activities logging, and recording of electroencephalographic (EEG) and heart rate signals derived from wearable sensors connected using Bluetooth. All behavioral and physiological data are saved locally on the phone as SQLite databases (The SQLite Consortium), which are regularly uploaded via the phone’s data connection to a dedicated cloud.

Daily schedule

Subjects first visited the lab to receive instructions, test the app on their phones, and try out the experimental task (see Initial lab visit section below). Starting from the next day, subjects performed two experimental sessions a day, one in the morning and one in the evening, over a period of 10 consecutive days followed by a rest day, and a final day of testing. Each session began with a 5-minute heart rate measurement during which subjects were asked to remain seated. Following this, subjects put on the EEG sensor and played two games of the experimental task. The app allowed subjects to perform the morning session from either 6AM, 7AM, 8AM or 9AM, as best fitted the subject’s daily schedule, and the evening session from 8 hours following this time. Subjects were allowed to adjust the timing of the sessions according to their daily schedule but were required to ensure a gap of at least 6 hours between successive sessions. On average, subjects performed the morning session at 8:56AM (mean SD ± 40 min) and the evening session at 6:12 PM (mean SD± 32 min). Subjects who were religiously observant were allowed to suspend the experiment due to holiday observance as long as they resumed it the following day. Twenty five out of twenty-seven subjects took a holiday break, but only six of those subjects took the break during learning about specific images (they had only completed one of two learning sessions with those images). We evaluated whether these breaks resulted in accuracy drop-off for these images relative to other images within the same condition for which learning was uninterrupted but found no significant drop-off (Mean_{with break} = .87±.04, Mean_{without break} = .84±.03; p = .196 bootstrap test). As part of a larger data collection effort, subjects were also asked to report their mood prior to playing each game as well as twice more throughout the day.

Materials

The experiment involved 60 images, which were abstract patterns collected from various internet sources. To ensure that images were sufficiently distinguishable from one another we ran a structural similarity analysis that assesses the visual impact of three characteristics of images: luminance, contrast, and structure [48]. We considered as sufficiently distinguishable images with a similarity index of at most .6, and this was verified by visual inspection.

Statistical analyses

All non-modeling statistical analyses were performed in R using RStudio. Statistical tests were carried out using the bootstrap method with the “simpleboot” package. Correction for multiple comparisons across the two types of diversity was carried out using the Benjamini–Hochberg procedure.

Regression analyses

Regression analyses were performed using the “brms” package, which performs approximate Bayesian inference using Hamiltonian Monte Carlo sampling. We used default priors and sampled two chains of 10000 samples each. 1000 samples per chain were used as warm-up. To ensure convergence, we required an effective sample size of at least 10000 and a R-hat statistic of at most 1.01 for all regression coefficients. To evaluate an effect of interest, we report the median of the posterior samples of the relevant regression coefficient and their 95% high density interval (HDI). A reliable relationship is said to exist between a predictor and an outcome if the 95% HDI excludes zero.

Examining rank bias

To calculate whether subjects preferred images ranked higher in the original learning context, we calculated each subject’s ranking of each image based on how many times they chose the image relative to the other images it was pitted against (best, second-best, or worst). If the difference in choice frequency between two images that were pitted against each other was below 10%, indicating no established ranking between them, then the pair was excluded from the analysis (8% of trials). These rankings were then averaged across the two learning sessions in which an image was learned about to generate an overall rank for the image. Finally, we tested for a ranking bias by examining subjects’ choice between differently ranked but similarly rewarded images (defined as less than a 10% difference in percent rewarded outcomes). Bias was defined as a tendency to choose the image that had been ranked higher during learning.

Examining the influence of other options’ outcomes

To determine whether training diversity modulated the influence on choice of past outcome of non-presently relevant images, we used a Bayesian logistic mixed model predicting subjects’ choices. The predictors are described below for a choice between image A and image B:

Own—Reward history of the currently available images in the current pair, computed as the proportion of rewarded trials when option A was chosen versus any image minus the proportion of rewarded trials when option B was chosen versus any image.
Current Alternative—Reward history of the currently available images in the previous times the same images were pitted against each other, computed as the proportion of rewarded trials when option A was rejected in favor of option B minus the proportion of rewarded trials when option B was rejected in favor of option A.
Other—Reward history of choosing against the currently available images, computed as the proportion of rewarded trials when option A was rejected in favor of any image other than B minus the proportion of rewarded trials when option B was rejected in favor of any image other than A.
Times Chosen—Computed as the number of trials in which option A was selected out of all trials that included option A minus the number of trials when option B was selected out of all trials that included option B.
Concurrent—concurrent diversity condition.
Cumulative—cumulative diversity condition.

All two-way interactions between each type of reward history and each type of diversity condition: Concurrent×Own, Concurrent×Current Alternative, Concurrent×Other, Cumulative×Own, Cumulative×Current Alternative, Cumulative×Other.

To provide a concrete example, consider the following five-trial segment for which we calculate the corresponding value of each regressor:

Trial 1: A vs B, A is selected and rewarded
Trial 2: B vs C, B is selected and is not rewarded
Trial 3: B vs C, C is selected and rewarded
Trial 4: A vs C, A is selected and is not rewarded

Current trial. Trial 5: A vs C

“Own” is calculated as the proportion of times A was rewarded when A was chosen (Trial 1, Trial 4 for a total of A_own = 1/2) minus the proportion of times image C was rewarded when chosen (Trial 3; C_own = 1/1). Thus, $β_{o w n} = \frac{1}{2} - 1 = - \frac{1}{2}$ .

“Current alternative” is calculated as the proportion of times the subject was rewarded when they rejected image A in favor of image C (no such trial exists so A_current alternative is set to the null value of 1/2), minus the proportion of times the subject was rewarded when they rejected image C in favor of image A (trial 4 –C = 0). Thus β_{current alternative} = 1

“Other” is calculated as the proportion of times the subject was rewarded when they rejected image A in favor of any image other than C (no such trials exist so A_other = 1/2) minus the proportion of times the subject was rewarded when they rejected image C in favor of any image other than A. In our case, image C was rejected in favor of image B (trial 2) and image B is not rewarded (0/1) so the value of $β_{o t h e r} = \frac{1}{2}$

The probability of choosing image A over image B was thus modelled as:

\begin{array}{l} P (c h o i c e = i) \sim {σ (β_{0} + β}_{O w n} O w n_{A - B} + \\ β_{C u r r e n t a l t e r n a t i v e} {C u r r e n t A l t e r n a t i v e}_{A - B} {+ β}_{O t h e r} O t h e r_{A - B} + \\ β_{T i m e s C h o s e n} T i m e s C h o s e n_{A - B} + β_{C o n c u r r e n t} C o n c u r r e n t + \\ β_{C u m u l a t i v e} C u m u l a t i v e + β_{C o n c u r r e n t \times O w n} C o n c u r r e n t \times O w n_{A - B} \\ + β_{C o n c u r r e n t \times C u r r e n t a l t e r n a t i v e} C o n c u r r e n t \times {C u r r e n t A l t e r n a t i v e}_{A - B} + \\ β_{C o n c u r r e n t \times O t h e r} C o n c u r r e n t \times O t h e r_{A - B} + β_{C u m u l a t i v e \times O w n} C u m u l a t i v e \times O w n_{A - B} \\ + β_{C u m u l a t i v e \times C u r r e n t a l t e r n a t i v e} C u m u l a t i v e \times {C u r r e n t A l t e r n a t i v e}_{A - B} + \\ β_{C u m u l a t i v e \times O t h e r} C u m u l a t i v e \times O t h e r_{A - B}) \end{array}

(5)

where σ represents the logistic function. To account for between-subject variation, we included random intercepts as well as random slopes for all predictors.

Computational formalization

Whereas the main components of the computational model are described in the main text, here we detail precisely how preference and value learning were influenced by diversity conditions. On each trial, a set of per-subject β_{preference−baseline} and β_{value−baseline} parameters were modulated by the following main effects and interaction parameters.

Main effects

$β_{\frac{l o w}{h i g h} c o n c u r r e n t}$ and $β_{\frac{l o w}{h i g h} c u m u l a t i v e}$ are ratios that represent the impact of concurrent and cumulative diversity during learning on choice.

$β_{\frac{l e a r n i n g}{t e s t i n g}}$ is a ratio that represents how the influence of prior outcomes on choice differs between learning and testing trials.

Interaction terms

$β_{\frac{l o w}{h i g h} c o n c u r r e n t \times \frac{v a l u e}{p r e f e r e n c e}}$ and $β_{\frac{l o w}{h i g h} c u m u l a t i v e \times \frac{v a l u e}{p r e f e r e n c e}}$ are ratios that represent the impact of training diversity on the relative influence of value and preference learning on choice. The more either ratio diverges from 1, the greater the impact diversity has on the balance between value and preference learning.

$β_{\frac{l o w}{h i g h} c o n c u r r e n t \times \frac{l e a r n i n g}{t e s t i n g}}$ and $β_{\frac{l o w}{h i g h} c u m u l a t i v e \times \frac{l e a r n i n g}{t e s t i n g}}$ are ratios that represent the impact of training diversity on the relative influence of prior outcomes on choices in learning compared to testing trials; the more either ratio diverges from 1, the greater the impact diversity has on the relative influence of prior outcomes on choices in learning compared to testing trials.

$β_{\frac{v a l u e}{p r e f e r e n c e} \times \frac{l e a r n i n g}{t e s t i n g}}$ is a ratio that represents the relative influence of value and preference learning on choice differs in learning and testing trials. The more this ratio diverges from one, the more preference and value learning are differentiated in the sense that one algorithm influences choices more during learning trials and the other algorithm influences choices more during testing trials.

Using these main effect and interaction parameters, β_preference and β_value can be computed for each trial type. Thus, for example, in a learning trial of low concurrent but high cumulative diversity we can calculate the inverse temperature for the value and preference algorithm as follows:

\begin{array}{l} β_{p r e f e r e n c e =} β_{p r e f e r e n c e - b a s e l i n e} \times β_{\frac{l o w}{h i g h} c o n c u r r e n t} \times β_{\frac{l e a r n i n g}{t e s t i n g}} \\ \div β_{\frac{l o w}{h i g h} c u m u l a t i v e} \times β_{\frac{l o w}{h i g h} c u m u l a t i v e \times \frac{v a l u e}{p r e f e r e n c e}} \div β_{\frac{v a l u e}{p r e f e r e n c e} \times \frac{l e a r n i n g}{t e s t i n g}} \\ \div β_{\frac{l o w}{h i g h} c u m u l a t i v e \times \frac{l e a r n i n g}{t e s t i n g}} \\ \div β_{\frac{l o w}{h i g h} c o n c u r r e n t \times \frac{v a l u e}{p r e f e r e n c e}} \times β_{\frac{l o w}{h i g h} c o n c u r r e n t \times \frac{l e a r n i n g}{t e s t i n g}} \end{array}

\begin{array}{l} β_{v a l u e =} β_{v a l u e - b a s e l i n e} \times β_{\frac{l o w}{h i g h} c o n c u r r e n t} \div β_{\frac{l o w}{h i g h} c u m u l a t i v e} \times β_{\frac{l e a r n i n g}{t e s t i n g}} \div \\ β_{\frac{l e a r n i n g}{t e s t i n g} \times \frac{l o w}{h i g h} c u m u l a t i v e} \div β_{\frac{l o w}{h i g h} c u m u l a t i v e \times \frac{v a l u e}{p r e f e r e n c e}} \times β_{\frac{l e a r n i n g}{t e s t i n g}} β_{\frac{l o w}{h i g h} c o n c u r r e n t} \times \\ β_{\frac{l o w}{h i g h} c o n c u r r e n t \times \frac{v a l u e}{p r e f e r e n c e}} \times β_{\frac{v a l u e}{p r e f e r e n c e} \times \frac{l e a r n i n g}{t e s t i n g}} \end{array}

(6)

Alternative models

To identify the computations that guided subjects’ choices we compared the model presented in the main text to several variations of this model, in terms of how well each fitted subjects’ choices. These included a model that only learns absolute values (β_preference = 0), a model that only learns relative preferences (β_value = 0), a model with leak parameters (γ_preference and γ_value) that vary across conditions (BIC +240), and a non-Bayesian learning model that, instead of beta distributions, learns expected values and relative preferences based on a Rescorla-Wagner update rule with fixed learning rates. This latter model includes all main-effect and interaction parameters included in our winning model except for the learning rates α_value, α_preference, which replaced the leak parameters γ_value, γ_preference ([49]; BIC +2370 relative to winning model).

To test for a gradual shift towards absolute-value encoding as a function of time (see Ruling Out Alternative Explanations), we implemented a model that scales the inverse temperature parameter for value learning, β_value, by e^{τ∙trial(i,t)}, and the inverse temperature parameter for preference learning, β_Preference, by $\frac{1}{e^{τ * t r i a l (i, t)}}$ . Here τ is a free parameter controlling the degree of the shift, as a function of the number of trials that elapsed since image i was chosen at trial t. These inverse temperatures apply specifically to the outcome obtained that trial.

To additionally determine whether our data might be better accounted for by prior work which suggested that humans gradually shift towards value learning in similar RL experiments [39], we implemented an alternative model which assumes that the shift towards absolute-value encoding grows with the number of trials that elapsed since the beginning of the experiment. This model, too, did not fit the data well as the original model that did not assume a continuous gradual shift (+BIC 353), likely because subjects were made aware from the beginning of the experiment that they would need to generalize learned values.

Finally, we tested a beta-binomial model which allowed for asymmetries in the learning process from rewarded and non-rewarded outcomes. This model indeed improved the fit to the data but did not alter any of the main findings (S3 Fig)

Model fitting

We fit model parameters to subjects’ choices using an iterative hierarchical importance sampling approach [32] using MATLAB. We first used 2.5 × 10⁵ random settings of the parameter from predefined group-level distributions to compute the likelihood of observing subjects’ choices given each setting. We approximated posterior estimates of the group-level prior distributions for each of our parameters by resampling the parameter values with likelihoods as weights, and then re-fit the data based on the updated priors. These steps were repeated iteratively until model evidence ceased to increase. To derive the best-fitting parameters for each individual subject, we computed a weighted mean of the final batch of parameter settings, in which each setting was weighted by the likelihood it assigned to the individual subject’s decision.

Parameter initialization

Across both models, β_{preference−baseline} and β_{value−baseline} were initialized by sampling from a gamma distribution (k = 1, θ = 1), leak parameters (γ_preference and γ_value) were initialized by sampling from a beta distribution with (α = 9 and β = 1) and all other parameters were initialized by sampling from a lognormal distribution (μ = 0, σ = 1).

Model comparison

For each model we estimated the optimal parameters by likelihood maximization. We then applied the Bayesian Information Criterion (BIC) to compare the goodness of fit and parsimony of each model. A so-called ‘integrative BIC’ [50] can be computed as follows: BIC = - 2 ln L + k ln n, where L is the evidence in favor of each model, estimated as the mean likelihood of the model given random parameter settings drawn from the fitted group-level priors, k is the number of fitted group-level parameters and n is the number of subject choices used to compute the likelihood. This method has shown high reliability and efficacy in detecting differences within and between subjects [50–53]. We validated the model comparison procedure by simulating data using each model and using the model comparison procedure to recover the correct model (S3 Table). To validate the BIC model comparison results, we also performed model comparison using the Akaike information criterion (AIC) (S5 Table).

Statistical tests of parameter fits

Statistical significance of each interaction parameter was measured using a two tailed permutation test. First, we calculated the mean of the log fit across the 27 subjects to generate a summary statistic of how much the parameter deviates from 1. A mean of zero indicates that the parameter of interest does not significantly scale the inverse temperature in either direction. A mean significantly different from zero indicates the condition modulates the inverse temperature parameters in favor of either absolute values or relative preferences.

Thus, for each parameter of interest, we generated a null distribution composed of 1000 random permutations of the data, randomly shuffling the condition of interest (e.g., whether the subject is in low or high concurrent diversity, which corresponds to inverting the impact of the parameter). We then applied the full model fitting procedure to each permuted data set and computed the p value by comparing the actual parameter fit to the distribution of parameter fits for the permuted data. We validated our parameter fits through simulating data using the best fitting parameters for each subject and then recovering those parameters (S4 Table).

Falsification of alternative models

In addition to model comparison, we examined whether alternative models generate predictions that are falsifiable by the data. Simulations from a Preference-only model could not account for subjects’ ability to perform generalization at a high level across conditions (Posterior model prediction: mean = .66 ±.1 vs Real data mean = .83±.2.), whereas a value-only model could not account for the difference in accuracy between learned-pair and novel-pair trials

(MeanΔ_{concurrent low} = 0±1, MeanΔ_{concurrent high} = 0±.1; p = .55 bootstrap). Furthermore, such a model was unable to account for the effects of other images’ outcomes on image choices. Specifically, the effect of other image’s outcomes was not significant CI = [-5.31,4.84] nor was it modulated by concurrent diversity CI = [-4.45,5.76]. Thus, neither value learning nor preference learning could account for subjects’ behavior alone.

Dryad DOI

https://doi.org/10.5061/dryad.1rn8pk0xr [46]

Supporting information

S1 Table. Performance on testing trials.

(DOCX)

Click here for additional data file.^{(13.2KB, docx)}

S2 Table. Bayesian logistic regression of subjects’ choices on reward history, training diversity, and past choice.

(DOCX)

Click here for additional data file.^{(14KB, docx)}

S3 Table. Model validation.

10 full experimental datasets were simulated using each model. Rows indicate the model used to simulate data and columns indicate the model recovered from the data using the model comparison procedure.

(DOCX)

Click here for additional data file.^{(12.8KB, docx)}

S4 Table. Validation of Parameter Recovery.

We validated our parameter fits through simulating data using the best fitting parameters for each subject and then recovering those parameters. Our correlation between simulated and recovered parameters was at least .74 for all parametes of interest that capture the effects of the experimental conditions, and at least .51 for all other parameters.

(DOCX)

Click here for additional data file.^{(13.5KB, docx)}

S5 Table. Model comparison using AIC.

To validate the BIC model comparison results, we also performed model comparison using the Akaike information criterion (AIC).

(DOCX)

Click here for additional data file.^{(12.8KB, docx)}

S1 Fig. Choice accuracy on similarly ranked images with different expected values across diversity conditions.

Concurrent diversity improved subjects’ performance (Mean_high = .83±.02; Mean_low = .78±.02; p_corrected = .035 bootstrap test). By contrast, cumulative diversity impaired performance on such trials (Mean_low = .82±.02 Mean_high = .77±.02 p_corrected = .04 bootstrap test), as consistent with its general effect on overall performance. The plot shows individual subject accuracy (circles), group distributions of accuracy levels (violin), group means (thick lines) and standard errors (gray shading).

(TIF)

Click here for additional data file.^{(1.3MB, tif)}

S2 Fig. Data simulated using the combined value and preference learning model demonstrates all key behavioral findings.

To determine whether the model successfully captured individual differences in our experiment, we examined how parameter fits correlated with model-agnostic measures of behavior. As expected, we found that β_value was significantly correlated with generalization performance (r = .7) while β_preference correlated with our measure of rank bias (r = .5). We then validated the best-fitting model thoroughly by simulating, for each subject, 1000 data sets using their best fitting parameters and analyzing the simulated data in the same fashion in which we analyzed the real data. This procedure showed the model uniquely accounted for all of our behavioral findings (Fig 1) (A) In learning trials, performance is better in conditions of low concurrent (Mean_low = 88% ±1% vs Mean_high = 85%±1) and cumulative (Mean_low = 85%±1% vs Mean_high = 82%±1%) diversity. (B) Concurrent but not cumulative diversity leads to better generalization (Concurrent: Mean_low = −7.8%±1% vs Mean_high = −1.2%±1 p_corrected < .001 bootstrap test; Cumulative Mean_low = −4%±1% vs Mean_high = −6%±1 p_corrected = .23). (C) Concurrent but not cumulative diversity diminishes ranking bias (Concurrent: Mean_low = 59% ±1% vs Mean_high = 53%±1, p = .01, Cumulative: Mean_low = 56% ±1% vs Mean_high = 54%±1). (D) The simulated choices show that preference for an image is inversely influenced by the outcomes of both the current alternative (β_{Current alternative} = -.46, CI = [-.56, -.35]) and of the other images it had previously been pitted against (β_Other = -.42, CI = [-.54, -.30]). Furthermore, the influence of other images’ reward history is reduced by high concurrent diversity (β_{concurrent×other} = .31, CI = [.22, .40]).

(TIF)

Click here for additional data file.^{(846.4KB, tif)}

S3 Fig. Model fits of the winning model with asymmetric learning rates.

To account for previous findings of asymmetric in learning from positive versus negative reward prediction errors [54] we implemented a beta binomial model with asymmetric update rates. This modification did not alter any of the main findings. Namely, preference learning manifested to a greater extent in conditions of low concurrent diversity (low concurrent: β_preference = 3.07 ±.21; high concurrent: β_preference = 1.76 ±.12; p<.001 permutation test) whereas value learning manifested to a greater extent in conditions of high concurrent diversity (low concurrent: β_value = 4.04±.18; high concurrent: β_value = 4.54±.14 p<.001 permutation test). Furthermore, as in our winning model, cumulative diversity inhibited value learning (low cumulative: = 4.99 ±.3; high cumulative: β_value = 3.78 ±.3; p < .001 permutation test) and had no significant impact on preference learning (low cumulative: β_preference = 1.60±.12; high cumulative: β_preference = 1.87±.14; p = .11 permutation test)

(TIF)

Click here for additional data file.^{(791.6KB, tif)}

Acknowledgments

We thank Elisa Milwer and Alina Ryabtev for help with the data collection process.

Data Availability

All human data is freely available as a Dryad, Dataset, https://doi.org/10.5061/dryad.1rn8pk0xr. All code written in support of this publication is publicly available at https://github.com/lsolomyak/training_diversity_promotes_absolute_value_guided_choice

Funding Statement

This work has been made possible by NIH grants R01MH124092 and R01MH125564 (to E.E), an ISF grant 1094/20 (to E.E), and the US-Israel BSF grant 2019802 (to E.E.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Kable JW, Glimcher PW. The neurobiology of decision: consensus and controversy. Neuron. 2009;63(6):733–45. doi: 10.1016/j.neuron.2009.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Montague PR, Berns GS. Neural economics and the biological substrates of valuation. Neuron. 2002. Oct 10;36(2):265–84. doi: 10.1016/s0896-6273(02)00974-1 . [DOI] [PubMed] [Google Scholar]
3.Padoa-Schioppa C: Range-adapting representation of economic value in the orbitofrontal cortex. J Neurosci 2011, 29:14004–14014. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.O’Doherty JP. The problem with value. Neurosci Biobehav Rev. 2014;43:259–68. doi: 10.1016/j.neubiorev.2014.03.027 . [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Levy DJ, Glimcher PW. The root of all value: a neural common currency for choice. Curr Opin Neurobiol. 2012;(6)1027–38. doi: 10.1016/j.conb.2012.06.001 ; [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Mongillo G., Shteingart H., Loewenstein Y., The Misbehavior of Reinforcement Learning. Proceedings of the IEEE 2014, vol. 102, no. 4, pp. 528–541 [Google Scholar]
7.Hayden BY, Niv Y. The case against economic values in the orbitofrontal cortex (or anywhere else in the brain). Behavioral Neuroscience. 2021;135: 192–201. doi: 10.1037/bne0000448 [DOI] [PubMed] [Google Scholar]
8.Bennett D., Niv Y., Langdon A. Value-free reinforcement learning: Policy optimization as a minimal model of operant behavior. Current Opinion in Behavioral Sciences 2021; 41, 114–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275(5306):1593–9. doi: 10.1126/science.275.5306.1593 . [DOI] [PubMed] [Google Scholar]
10.O’Reilly RC. Unraveling the Mysteries of Motivation. Trends Cogn Sci. 2020. (6):425–434. doi: 10.1016/j.tics.2020.03.001 . [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Bartra O, McGuire JT, Kable JW. The valuation system: a coordinate-based meta-analysis of BOLD fMRI experiments examining neural correlates of subjective value. Neuroimage. 2013;76:412–27. doi: 10.1016/j.neuroimage.2013.02.063 ; [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Li J, Daw ND: Signals in human striatum are appropriate for policy update rather than value prediction. J Neurosci 2011, 31:5504–5511. doi: 10.1523/JNEUROSCI.6316-10.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Biderman N, Shohamy D. Memory and decision making interact to shape the value of unchosen options. Nat Commun. 2021;12(1):4648. doi: 10.1038/s41467-021-24907-x ; PMCID: PMC8324852. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zimmermann J, Glimcher PW, Louie K. Multiple timescales of normalized value coding underlie adaptive choice behavior. Nat Commun. 2018;9(1):3206. doi: 10.1038/s41467-018-05507-8 ; [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Padoa Schioppa C. Range-Adapting Representation of Economic Value in the Orbitofrontal Cortex. The Journal of neuroscience 2009. 29. 14004–14. doi: 10.1523/JNEUROSCI.3751-09.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Azab H., Hayden B. Y. Correlates of decisional dynamics in the dorsal anterior cingulate cortex. PLoS biology 2017, 15(11), e2003091 doi: 10.1371/journal.pbio.2003091 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Hunt LT, Malalasekera WMN, de Berker AO, Miranda B, Farmer SF, Behrens TEJ, Kennerley SW. Triple dissociation of attention and decision computations across prefrontal cortex. Nat Neurosci. 2018;21(10):1471–1481. doi: 10.1038/s41593-018-0239-5 ; [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Hayes WM, Wedell DH. Reinforcement learning in and out of context: The effects of attentional focus. J Exp Psychol Learn Mem Cogn. 2022. doi: 10.1037/xlm0001145 . [DOI] [PubMed] [Google Scholar]
19.Vlaev I, Chater N, Stewart N, Brown GD. Does the brain calculate value? Trends Cogn Sci. 2011;15(11):546–54. doi: 10.1016/j.tics.2011.09.008 . [DOI] [PubMed] [Google Scholar]
20.Gigerenzer G., Gaissmaier W. Heuristic decision making. Annual review of psychology 2011, 62, 451–482. doi: 10.1146/annurev-psych-120709-145346 [DOI] [PubMed] [Google Scholar]
21.Lichtenstein S., Slovic P. The construction of preference. Cambridge University Press; 2006. [Google Scholar]
22.Khaw MW, Glimcher PW, Louie K. Normalized value coding explains dynamic adaptation in the human valuation process. Proc Natl Acad Sci U S A. 2017;114(48):12696–12701. doi: 10.1073/pnas.1715293114 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Klein T., Ullsperger M., Jocham G. Learning relative values in the striatum induces violations of normative decision making. Nat Commun 2017. doi: 10.1038/ncomms16033 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Nieuwenhuis S, Heslenfeld DJ, von Geusau NJ, Mars RB, Holroyd CB, Yeung N. Activity in human reward-sensitive brain areas is strongly context dependent. Neuroimage. 2005; 25(4):1302–9. doi: 10.1016/j.neuroimage.2004.12.043 . [DOI] [PubMed] [Google Scholar]
25.Otto A. R., Vassena E. It’s all relative: Reward-induced cognitive control modulation depends on context. Journal of Experimental Psychology: 2021. General, 150(2), 306–313. doi: 10.1037/xge0000842 [DOI] [PubMed] [Google Scholar]
26.Bavard S, Lebreton M, Khamassi M, Coricelli G, Palminteri S: Reference-point centering and range-adaptation enhance human reinforcement learning at the cost of irrational preferences. Nature Communications 2018, 9:1–12 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Bavard S, Rustichini A, Palminteri S. Two sides of the same coin: Beneficial and detrimental consequences of range adaptation in human reinforcement learning. Sci Adv. 2021. doi: 10.1126/sciadv.abe0340 . [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Palminteri S., Lebreton M. Context-dependent Outcome Encoding in Human Reinforcement Learning. PsyArXiv, June 2021. [Google Scholar]
29.Soltani A, De Martino B, Camerer C: A range-normalization model of context-dependent choice: a new model and evidence. PLoS Comput Biol 2012, 8:e1002607. doi: 10.1371/journal.pcbi.1002607 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Gong Z., Ping H., Weidong W. Diversity in Machine Learning. IEEE Access 2018 arXiv:1807.01477 [Google Scholar]
31.Lee JC, Lovibond PF, Hayes BK. Evidential diversity increases generalization in predictive learning. Q J Exp Psychol. 2019;72(11):2647–2657. doi: 10.1177/1747021819857065 . [DOI] [PubMed] [Google Scholar]
32.Eldar E, Roth C, Dayan P, Dolan RJ. Decodability of Reward Learning Signals Predicts Mood Fluctuations. Curr Biol. 2018. 1433–1439 doi: 10.1016/j.cub.2018.03.038 ; [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Cleveland, William S Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association 1979. 74 (368): 829–836. doi: 10.2307/2286407.JSTOR2286407 [DOI] [Google Scholar]
34.Findling C, Skvortsova V, Dromnelle R, Palminteri S, Wyart V. Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nat Neurosci. 2019;22(12):2066–2077. doi: 10.1038/s41593-019-0518-9 . [DOI] [PubMed] [Google Scholar]
35.Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006. Jun 15;441(7095):876–9. doi: 10.1038/nature04766 . [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kass Robert E., Raftery Adrian E. “Bayes Factors.” Journal of the American Statistical Association, vol. 90, no. 430, 1995, pp. 773–95. JSTOR, 10.2307/2291091. [DOI] [Google Scholar]
37.Polania R, Woodford M, Ruff CC: Efficient coding of subjective value. Nat Neurosci 2019, 22:134–142 doi: 10.1038/s41593-018-0292-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Hunter L, Daw ND. Context-sensitive valuation and learning. Current Opinion in Behavioral Sciences 2021. 41: 122–12 doi: 10.1016/j.cobeha.2021.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Juechems K., Altun T., Hira R., Jarvstad A. Human value learning and representation reflects rational adaption to task demands. Nat Hum Behav. 2022;6(9):1268–1279. doi: 10.1038/s41562-022-01360-4 [DOI] [PubMed] [Google Scholar]
40.Collins AGE, Ciullo B, Frank MJ, Badre D. Working Memory Load Strengthens Reward Prediction Errors. J Neurosci. 2017;37(16):4332–4342. doi: 10.1523/JNEUROSCI.2700-16.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Bornstein AM, Khaw MW, Shohamy D, Daw ND. Reminders of past choices bias decisions for reward in humans. Nat Commun. 2017. Jun 27;8:15958. doi: 10.1038/ncomms15958 . [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Bornstein AM, Norman KA. Reinstated episodic context guides sampling-based decisions for reward. Nat Neurosci. 2017. Jul;20(7):997–1003. doi: 10.1038/nn.4573 . [DOI] [PubMed] [Google Scholar]
43.Bhui R, Gershman SJ. Decision by sampling implements efficient coding of psychoeconomic functions. Psychol Rev. 2018. Nov;125(6):985–1001. doi: 10.1037/rev0000123 . [DOI] [PubMed] [Google Scholar]
44.Ronayne D., Brown G. D. Multi-attribute decision by sampling: An account of the attraction, compromise and similarity effects. Journal of Mathematical Psychology 2017, 81, 11–27 doi: 10.1016/j.jmp.2017.08.005 [DOI] [Google Scholar]
45.Gershman S. J., Daw N. D. Reinforcement learning and episodic memory in humans and animals: an integrative framework. Annual review of psychology 2017., 68, 101–128. doi: 10.1146/annurev-psych-122414-033625 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Solomyak Levi; Eldar Eran; Sharp Paul (2022), Training diversity promotes absolute value guided choice, Dryad, Dataset, 10.5061/dryad.1rn8pk0xr [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Cohen J. Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers; 1988. [Google Scholar]
48.Zhou W., Bovik A. C., Sheikh H. R., Simoncelli E. P. "Image Quality Assessment: From Error Visibility to Structural Similarity." IEEE Transactions on Image Processing. Vol. 13, Issue 4, April 2004, pp. 600–612. doi: 10.1109/tip.2003.819861 [DOI] [PubMed] [Google Scholar]
49.Recorla R. A., Wagner A. R. A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement. In Black A. H., & Prokasy W. F. (Eds.), Classical Conditioning II: Current Research and Theory 1972. pp. 64–99. [Google Scholar]
50.Huys QJ, Pizzagalli DA, Bogdan R, Dayan P. Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis. Biol Mood Anxiety Disord. 2013. 19;3(1):12. doi: 10.1186/2045-5380-3-12 . [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Eldar E, Hauser TU, Dayan P, Dolan RJ. Striatal structure and function predict individual biases in learning to avoid pain. Proc Natl Acad Sci USA. 2016;113(17):4812–7. doi: 10.1073/pnas.1519829113 . [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Eldar E, Lièvre G, Dayan P, Dolan RJ. The roles of online and offline replay in planning. Elife. 2020; 9:e56911. doi: 10.7554/eLife.56911 . [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Sharp PB, Russek EM, Huys QJM, Dolan RJ, Eldar E. Humans perseverate on punishment avoidance goals in multigoal reinforcement learning. Elife. 2022. Feb 24;11:e74402. doi: 10.7554/eLife.74402 Erratum in: Elife. 2022 Oct 10;11: ; PMCID: PMC8912924. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Ciranka S., Linde-Domingo J., Padezhki I. et al. Asymmetric reinforcement learning facilitates human inference of transitive relations. Nat Hum Behav 2022. 6, 555–564 doi: 10.1038/s41562-021-01263-w [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010664.r001

Decision Letter 0

Samuel J Gershman, Lusha Zhu

22 Jul 2022

Dear Mr. Solomyak,

Thank you very much for submitting your manuscript "Training diversity promotes absolute-value-guided choice" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by four independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Lusha Zhu, Ph.D.

Associate Editor

PLOS Computational Biology

Samuel Gershman

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: Thank you for the opportunity to review this manuscript by Solomyak and colleagues, which reports the results of an interesting longitudinal experiment studying the factors that influence absolute-value-guided choice versus relative-value-guided choice. This topic speaks to a recent debate in the reinforcement learning literature, which distinguishes between (absolute-)value-based learning models, in which participants are assumed to learn a representation of the discounted cumulative expected reward associated with different options, versus relative-value (or preference-based) learning models, in which participants learn a representation of the relative goodness of different choice options, but not their absolute value. This manuscript argues that one situational factor that promotes learning of absolute values is 'concurrent diversity', operationalized as the number of different alternatives that a choice option is paired with during the learning phase.

Strengths of the paper include its careful longitudinal research design and its use of sophisticated modelling analyses to get at the aspects of the data that address its research question. I have several questions about the manuscript that are mostly about ascertaining that the statistical analyses do indeed provide an appropriate degree of support for the manuscript's conclusions.

- One of the primary model-agnostic statistics in support of the manuscript's conclusions (reported on Page 14, lines 176-182) is that at the test phase, participants preferred similarly rewarding higher-ranked images from low-concurrent-diversity conditions (Mean = 63%, p = .006) but not from the high-concurrent-diversity condition (Mean = 49%, p = .24). First of all, these numbers should be tested against one another to ensure that this difference is statistically significant. Secondly, my impression is that this test does not account for the actual reward rates of the different stimuli (i.e., the higher-ranked image might have a reward rate of 65% and the lower-ranked image might have a reward rate of 56%). It is therefore crucial to control for the actual reward rate differences of the two stimuli in the test phase in this analysis, and to show that the difference between these two conditions remains statistically significant once the actual reward rate differences of the two stimuli are accounted for.

- The beta-binomial model implicitly assumes that participants learn equivalently from gains and losses (Equation 1 and Equation 2). Other beta-binomial models are possible which allow for asymmetries in the learning process (e.g., the count of reward outcomes may be incremented at a different rate to the count of non-reward outcomes). Given the numerous studies showing asymmetries in learning from positive versus negative reward prediction errors (including in a transfer learning context; see Ciranka et al., Nature Human Behaviour, 2022), it could be useful to include such an asymmetry in the learning models.

- In the 'Testing Alternative Interpretations' section on Page 21, the manuscript tests whether there is a shift in value learning over the course of learning by means of a variant model in which there is a difference in the absolute value learning parameter only in the second half of training. This stepwise change model seems to me to be not giving a fair shot to this alternative explanation - what if the change is gradual, rather than occurring all of a sudden half way through the task? I suggest that a stronger test of this alternative explanation would be if the beta_preference and beta_value parameters were permitted to vary linearly across the course of training (with an intercept and slope fit to the data) rather than assumed to be constant or varying in a stepwise fashion.

Reviewer #2: Solomyak and colleagues investigated the environmental conditions that promote the learning of absolute value or relative preferences among choice options, both of which were common strategies in humans’ decision-making. The article investigated four learning environments and tested the participants’ performance in both learnt and novel pairs. Interestingly, the article found that training diversity promoted absolute-value-guided choice. However, a few major issues listed here needed to be addressed.

A major concern was whether the results of the current study could be explained by a third hypothesis: ranked preference. Specifically, the participants might learn the relative preference but also make use of the transitivity of relative preferences. That is, participants might learn the rank of items, but not necessarily the absolute value of each item. For example, if participants learnt A was better than B and B was better than C in the training phases, then participants would easily infer A was better than C, even in situations where the participants did not learn the value of A and C directly. Whether there is evidence supporting the absolute value strategies that cannot be explained by this hypothesis?

A second concern was the contributing factor to the generalization effect. The article found the generalization effect by calculating the accuracy differences between the novel pairs and the learnt pairs (Figure 3). It was not clear whether the generalization effect was caused by the higher performance in the novel pairs or lower performance in the learnt pairs. It might be helpful to do further analysis to confirm the contributing factor to the generalization effect and report the effect size in all four conditions to exclude the possibilities that only some of the conditions contribute the effect.

Minor Concerns:

It is not clear whether statistical tests in this article had been corrected for multiple comparisons (e.g., The test in Figure 3 needed to be corrected, since the same data was used to test the two hypotheses, concurrent diversity and cumulative diversity).

There are some typos in the manuscript: (i) In the caption of Figure 2, the t value was missing, “t(51.8)”; (ii) Figure citations need to be consistent (e.g., Line 113, “Figure 2b” to Figure 2B); (iii) Figure 4 was not cited in the main text.

Some preprint reference has been published and might be updated. (e.g., “Hayden B, Niv Y: The case against economic values in the brain. PsyArXiv 2020”, now is Hayden BY, Niv Y. The case against economic values in the orbitofrontal cortex (or anywhere else in the brain). Behavioral Neuroscience. 2021;135: 192–201. doi:10.1037/bne0000448)

Reviewer #3: The paper poses an excellent question and will be a good addition to the literature once the authors expand and clarify the methods. The analyses also need to be stregthnend. At the moment, it is unfortunately difficult to ascertain the particulars the experimental procedure, which somewhat reduces the interpretability of the results. This is particularly worrying as we (this review is a joint work) work in the field of relative/absolute value learning.

Please see our detailed comments below:

Major comments:

Task Description

Options/conditions

1/ It is not very clear what options participants saw when and what was their visual representation. It would be helpful to have a figure similar to Figure 1C showing the option combinations for each day, condition, session and game, explicitly stating both the outcome probability of each option as well as the image (or image id) shown to the participant. Also note that Figure 1B&C suggests a different set-up of your task than I believe was the case. Namely, it suggests that in the low concurrent condition, games in the same session used the same options (and images) and that options/images were repeated across the conditions.

Experimental testing

2/ Very little is said about the particulars of the testing trials. Can you please include more details? Here are some of the questions we had:

What was the ratio of learnt vs novel pairs?

How many repeats were there per pair?

Was day 11, the only day you tested images from multiple conditions or was this also the case on previous days?

Were the paired options originally from the same day/condition or did you combine options across day/conditions?

How did you come up with the novel pairs? Did you have any criteria when drawing the novel pairs or were the pairings random, or exhaustive (all possible combinations)?

While feedback was not shown for the testing trials, were the testing trials rewarded? And if so how?

Also there is some information that you mention in the figure legend but not in the methods themselves. Could you please put all relevant info in the methods even if it means repeating it? E.g. Figure 1C - There were 24 testing trials per game

Data analysis

3/ Additional Analysis required

It would be great if you could split novel pairs testing into groups based on the relative and absolute value prediction and show the bias separately for each group. In particular the following grouping will be of particular interest:

- A group with options that have the same absolute value but differ in the relative value (I believe you already isolated those and plotted them in Figure 4A?)

- A group with pairs where the relative value is the same (or very similar) for both options but their absolute value is different, e.g. options like 100% vs 66% from low reward context and 66% from high reward context vs 33% from low reward context

- A group where one of the options has both higher relative and absolute value, e.g. 100% vs 33% low reward context. If I understood your set-up correctly, this should be the majority of novel pairs.

Potentially, also a group where the option with the higher absolute value has lower relative value (though I don’t think there was any pair like this in your task. The reason for this is that while all 4 groups are informative of participants ability to generalise, only those in group 1,2 & 4 are actually informative of their preference for absolute or relative encoding.

4/ Did you observe any difference in responses to the testing trials across the span of the experiment? Did participants become more absolute encoders over time as they learnt that they will be tested on novel pairs (as they did in Juechems 2021)

Sanity checks:

5/ In choice sets where the 100% option was present, how often participants explored the alternative option.

6/ What about the simulations the alternative models? Are they falsified by the data?

Missing details

7/ (lines 182 - 184) to prove your point, you should test that preference for higher ranked images in the low concurrent condition was significantly higher than in the high concurrent condition (and the same for the cumulative condition). If one group is significantly different from a reference value while the other group is not, it does not mean that the group difference is significantly different

8/ (lines 147 - 154). How did you test this?

Previous work and position of the paper within the literature.

9/ The authours should refer to Haynes and Wedell (“in and out of context”; Journal of Experimental psychology).

10/ The authors should mentioned within lines 51-52 “authors have proposed models of” a reference to Palminteri and Lebreton 2021, where most of the relative value learning models are summarized.

11/ Even more importantly, the proposed winning model is a variant of the model proposed by Bavard et al. (2018; Nature Communications; HYBRID), however somehow surprisingly this is not acknoledged and the paper is not cited.

12/ Bavard et ak. (2021; Science Advances) propose a model to explain why context effects (i.e., deviation from absolute values) are stronger in blocked concerning interleaved design (the idea is global versus local context values). Is this idea relevant here?

Minor comments:

- The fraction of your code that is publicly available is written in MATLAB as opposed to R you mentioned in the paper. - - Did you do part of modelling in MATLAB? If so, can you specify what optimizer (and other functions) you used?

- The distinction you make between concurrent and cumulative conditions was not very clear from the introduction. Although, Figure 1 somewhat clarified the issue, rephrasing the relevant lines in the introduction (72-75) would be beneficial, alternatively adding an example might also help.

- Figure 1A - there is no ITI in the picture, I believe that a game had 48 trials instead of the 72 mentioned in the figure.

- 109-111 - This is calculated on data from the learning trials right? I.e. trials with feedback

- 129 - not sure I understand what a proportion reward difference is in this case. Can you please elaborate?

- 179 - p=.006 is missing a period before the 00

- 184 - If you stated in the previous section that cumulative condition had no effect on absolute vs relative learning. Why is it suddenly problematic that you do not find any effect of the cumulative condition on a related measure .

- 268 - Is the βvalueV(i) + βpreferenceW’(i) ever over 1?

- 290-291 - β preference is stated twice with different results. I assume one of them was supposed to be β value?

- 279 - Did you fit the model only on the learning data or also on the testing data?

- 327- Did you get the same results with AIC?

- Supplementary Table 2 - Low Concurrent - Day 2 - Game 1 - Wasn't this supposed to be {4,5,6} and {10,11,12}? as per the previous schedule?

- 539 - Did you observe differences in performance in the test trials during the regular 10-day schedule and during day 11 (i.e. after one day break?)

- 549 - How many people took the break for holiday observance, did it fall within or between condition and did it affect the performance in any way?

- 878 - Can you include the model you used to calculate the rank bias?

- 635 - choice difference?

- 644 - Could you make the gap between the first fraction and the second fraction larger? On a first read it looks like a single fraction.

- 653 - Can you include a full description of the RW model you used, including free parameters?

Did you also use R for the model fitting. IF you which package?

Reviewer #4: Solomyak et al present an interesting and ambitious experiment testing how two different sources of diversity, concurrent and cumulative, push participants to learn the expected or relative value of a choice over 10 days. They found that concurrent, but not cumulative, diversity helps participants encode the expected versus relative value of an option by enabling better generalization when making new choices between previously learned options. Nevertheless, this manuscript could substantially benefit from major revisions. Namely, it is challenging to adequately assess the results and their interpretation when the task details are unclear.

Major points

1. I appreciate that this is a complicated task to describe, but it is currently very difficult to understand task details and conditions, and in some cases, task visualization appears to conflict with task description:

— For example, (line 482) in the low concurrent diversity condition, two games within each learning sessions are said to each have a distinct set of three images, yet in the experimental design figure (figure 1), it appears that they share the same three images. While I understand that the figure may not describe exact task conditions, it is extremely confusing and should mirror the actual task as much as possible. From the figure, one would infer that some options (i.e., orange and green) are sampled every day, across conditions and much more frequently than other options (i.e., hot pink and light green). From the methods, it appears that there were 60 stimuli, did participants learn all 60? How many times did they get to see each stimulus? Over how many days was each stimulus presented? I know the latter differs on whether cumulative diversity was high or low, but even in the table example on page 32, it appears that some stimuli are repeated across days, and some are not - how was this determined? These task details need to be made much clearer and in the main manuscript (the experimental design figure should also include this information).

— (line 508) It is mentioned that for high cumulative diversity, the reward context was the same in both learning sessions of a given image. Was this true of other conditions? What was the high/low reward context breakdown by condition/day? (could also be noted in experimental design).

2. From what I understand, testing either occurred interleaved during learning or during the last testing day. How were the novel pairs chosen? It is mentioned that the stimuli were paired with stimuli from prior sessions (‘two sessions’ back), but was this restricted to the same condition? Additionally, how were the novel pairs chosen on the last day? I see they were drawn from different conditions, did this lead to different choice behavior than during interleaved testing? How did older versus newer options fare on Day 11 testing? (i.e., how did long-term memory influence choice?).

3. To rule out the potential confound that a difference in the duration of learning is modulating the high versus low concurrent condition effects, authors tested a model where the shift towards absolute value learning only occurred in the second half of high-concurrent sessions. Since this model fit worse, they ruled out the confound. While I agree with this approach and interpretation, it belies an interesting assumption that more learning should lead to more ‘absolute’ versus relative preference. From what I can glean from the manuscript, the full model doesn’t seem to incorporate a dynamic “shift” from relative to absolute values, right? I say this because of the large literature on habits showing that, conversely, it could be expected that more learning leads to stronger habits/relative preference. Perhaps the potential/speculated dynamic between expected and relative value learning could be expanded upon in the discussion. Alternatively, if there is a way to test a shift from relative to absolute value learning in the current data, that should be examined.

4. A fascinating tension in the concurrent condition is that while (as expected) learning is worse in the high condition, generalization is better. Is this observed at an individual-difference level? Are worse learners better able to generalize? It could be an interesting consequence of this trade-off.

5. To differentiate independent versus relative value and given the literature on learning ‘extreme’ values within a learning context, were there test trials between the ‘extreme’ options in the two reward contexts? i.e., comparing choice between (1) stimulus associated with 0 reward (low-reward context) and stimulus associated with 0.33 probability reward in the high-reward context and (2) stimulus associated with 0.66 reward probability in low-reward context versus the 100% probability reward stimulus (high-reward context). If so, is there a reliable difference in choice for these pairs in the different conditions?

Minor points

1. Equations 6 and 7 are difficult to read. For equation 6, you may want to have consistent notation with the paragraph above (a/b versus i/j), and I can’t parse equation 7, there seem to be missing/inconsistent notation of operations.

2. It appears that only some of the effects of the Bayesian logistic mixed model (equation 6) are reported, it would be helpful to have a full table or report of the results.

3. The description of “own” versus “current alternative” and “other” isn’t clear or very intuitive. I think it would help to have a clear example of the three calculations and their distinctions.

4. The language describing “absolute” values could be slightly misleading since “absolute” is often used to mean unsigned (i.e., a value of 5 and -5 both have an absolute value of 5). I don’t think you need to change this language throughout the paper, but comparing “expected (or ‘independent’) value” and “relative value” may be more straightforward.

5. (line 82) It is first said that consecutive sessions were separated by 12 hours on average, but later (line 547) it appears this difference was (at least 6) and closer to 9 hours, and the average hours are also different in the experimental design figure description (6:56 versus 6:12).

6. I think there is a typo in Table 1 C) for stimulus 5 - it says it’s pitted against 3 and 7, but stimulus 7 isn’t included in that set.

7. (line 65) “Unfamiliar sets of familiar options” may be better understood as “New or novel sets” of learned or experienced options.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: No: there is discrepancy between what they say in the manuscript (R) and what is shared (Matlab) that has to be clarified.

Reviewer #4: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2022 Nov 2;18(11):e1010664. doi: 10.1371/journal.pcbi.1010664.r002

Author response to Decision Letter 0

22 Sep 2022

Attachment

Submitted filename: Responses_to_reviewers.docx

Click here for additional data file.^{(3MB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010664.r003

Decision Letter 1

Samuel J Gershman, Lusha Zhu

18 Oct 2022

Dear Mr. Solomyak,

We are pleased to inform you that your manuscript 'Training diversity promotes absolute-value-guided choice' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Lusha Zhu, Ph.D.

Academic Editor

PLOS Computational Biology

Samuel Gershman

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: I thank the authors for their thoughtful response, which has addressed my concerns regarding this manuscript.

Reviewer #2: The authors addressed all of my concerns. I have no further issues and appreciate that the authors performed additional analyses to convey their points well.

Reviewer #3: After careful consideration, I believe the authors successfully addressed my concerns.

Reviewer #4: The authors provided a clear and thorough response to my comments, no further suggestions.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No: I could not find an active link

Reviewer #4: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010664.r004

Acceptance letter

Samuel J Gershman, Lusha Zhu

25 Oct 2022

PCOMPBIOL-D-22-00824R1

Training diversity promotes absolute-value-guided choice

Dear Dr Solomyak,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. Performance on testing trials.

(DOCX)

Click here for additional data file.^{(13.2KB, docx)}

S2 Table. Bayesian logistic regression of subjects’ choices on reward history, training diversity, and past choice.

(DOCX)

Click here for additional data file.^{(14KB, docx)}

S3 Table. Model validation.

(DOCX)

Click here for additional data file.^{(12.8KB, docx)}

S4 Table. Validation of Parameter Recovery.

(DOCX)

Click here for additional data file.^{(13.5KB, docx)}

S5 Table. Model comparison using AIC.

To validate the BIC model comparison results, we also performed model comparison using the Akaike information criterion (AIC).

(DOCX)

Click here for additional data file.^{(12.8KB, docx)}

S1 Fig. Choice accuracy on similarly ranked images with different expected values across diversity conditions.

(TIF)

Click here for additional data file.^{(1.3MB, tif)}

S2 Fig. Data simulated using the combined value and preference learning model demonstrates all key behavioral findings.

(TIF)

Click here for additional data file.^{(846.4KB, tif)}

S3 Fig. Model fits of the winning model with asymmetric learning rates.

(TIF)

Click here for additional data file.^{(791.6KB, tif)}

Attachment

Submitted filename: Responses_to_reviewers.docx

Click here for additional data file.^{(3MB, docx)}

Data Availability Statement

[pcbi.1010664.ref001] 1.Kable JW, Glimcher PW. The neurobiology of decision: consensus and controversy. Neuron. 2009;63(6):733–45. doi: 10.1016/j.neuron.2009.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref002] 2.Montague PR, Berns GS. Neural economics and the biological substrates of valuation. Neuron. 2002. Oct 10;36(2):265–84. doi: 10.1016/s0896-6273(02)00974-1 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref003] 3.Padoa-Schioppa C: Range-adapting representation of economic value in the orbitofrontal cortex. J Neurosci 2011, 29:14004–14014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref004] 4.O’Doherty JP. The problem with value. Neurosci Biobehav Rev. 2014;43:259–68. doi: 10.1016/j.neubiorev.2014.03.027 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref005] 5.Levy DJ, Glimcher PW. The root of all value: a neural common currency for choice. Curr Opin Neurobiol. 2012;(6)1027–38. doi: 10.1016/j.conb.2012.06.001 ; [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref006] 6.Mongillo G., Shteingart H., Loewenstein Y., The Misbehavior of Reinforcement Learning. Proceedings of the IEEE 2014, vol. 102, no. 4, pp. 528–541 [Google Scholar]

[pcbi.1010664.ref007] 7.Hayden BY, Niv Y. The case against economic values in the orbitofrontal cortex (or anywhere else in the brain). Behavioral Neuroscience. 2021;135: 192–201. doi: 10.1037/bne0000448 [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref008] 8.Bennett D., Niv Y., Langdon A. Value-free reinforcement learning: Policy optimization as a minimal model of operant behavior. Current Opinion in Behavioral Sciences 2021; 41, 114–121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref009] 9.Schultz W, Dayan P, Montague PR. A neural substrate of prediction and reward. Science. 1997;275(5306):1593–9. doi: 10.1126/science.275.5306.1593 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref010] 10.O’Reilly RC. Unraveling the Mysteries of Motivation. Trends Cogn Sci. 2020. (6):425–434. doi: 10.1016/j.tics.2020.03.001 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref011] 11.Bartra O, McGuire JT, Kable JW. The valuation system: a coordinate-based meta-analysis of BOLD fMRI experiments examining neural correlates of subjective value. Neuroimage. 2013;76:412–27. doi: 10.1016/j.neuroimage.2013.02.063 ; [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref012] 12.Li J, Daw ND: Signals in human striatum are appropriate for policy update rather than value prediction. J Neurosci 2011, 31:5504–5511. doi: 10.1523/JNEUROSCI.6316-10.2011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref013] 13.Biderman N, Shohamy D. Memory and decision making interact to shape the value of unchosen options. Nat Commun. 2021;12(1):4648. doi: 10.1038/s41467-021-24907-x ; PMCID: PMC8324852. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref014] 14.Zimmermann J, Glimcher PW, Louie K. Multiple timescales of normalized value coding underlie adaptive choice behavior. Nat Commun. 2018;9(1):3206. doi: 10.1038/s41467-018-05507-8 ; [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref015] 15.Padoa Schioppa C. Range-Adapting Representation of Economic Value in the Orbitofrontal Cortex. The Journal of neuroscience 2009. 29. 14004–14. doi: 10.1523/JNEUROSCI.3751-09.2009 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref016] 16.Azab H., Hayden B. Y. Correlates of decisional dynamics in the dorsal anterior cingulate cortex. PLoS biology 2017, 15(11), e2003091 doi: 10.1371/journal.pbio.2003091 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref017] 17.Hunt LT, Malalasekera WMN, de Berker AO, Miranda B, Farmer SF, Behrens TEJ, Kennerley SW. Triple dissociation of attention and decision computations across prefrontal cortex. Nat Neurosci. 2018;21(10):1471–1481. doi: 10.1038/s41593-018-0239-5 ; [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref018] 18.Hayes WM, Wedell DH. Reinforcement learning in and out of context: The effects of attentional focus. J Exp Psychol Learn Mem Cogn. 2022. doi: 10.1037/xlm0001145 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref019] 19.Vlaev I, Chater N, Stewart N, Brown GD. Does the brain calculate value? Trends Cogn Sci. 2011;15(11):546–54. doi: 10.1016/j.tics.2011.09.008 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref020] 20.Gigerenzer G., Gaissmaier W. Heuristic decision making. Annual review of psychology 2011, 62, 451–482. doi: 10.1146/annurev-psych-120709-145346 [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref021] 21.Lichtenstein S., Slovic P. The construction of preference. Cambridge University Press; 2006. [Google Scholar]

[pcbi.1010664.ref022] 22.Khaw MW, Glimcher PW, Louie K. Normalized value coding explains dynamic adaptation in the human valuation process. Proc Natl Acad Sci U S A. 2017;114(48):12696–12701. doi: 10.1073/pnas.1715293114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref023] 23.Klein T., Ullsperger M., Jocham G. Learning relative values in the striatum induces violations of normative decision making. Nat Commun 2017. doi: 10.1038/ncomms16033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref024] 24.Nieuwenhuis S, Heslenfeld DJ, von Geusau NJ, Mars RB, Holroyd CB, Yeung N. Activity in human reward-sensitive brain areas is strongly context dependent. Neuroimage. 2005; 25(4):1302–9. doi: 10.1016/j.neuroimage.2004.12.043 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref025] 25.Otto A. R., Vassena E. It’s all relative: Reward-induced cognitive control modulation depends on context. Journal of Experimental Psychology: 2021. General, 150(2), 306–313. doi: 10.1037/xge0000842 [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref026] 26.Bavard S, Lebreton M, Khamassi M, Coricelli G, Palminteri S: Reference-point centering and range-adaptation enhance human reinforcement learning at the cost of irrational preferences. Nature Communications 2018, 9:1–12 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref027] 27.Bavard S, Rustichini A, Palminteri S. Two sides of the same coin: Beneficial and detrimental consequences of range adaptation in human reinforcement learning. Sci Adv. 2021. doi: 10.1126/sciadv.abe0340 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref028] 28.Palminteri S., Lebreton M. Context-dependent Outcome Encoding in Human Reinforcement Learning. PsyArXiv, June 2021. [Google Scholar]

[pcbi.1010664.ref029] 29.Soltani A, De Martino B, Camerer C: A range-normalization model of context-dependent choice: a new model and evidence. PLoS Comput Biol 2012, 8:e1002607. doi: 10.1371/journal.pcbi.1002607 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref030] 30.Gong Z., Ping H., Weidong W. Diversity in Machine Learning. IEEE Access 2018 arXiv:1807.01477 [Google Scholar]

[pcbi.1010664.ref031] 31.Lee JC, Lovibond PF, Hayes BK. Evidential diversity increases generalization in predictive learning. Q J Exp Psychol. 2019;72(11):2647–2657. doi: 10.1177/1747021819857065 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref032] 32.Eldar E, Roth C, Dayan P, Dolan RJ. Decodability of Reward Learning Signals Predicts Mood Fluctuations. Curr Biol. 2018. 1433–1439 doi: 10.1016/j.cub.2018.03.038 ; [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref033] 33.Cleveland, William S Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association 1979. 74 (368): 829–836. doi: 10.2307/2286407.JSTOR2286407 [DOI] [Google Scholar]

[pcbi.1010664.ref034] 34.Findling C, Skvortsova V, Dromnelle R, Palminteri S, Wyart V. Computational noise in reward-guided learning drives behavioral variability in volatile environments. Nat Neurosci. 2019;22(12):2066–2077. doi: 10.1038/s41593-019-0518-9 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref035] 35.Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ. Cortical substrates for exploratory decisions in humans. Nature. 2006. Jun 15;441(7095):876–9. doi: 10.1038/nature04766 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref036] 36.Kass Robert E., Raftery Adrian E. “Bayes Factors.” Journal of the American Statistical Association, vol. 90, no. 430, 1995, pp. 773–95. JSTOR, 10.2307/2291091. [DOI] [Google Scholar]

[pcbi.1010664.ref037] 37.Polania R, Woodford M, Ruff CC: Efficient coding of subjective value. Nat Neurosci 2019, 22:134–142 doi: 10.1038/s41593-018-0292-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref038] 38.Hunter L, Daw ND. Context-sensitive valuation and learning. Current Opinion in Behavioral Sciences 2021. 41: 122–12 doi: 10.1016/j.cobeha.2021.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref039] 39.Juechems K., Altun T., Hira R., Jarvstad A. Human value learning and representation reflects rational adaption to task demands. Nat Hum Behav. 2022;6(9):1268–1279. doi: 10.1038/s41562-022-01360-4 [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref040] 40.Collins AGE, Ciullo B, Frank MJ, Badre D. Working Memory Load Strengthens Reward Prediction Errors. J Neurosci. 2017;37(16):4332–4342. doi: 10.1523/JNEUROSCI.2700-16.2017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref041] 41.Bornstein AM, Khaw MW, Shohamy D, Daw ND. Reminders of past choices bias decisions for reward in humans. Nat Commun. 2017. Jun 27;8:15958. doi: 10.1038/ncomms15958 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref042] 42.Bornstein AM, Norman KA. Reinstated episodic context guides sampling-based decisions for reward. Nat Neurosci. 2017. Jul;20(7):997–1003. doi: 10.1038/nn.4573 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref043] 43.Bhui R, Gershman SJ. Decision by sampling implements efficient coding of psychoeconomic functions. Psychol Rev. 2018. Nov;125(6):985–1001. doi: 10.1037/rev0000123 . [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref044] 44.Ronayne D., Brown G. D. Multi-attribute decision by sampling: An account of the attraction, compromise and similarity effects. Journal of Mathematical Psychology 2017, 81, 11–27 doi: 10.1016/j.jmp.2017.08.005 [DOI] [Google Scholar]

[pcbi.1010664.ref045] 45.Gershman S. J., Daw N. D. Reinforcement learning and episodic memory in humans and animals: an integrative framework. Annual review of psychology 2017., 68, 101–128. doi: 10.1146/annurev-psych-122414-033625 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref046] 46.Solomyak Levi; Eldar Eran; Sharp Paul (2022), Training diversity promotes absolute value guided choice, Dryad, Dataset, 10.5061/dryad.1rn8pk0xr [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref047] 47.Cohen J. Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers; 1988. [Google Scholar]

[pcbi.1010664.ref048] 48.Zhou W., Bovik A. C., Sheikh H. R., Simoncelli E. P. "Image Quality Assessment: From Error Visibility to Structural Similarity." IEEE Transactions on Image Processing. Vol. 13, Issue 4, April 2004, pp. 600–612. doi: 10.1109/tip.2003.819861 [DOI] [PubMed] [Google Scholar]

[pcbi.1010664.ref049] 49.Recorla R. A., Wagner A. R. A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement. In Black A. H., & Prokasy W. F. (Eds.), Classical Conditioning II: Current Research and Theory 1972. pp. 64–99. [Google Scholar]

[pcbi.1010664.ref050] 50.Huys QJ, Pizzagalli DA, Bogdan R, Dayan P. Mapping anhedonia onto reinforcement learning: a behavioural meta-analysis. Biol Mood Anxiety Disord. 2013. 19;3(1):12. doi: 10.1186/2045-5380-3-12 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref051] 51.Eldar E, Hauser TU, Dayan P, Dolan RJ. Striatal structure and function predict individual biases in learning to avoid pain. Proc Natl Acad Sci USA. 2016;113(17):4812–7. doi: 10.1073/pnas.1519829113 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref052] 52.Eldar E, Lièvre G, Dayan P, Dolan RJ. The roles of online and offline replay in planning. Elife. 2020; 9:e56911. doi: 10.7554/eLife.56911 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref053] 53.Sharp PB, Russek EM, Huys QJM, Dolan RJ, Eldar E. Humans perseverate on punishment avoidance goals in multigoal reinforcement learning. Elife. 2022. Feb 24;11:e74402. doi: 10.7554/eLife.74402 Erratum in: Elife. 2022 Oct 10;11: ; PMCID: PMC8912924. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010664.ref054] 54.Ciranka S., Linde-Domingo J., Padezhki I. et al. Asymmetric reinforcement learning facilitates human inference of transitive relations. Nat Hum Behav 2022. 6, 555–564 doi: 10.1038/s41562-021-01263-w [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Training diversity promotes absolute-value-guided choice

Levi Solomyak

Paul B Sharp

Eran Eldar

Roles

Abstract

Author summary

Introduction

Results

Fig 1. Experimental design.

Subjects formed preferences in favor of more rewarding images

Fig 2. Overall performance. n = 27 subjects.

Concurrent diversity increased generalization

Fig 3. Generalization performance.

Concurrent diversity reduces influence of other options’ outcomes

Fig 4. Effect of other options’ outcomes.

Computational formalization of value and preference learning

Subjects combine value and preference learning

Fig 5. n = 27 subjects.

Concurrent diversity enhances value learning

Testing alternative interpretations

Discussion

Conclusion

Contact for resource sharing

Methods

Ethics statement

Subjects

Experimental design

Concurrent diversity

Table 1. Example image pairings in conditions of low and high concurrent diversity and reward context.

Cumulative diversity

Table 2. Example arrangement of images across days in conditions of high cumulative diversity.

Mobile platform

Daily schedule

Materials

Statistical analyses

Regression analyses

Examining rank bias

Examining the influence of other options’ outcomes

Computational formalization

Alternative models

Model fitting

Parameter initialization

Model comparison

Statistical tests of parameter fits

Falsification of alternative models

Dryad DOI

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Samuel J Gershman

Lusha Zhu

Roles

Author response to Decision Letter 0

Decision Letter 1

Samuel J Gershman

Lusha Zhu

Roles

Acceptance letter

Samuel J Gershman

Lusha Zhu

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases