Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2012 May 16;109(22):8776–8779. doi: 10.1073/pnas.1205131109

Mice take calculated risks

Aaron Kheifets 1,1, C R Gallistel 1,1
PMCID: PMC3365199  PMID: 22592792

Abstract

Animals successfully navigate the world despite having only incomplete information about behaviorally important contingencies. It is an open question to what degree this behavior is driven by estimates of stochastic parameters (brain-constructed models of the experienced world) and to what degree it is directed by reinforcement-driven processes that optimize behavior in the limit without estimating stochastic parameters (model-free adaptation processes, such as associative learning). We find that mice adjust their behavior in response to a change in probability more quickly and abruptly than can be explained by differential reinforcement. Our results imply that mice represent probabilities and perform calculations over them to optimize their behavior, even when the optimization produces negligible material gain.

Keywords: reinforcement learning, decision under uncertainty, model-based control, probability estimation, timing


A central question in cognitive neuroscience is the extent to which computed representations of the experienced environment direct behavior. Many contemporary models of adjustment to changing risks assume computationally simpler event-by-event value-updating processes (13). The advantage of these models is that a computationally simple process does in time adjust behavior more or less optimally to changes in, for example, the probability of reinforcement. The advantage of computationally more complex model-based decision making is the rapidity and accuracy with which behavior can adjust to changing probabilities (4, 5). Recent studies with humans (6) suggest that the rapidity of adjustment to changes in the payoff matrix in a choice task requires a representation of risk factors (probabilities and variabilities). We report an analogous result with mice when they adjust to a shift in relative risk. In our experiment, a change in the relative frequencies of short and long trials changes the optimal temporal decision criterion in an interval-timing task. We find that mice make the behavioral adjustment to the change in relative risk about as soon and abruptly as is in principle possible. Our results imply that neural mechanisms for estimating probabilities and calculating relative risk are phylogenetically ancient. This finding means that they may be investigated by genetic and other invasive procedures in genetically manipulated mice and other laboratory animals.

The hopper-switch task (7) has been used to show approximately optimal risk adjustment in mice (8). The task requires the mouse to decide on each trial whether to leave a short-latency feeding location (hopper) to go to a long-latency location (hopper) based on its estimate of how much time has elapsed since the trial began (Fig. 1). If the computer has chosen a short trial, the mouse gets a pellet in the short-latency hopper in response to the first poke there at or after 3 s. Failure to obtain a pellet there after 3 s have elapsed implies that that the computer has chosen a long-latency trial. On a long-latency trial, a pellet is delivered into the other hopper in response to the first poke there at or after 9 s. The computer chooses short or long trials at random, with the probability of choosing a short trial denoted pS. At the beginning of a trial, the mouse does not know which type of trial it will be.

Fig. 1.

Fig. 1.

The experimental environment. In the switch task, a trial proceeds as follows: 1: Light in the Trial-Initiation Hopper signals that the mouse may initiate a trial. 2: The mouse approaches and pokes into the trial-initiation hopper, extinguishing the light there and turning on the lights in the two feeding hoppers (trial onset). 3: The mouse goes to the short-latency hopper and pokes into it. 4: If, after 3 s have elapsed since the trial onset, poking in the short-latency hopper does not deliver a pellet, the mouse switches to the long-latency hopper, where it gets a pellet there in response to the first poke at or after 9 s since the trial onset. Lights in both feeding hoppers extinguish either at pellet delivery or when an erroneously timed poke occurs. Short trials last about 3 s and long trials about 9 s, whether reinforced or not: if the mouse is poking in the short hopper at the end of a 3-s trial, it gets a pellet and the trial ends; if it is poking in the 9-s hopper, it does not get a pellet and the trial ends at 3 s. Similarly, long trials end at 9 s: if the mouse is poking in the 9-s hopper, it gets a pellet; if in the 3-s hopper, it does not. A switch latency is the latency of the last poke in the short hopper before the mouse switches to the long hopper. Only the switch latencies from long trials are analyzed.

The rational strategy given these contingencies is to poke in the short hopper at the start of each trial, then switch to the long hopper if and when 3 s or more elapse without a pellet being delivered in response to a poke into the short hopper. The mice learn to do this action within a few daily sessions. When the mice have learned the action, they get a pellet on almost every trial. However, the mice sometimes switch too soon (after fewer than 3 s) or too late (after more than 9 s), and they sometimes go directly to the long hopper. Whether these errors cost them a pellet or not depends on which kind of trial the computer has chosen. Thus, the risk of one kind of error or the other depends on pS, the probability that the current trial is a short one, and on the subject’s timing variability, how precisely it can judge when 3 s have elapsed.

When pS changes, altering the relative risks from the two kinds of error, mice shift the distribution of their switch latencies by an approximately optimal amount (8) (Fig. 2). The question we pose is, what kind of process underlies this behavioral shift in response to a change in the relative frequencies of the short and long trials?

Fig. 2.

Fig. 2.

Scatter plot of switch latencies on long trials (small black circles). Red dots at top and bottom of each plot mark long and short trials, respectively. Vertical lines mark session boundaries. Thick blue horizontal lines are the medians of the distributions. When the relative density of red crosses on top and bottom changes—that is, when the relative frequency of long and short trials changes—the distribution of black circles shifts away from the increased red density (increased risk). Large red ovals indicate unilateral feeder malfunctions.

On the reinforcement-learning hypothesis, the shift in the distribution of switch latencies is driven by a change in the relative frequency with which different switch latencies are reinforced. When the short-latency trials are frequent, the probability of a long-latency switch being reinforced increases and the probability of a short-latency switch decreases, and vice versa when long-latency trials are frequent. With this hypothesis, a slow trial-and-error hill-climbing process leads in time to a new stable distribution that maximizes the reinforcement probability. In this reinforcement-learning hypothesis, the mouse makes no estimate of the relative frequency of the short and long trials, nor of its own timing variability, and it does not calculate an optimal target. The mouse simply discovers by trial and error the target switch latency that maximizes the rate of reinforcement.

On what we call the model-based hypothesis, the mouse brain estimates pS, it estimates its timing variability (Inline graphic) or its uncertainty about how much time has elapsed, and it calculates from its estimates of these stochastic parameters the optimal target switch time (Fig. S1). With this hypothesis, there will be a step-change in the distribution of switch latencies as soon as the mouse detects the change in the relative frequency of the short and long trials, estimates the new relative frequency, and calculates the new optimum. In this step hypothesis, a step-change in the distribution of switch latencies can occur before the change in the relative frequency of the trial types has produced any change in the rate of reinforcement. In the reinforcement-learning hypothesis, this cannot happen because the change in behavior is driven by a change in the relative rates at which short and long switches are reinforced.

Results

As may be seen from the raw data on switch latencies plotted in Fig. 2, when we changed the relative frequency of the short and long trials, the mice adjusted the distribution of their switch latencies so quickly and so abruptly that the adjustments appear step-like.

Regardless of the relative frequency of the short and long trials, the distribution of switch latencies was well described by an ExpGauss mixture distribution. One component of the mixture was a short-latency exponential, the expectation of which did not vary with the relative frequency of short and long trials. The switches contributing to this component of the mixture appeared to be impulsive, not timed. The other component was a Gaussian, the mean of which varied with the relative frequency of the short and long trials. The ExpGauss mixture distribution has four parameters: the expectation of the exponential (denoted Inline graphic), the mean and SD of the Gaussian (Inline graphic), and the proportion parameter (Inline graphic), which gives the proportion between the two components. We take the mean of the Gaussian component to be an estimate of the target switch time on those trials when the mouse timed its switch.

We modeled the trial-by-trial trajectory of the behavioral shifts from one ExpGauss distribution to another as a piece-wise linear change in Inline graphic, the parameter vector for the ExpGauss distribution (Fig. 3A). Our transition function has two parameters: the number of trials before the start (S) of the behavioral shift and number of trials for it to go to completion once started (D, for duration):

graphic file with name pnas.1205131109eq1.jpg

Fig. 3.

Fig. 3.

(A) A transition function, with parameters, S (start) and D (duration), describes the trial-by-trial change in the parameter vector, Inline graphic, of the ExpGauss mixture. Trial 0 is the last trial before the change in pS. (B) Cumulative distribution of S estimates. (C) Cumulative distribution of Bayes factors favoring D = 1 (a step switch) vs. 1 ≤ D ≤ 100 (a more gradual switch).

where Inline graphic is the fraction of the transition completed as of trial T (Fig. 3A). We then computed the marginal-likelihood functions for S and D to find their plausible values.

Fig. 3B plots the cumulative distribution of the expectations of the marginal-likelihood functions for S, the start parameter. These expectations are estimates of how many trials it takes a mouse to detect a substantial change in pS and begin to adjust its switch latencies. The median estimate for the start of a transition is 10 trials.

The median expectation of the marginal-likelihood function for D, the duration of a transition, was 19 trials, with a range from 3 to 105. However, in all 14 transitions, the likelihood at D = 1 was higher than the mean likelihood over the range 1 ≤ D ≤ 100. Thus, in every case, the transition data favored the step hypothesis (D = 1) over the alternative hypothesis that the transition lasted somewhere between 1 and 100 trials (see Fig. S2 for the marginal-likelihood functions). The cumulative distribution of the Bayes Factors (the odds in favor of D = 1) is in Fig. 3C. The combined odds in favor of a step transition (the product of the Bayes factors) are enormous.

Our key findings are thus: (i) It takes mice about 10 trials to detect a substantial change in the relative frequency of the short and long trials and to compute an estimate of the new pS, based on the data since the estimated change; (ii) The change in target switch latency––hence the distribution of switch latencies––is accomplished in a single trial. The step change in the distribution implies that the prechange estimate of pS is replaced by a postchange estimate, based on postchange data that do not overlap the prechange data.

To constrain the space of possible calculations that could underlie this behavior, we considered two basic kinds of algorithms: reinforcement-driven algorithms that optimize behavior without modeling the experienced world, and model-driven algorithms that estimate stochastic parameters of the experienced world (probabilities and variabilities) and calculate an optimal change in decision criterion (target switch time) given those estimates. Reinforcement-learning algorithms are often used in machine learning when computing a model of the environment in which the machine is to act is thought to be too complex. The assumption that reinforcement shapes behavior without the brain’s constructing a representation of the experienced world has a long history in psychology and behavioral and cognitive neuroscience (913). In reinforcement learning, behavioral change is driven by changes in the rate of reinforcement that a pattern of behavior (“strategy”) produces. In the present case, this translates to changes in the average number of pellets obtained per trial for a given target switch latency. The target switch latency is the strategy. Varying this strategy will vary the rate of reinforcement.

In line with the findings of ref. 8, we found that subjects missed very few reinforcements. Each plot in Fig. 4 plots two functions in the plane of the parameters of the Gaussian distribution (Inline graphic and Inline graphic). The gain contours (red curves) delimit the expected percentage of reinforced trials for combinations of values for these parameters. When the Inline graphic estimates for the Gaussian component of a subject’s switch-latency distribution are below the 99% gain contour, the subject gets pellets on more than 99% of the trials. When above a given contour, the subject gets pellets on a smaller percentage of the trials.

Fig. 4.

Fig. 4.

Parameters of the Gaussian switch-latency distributions (black concentric circles) in relation to gain contours (red upside down U curves). A gain contour traces the upper limit on the combinations of variability (Inline graphic) and mean (Inline graphic) consistent with an expected percentage of reinforced trials. The small concentric circles are the confidence limits on our estimates of the mean and SD of the Gaussian component of the subject’s switch-latency distribution. The circles consistently fall near the mode of the 99% gain contour. Note the change in gain contours with changes in ps. Note also the shallowness of the gain hill. The shallower the gain hill, the more slowly a hill-climbing algorithm can find its summit (that is, converge on the optimal strategy). Note finally that the confidence limits under different pS conditions do not overlap. Thus, the changes in pS produce statistically significant changes in the mean switch latency, as is also evident in the raw data (Fig. 2).

The joint confidence limits around our estimates for the values of the parameters of the Gaussian component of their switch-latency distribution are the small concentric circles in Fig. 4. We assume that subjects cannot control their timing variability; that is, the location of the concentric circles along the Inline graphic axis. Subjects determine the location along the Inline graphic axis when they calculate or otherwise arrive at a target switch latency. As may be seen in Fig. 4, the observed locations are generally close to the mode of the 99% gain contour (close to the optimal location), even though substantial deviations from that mode would have little impact on gain (rate of reinforcement). The gain contours shift when pS changes. The optimal target latency depends on pS and on a subject’s timing variability. Sensitivity to one’s own variability has also been demonstrated in humans (4, 14, 15).

We examined whether the rate of reinforcement could drive these behavioral shifts by calculating the number of pellets missed before the change in behavior went to completion. In estimating when the change went to completion, we used the expected duration of a transition, rather than the maximally likely duration, which, as already noted, was always 1. Over 30% of the behavioral shifts went to completion before a single pellet was missed (because of switching too soon or too late). Moreover, the pattern and number of reinforcements missed after the behavioral shift was indistinguishable from the pattern and number missed over the same span of trials before the shift. Thus, our data rule out reinforcement-learning as an explanation for the changes in the distribution of switch latencies. The change in the distribution of switch latencies must be driven by a change in the estimate of pS, based on the subject’s experience over several trials.

Finally, we asked whether the adjustments in the estimates of pS were a consequence of trial-by-trial probability tracking (leaving later after experiencing one or two short trials and sooner after experiencing one or two long trials). Given the abruptness of the changes, the resulting estimates of pS would have to be highly sensitive to the most recent two or three trials. However, the distributions of switch latencies conditioned on the number of the most recent trials (for n = 1, 3, and 5) did not differ. That is, the distribution of switch latencies following three short trials did not differ from the distribution following three long trials. This finding implies that the abrupt behavioral shifts resulted from a change-detecting mechanism that parsed the experienced sequence into trials before a change and trials after a change, with the new estimate of pS based only on the relatively few trials between the estimated change point and the trial on which the change was perceived. In arguing for a model-based adjustment algorithm, we mean only that the algorithm operates on estimates of probabilities and change points. Whether it is necessary to assume that brains—even human brains—make such estimates has long been a question of moment in psychology, cognitive science, and neuroscience.

In a subsequent experiment with six new mice, we made the change approximately 40 trials into a session rather than at the beginning of a new session. We observed similarly abrupt changes. The marginal-likelihood functions from this experiment are in Fig. S3.

We conclude that mice can accurately represent sequential Bernoulli probability (pS), quickly perceive a substantial change in this probability, locate the point in the past at which the change occurred, estimate the new probability on the basis of the postchange trials, represent the uncertainty in their endogenous estimate of time elapsed in a trial, and correctly compute a new approximately optimal target switch latency.

This simple paradigm may be used with genetically manipulated mice, opening up the possibility of using genetic and molecular biological methods to discover the neurobiological mechanisms that make these evolutionarily ancient, life-sustaining computations possible.

Materials and Methods

Protocol.

Subjects lived 24/7 in the environment shown in Fig. 1, obtaining all their food by performing the tasks. For a complete description of this behavior-testing environment and the associated software, see ref. 16. The five mice of the CB57BL6/j strain obtained from Jackson Laboratories were ∼100 d old at the start of the experiment. In session 1 (1 d), the task was a concurrent VI matching protocol. In session 2 (1 d) it was a two-latency, two-hopper, autoshaped hopper-entry protocol: steps 1 and 2 were as in Fig. 1, except that at step 2 (trial onset), the light came on in one or the other of the two feeding hoppers and a pellet was delivered at the short or long latency, regardless of the mouse’s behavior. This process taught the mice the feeding latency associated with each hopper. In the switch task, the interval from the end of a trial to the next illumination of the trial-initiation hopper was exponentially distributed with an expectation of 60 s. The pellets were 20 mg Purina grain-based from WF Fisher and Son, Inc. The sole condition on pellet release was that the first poke after the computer-chosen feeding latency be in the correct hopper (the short hopper on short trials and the long hopper on long trials). Any behavior pattern that met this criterion was reinforced (for example, going directly to the long hopper on a long trial or switching from the short hopper to the long and then back to the short within 3 s on a short trial). This protocol was approved by the Institutional Animal Care and Use Committee of Rutgers University.

Transition Analysis.

We assume that all four parameters follow the same two-parameter linear transition trajectory (Fig. 2A). Thus, our statistical model for a transition has 10 parameters: 4 for the distribution before the transition (θb), 4 for after the transition (θa), and the transition parameters, S and D (Fig. 2A). We used Metropolis Hastings Markov chain Monte Carlo methods to sample from the posterior distribution to estimate values for these 10 parameters. The estimates for θb and θa were essentially the same as the maximum-likelihood estimates. To check the validity of the marginal-likelihood functions for S and D obtained from Markov chain Monte Carlo sampling, we computed the latter numerically after fixing θb and θa at their maximum-likelihood values.

Supplementary Material

Supporting Information

Acknowledgments

We thank Joseph Negron for help in running the experiment and Jacob Feldman for valuable comments on an early draft. This research was funded by Grant R01MH077027 from National Institute of Mental Health, entitled “Cognitive Phenotyping in the Mouse.”

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1205131109/-/DCSupplemental.

References

  • 1.Corrado GS, Sugrue LP, Seung HS, Newsome WT. Linear-nonlinear-Poisson models of primate choice dynamics. J Exp Anal Behav. 2005;84:581–617. doi: 10.1901/jeab.2005.23-05. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Dayan P, Daw ND. Decision theory, reinforcement learning, and the brain. Cogn Affect Behav Neurosci. 2008;8:429–453. doi: 10.3758/CABN.8.4.429. [DOI] [PubMed] [Google Scholar]
  • 3.Schultz W. Behavioral theories and the neurophysiology of reward. Annu Rev Psychol. 2006;57:87–115. doi: 10.1146/annurev.psych.56.091103.070229. [DOI] [PubMed] [Google Scholar]
  • 4.Trommershäuser J, Maloney LT, Landy MS. Decision making, movement planning and statistical decision theory. Trends Cogn Sci. 2008;12:291–297. doi: 10.1016/j.tics.2008.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Daw ND, Niv Y, Dayan P. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat Neurosci. 2005;8:1704–1711. doi: 10.1038/nn1560. [DOI] [PubMed] [Google Scholar]
  • 6.Stüttgen MC, Yildiz A, Güntürkün O. Adaptive criterion setting in perceptual decision making. J Exp Anal Behav. 2011;96:155–176. doi: 10.1901/jeab.2011.96-155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fetterman JG, Killeen PR. Categorical scaling of time: Implications for clock-counter models. J Exp Psychol Anim Behav Process. 1995;21:43–63. [PubMed] [Google Scholar]
  • 8.Balci F, Freestone D, Gallistel CR. Risk assessment in man and mouse. Proc Natl Acad Sci USA. 2009;106:2459–2463. doi: 10.1073/pnas.0812709106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hull CL. Principles of Behavior. New York: Appleton-Century-Crofts; 1943. [Google Scholar]
  • 10.Skinner BF. The Behavior of Organisms. New York: Appleton-Century-Crofts; 1938. [Google Scholar]
  • 11.Skinner BF. Selection by consequences. Science. 1981;213:501–504. doi: 10.1126/science.7244649. [DOI] [PubMed] [Google Scholar]
  • 12.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998. [Google Scholar]
  • 13.Rangel A, Camerer C, Montague PR. A framework for studying the neurobiology of value-based decision making. Nat Rev Neurosci. 2008;9:545–556. doi: 10.1038/nrn2357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Trommershäuser J, Maloney LT, Landy MS. Statistical decision theory and the selection of rapid, goal-directed movements. J Opt Soc Am A Opt Image Sci Vis. 2003;20:1419–1433. doi: 10.1364/josaa.20.001419. [DOI] [PubMed] [Google Scholar]
  • 15.Behrens TE, Woolrich MW, Walton ME, Rushworth MF. Learning the value of information in an uncertain world. Nat Neurosci. 2007;10:1214–1221. doi: 10.1038/nn1954. [DOI] [PubMed] [Google Scholar]
  • 16.Gallistel CR, et al. Screening for learning and memory mutations: A new approach. Acta Psychologica Sinica. 2010;42(1):138–158. doi: 10.3724/SP.J.1041.2010.00138. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES