Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Feb 1.
Published in final edited form as: Psychol Methods. 2023 Mar 27;30(1):128–154. doi: 10.1037/met0000554

Troubleshooting Bayesian cognitive models: A tutorial with matstanlib

Beth Baribault 1, Anne GE Collins 1
PMCID: PMC10522800  NIHMSID: NIHMS1856163  PMID: 36972080

Abstract

Using Bayesian methods to apply computational models of cognitive processes, or Bayesian cognitive modeling, is an important new trend in psychological research. The rise of Bayesian cognitive modeling has been accelerated by the introduction of software such as Stan and PyMC3 that efficiently automates the Markov chain Monte Carlo (MCMC) sampling used for Bayesian model fitting. Unfortunately, Bayesian cognitive models can struggle to pass the computational checks required of all Bayesian models. If any failures are left undetected, inferences about cognition based on model output may be biased or incorrect. As such, Bayesian cognitive models almost always require troubleshooting before being used for inference. Here, we present a deep treatment of the diagnostic checks and procedures that are critical for effective troubleshooting, but are often left underspecified by tutorial papers. After a conceptual introduction to Bayesian cognitive modeling and MCMC sampling, we outline the diagnostic metrics, procedures, and plots necessary to identify problems in model output with an emphasis on how these requirements have recently been improved. Throughout, we explain how the most commonly encountered problems may be remedied with specific, practical solutions. We also introduce matstanlib, our MATLAB modeling support library, and demonstrate how it facilitates troubleshooting of an example hierarchical Bayesian model of reinforcement learning implemented in Stan. With this comprehensive guide to techniques for detecting, identifying, and overcoming problems in fitting Bayesian cognitive models, psychologists across subfields can more confidently build and use Bayesian cognitive models.

All code is freely available from github.com/baribault/matstanlib.

Keywords: cognitive modeling, Bayesian methods, computational models


The Bayesian revolution of the last few decades (S. P. Brooks, 2003) has enabled a much larger pool of psychologists than ever before to apply Bayesian methods in their work (van de Schoot et al., 2017). Thanks to tutorial books and papers targeted at psychologists (e.g., Rouder et al., 2009; Kruschke, 2014), it is no longer rare to see Bayesian hypothesis tests and Bayesian linear models reported in psychological research. However, Bayesian data analysis is not the only approach to using Bayesian methods in psychological research: In this paper, we will discuss a different approach, Bayesian cognitive modeling, in which Bayesian methods are used to implement cognitive process models (Lee & Wagenmakers, 2014; not to be confused with Bayesian models of mind1). Process models are increasingly being used (Jarecki et al., 2020) to provide formal, testable accounts of the possible psychological mechanisms underlying observed behavior (Navarro, 2021). Using hierarchical Bayesian methods for cognitive modeling confers many benefits, such as the ability to quantify uncertainty in parameter estimates while simultaneously accounting for individual differences and other meaningful structures directly in a model (Lee, 2011). Bayesian cognitive modeling is a principled and coherent approach to quantitative evaluation of psychological theory.

While using Bayesian methods for cognitive modeling had long been the province of mathematical psychologists, as it required comfort with mathematical statistics and statistical programming (Gilks et al., 1995; Gelman et al., 2013; for an example, see Rouder & Lu, 2005), this has changed with the maturation of software that automates the Markov chain Monte Carlo (MCMC) methods used for Bayesian model fitting2 (such as JAGS, Plummer, 2003; and Stan, Gelman et al., 2015) and software that likewise automates Bayesian model specification (for linear models, Bürkner, 2017; and for select cognitive models, Ahn et al., 2017). These developments have made Bayesian cognitive modeling newly accessible to psychologists in many subfields, including cognitive psychologists (e.g., Donkin et al., 2016; Navarro et al., 2016; Annis & Palmeri, 2019), cognitive neuroscientists (e.g., Wiecki et al., 2013; Blanchard & Gershman, 2018; Nunez et al., 2019; Peters & D’Esposito, 2020), and clinical psychologists (e.g., Matzke et al., 2017; Haines et al., 2020; Addicott et al., 2021; Brown et al., 2021), and social psychologists (e.g., Hütter & Klauer, 2016; Johnson et al., 2018; Golubickis et al., 2018).

However, while linear statistical models (e.g. multilevel regression models) can be relatively easily implemented in a Bayesian framework, Bayesian implementations of cognitive models tend to require more careful testing and tweaking before they may be confidently applied to data. This is because most Bayesian cognitive models have characteristics which are known to pose challenges for Bayesian model fitting. In order to quantitatively express cognitive mechanisms, Bayesian cognitive models often require complicated, nonlinear likelihood functions, and to incorporate relevant domain knowledge, non-conjugate priors over restricted domains are often used. Correlations among model parameters are closer to the rule than the exception (e.g., Krefeld-Schwalb et al., 2021). Hierarchical model structures are common, if not universally encouraged (Boehm et al., 2018), as they allow for the simultaneous account of group- and individual-level effects (Lee, 2011; Scheibehenne & Pachur, 2015). These features all tend to produce posterior geometries that are challenging for MCMC algorithms to navigate, and therefore heighten the risk of computational failures. If active steps are not taken to conduct computational checks for such failures (as well as consistency checks of other model assumptions, including parameter recoverability), then any inference based on the model output risks being fundamentally flawed.

As such, the ability to detect, diagnose, and remedy problems — via procedures which we collectively call troubleshooting — is essential for practitioners of Bayesian cognitive modeling. Unfortunately, troubleshooting seems to be a blind spot in the didactic literature on Bayesian methods aimed at cognitive scientists. Of these past tutorial papers and books, those that introduce the core concepts of Bayesian data analysis often do not cover Bayesian cognitive modeling (e.g., Etz & Vandekerckhove, 2018; Kruschke, 2014). Those that focus on Bayesian cognitive model design (Lee, 2008; Lee & Wagenmakers, 2014; Matzke et al., 2018; Shiffrin et al., 2008) and development (Annis & Palmeri, 2018; Greene & Rhodes, 2020; Lee & Vanpaemel, 2018; Rouder & Lu, 2005; Schad et al., 2021; Shiffrin et al., 2008; Vanpaemel, 2010) tend to underspecify the model-checking steps required before a model may be used for inference.3 This is complicated by the fact that model-checking requirements have recently been improved, such that failure modes that researchers were not previously able to be detected may now be reliably exposed.

Specifically, recent advances in Bayesian statistical practice have amended and broadened the suite of diagnostic checks of Bayesian model output that are deemed necessary. Consider for a moment that the most familiar convergence diagnostic, R^, has been required to be ≤ 1.1 since the 1990s (Gelman & Rubin, 1992; Gelman et al., 1995). In just the past couple of years, the computation of R^ has been made markedly more conservative, and now R^ values must meet the far more stringent criterion of being ≤ 1.01 (Gelman et al., 2020; Vehtari et al., 2021). As we will discuss in detail here, the work of Vehtari and colleagues has mandated other significant changes, the collective effect of which is that some previously passing model output will now fail convergence checks. Other changes are a result of the collective shift away from the previously preferred MCMC method, Gibbs sampling (via JAGS, Plummer, 2003), toward a newer, more powerful, and more efficient method, Hamiltonian Monte Carlo (via Stan, Gelman et al., 2015; and PyMC3, Salvatier et al., 2016). The specific flavors of Hamiltonian Monte Carlo implemented by Stan and PyMC3 (e.g., Hoffman & Gelman, 2014) require that multiple additional diagnostic quantities are checked as a matter of course (e.g., BFMI; Betancourt, 2016). These notable changes to the core practices of Bayesian model fitting are still somewhat unfamiliar in the Bayesian cognitive modeling literature.

As such, the primary purpose of this tutorial is to present a current, thorough treatment of the computational and model consistency checks required for proper use of Bayesian cognitive models (see Figure 1), including clear guidance on what to do when model output fails one or more checks. Because some of the topics we cover here can seem arcane to psychological researchers who are not also Bayesian statistical researchers, we have taken care to provide conceptual explanations of all procedures, and to make connections to principles from cognitive science, if not actively demonstrating by example, where possible. To this end, we begin with an overview of the Bayesian cognitive modeling approach so as to build the conceptual groundwork that is prerequisite for successful troubleshooting. Then, we explain how to check for computational and other problems using convergence diagnostics, diagnostic plots, and consistency checks, while offering remedies for simpler issues along the way. Next, we explain how thornier issues related to parameterization and posterior geometry can be elucidated through the use of diagnostic plots and other visualizations. Throughout, we offer guidance on how to better utilize and triage among the many techniques for identifying the exact nature of the problem, as well as techniques for selecting appropriate solutions. Finally, we review more use case-dependent methods, such as posterior predictive checks, that test how capable and useful a model is (or is not) for a given research application.

Figure 1.

Figure 1

An abbreviated representation of the Bayesian cognitive modeling workflow that emphasizes those steps most relevant to troubleshooting. Model output that does not pass through the filter (representing the requisite computational and consistency checks) must be rejected. A troubleshooting process should be used to improve the model design such that the output might ultimately pass through all checks. Then, and only then, may the Bayesian cognitive model output be used as the basis for inference. (Note that the model-checking techniques listed in the figure need not be performed in the order in which they appear; prior predictive checks, for example, would ideally be performed before model fitting; see Gelman et al., 2020 for an exhaustive, ordered list.)

Many of the troubleshooting procedures that we recommend here rely on the visualization of Bayesian cognitive model output (as is recommended for all Bayesian workflows; Gabry et al., 2019). While excellent tools for visualization exist in other languages, including bayesplot in R (Gabry & Mahr, 2021) and ArviZ in Python (Kumar et al., 2019), few resources exist for MATLAB users. We created matstanlib, a library of MATLAB code for processing, analysis, and visualization of output from Bayesian models (CITE GITHUB or JOSS), to put MATLAB-based psychologists on an equal footing. Therefore, a second purpose of this paper is to show how matstanlib, which is freely available on GitHub (https://github.com/baribault/matstanlib), supports Bayesian cognitive model-based research by facilitating the techniques required by current best practices. Many example models, including the hierarchical Bayesian reinforcement learning model that we use for most of our visual demonstrations, are also available as part of matstanlib.

Nonetheless, it is our intent that this paper will serve as a general reference for how to detect, diagnose, and correct frequently-encountered problems with Bayesian cognitive models, regardless of one’s preferred sampling software or programming language. While we reference matstanlib functions most frequently, similar functionality is available in both R and Python (see the Appendix for a conversion chart), and could be coded from scratch in any language. Ultimately, it is the principles and processes of troubleshooting that we seek to explicate here, and the same core approach applies to all use of Bayesian methods for cognitive modeling.

Bayesian cognitive modeling

We begin with a review of the Bayesian cognitive modeling approach, with an emphasis on the specific Bayesian methods and packages that are currently most commonly used for Bayesian cognitive model fitting. While we assume a general familiarity with Bayesian principles (see Etz et al., 2018 for a first introduction, or Gelman et al., 2013 for a deeper treatment) and computational modeling (e.g., Farrell & Lewandowsky, 2018), we include this overview to establish conceptual ideas and terminology that we rely on throughout the tutorial.

The Bayesian framework

All Bayesian analysis derives from Bayes’ theorem:

p(θx)=p(xθ)p(θ)p(xθ)p(θ)dθ

and Bayesian cognitive modeling, of course, is no exception. Bayes’ theorem tells us how the prior probability, p(θ), of an unobserved parameter or set of parameters, θ, and the likelihood, p(x|θ), of the observed data, x, may be used to derive the posterior probability, p(θ|x), of the parameters in light of the data.

In a Bayesian cognitive model, the likelihood is specifically used to express the cognitive process or mechanism theorized to have produced the behavioral data. Cognitive model likelihoods are typically nonlinear and may be rather complex, as in an evidence accumulator model of response times (Ratcliff & McKoon, 2008; Vandekerckhove et al., 2008), a cumulative prospect theory-based model of decisions (Nilsson et al., 2011), or a reinforcement learning model of action selection (Dearden et al., 1998). This assumed distribution of the data is conditional on the unobserved parameters, which in a cognitive model will have meaningful psychological interpretations as they are intended to capture one aspect or shape one dynamic of the cognitive process expressed in the model. For example, in the respective aforementioned models, we interpret ν as the speed of evidence accumulation, λ as the relative weighting of losses and gains, and α as the learning rate. Bayesian analysis requires that each parameter has an associated prior distribution, which should be defined over all conceivably possible values. In a cognitive model, these priors are often seen as an opportunity to incorporate domain knowledge relevant to each of the parameterized cognitive dynamics. In mature subfields, this knowledge can be considerable, especially when a particular model has been long been used successfully. (For a worked example of how to use domain expertise to support prior elicitation, see Vanpaemel, 2010.)

Modern Bayesian cognitive models are nearly always hierarchical, in that they include additional model structure to simultaneously account for data from multiple participants, groups, conditions, tasks, and so on. At a minimum, Bayesian cognitive models should include a hierarchical extension of the model over participants, such that each participant is allowed a unique set of parameter values (e.g., in a reinforcement learning model, their own learning rate, α), but all instances of each parameter (e.g., all participants’ α parameters) are assumed to be drawn from a common group-level hyperprior distribution. (In other words, most modern Bayesian cognitive models are multilevel models.) This confers dual benefits of sharing informative power, which is helpful in small-data situations, and regularizing parameter estimates across participants, which engenders more reliable estimates Katahira, 2016; Lee, 2011. Other hierarchies may be included such that the dependencies among the behavioral data and the parameters suit the experimental context, among other needs (Lee, 2011; Katahira, 2016; Scheibehenne & Pachur, 2015).

Taken together, the likelihood and all priors comprise a Bayesian model specification. Because deriving the exact posterior analytically is feasible only for the simplest of models, virtually all Bayesian cognitive model fitting relies on MCMC sampling methods to approximate the joint posterior distribution. MCMC algorithms allow for samples to be drawn from the joint posterior in such a way that each possible value of a given parameter should be drawn with a probability proportional to its posterior density:

p(θx)p(xθ)p(θ)

If an infinite number of samples were to be collected, one would be guaranteed to recover the true posterior (among other mathematical guarantees; S. Brooks et al., 2011; Gelman et al., 2013). The finite number of posterior samples collected in practice serve to approximate the true posterior in the same way that one might use a histogram of collected scores as an approximation of the true distribution of scores in the population.

An example in Stan

As an example of a hierarchical Bayesian cognitive model, we consider a Bayesian implementation of a reinforcement learning (RL) model (and reference it throughout the paper to help explain multiple troubleshooting techniques). In matstanlib, we include a script, example_RL.m, that will specify, apply, and troubleshoot this model. In the example script, the model is applied to data simulated from an experimental design in which each “participant” completes four blocks of a probabilistic three-armed bandit problem. In each block, the simulated participant sees the bandit stimuli 20 times and must learn from the results of their choices, over time, which arm is most likely to give a point reward (see Figure 2a). Across the trials within each block, the participants learn which bandit with the highest reward probability. If this choice is considered correct, then we will expect accuracy to start at chance (13), rise, and asymptote in a classic learning curve (as seen in Figure 2b).

Figure 2.

Figure 2

Most of the other figures in this paper present output from our hierarchical Bayesian implementation of a classic delta-rule reinforcement learning model (detailed in text). When behavioral data for a probabilistic 3-armed bandit task (a) was simulated according to the model specification, a characteristic learning curves (b) are seen at the group level (thick black line) and in the simulated data for individual subjects (lighter gray lines).

Our reinforcement learning model (Sutton & Barto, 2018) captures participants’ learning by allowing for each stimulus to be assigned a separate value. At the start of each block, we assume that every participant begins by expecting each stimulus to have some starting value, e.g., Q0 = [0, 0, 0]. Then, on every trial, t, the values, Q, are scaled by an inverse temperature parameter, β, and run through a softmax function to determine the probability of selecting each stimulus, π. The participant then makes a choice, c, according to those probabilities:

πtieβQtij=13eβQtjctCategorical(πt)

The difference between the reward resulting from that choice, r, and the current value of the chosen stimulus constitutes a prediction error, δ. The prediction error is used to update the chosen stimulus’ value according to a learning rate, α:

δt=rtQt(ct)Qt(ct)=Qt(ct)+αδt

Finally, to capture how participants may forget over time, all Q-values are subject to decay with rate ϕ before the next trial begins:

Qt+1=Qt+ϕ(Q0Qt)

Together, the action selection, learning, and forgetting mechanisms describe the cognitive process of learning over time. As the exact dynamics of the process will be unique to each individual, each participant p is allowed to have a different β, α, and ϕ. However, we also assume that all participants come from a group4 that shares common cognitive processes. To express this knowledge in the model, we incorporate a hierarchy over participants, and we set participant-level priors for each parameter:

βpHalfNormal(μβ,σβ)αpNormal(μα,σα)𝒯[0,1]ϕpNormal(μϕ,σϕ)𝒯[0,1]

that are dependent on group-level hyperparameters, with associated hyperpriors:

μβHalfNormal(10,5)σβHalfNormal(0,5)μαUniform(0,1)σαNormal(0,0.5)𝒯[0,1]μϕUniform(0,1)σϕNormal(0,0.5)𝒯[0,1]

(The subscript 𝒯 indicates a truncation of the distribution to between the specified bounds or bound.)

We selected this simple delta-rule learning model because it has many of the features common to Bayesian cognitive models that are known to pose problems for MCMC algorithms, such as the aforementioned nonlinear and complicated likelihood and the hierarchical model structure. In addition, its parameters require restricted ranges (α, ϕ ∈ [0, 1], β+, exclusive of 0), are decidedly not normally-distributed (e.g., the empirical distribution for β is positively skewed), and are well-known to be correlated (in some reinforcement learning experiments and models, though certainly not all). As such, this popular model presents good opportunities for demonstrating the principles of troubleshooting.

In fact, while the model specification above may appear sufficient at first glance, it will reliably fail most of the required computational and consistency checks (as described in the next section). These failures are intentional: Over the course of this tutorial, most of the figures we present will include flawed output and questionable diagnostic plots that mirror those we generated in hopes of uncovering a way to improve the above model. By including real model output in all figures, we hope to demonstrate how each and every failed check can offer crucial help in identifying the nature of the problem with the model specification above and may, in turn, lead us to discover ways to correct the model.

In the examples folder of matstanlib, the example_RL.m script will run both this initial, flawed version of the model and a final, improved version of the model. For readers who are not MATLAB users but wish to follow along, standalone files containing Stan code for the model specification only is also included in examples (as RL_broken.stan and RL_fixed.stan, respectively). This script demonstrates how matstanlib functions may be used to extract the model output, run diagnostic checks, and visualize each RL model’s output. Running this script and another example script, example_funnel.m, will collectively reproduce most of the figure panels in remainder of this paper (all of which present real Bayesian cognitive model output).

A brief introduction to sampling algorithms

By fitting or running the model, we specifically mean using software that, given a model specification and data, automates running of an MCMC algorithm to estimate the joint posterior of the model. In order to ensure that the MCMC output does not violate any assumptions of MCMC, and offers enough power for the planned inference, various computational checks of the MCMC output are required (as discussed in detail in Detecting problems, below). It is important to build intuition for how the posterior samples are generated in order to understand these sampler diagnostics, as they will motivate most of the troubleshooting techniques discussed here. For the purposes of this paper, our discussion of MCMC sampling will implicitly assume the use of a dynamic variant of Hamiltonian Monte Carlo (HMC) such as a No U-Turn Sampler (NUTS), as this is the most commonly implemented state-of-the-art algorithm in currently available sampling software such as Stan and PyMC3 (but see Van Ravenzwaaij et al., 2018, for an accessible introduction to Metropolis/Gibbs sampling that emphasizes many core principles of MCMC techniques). Hereafter, we will use a composite acronym, HMC/NUTS.

With a Bayesian cognitive model specification and dataset in hand, the MCMC sampler is initialized at a random point in the parameter space, as defined by the model. This place and every subsequent place the sampler visits in the posterior parameter space, is recorded as a sample from the joint posterior. From the random initial position, the sampling algorithm is used to compute a trajectory within the parameter space from the current position to the next, and again from there to another position, and so on, until a pre-specified number of joint posterior samples have been collected. These samples, in order, are called a chain. Because the first few samples or iterations in the chain will usually be more representative of the initializing value than of the true posterior (called the target distribution), the first handful or more of iterations in a chain are discarded. In modern sampling software, this warmup period (or burn-in, in older sources) is also used for adaptation of the sampler itself. For example, roughly how big of a step in the joint parameter space is taken with each successive iteration is a tuning parameter that is adjusted during warmup (for a review of HMC/NUTS sampler dynamics, see Betancourt, 2017).

In practice, multiple chains are run simultaneously, because without multiple chains we cannot perform some of the computational checks required to assess the quality of the sampling (Gelman & Rubin, 1991) — in addition to saving precious time. A good expectation for Bayesian cognitive model applications is to collect at least 2000 total iterations per each of four chains, with at least the first 500 apportioned for warmup, and the remaining 1500 kept and used for inference.5 Ultimately, the warmup period should be long enough that the chains have converged (or agreed) on a stationary distribution, and the subsequent period of collecting kept iterations should be sufficient both to pass the computational checks and to support all planned uses of the marginal posteriors for inference. However, when one is first beginning to work with a model, we recommend first collecting only 50 warmup and 50 kept iterations, just to ensure the model runs. Then, we recommend observing whether shorter runs of the model, such as 150 warmup and 500 kept iterations, might reveal problems. These abbreviated runs will allow one to begin the iterative process of troubleshooting (which requires repeated model runs) without spending as much time waiting for failures (Gelman et al., 2020). A high-level view of the Bayesian cognitive modeling approach that emphases the iterative nature of model testing and troubleshooting is presented in Figure 1.

Sampler output

To run the example model or any other Bayesian cognitive model, we submit the model specification and the data — whether experimentally collected or simulated — to the sampling software via an interface specific to our programming environment. In our example, we use Stan to perform the MCMC sampling with the dynamic HMC/NUTS algorithm, and the MATLAB interface to Stan handles returning the Stan output to the workspace for us to check and analyze. This process is collectively called running the model.

Specifically, the output will be a collection of samples from the marginal posterior distribution for every parameter of the model, and various diagnostic quantities for the MCMC computation used to generate each iteration in each chain. (Be mindful that which chain a sample was collected in, and the order in which the samples were collected within each chain is crucially important information. As such, one should never engage a ‘permute’ option during sample extraction: it will prevent many model-checking diagnostics, and as such, will make your output unusable.) For many Stan interfaces, the posterior samples for a scalar parameter should be returned as a matrix of size [N M] where M is the number of chains, and N is the number of iterations within each chain (excluding warmup iterations, which are discarded). Samples for non-scalar parameters (meaning parameters with multiple instances) will have additional indices for the array dimensions A and parameter dimensions, P: [N M(A)(P)].

Unfortunately for MATLAB users, neither the recognized MATLAB interface to Stan, MatlabStan (https://github.com/brian-lau/MatlabStan), nor the alternative interface, Trinity (https://github.com/joachimvandekerckhove/trinity), returns samples in this format. It can be especially laborious to extract samples from structures within the StanFit object generated by MatlabStan, as this needs to be done parameter instance by parameter instance. The matstanlib function, extractsamples.m, is designed to automate this process.

[samples,diagnostics] = extractsamples(‘MatlabStan’, fit)

This extracts samples for all parameters from a StanFit object and reorganizes them in a structure that is consistent with the format described above. Posterior samples and diagnostic quantities are returned in separate structures, and for both, matstanlib ensures that chain identity and the iteration order is faithfully maintained.

(For those who are not MATLAB users, we include a table of analogous commands in R and Python for these and all other matstanlib functions referenced in the paper in the Appendix.)

Detecting problems

After the Bayesian cognitive model finishes running and posterior samples and MCMC diagnostics have been extracted, we first must check whether we can detect any problems in the output, and if so, we must initiate a troubleshooting process to identify a potential underlying cause. For a Bayesian cognitive model, these checks will likely be performed many times. At an absolute minimum, they should be performed twice: first, after the model is run with simulated data, and again after applying the model to experimentally-collected data. The model will tend to be run many more times as the troubleshooting process is iterative: Each time a check fails, one should tweak the model setup and run the model again.

In the simulation study, data should be simulated according to the data distribution in the model specification, using known or true parameter values. While these values may be hand-selected, it is better to randomly generate the true values directly from the prior distributions, and to let new values be drawn each time the simulation study script is called. Simulation studies with randomly-selected true values are a more robust way to check model performance because, over repeated runs, one will gain a sense of the model’s behavior across the parameter space.6 Once troubleshooting is complete for the simulation study, we can progress to troubleshooting the model’s performance with experimentally-collected data (if needed).

In both applications, problems can be detected using the recommended suite of computational checks and consistency checks to probe whether the MCMC sampling and the model itself respectively have functioned as intended. The currently required MCMC diagnostics (which we discuss in detail below) are R^, divergences, BFMI, and ESS. Other quantities, such as k^ (which is computed as part of Bayesian cross-validation; Vehtari et al., 2017), are diagnostically useful, but as they are not currently recommended default diagnostics (and require a deeper technical understanding) we consider them to be beyond the scope of present paper. The assessments that we call consistency checks are a necessary complement to the computational checks, as they are designed to target whether a model is behaving consistent with one’s expectations and domain knowledge, or whether the model may be misspecified. All of these core diagnostics have associated visualizations, which we call diagnostic plots.

To run the required computational diagnostics in MATLAB, two functions from matstanlib are needed: mcmctable.m and interpretdiagnostics.m. First, we compute those diagnostics that are based on the posterior samples (along with some basic summary statistics):

posteriorTable = mcmctable(samples);

Then, an automated assessment of all MCMC diagnostics may be printed at the command line:

interpretdiagnostics(diagnostics,posteriorTable)

If a structure of true parameter values is in the workspace, then it is easy to generate the parameter recovery plots required for the first of the model consistency checks:

plotrecovery(samples,trueValues)

The second consistency check requires custom specification to perform meaningfully, as we explain toward the end of this section.

Each of these diagnostics presents a different opportunity for troubleshooting of a Bayesian cognitive model, as each is geared toward the detection of different types of problems. It is intuitive to see these diagnostic checks in terms of the questions they are most helpful in answering. These questions are:

  1. Is there any evidence that the chains disagree about any of the marginal posteriors?

  2. Is there any evidence that the posterior distribution was not fully explored?

  3. Is there any evidence that sampling was not efficient enough to support a good posterior approximation?

  4. Is the model failing to generate coherent parameter estimates?

  5. Is the data the model expects to encounter unreasonable or otherwise inconsistent with my domain expertise?

If the diagnostics suggest that the answer to any of these questions is “yes”, then there is a problem with the model setup that absolutely must be corrected. By understanding what each diagnostic is designed to assess, and what information each diagnostic plot is designed to express, one can identify the problem and, accordingly, a solution.

Computational checks

Convergence and divergence

The most familiar MCMC diagnostic is R^ (which is sometimes called the “Gelman-Rubin statistic” in older sources Gelman & Rubin, 1992). If all MCMC chains have converged on the target distribution, then the chains should agree so strongly that they appear identical, in which case R^ will be close to 1. For each parameter, the chains should specifically agree with respect to both the location and spread of the marginal posterior distribution. Furthermore, there should be no remaining influence of any chain’s starting value, and over the full range of the kept iterations, all chains should appear stationary. In this ideal case, R^ will be exactly 1; a value of R^ that is meaningfully greater than 1 suggests that the chains have failed to converge.

Understanding how R^ is computed can build intuition for what kinds of problems can be detected by high R^ (which we review below). After splitting the chains (such that the first and second half might temporarily be considered as separate chains), the R^ computation essentially compares the between-chain variance to the within-chain variance. R^ will be high if the chains are not mixing (meaning failing to sample from similar ranges of values), or if any of the chains are not stationary (meaning that a notably different range of values is sampled over time), as in both cases the between chain variance will be disproportionately high (Gelman et al., 2013). To make R^ more stable, the current formulation of R^ is the greater of R^ for the raw samples, and R^ for the z-scored samples. This enables R^ to be sensitive to some patterns that were not able to be detected by previous implementations (Vehtari et al., 2021). (The improved ability of R^ to detect convergence failures is also why the criterion has been lowered.)

As such, R^ may be interpreted as the degree to which the chains disagree, and a value of R^1.01 is required for every instance of every parameter in the model. While high R^ values do not suggest a remedy in and of themselves, with the assistance of trace plots, we can begin to understand why the chain disagreement flagged by R^ might be occurring. Trace plots visualize chain behavior by plotting the sequence of parameter values sampled at each iteration in each chain, in order, as a line (called the chain trace). In the trace plots generated by matstanlib’s tracedensity.m function, a histogram of the samples across all chains is also included, as this representation of the marginal posterior often helps to interpret the trace plot. Some classic examples of ideal, acceptable, and unacceptable chain behavior are presented in Figure 3.

Figure 3.

Figure 3

Trace plots can be used to visualize chain (dis)agreement, and support troubleshooting of convergence issues signaled by high R^. Only the chain traces in (a) and (b) are acceptable; the traces in (c–f) each have a common yet serious problem. Extreme autocorrelation (c), drift (d), label-switching (e), and sticking (f), all tend to cause high R^, and all are unacceptable. Troubleshooting techniques are necessary to identify the root of these problems.

In an ideal situation where R^ is close to 1 and the chains are stationary and mixing well, the chain traces will tend to appear in a trace plot as in Figure 3a; this appearance has been said to resemble a furry caterpillar or a “bottle-brush” More commonly in Bayesian cognitive models, there will be some degree of autocorrelation within each chain, meaning a tendency for similar values to be sampled in successive iterations versus more distant iterations (as seen in Figure 3b). Unless other diagnostics have failed (or effective sample size, as discussed below, is undesirably low), a mild amount of autocorrelation is not of concern, and the model may simply be run for more kept iterations. Extreme autocorrelation, on the other hand (as in Figure 3c), should be investigated further. If other diagnostics have failed, it is likely that this behavior indicates that the parameter space defined by the model is making it difficult for the sample to move efficiently. Effective sample size plots (discussed in the next subsection) will likely help to determine the specific underlying issue.

Another unacceptable chain behavior that may be recognized in a trace plot is chain drift (Figure 3d). This may occur if starting points of one or more chains are still exerting an influence on the sampled values. Alternatively, this may occur if the drifting chain was in a local posterior maximum, and is (somewhat slowly) transitioning to a higher-density region of the posterior space. Regardless of cause, R^ will be high in cases of drift because one (or more) of the chains is not stationary. The first remedy to try in this situation is to run the model again with a longer warmup period (e.g., twice the number of warmup iterations).

A more challenging pattern to resolve is when each individual chain is stationary and moving well, but collectively the chains fail to mix, as they disagree on the location of the posterior distribution (which we call confident disagreement in Figure 3e). This is common to see when a cognitive model is insufficiently identified. For example, in a latent mixture model, behavior may be modeled as a weighted combination of two or more cognitive processes. If these component processes predict similar behavior, then the mixture parameter may only be weakly identified, and different chains may settle on different values of the mixture proportion. This phenomenon, known as label-switching, can also occur in models where two parameters are directly multiplied, but the priors and data are insufficient to identify more than the parameters’ product. In both cases, a first remedy is to make the priors more informative. However, if domain knowledge is not available or appropriate to incorporate, and the component processes or parameters cannot be distinguished in another way, then the experiment in which the behavioral data was collected may simply not be sufficient to distinguish the processes intended to be captured by model.

The last pattern that may be signaled by high R^ is when a chain will seem to get stuck at a particular value for extended periods of time (as in Figure 3f). When this sticking behavior occurs, it is often a result of the sampler trying and failing to reach a nearby area in the joint parameter space. A common scenario in which this occurs is when the chain for a standard deviation parameter becomes stuck near 0. Most of the time, when sticking behavior is seen in a trace plot for one parameter, it will be seen for others as well. Unfortunately, this can sometimes make it challenging to identify the true culprit, but if divergences (which we discuss in a moment) occur when the chain sticks (as they often do), those divergences will likely be more useful to investigate.

In some situations, trace plots can be exceptionally difficult to interpret. For example, when a very high number of samples have been collected, squashing the long traces into a standard-sized plot can hide some problematic chain behaviors, and the traces will spuriously appear good (see Figure 4). Trace plots can also be difficult to judge when distributions are highly skewed and/or fat-tailed, in which case the stereotypical “bottlebrush” pattern would not be expected even when chains have converged and are mixing well. For these reasons, it is now recommended to use rank plots (Vehtari et al., 2021) in addition to, if not in place of, trace plots, so that any differences in sampled values across chains can be more reliably visually inspected. Rank plots are a new diagnostic plot that is generated by ranking the samples pooled across all chains, then presenting a histogram of the ranks originating from each chain separately. If the chains have perfectly converged, then the distribution of ranks for each chain should approximate a uniform distribution (as in Figure 4a). Deviations from uniformity can indicate a wider variety of convergence issues. Two examples of problems that are more easily detected in rank plots are presented in Figure 4b and 4c (see figure caption); more examples are included in Vehtari et al. (2021).

Figure 4.

Figure 4

Rank plots are a new way to visualize chain (dis)agreement, and support troubleshooting of convergence issues signaled by high R^. In some cases, rank plots can expose problems that were not visible in a classic trace plot. While all three trace plots look acceptable, the corresponding rank plots of the same chains reveal that this impression is only genuine for (a), where the ranks for all chains appear roughly uniformly distributed. In (b), the sticking behavior is hidden under the bulk of the chain traces, but is readily apparent from the peak in chain 1’s rank plot. Similarly, the lower variance of one chain in (c) is not discernible in trace plot, but the skewed rank plot for chain 3 clearly suggests that this chain is sampling a restricted range values relative to the its fellows.

A recent addition to the suite of computational checks is divergences. For each posterior sample generated by the HMC/NUTS algorithm, whether that sample was a result of a diverging Hamiltonian trajectory is recorded as an indicator (0 or 1). Each divergence that occurs indicates that the sampler attempted to travel to a point in the joint posterior, but failed to do so. Most often this occurs when a chain is struggling to navigate a region of high posterior curvature, hence why divergences are sometimes very unevenly encountered across the independent chains, as in Figure 5a. Divergences are a critically important diagnostic because they signal that some part of the posterior distribution was unable to be sampled, and as a logical result, the available posterior samples are known to be biased.

Figure 5.

Figure 5

Diagnostic plots for troubleshooting divergences. (a) In these rug plots, a red tick marks each iteration within a given chain where a divergence occurred. (b) It is useful to include a similar plot of divergences (collapsed across chains) at the bottom of a trace plot. In this case, the divergences correlate with the chain sticking behavior (dark green chain 1, samples 650-800); the sampler is likely struggling to sample values near 0 for this parameter. (c) If univariate plots are insufficient to localize the issue, a bivariate density plot can demonstrate whether divergences (in red) are randomly distributed or are concentrated in one area. Here, the divergences concentrate at the tip of this funnel, where lower values of sigma, a standard deviation, increasingly constrain the ability to sample the values for mu, a mean parameter. (d) This is a common problem for hierarchical models that is overcome by reparameterizing the model (see (Re)parameterization subsection for details).

As such, when using an HMC/NUTS sampler, it is required to check that no divergences occurred. When divergences do occur, the samples should not be used for parameter estimation, model comparison, or any other type of inference. Instead, the output should be investigated in order to determine what parameter or part of the model specification may be inducing the unnavigable curvature.7

Admittedly, in very rare cases, the sampler may record a divergence when the trajectory did not in fact diverge. While some sources note that divergences may be disregarded in special cases (Gabry et al., 2019; Schad et al., 2021), we find it important to note that these sources are universally written in the context of statistical linear models. In the context of Bayesian cognitive modeling, we do not recommend ever disregarding divergences. Unlike linear models, which have been used for more than a century and are exceptionally well-understood, cognitive models are fundamentally bespoke things: They are continually being customized, tweaked, and extended, and entirely novel models are regularly designed. In our experience, even established cognitive models can suddenly fail when applied to a new dataset (as in a case where a participant is not performing the task, and produces a series of nonsensical responses that “break” the model). For these reasons, we recommend that practitioners of Bayesian cognitive modeling always work from the assumption that divergences are genuine, and consider their diagnostic potential.8

The simplest strategy to use in investigating the cause of divergences is to generate trace plots with diagnostic overlays. In matstanlib, giving the diagnostics structure as an additional input to tracedensity.m:

tracedensity (samples, parameterNames, diagnostics)

will cause a rug plot of divergences to be included at the bottom of the trace plot (wherein each iteration for which a divergence was observed, regardless of chain identity, is marked by a red tick). Often if the aforementioned sticking behavior is seen in a trace plot, this sticking will correspond directly with the occurrence of divergences (as in Figure 5b). However, if a chain appears to stick in one parameter’s trace, this often constrains sampling for other parameters so that the same chain will appear to stick over the same iterations in their traces as well. To be sure of exactly which parameter is driving the divergences, one should therefore also visualize bivariate marginal densities with diagnostic overlay, as by using matstanlib’s jointdensity.m function:

jointdensity (samples, ‘mu’, ‘sigma’, diagnostics)

For example, in Figure 5c, the divergences are concentrated at the bottom of the joint distribution, where the sigma parameter takes lower values. After fixing this model, it is apparent that a part of the joint distribution nearby to the divergences that was previous inaccessible is now able to be sampled. This model was corrected using reparameterization, as is often the required approach to overcome divergences. The goal of reparameterization is typically to enable the sampler to more easily navigate the posterior geometry. As such, the goal of investigating divergences is simply to identify which parameters or part of the model specification is the best candidate for this reworking. We will return to problems indicated by divergent transitions in the next section.

Another recently introduced diagnostic that is specific to HMC/NUTS sampling is the estimated Bayesian fraction of missing information (BFMI or E-BFMI; Betancourt, 2016). BFMI is computed from the energy diagnostic recorded during the generation of each sample within each chain. It is a metric of the HMC/NUTS algorithm’s efficiency at a much deeper computational level than the other diagnostics we discuss here. Nonetheless, the interpretation of BFMI is clear: when the BFMI value computed from the energy history for a given chain is extremely low, it indicates that that chain was unable to efficiently and effectively explore the target distribution. In other words, similar to divergences, low BMFI values tell us that the posterior distribution was not fully explored, and as such, the samples we do have are insufficient and/or biased.

It is currently required that BFMI is ≥ 0.2 for all chains. If BFMI is less than 0.2 for one or more chains, the output should not be used as the basis for inference. Most often, the BFMI check will not be the only computational check that is failed; in this case, the other diagnostics are more targeted, and so should be investigated first. However, in some cases, only the BFMI computational check will no be passed. In this more difficult scenario, an energy diagnostic plot (Figure 6a,b) should be used to visualize the energy distribution differences for each chain, and a multivariate density plot that includes the energy diagnostic (Figure 6c) should be inspected. If one or more parameters’ posterior samples appear to correlate with energy, then places in the model specification that most directly involve these parameters are most likely to be problematic. Reparameterization is most likely to help.

Figure 6.

Figure 6

Diagnostic plots for troubleshooting low BFMI. (a) In an energy diagnostic plot, every chain’s marginal and transitional energy distributions should overlap. (b) If they do not, the discrepancy suggests that the chain has likely failed to efficiently explore the posterior distribution. (c) Including the energy diagnostic in a grid of bivariate densities may be used to identify which parameters are the most likely contributors to this inefficiency as their the parameter’s samples will tend to correlate with the energy history. (None of the parameters in this plot are suspicious.)

A third diagnostic that is specific to HMC/NUTS sampling is the treedepth used to reach the final parameter value selected for each sample. Of those matstanlib functions that can highlight divergences in red (e.g., tracedensity.m, jointdensity.m, etc.), iterations for which the maximum treedepth was reached will be marked similarly in orange. This is generally not of much concern as it suggests computational inefficiency, but not failure. Nonetheless, if the maximum treedepth is being reached for many samples, it can indicate inadequate adaptation of the sampler, and therefore the model should be run again with a longer warmup. If treedepth warnings persist, the maximum treedepth setting of the sampler may be raised to allow longer trajectories, although this will severely increase the computation time needed per iteration. As this is less frequently encountered in Bayesian cognitive modeling, we do not discuss treedepth or other advanced topics related to sampler tuning dynamics here (but see Betancourt, 2017 or Hoffman & Gelman, 2014 for a introduction).

Sampling efficiency

The final computational check that model output must pass is related to sampling efficiency. Before the model output is used for inference, one should check that the samples offer enough certainty to support the specific sample-based estimates one intends to use as the basis for inference (Gelman et al., 2013; Vehtari et al., 2021). (While this concept is superficially similar to the idea of statistical power, unlike power, sampling efficiency can only be assessed post hoc. This is because the sampling efficiency check depends not on the number of samples one set out to collect, but on the sequence of samples that was actually collected.) Sometimes low sampling efficiency is obvious, as in cases of significant autocorrelation (see Figure 3c), where it appears that effectively fewer places were visited in the parameter space than actual samples were collected.9

Estimates of effective sample size (ESS; previously called the number of effective samples, Neff) get at exactly this issue, by quantifying sampling efficiency in a reliable way. When sampling efficiency is low, ESS will be lower than the actual number of samples collected. Although some inefficiency is extremely common in Bayesian cognitive models, typically it is not an problem unless the ESS is below the criterion. For all Bayesian models, it is now required that the ESS is at least 100 × the number of chains (i.e., ESS ≥ 400, assuming four chains) for all parameters. By default, the current implementation of ESS quantifies the sampling efficiency in both the bulk and tails of the posterior distribution, but ESS estimates can also be computed for other applications, such as for specific quantiles and small intervals of quantiles, as well as for the posterior mean, median, standard deviation, and mean absolute deviation (Vehtari et al., 2021).

Trace and rank plots may again be used to probe whether a low ESS seems to be a result of mild autocorrelation. If this diagnostic plots support this conjecture and no other problems are detected, then one may simply increase the number of kept iterations until all relevant ESS estimates are sufficient. (While past sources may recommend thinning the samples by a factor of n (meaning discarding every nth iteration) to reduce autocorrelation posthoc, thinning is no longer recommended, except in cases of severe computer memory constraints, as it degrades the precision of posterior estimates; Link & Eaton, 2012.) In many cases, however, when ESS is low for one or more parameters, it signals a deeper problem with the model. In particular, low ESS may suggest that one or another factor is making it difficult for the sampler to move through the parameter space. Diagnostic ESS plots (introduced very recently in Vehtari et al., 2021) may be used to clarify whether low ESS is indicative of any systematic sampling inefficiencies and biases.

The first such plot visualizes the efficiency over subsets of iterations (Figure 7a). This plot is particularly useful as some sampling inefficiencies may only become evident when sufficiently long chains are run. Ideally, the sampling efficiency should be such that ESS estimates should grow linearly with the number of samples. One should be wary if ESS estimates level off or decrease, as this suggests those periods of sampling were relatively less efficient; this metric should be stable over time.

Figure 7.

Figure 7

Diagnostic plots for troubleshooting low ESS. Ideally, ESS will grow linearly with the total number of samples (pooled across chains) in the efficiency per iteration plot (a), and all ESS estimates will be above the dashed line representing the minimum ESS in the efficiency of quantile estimates plot (b) and the local efficiency of small-interval estimates plot (c). While these patterns are seen for the well-behaved model (top), they do not hold for the problematic model (bottom), where tail ESS crashes as more samples are collected and other ESS measures are often below the criterion. Troubleshooting techniques are necessary to identify the root of these problems.

The other diagnostic ESS plots both help to visualize whether different values for a given parameter are being more or less efficiently estimated. Visualizing whether ESS estimates are notably lower for some quantiles (Figure 7b) or regions of quantiles (Figure 7c), especially if those quantiles seem to correlate with divergences or hitting max treedepth, can help to identify what areas of the marginal posterior are driving the low ESS. For example, while ESS might be somewhat lower for extreme quantiles, if it is so markedly lower in one region that it is below the ESS criterion, it may suggest that that area of the parameter space is unable to be effectively explored. Such a conclusion might help to explain other diagnostics, such as low BFMI, by isolating which part of the posterior is problematic.

In this way, diagnostic ESS plots can be helpful to distinguish cases of too few samples due to an acceptable amount of autocorrelation (in which case more samples may simply be collected) from deeper, more fundamental problems with a model (in which case the model specification should be improved). While the latter is more challenging to correct, we will discuss techniques to target and remedy these issues in the next section, Identifying the root issue.

Model consistency

Because the computational checks described in the previous section are designed to assess the quality of the MCMC sampling, a different approach is required to assess whether the model itself is behaving in a way that is consistent with one’s expectations of it. For example, by checking whether basic assumptions about model behavior, such as the ability to recover known parameter values, are satisfied or are violated, the internal consistency of the model can be assessed. Likewise, by checking what patterns of behavior are implied by the model before it is exposed to data, we can assess whether the model’s initial expectations are consistent with, and therefore appropriate as a model of, the range of behavior we might reasonably expect to observe. As such, we call this second style of model assessment consistency checks.

The two consistency checks that we discuss here are essential techniques for detecting problems with Bayesian cognitive model behavior, and should be seen as of equal importance to the computational checks.

Prior predictives

Even before the simulation study is performed, one can use the model specification to generate the prior predictive distribution of behavioral data for the model. The prior predictive is computed by simulating data in exact accordance with the model specification, as one would to generate data for a simulation study: hyperparameter values are drawn from the hyperpriors, and subsequently used to draw parameter values from the priors, which are used to simulate data from the model’s data distribution. After repeating this process many times, one may visualize the distribution of this data or, more usefully, of some meaningful function of the data. This is called a prior predictive check, and is useful to reveal both subtle and deep problems with a model specification as it allows one to observe whether the range of behavioral data that is implied by the model a priori (i.e., before any data is seen by the model) is reasonable and consistent with domain knowledge (Vanpaemel, 2010).

Typically, prior predictive checks for Bayesian cognitive models must be customized to the research context. An ideal prior predictive check for a Bayesian cognitive model would be to assess the specific patterns of behavior that inspired the model. In Figure 8, we present prior predictive distributions for three different specifications of our RL model of the bandit task. Specifically, we visualized the prior predictive distribution of accuracy across trials to better understand what range of learning curves (as in Figure 2b) are more or less likely under a given specification of the model.

Figure 8.

Figure 8

Prior predictive checks help to assess whether the model specification is consistent with one’s expectations about behavior. Here, for three versions of the example RL model, prior predictive learning curves are plotted as a probability density over the proportion correct relative to to each trial (where darker colors indicate higher density). The prior predictive should not place excessive weight on unlikely patterns of behavior (a) nor should it place too little weight on patterns of behavior that might reasonably be observed (b). The ideal prior predictive for our RL model example (c) is consistent with the range of behaviors that is reasonably expected, but is diffuse enough to include all possible behavioral patterns.

If the prior predictive distribution reveals that too much probability has been placed on grossly unrealistic data, it strongly suggests a problem with the model specification, of which the prior is a core part (Gelman et al., 2017). For example, if an inappropriately wide swath of behavioral patterns are predicted (e.g., if equal weight were given to learning curves that fall over time and to learning curves that rise over time), it may suggest that the parameters of the model are too loosely constrained; in this case, more informative priors may help. Unfortunately, we see exactly this problem with the prior predictive for the model specification outlined earlier, which is shown in Figure 8a. This prior predictive is suggesting that the model considers it most likely for participants to have accuracy near chance across all trials. This is fundamentally inconsistent with the learning behavior we expect to observe.

It is also a problem if a severely restricted range of behavior is predicted, as in Figure 8b, where a smaller range of learning curves is implied by the model than we tend to observe in the lab. Even though the most weight is given to the most commonly observed patterns of behavior, this model has failed the prior predictive check because it is too tightly constrained, most likely as a result of priors that are excessively informative. Altering the priors and structure of the model specification in a way that alleviates these sorts of imbalances may not only lead to a more suitable prior predictive, but sometimes may be sufficient to resolve some computational and recovery failures.

Ideally, a prior predictive distribution will encompass a sufficiently broad range of possible behavior such that, in our example, any possible learning curve we could possibly expect to observe should be given a nonzero amount of probability by the prior predictive, and the typical range of behavior we expect to observe will be given just slightly more weight. After reparameterizing the RL model, the posterior predictive matches this ideal description (Figure 8c). Even when problems are not revealed, performing a prior predictive check is an excellent way to better understand the behavior and capabilities of a model.

Parameter recovery

Another way to assess the capabilities of a Bayesian cognitive model is to perform a parameter recovery check. If a model is applied to data that was simulated from the exact same model specification (including randomly generating parameter values from the priors and hyperpriors), then the model should be able to infer parameter values from that data that are reasonably close to the true values (i.e., the model should be capable of recovering those values). Good recovery is especially critical to demonstrate for Bayesian cognitive models as, unlike linear models or Bayesian hypothesis tests, they are often custom or novel models. Even if an established Bayesian cognitive model is being used, recoverability might not be generalize from one study to the next if the hierarchical structure of the model was adjusted to to accommodate different participant groupings, experimental designs, etc., or if different computational methods are used for model fitting.

The technical requirement for a Bayesian parameter recovery check is that the true value of a parameter will fall within the corresponding 95% credible interval for 95% of the parameters in the model or 89% will fall within the 89% interval, etc.; Rubin, 1984, but this should always be supported by visual inspection. The ability to recover known parameter values is necessary to establish before a model’s output is used as the basis for inference because if a model cannot recover from a known ground truth, then it certainly will not suddenly be able to offer coherent and useful parameter estimates when applied to experimentally collected data. Any statistical inference based on a model that fails a recovery check will be invalid.

When a model is tested with simulated data (as first it should be), then performing the parameter recovery check is simply a matter of generating a series of recovery plots. In matstanlib, recovery plots may be generated with the recoveryplot.m function. In Figure 9, each plot visualizes how closely the true parameter values and the model-derived estimates correspond. Ideally, most estimates will fall along the diagonal line representing the unity line (representing perfect recovery), and the credible intervals will be small, which suggests the model is very certain in its estimates (as in Figure 9a). However, for a Bayesian cognitive model, this degree of confidence can be difficult to achieve unless a great deal of data is available. For the smaller data sets that psychologists often work with, larger credible intervals and slightly more dispersion around the unity line are not unexpected (as in Figure 9b), and should also be seen as sufficient. Generally, if the true and estimated values appear moderately correlated or better, and the credible intervals are not undesirably wide or skewed, then recovery for that parameter may be classified as good (although for particular applications, weaker recovery for some parameters may be sufficient or stronger recovery for all parameters may be desired).

Figure 9.

Figure 9

Parameter recovery plots. When recovery is good, nearly all parameters’ credible intervals (vertical lines) will include the true value, and so will overlap the unity or “perfect recovery” line (diagonal), and the point estimates (markers) will tend to cluster nearby. The quality of recovery in (a–c) is all potentially acceptable, although the model is less certain in (b) and there is some amount of shrinkage in (c); depending on the context, these outcomes may or may not be sufficient. The quality of recovery in (d-e) is generally unacceptable. In (d), the credible intervals reveal that what initially looked like shrinkage (main panel) was better characterized as extreme uncertainty (inset). The consistent overestimation in (e) and abject failure to recover in (f) indicate moderate and severe problems with the model, respectively.

If the true and estimated values appear correlated, but also flattened (as in Figure 9c), such that higher true values are underestimated while lower true values are overestimated, then it suggests a mild to moderate degree of shrinkage is occurring. It is very common to see this kind of squashing of parameter estimates in hierarchical models, due to the regularizing influence of the hyperprior, which can sometimes markedly pull the estimates toward the group-level mean. However, extreme shrinkage, or nearly flat estimates, can indicate that the parameter estimates are being excessively influenced by the prior. This can occur if the hyperpriors are too strongly informative, or if the data is not sufficient to meaningfully inform the parameter (i.e. the likelihood is too flat).

It is important to note that evaluating the quality of parameter recovery for a hierarchical Bayesian model is not only judged by the point estimates, but also by the credible intervals, as they carry critical information about the degree of uncertainty in the parameter estimate, the exclusion of which may be misleading. For example, in Figure 9d), looking only at the point estimates suggests that an excessive amount of shrinkage has occur ed. When the credible intervals are included in the recovery plot, it becomes apparent that the bigger issue is the extreme uncertainty. The credible intervals in this case mimic the highest-density region of the prior density, which suggests that either not enough data is available to fit the model, or the data that is available is not informative enough to update this parameter’s value.

Another common problem that recovery plots may reveal is if a parameter is being consistently over-estimated or under-estimated (as in Figure 9e). The first thing to check in this scenario is the prior. If the prior places most of its weight on values that are far from the values it would otherwise infer, then, depending on the model, this can bias all estimates in the same direction. This kind of consistent bias may also occur when two or more parameters are “trading off,” in which case a pathological inverse coupling is observable each time the model is fit. While this behavior is sometimes just a structural fact of some models (Krefeld-Schwalb et al., 2021), in other cases, it can be ameliorated by making the priors for the relevant parameters more informative or by reparameterizing.

Finally, if a parameter is being estimated in a way that demonstrates no relationship to the true values whatsoever (as with the collapsed estimates in Figure 9f), there has been a total failure to recover. Sometimes this most extreme mode of recovery failure actually has a simple solution: it may be the result of a good ol’ fashioned implementation error in the model specification. (Figure 9f was actually the result of inadvertently commenting out a line in the data simulation code, such that the parameter in question had no influence on the likelihood.) Other times, such an abject lack of ability for the model to capture one or more dynamics of the proposed cognitive process will suggest a deeper problem with the model. In this case, many of the diagnostic plots that we discuss next may potentially help to get at the root of the problem.

Identifying the root issue

While some problems that have been detected have straightforward solutions, other problems will require a longer and more investigative troubleshooting process before a good candidate solution can be identified. In many cases, visualizing the posterior samples for multiple parameters simultaneously will be a fruitful approach to identify the root cause of the detected problems.

Posterior geometry

These visualizations are especially important tools when attempting to identify the cause of issues related to posterior geometry, such as the regions of high curvature that are notoriously difficult for HMC/NUTS samplers to traverse. As mentioned earlier, this is likely to be the case when the sampler reports divergences. The goal is to uncover where exactly that region is located, and which parameters are most directly implicated in creating the high curvature.

Grids of many bivariate marginal posterior densities are the most useful visualizations in this pursuit as they allow one to visualize posterior dependencies among parameters. In matstanlib, the multidensity.m function may be used to presents bivariate densities for multiple conjunctions of parameters are presented in the same figure (see Figure 10a). This makes it readily apparent when parameters are correlated, as a bivariate density will appear as an oblong shape. However, parameter correlations are far less of a problem for HMC/NUTS algorithms, and so are not as likely to be a problem as they would be for a Gibbs sampler.

Figure 10.

Figure 10

Visualizing the posterior samples for multiple parameters simultaneously using (a) grids of bivariate marginal densities with diagnostic overlays and (b) parallel coordinate plots are both are useful to search for problems related to parameterization. In the grid of densities, one should look for parameters where divergences (red x’s) are not randomly distributed, but rather are clustered together. In the (z-scored) parallel coordinate plot, each line represents a joint posterior sample. One should look for where the red lines representing divergent samples seem to “pull together.” Both of these plots clearly suggest the root of the issue is with the lower bound for sigma_phi.

A better candidate is a bivariate density with a funnel shape, as is commonly observed in hierarchical models. In some contexts, when setting priors directly on mean and standard deviation hyperparameters, progressively smaller values of the standard deviation parameter will increasingly constrain what values of the mean parameter can feasibly be sampled. This portion of the bivariate density can take on a much high curvature than the rest of the posterior, in which case the sampler will struggle to navigate this tip of the funnel (as signaled by the concentration of divergences at the bottom of the joint distribution in Figure 5c). If identified, this issue may be corrected by converting to a non-centered parameterization (as described in detail at the conclusion of this section), which enables the funnel to be fully explored by breaking the dependency between the relevant parameters (Betancourt & Girolami, 2015).

Other potentially difficult to navigate posterior regions can occur at parameter boundaries. For example, in a Gamma prior where the shape parameter is constrained such that the prior density is 0 over a parameter value of 0, but the highest density value in the target distribution is sufficiently near 0, a region of high curvature may be induced at the prior’s lower bound. Sometimes domain knowledge requires a prior such as this to be specified, but due to the constraint at 0, the sampler may attempt and fail to sample very small values. A multidensity plot will show evidence of this as the divergences will concentrate near the problematic boundary. Selecting an alternative parameterization of the prior can help to overcome this issue as well (as we discuss in the following subsection).

It can be difficult to use a multidensity plot when a very high proportion of the posterior samples are the result of divergences, as the densities can seem to all be covered in the red divergence indicators. In these cases, a multidensity plot can still be useful as a quick way to view many univariate marginal densities (shown along the diagonal of the figure): If bumps or multimodality are observed in the univariate distribution, this may suggest a place where the divergences are more highly concentrated, even when it is not otherwise apparent.

We can also observe concentrations of divergences in parallel coordinate plots, which in matstanlib is generated by the parallelsamples.m function. In such a plot, each joint posterior sample for a given subset of model parameters is presented as a separate line, and iterations resulting from divergent transitions are plotted in red (as in Figure 10b). These plots are especially useful to uncover the cause of divergences, as for the most closely implicated parameters, divergent sample lines will appear to “pull together” (while appearing randomly distributed across the values of other parameters). How this issue should be resolved will depend on that parameter’s role in the model specification.

A challenge in using both of these plots as diagnostic tools is that there may be too many parameters to include on the plot at once. A strategy that we have found useful is to begin with the parameters at the highest levels of hierarchy in the model, then work downward. At lower levels where there are many parameter instances, including one or two instances of each parameter is also generally more useful than visualizing many instances of the same parameter. However, in some cases, neither the parallel sample and multidensity plots, nor the R^ statistics, ESS plots, or other previously discussed techniques will clearly implicate any particular part of the model specification. If the specific issue is still unclear after a thorough troubleshooting process, generally some sort of reparameterization is likely to be the solution. Critical review of the model specification may prove more worthwhile in these cases.

(Re)parameterization

Once the troubleshooting process has led to the identification of a parameter or segment of the model specification that is problematic, one or more strategies to alter the current parameterization of the model may be applied to attempt to resolve the model’s issues. Here, we review the most commonly useful techniques for such alterations, from simply bounding the hyperparameters to reparameterization techniques that enact deeper structural changes to a model.

If one has already conducted a prior predictive check and found the distribution of data implied by the model to be lacking, a complementary technique is to simply ensure that the priors are sensible by simulating and visualizing the distribution of priors permitted by the current model specification. In Bayesian cognitive models, it is not uncommon to use a non-conjugate, non-normal prior, nor is it uncommon to include hierarchical structure in the model. Unfortunately the conjunction of these two design decisions can make it exceptionally difficult to specify good hyperpriors. For example, while one may have an intuition for what a good participant-level Gamma prior distribution would be, one may feel at a loss in determining suitable hyperpriors for the shape and rate hyperparameters of that Gamma prior. In this scenario, using prior simulation to visualize the distribution of priors implied by different hyperpriors can be indispensable. (For an overview of prior elicitation for hierarchical Bayesian cognitive models, see Lee & Vanpaemel, 2018.)

matstanlib can automate prior simulation through the hyperpriortester.m function, the output of which is shown in Figure 11a and 11b. Given the specified hyperprior distributions (top left), random samples from each distribution (or function of it; top right) are used to define a random selection of priors (bottom). If extremely undesirable priors are too often being sampled, prior simulation should be used to explore alternative hyperpriors. In these simulations, it can also sometimes help to constrain or transform the hyperparameters in such a way that the unsuitable priors are no longer possible. For example, for a parameter defined over the (0, 1) interval, one may wish to set hyperpriors that imply a distribution over Beta priors that is not unduly biased toward any part of the parameter space (i.e., that is relatively uninformative in context). In Figure 11a, a choice of Gamma(1,1) hyperpriors is revealed to leading to the overselection of priors that place infinite weight over 0 and 1 (approximately 40% of priors), which is inconsistent with our intentions for this parameter (represented by the reference prior, Beta(2,2)). If a minimum value of 1 is enforced for each hyperparmeter, as in 11b, horseshoe priors are no longer possible, and a more reasonable distribution over priors is achieved.

Figure 11.

Figure 11

Prior simulation may be used to check whether the priors and hyperpriors are consistent with domain knowledge and other expectations. In this example, a Gamma hyperprior is specified for each hyperparameter of a Beta prior. (a) The original hyperpriors lead to undesirably high prior weight at the extreme values of the parameter of interest. (b) Enforcing a minimum value of 1 on both hyperparameters (by specifying ~ Beta(1 + a, 1 + b) instead of ~ Beta(a, b)) prevents the selection of U-shaped priors, allowing for a more appropriate distribution over priors that, on average, allows for a more even spread of prior weight across the whole range of the parameter value, excluding the bounds.

We have found that prior simulation is often a critical support to prior predictive checks, and vice versa. While prior predictive checks may reveal implications of the model that are inconsistent with one’s expectations, they rarely also reveal the exact cause of that inconsistency: in these cases, prior simulation can help to identify the root of the issue. However, good prior simulation alone is likewise insufficient: prior predictive checks must be used to demonstrate that the priors make sense in the context of the likelihood (from which they cannot be divorced; Gelman et al., 2017). The selection of good hyperpriors can also be guided by the statistical literature on hyperprior elicitation (e.g., Berger et al., 2005).

If it is still difficult to simulate a sensible apportionment of probability across the distribution of priors, then changing the distributional form of the prior can open up additional opportunities for model improvement. In particular, changing the form of a prior may help in cases where one or more features of the model specification are known to induce posterior geometries that are challenging for MCMC algorithms to navigate. For example, prior distributions with fat tails, such as the Cauchy and Student’s t distributions, can lead to divergences and high treedepths when sampling in the tails. If this occurs, using a lighter-tailed alternative, such as a normal distribution, should be the first step.

Changing the form of the prior is especially likely to help in cases where the cognitive model demands that a parameter be defined over just a subset of the reals and truncation is currently being used to effect the domain constraint. In some cases, selecting an alternative prior form that is naturally defined on the desired domain may help to resolve a variety of issues. For example, rather than truncating a distribution to the positive reals (when it is naturally defined over the entire real line), one might use a distribution that is already defined only on the positive reals, such as an Exponential or Lognormal distribution. In a similar fashion, rather than doubly truncating a distribution when both a lower and upper bound is needed, the generalized Beta distribution (meaning, a Beta distribution that has been scaled and/or shifted such that it is defined over a domain other than [0, 1]) is likewise a handy alternative.10 If such a change of prior leads to less interpretable hyperparameters, one can often use known formulae to derive more useful quantities (e.g., samples for a Gamma prior’s shape and rate hyperparameters may be used to compute, sample by sample, the hyper-level mean and standard deviation). matstanlib supports exactly these kinds of posthoc reparameterizations with the hypertransform.m function, which will apply select commonly used transformations as needed.

Alternatively, some priors with less-easily-interpreted parameters have established reparameterizations that are more intuitive to work with, and may facilitate the setting of hyperpriors in hierarchical models. For example, in Stan, the Gamma distribution is parameterized by a shape parameter, α and a rate parameter, β (while in MATLAB, the Gamma distribution is parameterized slightly differently, in terms of a shape and scale parameter, where the scale is simply the inverse of the rate). A more interesting parameterization of the Gamma distribution is in terms of its mean. The mean of a Gamma distribution is defined as the ratio of its hyperparameters, μ=αβ. A simple variable substitution permits the reparameterization Gamma(α,αμ). Other distributions have known mean-based reparameterizations; we have found these to be helpful to strike an easier balance between sampler-friendly geometry and parameter interpretabilty in a variety of contexts.

Other small parameterization tricks are especially useful when there is a need to exert control over the prior near a boundary to avoid model misspecification (as for parameters whose values cannot conceivably be 0 and sufficient domain knowledge is available to further specify what parameter values should qualify as “near 0” or ”practically equivalent to 0”). For example, in the reinforcement learning model outlined earlier, an inverse temperature of β = 0 breaks the model, as the learned Q values then have no bearing on action selection. As such, a prior for β that apportions most of the prior probability to 0 and values near 0 is effectively a model misspecification, as the most prior weight is given to not just the least likely values, but values that are so inappropriate that one would conclude the model is malfunctioning rather than accept the estimates. One approach to cope with exactly this scenario is to use a boundary-avoiding prior (BAP) or zero-avoiding prior (ZAP). While these parameterizations are no longer seen as appropriate for standard deviation parameters, especially in Bayesian linear statistical models where they risk censoring valid model configurations, for some Bayesian cognitive model parameter, BAPs and ZAPs are sometimes not only permissible, but more appropriate than alternative prior specifications. In our corrected RL model, we use a Gamma prior for the inverse temperature parameter where the shape parameter is required to be greater than 1; this creates a ZAP by ensuring the Gamma distribution allocates zero probability to a value of 0. In our earlier prior simulation example, in which we set lower bounds on both hyperparameters of a Beta distribution, you may notice that we created a simultaneous ZAP (at the lower bound) and BAP (at the upper bound). (see Figure 11. While this category of prior is, again, inappropriate for standard deviations, our final reparameterization technique is likely to be an applicable alternative.

That repararmeterization technique is non-centered parameterization (previously called the “Matt trick” in some older sources), which is uniquely useful for hierarchical models. If pathological funnel-shaped geometry has been generated (as shown earlier in Figure 5c), it may have occured because a centered parameterization was used:

μNormal(15,5)σGamma(1,1)θnNormal(μ,σ)

so called because the prior is centered on the mean parameter. This section of the model may be rewritten to use a non-centered parameterization:

μNormal(15,5)σGamma(1,1)ηNormal(0,1)θn=μ+ση

which is mathematically equivalent. Even though the same hyperpriors are used, this expansion allows the entirety of the funnel to be explored efficiently, by introducing an auxilliary sampled variable η that is independent of μ, and then rescaling it by the sampled standard deviation σ (Figure 5d). While non-centered parameterizations are most frequently applied in the context of normal distributions, they are also applicable to any distribution that is parameterized in terms of a location and scale parameter to overcome a funnel pathology. The example_funnel.m script included in matstanlib demonstrates both a centered and non-centered parameterization for a toy model from (Betancourt & Girolami, 2015).

Of course, parameterization and reparameterization are such broad terms that we cannot hope to cover even just some of the most popular methods, techniques, and tricks here. While some approaches that we do not cover here, such as parameter expansion, are broadly applicable (Gelman, 2004; for a Bayesian cognitive model example, see Matzke et al., 2015), you may find others that are only common within a specific category of Bayesian cognitive models.

From troubleshooting to model development

At this point in the tutorial, we have described in great detail how diagnostic checks and plots may be used to detect and identify the most commonly encountered problems in Bayesian cognitive modeling. Along the way, we have suggested which remedies are most likely to correct these problems in a wide variety of situations. Each time you apply a Bayesian cognitive model, it is always necessary to perform the model-checking steps and any subsequently needed troubleshooting as outlined here to ensure that the output from your model is (1) computationally sufficient and (2) consistent with your intentions. Even if you are using an established Bayesian cognitive model, diagnostic checks and plots can suddenly reveal problems when the model is applied to a new dataset, or is fit with a different sampling algorithm.

In this section, we discuss a few final techniques that each test different assumptions about model behavior, and discuss how they may support troubleshooting for Bayesian cognitive models in particular. While the latter two steps are more often spoken about in the context of model development, they should also be considered part of the troubleshooting process as they support the identification of the kind of shortcomings that can compromise the validity of a model-based inferences.

Depending on how one is planning to apply a given Bayesian cognitive model (or models), some or all of these final checks may be needed to ensure your model capable of doing what you will ask of it. Some of these techniques may need to be customized to your model to even be implemented at all.

Simulation-based calibration

A still somewhat procedural method of troubleshooting which is applicable to all Bayesian cognitive models is simulation-based model calibration (SBC). Similar to parameter recovery, SBC is a way to establish that a model’s estimates are internally consistent (Cook et al., 2006; Talts et al., 2018). In fact, SBC is the technique to formally validate a Bayesian model as it is currently implemented, by testing whether the posteriors tend to be overly wide, overly narrow, or otherwise biased. This is accomplished by running Nreps replications of a recovery scoring routine. For each run, true parameter values θ˜ drawn from the priors are used to simulate a dataset y˜, which is then submitted to the model to collect a relatively small number L of post warmup iterations. The result of each replication is each true parameter value’s rank within the corresponding posterior samples. If the model is correctly implemented, then the distribution of ranks across the Nrep replications will appear uniform for every model parameter in the model (Talts et al., 2018). Depending on the computational demands of the model (and on Nrep and L), this may take a considerable amount of time.

There has been a recent push to consider SBC as less of an option and more of a requirement in Bayesian workflows, especially when working with models that are novel and relatively untested, as nearly all Bayesian cognitive models are. However, because chain autocorrelation violates the assumptions of SBC, an amended and extended version of the base SBC procedure will often need to be used to properly validate Bayesian cognitive model scripts. To use the extended version of SBC, psychologists will need to program a much more demanding and complex version of SBC and inspect additional diagnostic plots. This extended procedure is still being tested and actively developed by Bayesian statistical researchers. Since the version of SBC needed for Bayesian cognitive models is currently less well-defined, at this point in time we do not recommend that psychologists deploy SBC unless they are comfortable reading the most current version of the Talts et al. (2018) paper in detail, coding the autocorrelation-correcting version of the procedure, and, most importantly, keeping up to date with the SBC literature. However, we do recommend that all psychologists expect to incorporate SBC in their modeling pipelines soon, as it is the proper way to ensure that a model produces computationally valid, self-consistent estimates on a parameter-by-parameter basis.

Model recovery

When multiple models are to be applied to the same experiment data, one should strongly consider performing a model recovery study. In research that involves the comparison of multiple models, a model recovery study should be performed to establish whether one can adequately recover the identity of the model that generated the data, given the set of candidate models (Pitt et al., 2003). The model recovery procedure is simple to explain. First, N datasets are to be simulated from each of M models. Then, for each dataset, all M models are applied and compared using a fully-Bayesian model comparison metric (e.g., WAIC or LOO; Vehtari et al., 2017). The output of the study is a contingency table summarizing the frequency with which each data-generating model was judged to be the best-fitting model (i.e., a confusion matrix). If the models are sufficiently distinguishable, then the model comparisons will identify the true generating model for the majority of the M · N datasets. If the confusion matrix is confused, then the results of any model comparison involving the application of this set of models to experimentally-collected data should not be trusted. Unfortunately, conducting a model recovery study requires an even greater investment of time (and we acknowledge that not all researchers have a backup computer that can be tied up for a week). Still, we strongly recommended that a model recovery study is performed for new model comparisons, especially when a Bayesian cognitive model is novel or newly extended, as in these cases model recoverability is a uniquely important assumption to test. This is particularly important if the study means to draw strong conclusions based on model comparison, such as arbitrating between theories represented by two models (e.g., theory A represents data better than theory B).

Posterior predictives

Another way to understand the differences between models, or to better understand the behavior of a single model, is to compute the posterior predictive distribution (Gelman et al., 2013; Rubin, 1984). For a given model, the posterior predictive distribution p(x˜x) is the distribution of future data x˜ implied by the fitted model. It is computed by using each joint posterior sample in turn to generate a new data point or dataset. Visualizing the distribution of these behavioral responses gives a sense of what patterns of data one would expect to see given the model and the data that was already observed, while accounting for the posterior uncertainty in each parameter.11 One may perform a posterior predictive check by simply comparing a function of the posterior predictive distribution to a function of the observed data.

The simplest posterior predictive check, which may be generated using matstanlib’s postpredhist.m function, is to compare a histogram of the posterior predicted data x˜, to a histogram of the observed data, x. While this is a good check for some linear models, for a Bayesian cognitive model, it is more common (and far more useful) to compare more meaningful functions of the data. As with a prior predictive check, posterior predictive checks for Bayesian cognitive models most often rely on selecting summaries of the data that are meaningful within the specific research context. For example, in a reinforcement learning model, this pattern may be the learning curves and asymptotic means; in a decision-making task, a good predictive may be the participant-level rates of a sub-optimal behavior.

A posterior predictive check is a final assessment of the internal consistency of a model. If model behaving as intended, then the data used to estimate the parameters of the model should easily fall within the spread of the posterior predictive distribution of data based on those same estimates. These checks may be used to evaluate the suitability of a single Bayesian cognitive model, or to compare the relative adequacy of multiple different models, when used as a qualitative complement to a quantitative model comparison. In the latter case, a failed posterior predictive check may be used to falsify a candidate model, and thereby provide stronger conclusions about how well a successful model captures a phenomenon of interest Palminteri et al., 2017; Wilson and Collins, 2019.

While severe misfits between the posterior predictive and the observed behavior can be used to invalidate a model (or, again, suggest a need for further troubleshooting), small misfits may be seen as opportunities. This misfit may be reconsidered as the goal of a model development process — but at this point, the lines between model checking, model development, and model usage have become severely blurred. The techniques in this last collection of troubleshooting procedures are all also important steps in a larger Bayesian modeling workflow (Gelman et al., 2020) that may be used for a wider variety of purposes. What specific problems they might identify, how one should work to remedy those problems, and even how exactly the technique should be implemented in the first place, will all be more dependent on your domain knowledge, your planned applications, the specific hypotheses you will use the model(s) to test, and so on. Nonetheless, we still consider these final procedures to be troubleshooting techniques as each is capable of detecting flaws in a model that might be remedied through another iteration of the troubleshooting process.

The troubleshooting process ends when no further problems of any kind are able to be detected, and as such, from both a computational perspective and a model consistency perspective, one is reasonably confident that any inferences one will make based on the model will computationally sufficient, internally consistent, and reasonable for the task at hand. Of course, even a properly troubleshot model’s estimates or predictions may still miss a subtler behavioral pattern, conflict with your domain knowledge, violate a theoretical maxim, etc. Even when a model is computationally sufficient, has good internal consistency, and generates reasonable predictions, some aspect of the model’s output may yet be inconsistent with your expectations. As such, you may wish to extend or modify the model so that it better matches your expert understanding of the relevant theory, past experimental research, and so on. At this point, the process of altering the model specification is no longer called troubleshooting, but called model development. As such, one’s continued pursuit of a better Bayesian cognitive model is less of a problem to be solved, and more of a project in a line of model-based research.

Reporting results

When publishing results from research using Bayesian cognitive modeling, authors should explicitly mention that the required model checks were performed. It is not necessary to record and report exhaustively every detail of your troubleshooting and model development process (although this may be done as a “postregistration” of model-based work; Lee et al., 2019). However, the final specification of the model that is being used and what diagnostic checks were performed should always be made clear. All reports of results from Bayesian cognitive models should include the model specification (i.e., the likelihood and priors used), the sampling algorithm used (including any actively given sampler-specific inputs), and the criteria used to evaluate the computational sufficiency of the model. An example of how this may be reported is:

With these model specifications in hand, we used Stan (Gelman et al., 2015) to estimate the joint posterior distribution of each model via Markov chain Monte Carlo sampling. For each model, we ran 4 chains of 500 warmup iterations and 1500 kept iterations each, then performed a series of diagnostic checks. We required an R^ value of ≤ 1.01 and an effective sample size of ≥ 400 for all parameters, a BFMI of ≥ 0.2 for all chains, and that no divergences were observed. When we report 90% credible intervals (equal-tailed), we also required an effective sample size of ≥ 400 for the 5% and 95% quantiles of those parameters. These checks were supported by a visual inspection of diagnostic and other plots. Finally, before we applied the model to our data, we demonstrated that the model was capable of recovering known parameter values when fit to simulated data (see Figure X). Only kept iterations from models that met these criteria were used for inference.

Conclusion

While Bayesian cognitive modeling can be a challenging method to use properly, it is also a rewarding approach to psychological research that is only increasing in popularity (Jarecki et al., 2020; van de Schoot et al., 2017). In this tutorial, we have sought to make the troubleshooting process clear and accessible, especially for psychologists who may be new to Bayesian cognitive modeling. While the exact sequence of troubleshooting steps needed will be different depending on one’s choice of cognitive model, experimental design, and planned application, one should now have a firm enough grasp on the core tenets of Bayesian troubleshooting to investigate one’s own models. One will not only now know the most essential steps — from the requisite automated computational checks, through the investigative toolbox of diagnostic plots, to the more custom methods to ensure the model is functioning as intended — but should also be able to judge the quality of the output at each step. Ultimately, it is our hope that this guide will not only encourage more more vigilant and conscientious use of Bayesian cognitive models, but also might empower psychologists to build and apply Bayesian cognitive models in their own research with confidence in the quality of their work.

Acknowledgments

We would like to thank Aspen Yoo, Amy Zou, Milena Rmus, Gaia Molinaro, and Soobin Hong for their comments on an earlier draft. This work was supported by NIH grant #R01MH119383.

Appendix

Code command conversion chart

Table A1.

Closest counterparts of various matstanlib commands in similar libraries for other programming languages.

command name
MATLAB R Python
matstanlib bayesplot ArviZ
Core functionality
reformat samples and diagnostics extractsamples as.array, nuts_params from_pystan, etc.
generate a table of posterior statistics and convergence diagnostics mcmctable monitor summary
diagnostics report interpretdiagnostics check_hmc_diagnostics check_hmc_diagnostics
Diagnostic plots
trace plot tracedensity mcmc_trace, mcmc_combo plot_trace
rank plots rankplots mcmc_rank_hist plot_rank
divergences by chain plotdivergences
energy plot plotenergy mcmc_nuts_energy plot_energy
ESS diagnostic plots plotess plot_ess
bivariate density with marginals jointdensity mcmc_scatter plot_pair
grid of bivariate densities multidensity mcmc_pairs plot_pair
parallel coordinates plot parallelsamples mcmc_parcoord plot_parallel
parameter recovery plot plotrecovery mcmc_recover_scatter
Other functionality
prior simulation hyperpriortester
posthoc application of select known reparameterizations hypertransform

Function names in gray indicate the command is from the main interface package (i.e., Rstan, PyStan) as similar functionality is not included in the support package (i.e., is not available in bayesplot or ArviZ). If no function name is given, then as of this writing (12/2021), there is no counterpart in the interface package or the specified support package.

Footnotes

1

It is important to note that Bayesian cognitive modeling is also distinct from a Bayesian theory of mind approach (which is sometimes termed “Bayes in the head”; e.g., Griffiths et al., 2008). Bayesian models of mind view Bayes’ theorem as a cognitive mechanism in and of itself, that is capable of capturing how one might rationally update their beliefs about the world in light of their experiences. In contrast, Bayesian cognitive models are used to express a wide variety of other candidate cognitive mechanisms ans processes (which are not required to be rational — and as such, models may even be explicitly designed to capture non-optimal behavioral patterns; e.g., Busemeyer et al., 2011); Bayesian methods are only used as the technique for parameter estimation.

2

Although other methods such as variational Bayes (e.g., Galdo et al., 2020) are sometimes used, MCMC methods are by far the most widely used family of techniques for Bayesian cognitive model fitting.

3

Some of these sources also use linear models as their guiding example, rather than cognitive process models. It is important to distinguish between the two, as some techniques for the development of linear models are of limited use for cognitive models, and vice versa.

4

Humans.

5

If you are familiar with Gibbs sampling, then you may notice that the recommended number of kept iterations is far lower than than the number of samples recommended for Gibbs sampling. With HMC/NUTS, fewer posterior samples are required due to the the much higher efficiency, especially with respect to the ability of HMC/NUTS to move throughout the kinds of complex and high-curvature posterior geometries common in Bayesian cognitive modeling. For example, while parameter correlations greatly hinder Gibbs sampling, this is not often the case for HMC/NUTS. As a result, where Gibbs sampling via JAGS might require 10,000–100,000 samples, HMC/NUTS sampling via Stan might require only 1000–2000 samples. You may also notice that we do not mention thinning samples. Thinning samples is no longer recommended (Link & Eaton, 2012).

6

While the statistically complete way to accomplish this is simulation-based calibration (Talts et al., 2018), for Bayesian cognitive modeling applications, we recommend parameter recovery studies over simulation-based calibration (SBC) because, (1) most Bayesian cognitive model output violates a key assumption of simulation-based calibration (as we explain in a later section), (2) parameter recovery studies are so familiar and expected in psychological research, that SBC will likely supplement rather than replace recovery plots, and (3) if one always tests a Bayesian cognitive model using simulated data first, then one gets parameter recovery “for free” (meaning with no extra effort, whereas SBC requires a more concerted effort).

7

Veteran practitioners of Bayesian cognitive modeling who update their modeling pipelines from JAGS to Stan, for example, may experience that models specifications that previously passed convergence checks may suddenly fail due to the detection of divergences. While it is tempting to infer that Gibbs sampling is more capable of estimating such a model, this is unlikely to be the case, as HMC/NUTS is a more powerful sampler that overcomes many limitations of Gibbs sampling (such as being challenged by correlated parameters). Rather, it is most likely that Gibbs sampler was silently failing to explore the posterior distribution fully, and these failures only became detectable with the advanced diagnostics of HMC/NUTS.

8

matstanlib includes a script to demonstrate exactly this issue. example_funnel.m implements a toy model from Betancourt and Girolami (2015) that was written to demonstrate a serious structural problem common in hierarchical models, that is sometimes signaled only by divergences. We encourage you to run this script a few times: You may notice that on some runs, only a small number, or even 0 divergences occur. Consider whether a couple of divergences can be disregarded, given that even this model which is designed to fail may not reliably throw divergences.

9

While this is technically an oversimplification, it is the correct intuition.

10

We also caution against truncation generally, even when they work well in a model, because they can make extending the model later on extremely difficult.

11

It is important to note that this is different than using the collection of point parameter estimates for each parameter to simulate new data, which is an incorrect approach for Bayesian models. Using only the point estimates ignores the uncertainty associated with these estimates, and as such is unlikely to support a full understanding of the range of behavior seen as likely by the fitted model. (Also consider that the collection of marginal posterior point estimates is not necessarily a point in the joint parameter space that was visited during sampling, let alone guaranteed to be the most likely point in the joint parameter space.)

References

  1. Addicott MA, Pearson JM, Schechter JC, Sapyta JJ, Weiss MD, & Kollins SH (2021). Attention-deficit/hyperactivity disorder and the explore/exploit trade-off. Neuropsychopharmacology, 46(3), 614–621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ahn W-Y, Haines N, & Zhang L (2017). Revealing neurocomputational mechanisms of reinforcement learning and decision-making with the hBayesDM package. Computational Psychiatry, 24–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Annis J, & Palmeri TJ (2018). Bayesian statistical approaches to evaluating cognitive models. Wiley Interdisciplinary Reviews: Cognitive Science, 9(2), e1458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Annis J, & Palmeri TJ (2019). Modeling memory dynamics in visual expertise. Journal of Experimental Psychology: Learning, Memory, and Cognition, 45(9), 1599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Berger JO, Strawderman W, & Tang D (2005). Posterior propriety and admissibility of hyperpriors in normal hierarchical models. The Annals of Statistics, 33(2), 606–646. [Google Scholar]
  6. Betancourt M (2016). Diagnosing suboptimal cotangent disintegrations in Hamiltonian Monte Carlo. arXiv preprint arXiv:1604.00695. [Google Scholar]
  7. Betancourt M (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434. [Google Scholar]
  8. Betancourt M, & Girolami M (2015). Hamiltonian Monte Carlo for hierarchical models. In Upadhyay S, Singh U, Dey D, & Loganathan A (Eds.), Current trends in bayesian methodology with applications (pp. 79–101). Chapman & Hall/CRC. [Google Scholar]
  9. Blanchard TC, & Gershman SJ (2018). Pure correlates of exploration and exploitation in the human brain. Cognitive, Affective, & Behavioral Neuroscience, 18 (1), 117–126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Boehm U, Marsman M, Matzke D, & Wagenmakers E-J (2018). On the importance of avoiding shortcuts in applying cognitive models to hierarchical data. Behavior research methods, 50(4), 1614–1631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Brooks SP (2003). Bayesian computation: A statistical revolution. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 361 (1813), 2681–2697. [DOI] [PubMed] [Google Scholar]
  12. Brooks S, Gelman A, Jones G, & Meng X-L (2011). Handbook of markov chain monte carlo. Chapman & Hall/CRC. [Google Scholar]
  13. Brown VM, Zhu L, Solway A, Wang JM, McCurry KL, King-Casas B, & Chiu PH (2021). Reinforcement learning disruptions in individuals with depression and sensitivity to symptom change following cognitive behavioral therapy. JAMA psychiatry, 78(10), 1113–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Bürkner P-C (2017). Advanced Bayesian multilevel modeling with the R package brms. arXiv preprint arXiv:1705.11123. [Google Scholar]
  15. Busemeyer JR, Pothos EM, Franco R, & Trueblood JS (2011). A quantum theoretical explanation for probability judgment errors. Psychological review, 118(2), 193. [DOI] [PubMed] [Google Scholar]
  16. Cook SR, Gelman A, & Rubin DB (2006). Validation of software for Bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics, 15(3), 675–692. [Google Scholar]
  17. Dearden R, Friedman N, & Andre D (1998). Bayesian q-learning. Proceedings of the 30th annual conference of the Cognitive Science Society, 761–768. [Google Scholar]
  18. Donkin C, Kary A, Tahir F, & Taylor R (2016). Resources masquerading as slots: Flexible allocation of visual working memory. Cognitive Psychology, 85, 30–42. [DOI] [PubMed] [Google Scholar]
  19. Etz A, Gronau QF, Dablander F, Edelsbrunner PA, & Baribault B (2018). How to become a Bayesian in eight easy steps: An annotated reading list. Psychonomic Bulletin & Review, 25(1), 219–234. [DOI] [PubMed] [Google Scholar]
  20. Etz A, & Vandekerckhove J (2018). Introduction to Bayesian inference for psychology. Psychonomic Bulletin & Review, 25(1), 5–34. [DOI] [PubMed] [Google Scholar]
  21. Farrell S, & Lewandowsky S (2018). Computational modeling of cognition and behavior. Cambridge University Press. [Google Scholar]
  22. Gabry J, & Mahr T (2021). Bayesplot: Plotting for Bayesian models [R package version 1.8.0]. https://mc-stan.org/bayesplot/
  23. Gabry J, Simpson D, Vehtari A, Betancourt M, & Gelman A (2019). Visualization in Bayesian workflow. Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(2), 389–402. [Google Scholar]
  24. Galdo M, Bahg G, & Turner BM (2020). Variational Bayesian methods for cognitive science. Psychological methods, 25(5), 535. [DOI] [PubMed] [Google Scholar]
  25. Gelman A (2004). Parameterization and bayesian modeling. Journal of the American Statistical Association, 99(466), 537–545. [Google Scholar]
  26. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, & Rubin DB (2013). Bayesian data analysis (3rd ed.). Chapman & Hall/CRC. [Google Scholar]
  27. Gelman A, Carlin JB, Stern HS, & Rubin DB (1995). Bayesian data analysis (1st ed.). Chapman & Hall/CRC. [Google Scholar]
  28. Gelman A, Lee D, & Guo J (2015). Stan: A probabilistic programming language for Bayesian inference and optimization. Journal of Educational and Behavioral Statistics, 40(5), 530–543. [Google Scholar]
  29. Gelman A, & Rubin DB (1991). A single series from the Gibbs sampler provides a false sense of security. Bayesian Statistics, 4, 625–631. [Google Scholar]
  30. Gelman A, & Rubin DB (1992). Inference from iterative simulation using multiple sequences. Statistical science, 7(4), 457–472. [Google Scholar]
  31. Gelman A, Simpson D, & Betancourt M (2017). The prior can often only be understood in the context of the likelihood. Entropy, 19(10), 555. [Google Scholar]
  32. Gelman A, Vehtari A, Simpson D, Margossian CC, Carpenter B, Yao Y, Kennedy L, Gabry J, Bürkner P-C, & Modrák M (2020). Bayesian workflow. arXiv preprint arXiv:2011.01808. [Google Scholar]
  33. Gilks WR, Richardson S, & Spiegelhalter D (1995). Markov chain Monte Carlo in practice. Chapman & Hall/CRC. [Google Scholar]
  34. Golubickis M, Falben JK, Cunningham WA, & Macrae CN (2018). Exploring the self-ownership effect: Separating stimulus and response biases. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44(2), 295. [DOI] [PubMed] [Google Scholar]
  35. Greene NR, & Rhodes S (2020). A tutorial on cognitive modeling for cognitive aging researchers. [DOI] [PubMed]
  36. Griffiths TL, Kemp C, & Tenenbaum JB (2008). Bayesian models of cognition. In Sun R (Ed.), Cambridge handbook of computational psychology (pp. 59–100). Cambridge University Press. [Google Scholar]
  37. Haines N, Beauchaine TP, Galdo M, Rogers AH, Hahn H, Pitt MA, Myung JI, Turner BM, & Ahn W-Y (2020). Anxiety modulates preference for immediate rewards among trait-impulsive individuals: A hierarchical Bayesian analysis. Clinical Psychological Science, 8(6), 1017–1036. [Google Scholar]
  38. Hoffman MD, & Gelman A (2014). The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623. [Google Scholar]
  39. Hütter M, & Klauer KC (2016). Applying processing trees in social psychology. European Review of Social Psychology, 27(1), 116–159. [Google Scholar]
  40. Jarecki JB, Tan JH, & Jenny MA (2020). A framework for building cognitive process models. Psychonomic bulletin & review, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Johnson DJ, Cesario J, & Pleskac TJ (2018). How prior information and police experience impact decisions to shoot. Journal of personality and social psychology, 115(4), 601. [DOI] [PubMed] [Google Scholar]
  42. Katahira K (2016). How hierarchical models improve point estimates of model parameters at the individual level. Journal of Mathematical Psychology, 73, 37–58. [Google Scholar]
  43. Krefeld-Schwalb A, Pachur T, & Scheibehenne B (2021). Structural parameter interdependencies in computational models of cognition. Psychological Review. [DOI] [PubMed] [Google Scholar]
  44. Kruschke J (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press. [Google Scholar]
  45. Kumar R, Carroll C, Hartikainen A, & Martin OA (2019). ArviZ a unified library for exploratory analysis of Bayesian models in Python. The Journal of Open Source Software. 10.21105/joss.01143 [DOI] [Google Scholar]
  46. Lee MD (2008). Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15(1), 1–15. [DOI] [PubMed] [Google Scholar]
  47. Lee MD (2011). How cognitive modeling can benefit from hierarchical Bayesian models. Journal of Mathematical Psychology, 55(1), 1–7. [Google Scholar]
  48. Lee MD, Criss AH, Devezer B, Donkin C, Etz A, Leite FP, Matzke D, Rouder JN, Trueblood JS, White CN, et al. (2019). Robust modeling in cognitive science. Computational Brain & Behavior, 2(3), 141–153. [Google Scholar]
  49. Lee MD, & Vanpaemel W (2018). Determining informative priors for cognitive models. Psychonomic Bulletin & Review, 25(1), 114–127. [DOI] [PubMed] [Google Scholar]
  50. Lee MD, & Wagenmakers E-J (2014). Bayesian cognitive modeling: A practical course. Cambridge university press. [Google Scholar]
  51. Link WA, & Eaton MJ (2012). On thinning of chains in mcmc. Methods in Ecology and Evolution, 3(1), 112–115. [Google Scholar]
  52. Matzke D, Boehm U, & Vandekerckhove J (2018). Bayesian inference for psychology, part III: Parameter estimation in nonstandard models. Psychonomic Bulletin & Review, 25(1), 77–101. [DOI] [PubMed] [Google Scholar]
  53. Matzke D, Dolan CV, Batchelder WH, & Wagenmakers E-J (2015). Bayesian estimation of multinomial processing tree models with heterogeneity in participants and items. Psychometrika, 80(1), 205–235. [DOI] [PubMed] [Google Scholar]
  54. Matzke D, Hughes M, Badcock JC, Michie P, & Heathcote A (2017). Failures of cognitive control or attention? the case of stop-signal deficits in schizophrenia. Attention, Perception, & Psychophysics, 79(4), 1078–1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Navarro DJ, Newell BR, & Schulze C (2016). Learning and choosing in an uncertain world: An investigation of the explore–exploit dilemma in static and dynamic environments. Cognitive psychology, 85, 43–77. [DOI] [PubMed] [Google Scholar]
  56. Navarro DJ (2021). If mathematical psychology did not exist we might need to invent it: A comment on theory building in psychology. Perspectives on Psychological Science, 1745691620974769. [DOI] [PubMed] [Google Scholar]
  57. Nilsson H, Rieskamp J, & Wagenmakers E-J (2011). Hierarchical bayesian parameter estimation for cumulative prospect theory. Journal of Mathematical Psychology, 55(1), 84–93. [Google Scholar]
  58. Nunez MD, Gosai A, Vandekerckhove J, & Srinivasan R (2019). The latency of a visual evoked potential tracks the onset of decision making. Neuroimage, 197, 93–108. [DOI] [PubMed] [Google Scholar]
  59. Palminteri S, Wyart V, & Koechlin E (2017). The importance of falsification in computational cognitive modeling. Trends in cognitive sciences, 21(6), 425–433. [DOI] [PubMed] [Google Scholar]
  60. Peters J, & D’Esposito M (2020). The drift diffusion model as the choice rule in inter-temporal and risky choice: A case study in medial orbitofrontal cortex lesion patients and controls. PLoS computational biology, 16(4), e1007615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Pitt MA, Kim W, & Myung IJ (2003). Flexibility versus generalizability in model selection. Psychonomic bulletin & review, 10(1), 29–44. [DOI] [PubMed] [Google Scholar]
  62. Plummer M (2003). Jags: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd international workshop on distributed statistical computing, 124(125.10), 1–10. [Google Scholar]
  63. Ratcliff R, & McKoon G (2008). The diffusion decision model: Theory and data for two-choice decision tasks. Neural computation, 20(4), 873–922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Rouder JN, & Lu J (2005). An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychonomic bulletin & review, 12(4), 573–604. [DOI] [PubMed] [Google Scholar]
  65. Rouder JN, Speckman PL, Sun D, Morey RD, & Iverson G (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic bulletin & review, 16(2), 225–237. [DOI] [PubMed] [Google Scholar]
  66. Rubin DB (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 1151–1172. [Google Scholar]
  67. Salvatier J, Wiecki TV, & Fonnesbeck C (2016). Probabilistic programming in Python using PyMC3. Peer J Computer Science, 2, e55. [Google Scholar]
  68. Schad DJ, Betancourt M, & Vasishth S (2021). Toward a principled Bayesian workflow in cognitive science. Psychological methods, 26(1), 103. [DOI] [PubMed] [Google Scholar]
  69. Scheibehenne B, & Pachur T (2015). Using Bayesian hierarchical parameter estimation to assess the generalizability of cognitive models of choice. Psychonomic bulletin & review, 22(2), 391–407. [DOI] [PubMed] [Google Scholar]
  70. Shiffrin RM, Lee MD, Kim W, & Wagenmakers E-J (2008). A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32(8), 1248–1284. [DOI] [PubMed] [Google Scholar]
  71. Sutton RS, & Barto AG (2018). Reinforcement learning: An introduction. MIT press. [Google Scholar]
  72. Talts S, Betancourt M, Simpson D, Vehtari A, & Gelman A (2018). Validating Bayesian inference algorithms with simulation-based calibration. arXiv preprint arXiv:1804.06788. [Google Scholar]
  73. van de Schoot R, Winter SD, Ryan O, Zondervan-Zwijnenburg M, & Depaoli S (2017). A systematic review of Bayesian articles in psychology: The last 25 years. Psychological Methods, 22(2), 217. [DOI] [PubMed] [Google Scholar]
  74. Van Ravenzwaaij D, Cassey P, & Brown SD (2018). A simple introduction to Markov Chain Monte-Carlo sampling. Psychonomic bulletin & review, 25(1), 143–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Vandekerckhove J, Tuerlinckx F, & Lee M (2008). A bayesian approach to diffusion process models of decision-making. Proceedings of the 30th annual conference of the Cognitive Science Society, 1429–1434. [Google Scholar]
  76. Vanpaemel W (2010). Prior sensitivity in theory testing: An apologia for the Bayes factor. Journal of Mathematical Psychology, 54(6), 491–498. [Google Scholar]
  77. Vehtari A, Gelman A, & Gabry J (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and computing, 27(5), 1413–1432. [Google Scholar]
  78. Vehtari A, Gelman A, Simpson D, Carpenter B, & Bürkner P-C (2021). Rank-normalization, folding, and localization: An improved R^ for assessing convergence of MCMC. Bayesian Analysis, 16(2). 10.1214/20-ba1221 [DOI] [Google Scholar]
  79. Wiecki TV, Sofer I, & Frank MJ (2013). Hddm: Hierarchical bayesian estimation of the drift-diffusion model in python. Frontiers in neuroinformatics, 7, 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Wilson RC, & Collins AG (2019). Ten simple rules for the computational modeling of behavioral data. Elife, 8, e49547. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES