Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Sep 28.
Published in final edited form as: Biol Psychiatry Cogn Neurosci Neuroimaging. 2023 Jan 11;8(8):822–831. doi: 10.1016/j.bpsc.2023.01.001

From Classical Methods to Generative Models: Tackling the Unreliability of Neuroscientific Measures in Mental Health Research

Nathaniel Haines 1, Holly Sullivan-Toole 2, Thomas Olino 3
PMCID: PMC10333448  NIHMSID: NIHMS1902431  PMID: 36997406

Abstract

Advances in computational statistics and corresponding shifts in funding initiatives over the past few decades have led to a proliferation of neuroscientific measures being developed in the context of mental health research. Although such measures have undoubtedly deepened our understanding of neural mechanisms underlying cognitive, affective, and behavioral processes associated with various mental health conditions, the clinical utility of such measures remains underwhelming. Recent commentaries point toward the poor reliability of neuroscientific measures to partially explain this lack of clinical translation. Here, we: (1) provide a concise theoretical overview of how unreliability impedes clinical translation of neuroscientific measures; (2) discuss how various modeling principles—including those from hierarchical and structural equation modeling frameworks—can help to improve reliability; and then (3) demonstrate how to combine principles of hierarchical and structural modeling within the generative modeling framework to achieve more reliable, generalizable measures of brain-behavior relationships for use in mental health research.

Keywords: Generative modeling, Bayesian analysis, measurement error, reliability paradox, clinical assessment, neuroscientific measures

The Need for Improved Neuroscientific Measures in Mental Health Research

Neuroscientific measures have long held promise for revolutionizing clinical practice (1). This promise remains strong today, evidenced by funding initiatives such as the U.S. National Institute of Mental Health’s Strategic Plan to “define the brain mechanisms underlying complex behaviors” (2) and the U.K. Research and Innovation’s funding priority focuses on “neurosciences and mental health” and “systems approaches to the biosciences” (3). Despite decades of promises backed by millions of dollars in funding, neuroscientific measures still fail to meet basic validity requirements that would make them useful for diagnosis or treatment personalization in a real-world clinical context (47). Such failures are likely due to a variety of factors ranging from specifying our theories of how clinical phenomena arise to the statistical methods we use test our theories (8,9). Here, we give attention to the latter, with a particular focus on issues of reliability and measurement error in the context of behavioral and neural data. We then discuss frameworks that have been developed to account for imperfect reliability when making inferences, including classical test theory, hierarchical modeling, and structural equation modeling. Finally, we discuss how generative models can be developed to incorporate principles from these modeling frameworks, and how such models can be used to better understand brain-behavior relationships while accounting for low reliability or measurement error. Throughout, readers can refer to Box 1 for an overview of the terminology we use. Additionally, we recommend that interested readers follow along with our analyses using the companion Rmarkdown notebook located at: https://github.com/Nathaniel-Haines/biopsych-reliability-2022.

Box 1.

Reliability — Any measurement is comprised of the signal or “true score” and noise or measurement error; reliability is the fraction of measurement variance that is not error variance. Reliability is essentially a signal to noise ratio. Types of reliability measures include:

  • Internal Reliability — Consistency within a measure (e.g., consistency across items within a self-report measure). Can also be thought of as how “precisely” a parameter is estimated.

  • Split-Half Reliability — Consistency within a measure, quantifies extent to which parts of a given measure contribute to the overall measurement (e.g., consistency of performance on first half of a task with performance on second half of a task or consistency of odd compared to even numbered trials within a task).

  • Test-Retest Reliability — Consistency of a measure across time or repeated assessments (e.g., stability of task performance across successive administrations).

  • Inter-Rater Reliability — Consistency of measurement across individuals (e.g., consistency of scores from multiple clinicians determining a diagnosis).

Measurement Error — All remaining variance in a measure that is beyond the variance attributable to the variable of interest’s “true score”; i.e., measurement noise.

Structural Equation Model — A broad family of statistical models that leverage the covariance structure among variables to understand associations between observed variables and estimate latent (unobserved) variables that reflect the error-free shared variance among a set of predictors.

Hierarchical Model — A statistical model in which measurement units are clustered within groups hierarchically (e.g., people within a population, trials within a person, time points within a self-report construct) and variance is attributed to both the lower-level units and the higher-level group. Similar terms include: multi-level models, random-effect models, and mixed-effect models.

  • Hierarchical Shrinkage — Individual “random effect” estimates in hierarchical models are “pooled” or “regressed” toward the group-level mean. Shrinkage occurs through the same mechanism as regression-based true score estimation per classical test theory.

Generative Model — A statistical model of the joint probability distribution p(y,θ) on observed data (y) and unobserved parameters (θ). Conceptually, generative models can be thought of as probabilistic models that can simulate forward all aspects of both the observed and unobserved data/variables of interest. For example, a trial-level reinforcement learning model with specified parameters can generate/simulate data that resembles the structure of a participant’s actual choices in a decision-making task, with latent parameters such as learning rate, representing unobserved variables that influence behavior. Note: this definition is distinct from the neuroscience construct of “generative model” used in the context of “predictive coding”, “active inference”, and “free energy”.

Bayesian Model — A statistical model within which all uncertainty is quantified by specifying probability distributions on data and parameters of interest. Inference with Bayesian models is based on Bayes’ rule, which is written as p(θy)=p(yθ)p(θ)p(y). Because p(y,θ)=p(yθ)p(θ), Bayesian models are all, in principle, generative models. As such, both the theory and practice of developing Bayesian models is highly aligned with the philosophy of generative modeling.

The Reliability Paradox and Brain-Behavior Relationships

The goal of clinical assessment is to permit inferences of individuals’ behavior, performance, and, broadly, outcomes. Yet, there are critical considerations related to measurement reliability that do not receive adequate attention. For example, a number of studies have demonstrated variability in neural results, and thus effects on reliability, depending on features of the analytic pipeline (1012); assumptions about the hemodynamic response function (13); subject motion, especially when motion is linked to stimuli (14); whether activations or connectivity paths were located in the cortex versus subcortex (1518); and a variety of other factors (see (19) for review).

The Reliability Paradox (20) has emerged as a particularly important issue affecting reliability in clinical neuroscience—many behavioral measures that produce robust group effects paradoxically show low reliability (ρ) coefficients1. Typically, ρ is expressed as a ratio of between- relative to within-unit variability in a measure, where ρ ranges between 0 and 1, and higher or lower values indicate that a measure does a better or worse job of differentiating units within a group. Units within groups could be people within a population, trials within a person, or any number of other similarly structured hierarchical data structures. In the context of the Reliability Paradox, units are individual people, and the group is some population of interest that people are sampled from (but see Zorowitz & Niv (this issue) for a more comprehensive overview of the various conceptualizations of reliability). Therefore, when behavioral task measures have low reliability, it prevents using the measure for individual difference inference. For clinical applications, such inferences include determining whether someone scores above a clinical threshold on some behavioral, neural, or self-report measure.

Since Hedge et al. (20) coined the term, the Reliability Paradox has been observed throughout the psychological and brain sciences, with ρ=.4.5 across a wide range of behavioral tasks measuring self-control, implicit attitudes, learning and decision-making, and more (2123). Beyond behavioral measures, the Reliability Paradox is also apparent in neuroscience—a recent meta-analysis of 90 fMRI experiments showed an average task-based BOLD response ρ.4 across different brain regions and tasks (16). Resting state, task-based, and regional activation/connectivity measures show comparably low reliability (2426), suggesting that—like behavioral tasks—many neural measures cannot be used for clinical decision-making.

The low reliability of common behavioral and neural measures has led to considerable, justified concerns regarding the use of behavioral tasks and fMRI experiments to develop and test theories of individual differences (16,2729). The primary issue with low reliability is that, if both behavioral and neural measures are unreliable, statistical power suffers, and the sample sizes required to overcome measurement error to detect significant associations with critical outcomes must increase to compensate. For example, the sample size needed to detect a true medium effect (r=.3, with 80% power at α=.05) between two measures with perfect reliabilities ρbehavior=ρneural=1 is 84. Under a more realistic scenario where ρbehavior=ρneural=.4.5, the necessary sample size increases dramatically to 346-524, which would push scanner costs (assuming $750/hour) to ≈ $259,500 in an optimistic scenario. Given that most clinical neuroscience studies have sample sizes far below these numbers, it is likely the case that many reported findings are spurious or even in the wrong direction compared to true underlying effects (30,31).

A Theoretical Dive into Measurement Error and Reliability

Despite the bleak reality painted above, recent re-analyses and commentaries suggests that low reliability can be overcome. Some commentaries point to different ways of developing measures (7,32), whereas others have used alternative statistical modeling frameworks to address issues arising from the use of traditional measures that otherwise result in unwarranted attenuation of reliability (9,3335). Throughout this paper, we focus on this latter statistical modeling perspective, providing a concise overview of the key principles along with a brief tutorial that demonstrates these principles in action.

Classical Test Theory

The Classical Test Theory (CTT) perspective views low reliability as arising primarily from measurement error. Measurement error perturbs a “true score”, resulting in the “observed score” that is quantified. In its simplest form, the CTT model follows the form:

yi=θi+ϵi,t (1)

For person i,yi is their observed score, θi is their true score, and ϵi,t is their measurement error for observation t (36). Typically, we assume that true scores and measurement error are normally distributed such that θi𝒩(0,σθ2) and ϵi,t𝒩(0,σϵ2). Here, σθ2 indicates the variance in true scores across people, and σϵ2 indicates the variance of observed scores around the true scores within people (i.e., the measurement error), where each person is assumed to share the same within-person variance.

The true score θi is the quantity that we aim to make inferential statements about. Herein lies the role of reliability in individual difference inference in as early as the 1920 ‘s, Kelley showed that true scores could be estimated (θ^l) by regressing observed scores yi toward the group-level mean score (y) in proportion to the reliability (ρ) of the observed scores (e.g., reliability of person-level measures within a group) (37,38):

θ^l=y+ρyiy (2)

Here, if ρ=0, person-level observed scores contain no information to differentiate people. As a result, each person’s estimated true score is simply the group-level average (θ^l=y). Otherwise, if ρ=1, person-level observed scores contain no measurement error, so the observed score is equivalent to the true score (θ^l=yi). For values of ρ between 0 and 1, observed scores are regressed toward the group-level average score in proportion to reliability.

Notably, θ^l in equation 2 is now referred to as a James-Stein estimator throughout mathematical statistics (3942). The discovery of such estimators produced a revolution in statistics due to their counter-intuitive properties—they are mathematically proven to give an equal or better estimate in terms of mean squared error relative to the unbiased sample mean for each person yi2. This result is often described as Stein’s Paradox. As outlined by (42), R. A. Fisher had already proved that the sample mean yi contains all possible information that can be used to inform estimates of the true mean θ^l. However, Fisher did not consider that information across multiple sample means could be used to mutually constrain estimation of each individual sample mean. Intuitively, in the absence of further information, the group-level mean is your best estimate of a person’s true score before you can measure that individual. Regression-based true score estimators (or James-Stein estimators) formalize this intuition and extend it to the case where information on the individual is present, yet unreliable (or contaminated with measurement error). As we discuss later, this regression-based estimator ties together traditional CTT and more modern hierarchical models.

To use equation 2, we must first estimate reliability. In CTT, assuming that the observed scores yi are summaries of multiple conditionally independent measurements (e.g., the mean of multiple response times), reliability (ρ) is formulated as the ratio of true score variance (or between-person variance) relative to total variance:

ρ=σθ2σθ2+σϵ2n (3)

Here, n is the number of independent measurements (e.g., trials, although in CTT observed scores are treated as single observations, or n=1), and σϵ2n is then the standard error of measurement (assuming shared n and σϵ2 across all yi). In a response time task, σθ2 captures the variance in mean response times across people, and σϵ2 captures the variance of response times across trials within people. Note that this particular definition of ρ is termed average score reliability, also referred to as ICC(k) or ICC(1,k) as described by (43) and (44), respectively. A consequence of this definition is that reliability necessarily decreases as within-person variance (σϵ2) increases, n decreases, or between-person variance (σθ2) decreases.

In addition to being useful for understanding the relationship between observed and true scores, the concept of reliability in CTT can be used to better estimate “true correlations” given observed correlations. For example, in an fMRI study, we may collect behavioral response times in addition to ROI activations across participants and trials. We may then model the relationship between the vector of average response times across people (y) and the vector of average ROI activation across people (x) to determine if the ROI relates to the behavioral process underlying response times. In doing so, the true correlation of interest (rθy,θx) is attenuated by the (un)reliability of both measures (ρy and ρx), resulting in the following observed correlation:

ry,x=rθy,θxρyρx (4)

Because there is often high variability in both behavioral response times and neural ROI measures across trials, σϵy2 and σϵx2 are often high relative to between-person variability in either measure (σθy2 and σθx2). This results in low reliability for both measures (ρyρx1), in the end producing a low observed correlation (ry,x) even if the true correlation (rθy,θx) is quite high. To correct an observed correlation for measurement error only requires a rearrangement of equation 4:

rθy,θx=ry,xρyρx (5)

Above, if reliability is perfect for both measures (ρyρx=1), then the observed and true correlations are equivalent. Otherwise, to the extent that ρyρx0, the observed correlation is increasingly adjusted toward the true correlation. However, this correction (or dissattenuation for unreliability) must be treated with caution—if reliability is not estimated appropriately, estimates of the true correlation are biased and potentially invalidated in response (44,45).

Hierarchical and Structural Latent Variable Models

It is well known that correlations estimated between latent variables are typically greater in magnitude than the same correlations estimated using summed scores of the respective measures—why is this the case? Typically, latent variables are discussed from the perspective of structural equation models (SEMs) wherein latent variables reflect common variance in manifest indicators (i.e. observed data). As a result, latent variables are considered “error free”, or having perfect reliability, averaging across indicator variables to arrive at a more precise estimate of the “true” (i.e. latent) variable. Formally, we can understand the properties of latent variables by focusing on hierarchical models as a specific case (see Box 1), noting that hierarchical models can be viewed as a form of latent variable model within the SEM framework (4648). Hierarchical models are models in which measurement units are clustered within groups hierarchically (e.g., people within a population), with variance attributed to both the lower-level units and higher-level group. Hierarchical models exert a pooling or shrinkage (a.k.a. regularization or regression to the mean) effect on person-level parameters (which are latent variables). For example, in a hierarchical model where observed response times are assumed to be normally distributed within persons i with mean θi and standard deviation σϵ, and where variation between persons is also assumed to follow a normal distribution with mean μθ and standard deviation σθ, the shrinkage factor for each person i is given by (49):

ρi=σθ2σθ2+σϵ2ni (6)

Here, σϵ2ni is the standard error of the mean within person i, which depends on the variance in their response times across trials and the number of trials ni that they underwent in the task. Hierarchical pooling of each person’s observed sample mean response time (yi) toward the group-level mean (μ) then follows the equation:

θ^l=μ+ρiyiμ (7)

where θ^l is the pooled person-level estimate.

Note that equations 67 are identical to equations 23—hierarchical pooling occurs through the same mechanism as regression-based true score estimation (50). As a result, we can view the “random effects” (or person-level estimates) estimated from hierarchical models as “true score estimates”, which are corrected for unreliability. The correspondence also extends to individual difference correlations. Akin to how the reliability of two measures can be used to disattenuate their observed correlation, models that directly estimate the correlation between latent person-level parameters result in individual difference correlations that are corrected for unreliability (9,3335). The take-away is that CTT and hierarchical/latent variable models can account for measurement error in the same fundamental way, even to the point of producing identical results when the models are precisely specified. Figure 1 illustrates the effect of disattenuation (or hierarchical pooling).

Figure 1. Hierarchical pooling (or disattenuation) in behavioral data.

Figure 1.

The same mathematics underlie both reliability-based disattenuation and the pooling in hierarchical models. Here we demonstrate hierarchical pooling in the context of delay discounting rates (k) estimated for the same people across multiple sessions (a test-retest design). Gray dots indicate discounting rate estimates using maximum likelihood applied separately to each person, whereas red dots indicate the same estimates derived from a generative model with a multivariate hierarchical structure. Lines connect estimates that belong to the same person, revealing the effect that pooling has on each person. Notice that pooling occurs within each dimension simultaneously, producing generative model estimates with an overall higher correlation than the corresponding maximum likelihood estimates. Figure reprinted (with permission) from (9).

Generative Modeling as a General Framework

Given that classical and latent variable modeling frameworks can both be used to account for reliability, choosing the best one comes down to a question of flexibility. In the context of mental health research, models can often become quite complex. For example, we may use item response theory to estimate latent traits from a self-report, a reinforcement learning model to estimate parameters governing learning dynamics in a behavioral task, or a model of the hemodynamic response function in an fMRI experiment to estimate the effect of task stimuli on neural activity. Further, we may want to combine these models in some way, facilitating estimation of correlations between self-report, behavioral, and neural model parameters. CTT and classical structural equation modeling frameworks simply lack the flexibility to fully specify and estimate parameters from complex models with data spanning multiple modalities (e.g., neural, behavioral, self-report), structures (e.g., two-level, three-level, etc. hierarchies), and timescales (e.g., seconds, trials, days).

The generative modeling framework is well suited to tackle complex models that incorporate multiple data structures (including hierarchical) from multiple different modalities and timescales(9,34,51,52). Formally, we define a generative model as a statistical model of the joint probability distribution p(y,θ) on given observed data (y) and unobserved parameters (θ). Conceptually, generative models can be thought of as probabilistic models that can simulate forward all aspects of both the observed and unobserved data/variables of interest. Once a generative model is fully specified, parameters can be estimated using either classical or Bayesian methods. However, Bayesian methods are often preferred for both practical and theoretical reasons. Practically, modern probabilistic programming languages (e.g., Stan, PyMC, JAGS) make specifying and fitting generative models to data relatively simple. Theoretically, Bayesian analysis aligns strongly with the philosophy of generative modeling—we first specify a joint probability distribution that captures our knowledge of the data generating process and our uncertainty in key parameters of interest (i.e. a generative model), and then we use probability theory (i.e. Bayes Rule) to condition our model on observed data and update our representations of uncertainty. Although this does require careful attention to the assumed data-generating process (53,54), the benefit is that Bayesian analysis offers a unified approach for estimating parameters of arbitrarily complex generative models that represent the substantive theories we aim to test while simultaneously accounting for unreliability or measurement error (55,56).

A Primer on Generative Brain–Behavior Modeling

To demonstrate the benefits of generative modeling, we will walk through a conceptual tutorial of joint brain-behavior modeling. Because reinforcement learning (RL) models have played such a large role in understanding a diverse range of mental health conditions (57,58), we focus our example on how to infer the relationship between trial-level prediction errors derived from a reinforcement learning model (for primers on RL models, see (5961)) and corresponding BOLD signaling in a region of interest. We organized the following sections to: (a) detail the contrived experimental design; (b) describe the assumed behavioral and neural data-generating models, in addition to how they are linked together in the case of the generative model; (c) show how common analytical approaches can be viewed as special cases of the full generative model—cases that make unrealistic assumptions regarding measurement error; and end with (d) simulations to show how the generative approach compares to an approach that ignores reliability.

The Experimental Design

Throughout, we base our analyses on a simple event-related fMRI design wherein people undergo a probabilistic learning task. The task is a two-armed bandit where the true expected value of each option drifts across trials to motivate continued learning. Specifically, outcomes from each option follow a normal distribution with means μ1 and μ2 and standard deviations σ1=σ2=0.3. Each option starts on trial 1 with μ1=μ2=0. After an outcome is sampled on a given trial, both means are independently updated according to μ1:2=μ1:2γ+ϵt,1:2, where γ=0.99 and ϵt,1:2𝒩(0,0.1). A high value for γ enforces a high autocorrelation between successive values of μ, which makes it possible for people to develop a preference for one option over another as μ1 and μ2 temporarily drift apart. As depicted in Figure 2, we assume that the outcome presentation phase in the task for trial t is jittered such that it occurs st seconds after the previous trial (t1) outcome presentation, where stlog𝒩(2,0.2). Finally, we assume that brain volumes are measured by the fMRI scanner once every 2 seconds (TR=2s). For simplicity, we focus purely on the stimulus outcome—but not choice—presentation phase of the task.

Figure 2. Schematic of the experimental task.

Figure 2.

We designed the task used throughout our simulations such that the outcome presentation for each trial t occurred st seconds after the previous trial’s outcome presentation. To simplify the task for the purpose of demonstration, we assumed that stlog𝒩(2,0.2) such that the median time between outcome onsets was 2 seconds, with some variability from trial to trial. This design abstracts away the specific timings of each of the various task phases (e.g., fixation jitter time, choice time, and outcome presentation time), yet it retains the core elements necessary for simulations.

The Behavioral Model

We assume learning dynamics governed by delta-rule learning with learning rate α(0<α<1):

Vt=Vt1+αPEtPEt=(RtVt1) (8)

where Vt is the expected value on trial t, and PEt=RtVt1 is the corresponding prediction error between the actual reward Rt and expected value. Note that PEt is the quantity that we aim to associate with corresponding neural activity.

Next, we assume that choice probabilities follow a softmax (i.e. multinomial logistic) choice function with inverse temperature ξ(0<ξ<) :

pt,c=expVt,cξk=12expVt,kξ (9)

where pt,c is the probability of choosing option c on trial t. These choice probabilities then give rise to actual choices per a categorical (or multinomial) distribution:

yt,BCategoricalpt,1,pt,2 (10)

Here, yt,B indicates the observed choice for the respective person and trial, where B indicates that this is the behavioral (as opposed to neural) data. Figure 3 demonstrates how expected value and choice probability are represented by the behavior of the model on a trial-by-trial basis.

Figure 3. Behavioral model learning dynamics.

Figure 3.

The top row shows changes in the learned expected value for each option as choices are simulated from the behavioral model described by equations 810 (where α=.2 in the Low Learning Rate panel, α=.8 in the High Learning Rate panel, and ξ=1 in both cases). The thick transparent lines indicate the true expected value for each option, which drifts with normally distributed noise across trials (note that the trajectory is identical across panels). Points on lines indicate that the given option was chosen and produced a positive prediction error on the given trial, resulting in an increase in learned expected value for the next trial. The bottom row shows how the expected values in the top row translate to choice probabilities. Together, the panels show that the simulated agent learns a noisy representation of the true expected value for each option, which is then passed through the choice rule to produce a slight preference for one option over the other dependent on past experience.

The Neural Model

Because the neural measure in our example is BOLD signaling measured throughout an fMRI scan, the neural model itself takes on the form of a hemodynamic response function (HRF). A detailed walkthrough on the properties of HRFs is outside the scope of the current tutorial (see (62) for details). For our purposes, we will assume that neural responses follow the canonical HRF—a double-gamma function that follows the form:

h(s)=βh0(s)=βsα11b1α1exp(bss)(α11)!csα21b2α2exp(bss)(α21)! (11)

Above, h(s) is the height of the value of the HRF at time s (where s is continuous time in seconds), h0(s) can be thought of as the basis function of the HRF, the shape of which is governed by a set of pre-determined parameters as described in SPM12: α1=6,α2=16,b1=b2=1, and c=16(63). The actual value of the HRF at time s is determined by scaling this basis function by β, a parameter estimated directly from neural data. Thus, the parameter of interest in equation 11 is β, which represents the amplitude of the basis HRF in response to the behavioral stimuli assumed to elicit a neural response. We relegate details on the link between behavioral and neural model parameters to a later section.

The HRF described in equation 11 only captures the BOLD response to a single stimulus. To capture the temporal dynamics of the BOLD response over the course of an experiment, individual HRFs specified per equation 11 are assumed to follow in response to each stimulus outcome presentation. If we assume that BOLD responses are themselves time invariant, we can simply shift the HRF in equation 11 to start from each stimulus onset time, and then sum up the heights across HRFs at each timepoint to determine the expected BOLD response. Mathematically, this shifting is done by convolving the HRF with another function positioned at stimulus onset times (e.g., a boxcar or impulse function) (62). The summation across convolved HRFs is then conventionally done using matrix algebra, which provides a convenient way to separate out (and thereby estimate) the neural activation parameters β. Specifically, we can define a design matrix of convolved HRFs where rows represent trials t=1,2,3,,T and columns represent TRs s=0,2,4,,S:

X=h0,1(0)h0,1(S)h0,T(0)h0,T(S) (12)

Here, h0,1(0) is the value of the convolved basis HRF for the first trial (t=1) when the first brain volume is taken at the start of the experiment (s=0). Each row is therefore a timeseries of the convolved basis HRF for a given trial over the entire course of the experiment. Then, we can define a row vector of neural activation parameters as:

β=β1β2βTT (13)

where βt captures neural activity in response to the stimulus outcome presentation on trial t. The scaling and summation of convolved basis HRFs can then be done simultaneously with:

μNeural(s)=βX (14)

where μNeural(s) is the expected value of the BOLD signal at time s. Note that the assumed distribution of β will become important for linking behavioral and neural models together later. In isolation, however, we can simply assume that weights follow a normal distribution such that β𝒩(μβ,σβ).

Notably, we do not observe μNeural(s) directly. Instead, the observed BOLD signal contains measurement error resulting from a variety of factors. For example, the raw timeseries of an ROI or individual voxel is only obtained after the whole brain image has been pre-processed. Such pre-processing pipelines undoubtedly introduce measurement error. Here, we assume that such measurement error is normally distributed as follows:

yNeural(s)𝒩(μNeural(s),σNeural) (15)

where yNeural(s) is the observed BOLD signal at time s, and σNeural captures all sources of measurement error that result in mis-fit between the predicted and observed BOLD signal. Figure 4 depicts HRFs and the resulting observed BOLD signal simulated from the neural model described by equation 1115.

Figure 4. Shifting and summing up individual HRFs.

Figure 4.

Individual HRFs are represented by dotted lines, each with a corresponding onset marked by the point at the start of the HRF. The solid line indicates the total BOLD signal as determined by equations 1115. To show how variations in HRF amplitude—which capture stimulus effects—translate to variation in the observed BOLD signal, we assumed that β𝒩(1,.2). For visualization purposes, we set σNeural=0. HRFs themselves do not appear smooth because we simulated them with a 2 second TR, capturing how they may be observed in an actual fMRI experiment.

Linking Behavioral and Neural Models

In the context of fMRI designs, behavioral and neural models can be combined in a variety of different ways dependent on our theory of the data-generating process (64). In our running example, we assume that behavioral task prediction errors directly give rise to the BOLD signals observed in response to outcome presentations throughout the experiment. From an inferential perspective, we could estimate the relationship between trial-level behavioral model prediction errors and corresponding single-trial neural β weights in three different ways for an individual person:

  1. Post-hoc model: Fit the models separately, extract trial-level point-estimates of prediction errors (PE^) and β weights (β^), and then fit a regression model predicting β^ with PE^ (the hat “^” denotes an estimate);

  2. Sequential model: Fit the behavioral model, extract trial-level point-estimates of prediction errors (PE^), and then enter PE^ as single-trial regressors in the neural model. This model is sometimes referred to as the “model-based regressor approach” throughout model-based neuroscience; or

  3. Generative model: Fit the behavioral and neural models simultaneously, where latent trial-level prediction errors are used as single-trial regressors in the neural model

As we describe in the following paragraphs, the generative model is the only one that characterizes all assumed sources of uncertainty, thereby producing a better estimate of the relationship between trial-level prediction errors and corresponding BOLD signaling.

The generative model is specified by modifying the distribution on β as follows:

μβ=β0+βPEPEβ𝒩(μβ,σβ) (16)

Here, β is the vector of trial-level neural beta weights as described in equations 1314, PE is a vector of latent prediction errors as defined by equation 8, β0 is an intercept term capturing background neural activation, βPE is a slope parameter that captures how strongly the latent prediction errors relate to BOLD responses, and σβ is an error term capturing the variation across β conditional on PE. Equation 16 can be interpreted in many ways—a structural model between latent prediction errors and beta weights, a hierarchical model on beta weights with prediction errors as a group-level predictor, or a prior distribution on beta weights centered on latent prediction errors.

We can view both the post-hoc and sequential models as special cases of the generative model. Specifically, the sequential model is equivalent to the generative model under the assumption that prediction errors (PE) are estimated with perfect precision (i.e. without error), which can only be exactly true if the behavioral model parameters governing learning (just α in our example) are themselves estimated with without error. Similarly, the post-hoc model assumes that both the behavioral and neural parameters (PE and β) are estimated with perfect precision, which only happens when all parameters contributing to uncertainty in learning or the BOLD response in either model (α,β0,βPE,β,σβ) are estimated without error. Any imprecision in PE or β is then propagated to σNeural, which has the effect of attenuating the true correlation between PE and β. Figure 5 demonstrates this attenuation with simulated data by comparing the recovered correlations between the post-hoc, sequential, and generative models. Despite the true correlation being 1 across all simulations, increases in uncertainty (σnerual) produce attenuated, overconfident estimates for the post-hoc model, but not the sequential or generative models, when learning rates are generally low. However, when learning rates are high and uncertainty (σnerual) is relatively low, the sequential model underestimates the true correlation. Overall, the sequential model performs comparably to the generative model in many cases—this is good news for previous research linking prediction errors to BOLD responses, much of which has used the sequential approach. The reason is quite nuanced, stemming from the phenomena described by (65), wherein the trial-level prediction error signal can be estimated well even when the learning rate is estimated with high error. We return to this point in the discussion.

Figure 5. Recovered brain-behavior correlation.

Figure 5.

Recovered correlation between behavioral model trial-level prediction errors and neural model single-trial β weights for the post-hoc, sequential, and generative models. We simulated data by varying σNeural across a grid from 0 to 1. We set the behavioral (α=.2,ξ=1 for Low Learning Rate; α=.8,ξ=1 for High Learning Rate) and neural (β0=0,βE=1,σβ=0) model parameters constant to isolate the effects of σNeural. These particular parameter settings imply a perfect correlation (r=1) between prediction errors and neural beta weights. After estimating parameters for each approach, we converted the regression model parameters to correlations with the formula: r=βPESD(β)SD(PE). Here, SD(β) and SD(PE) indicate the standard deviation of β weights and prediction errors across trials. This formula is iterated over each posterior sample, preserving the full posterior distribution of correlations. Lines then indicate posterior means and intervals indicate 95% highest density intervals. Because we use the empirical standard deviations to convert parameters, the resulting correlation shows a small amount of error, which is why the intervals suggest that r>1 in some cases.

Discussion

Reliability is a factor that impacts statistical inference and is a challenge in all domains of psychological and psychiatric research. In the practice of cross-modality research, common approaches to examine associations compound measurement unreliability, while presuming that the units of analysis are perfectly measured. Through theoretical arguments and empirical simulations, we demonstrate how brain-behavior associations can be better estimated using fully generative models rather than models that rely on summary estimates of key quantities of interest. Further, we show how the assumptions underlying standard practices (e.g., the model-based regressor or sequential modeling approach) can be better understood within the generative modeling framework. Future theoretical work might build on the current models by considering variance due to researcher analytic decisions (1012), individual differences in learning strategies (66), and differences in theoretical accounts of how neural activity relates to behavioral performance (66).

In the context of mental health research, the use of generative models holds potential to greatly improve measurement—both in the form of more precision and in the development of statistical models that better capture our theories of core neurocognitive phenomena (9,33,34,52,64). Our results show that, at minimum, researchers interested in developing neuroscientific measures based on prediction errors for mental health applications should use sequential (a.k.a. model-based regressor) over post-hoc models. However, there are cases wherein even the sequential model underperforms the more general generative model—with high learning rates, in a high neural signal-to-noise context, sequential model estimates of brain-behavior correlations show a bias toward 0. Given that clinical groups and health controls often show differences in learning rates (67), and may also show differences in BOLD signal variability (68), generative models may produce more consistent findings than sequential models when investigating clinical phenomena related to prediction errors. However, the extent to which these performance similarities generalize beyond the simple learning paradigms in the current study remains unknown. In fact, prediction errors may be a special case—they can be estimated well irrespective to learning rate estimation error given a relatively simple task design (65), which explains why sequential and generative models performed similarly in our simulations. Future work should systematically investigate how these results generalize across the learning task designs and associated models used throughout clinical neuroscience and computational psychiatry.

For researchers interested in digging deeper, the generative modeling philosophy and associated framework has continued to become more widely accessible throughout the past decade. For example, we now have: (a) introductory and advanced (69,70) statistics textbooks detailing the approach; (b) software packages that allow simplified linear model syntax-based model specification of custom models (71) in addition to those that offer total flexibility through the use of probabilistic programming languages (72); and (c) open-source online courses designed to cover both general generative modeling principles (73) in addition to those with a particular focus on computational cognitive models relevant for clinical science and computational psychiatry (74). For a more comprehensive tutorial on applying generative models to fMRI data, we highly recommend (62), who include the practical R code and fMRI pre-processing details required to analyze real-world data. We anticipate that generative modeling principles will be increasingly incorporated into graduate programs and research communities, aiding in the translation of clinical science to real-world clinical practice.

Acknowledgements

I (NH) would like to thank Quentin Huys and Martin Paulus for organizing and co-editing this important special issue, and for including me as a part of it.

Footnotes

Disclosures

The authors declare no conflicts of interest. I (NH) carried out this work independently of any work at Aivo Health.

Supplemental Information

An R notebook with all code required to run our analyses and reproduce our figures can be found here: https://github.com/Nathaniel-Haines/biopsych-reliability-2022. We recommend that readers walk through the notebook alongside the main text to best understand the various concepts discussed throughout.

1

A measure itself is defined as any quantity being used to make inference, ranging from summed scores on self-report scales to parameters extracted from some model of BOLD response during an fMRI task.

2

For example, if θi indicates the true mean response time for person i, then the mean squared error between all pooled James-Stein estimates θ^l and true values θi will always be equal or lesser than the mean squared error between person-level sample means yi and true values θi. Mathematically, 1ni=1n(θ^lθi)21ni=1nyiθi2, where n>2 is the number of people.

Contributor Information

Nathaniel Haines, Aivo Health, Department of Data Science

Holly Sullivan-Toole, Temple University, Department of Psychology

Thomas Olino, Temple University, Department of Psychology

References

  • 1.Insel TR, Quirion R (2005): Psychiatry as a clinical neuroscience discipline. JAMA 294:2221–2224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Mental Health (n.d.): NIMH Strategic Plan for Research. Retrieved from https://www.nimh.nih.gov/sites/default/files/documents/about/strategic-planning-reports/NIMH%20Strategic%20Plan%20for%20Research_2022_0.pdf
  • 3.U.K. Research and Innovation (n.d.): U.K. Research and Innovation. Retrieved from https://www.ukri.org/what-we-offer/browse-our-areas-of-investment-and-support/ [Google Scholar]
  • 4.Winter NR, Leenings R, Ernsting J, Sarink K, Fisch L, Emden D, et al. (2022): Quantifying Deviations of Brain Structure and Function in Major Depressive Disorder Across Neuroimaging Modalities. JAMA Psychiatry. 10.1001/jamapsychiatry.2022.1780 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Infantolino ZP, Luking KR, Sauder CL, Curtin JJ, Hajcak G (2018): Robust is not necessarily reliable: From within-subjects fMRI contrasts to between-subjects comparisons. NeuroImage 173: 146–152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hedge C, Bompas A, Sumner P (2020): Task Reliability Considerations in Computational Psychiatry. Biol Psychiatry Cogn Neurosci Neuroimaging 5: 837–839. [DOI] [PubMed] [Google Scholar]
  • 7.Blair RJR, Mathur A, Haines N, Bajaj S (2022): Future directions for cognitive neuroscience in psychiatry: recommendations for biomarker design based on recent test re-test reliability work. Curr Opin Behav Sci 44: 101102. [Google Scholar]
  • 8.Hitchcock PF, Fried EI, Frank MJ (2022): Computational Psychiatry Needs Time and Context. Annu Rev Psychol 73: 243–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Haines N, Kvam PD, Irving LH, Smith C, Beauchaine TP, Pitt MA, et al. (2020):Theoretically Informed Generative Models Can Advance the Psychological and Brain Sciences: Lessons from the Reliability Paradox. PsyArXiv. 10.31234/osf.io/xr7y3 [DOI] [Google Scholar]
  • 10.Botvinik-Nezer R, Holzmeister F, Camerer CF, Dreber A, Huber J, Johannesson M, et al. (2020): Variability in the analysis of a single neuroimaging dataset by many teams [no. 7810]. Nature 582: 84–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Della-Maggiore V, Chau W, Peres-Neto PR, McIntosh AR (2002): An empirical comparison of SPM preprocessing parameters to the analysis of fMRI data. NeuroImage 17: 19–28. [DOI] [PubMed] [Google Scholar]
  • 12.Poline J-B, Strother SC, Dehaene-Lambertz G, Egan GF, Lancaster JL (2006): Motivation and synthesis of the FIAC experiment: Reproducibility of fMRI results across expert analyses. 27. 10.1002/hbm.20268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Fournier JC, Chase HW, Almeida J, Phillips ML (2014): Model Specification and the Reliability of fMRI Results: Implications for Longitudinal Neuroimaging Studies in Psychiatry. PLOS ONE 9: e105169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gorgolewski KJ, Storkey AJ, Bastin ME, Whittle I, Pernet C (2013): Single subject fMRI test–retest reliability metrics and confounding factors. NeuroImage 69: 231–243. [DOI] [PubMed] [Google Scholar]
  • 15.Korucuoglu O, Harms MP, Astafiev SV, Golosheykin S, Kennedy JT, Barch DM, Anokhin AP (2021): Test-Retest Reliability of Neural Correlates of Response Inhibition and Error Monitoring: An fMRI Study of a Stop-Signal Task. Front Neurosci 15. Retrieved December 11, 2022, from 10.3389/fnins.2021.624911 [DOI] [PMC free article] [PubMed]
  • 16.Elliott ML, Knodt AR, Ireland D, Morris ML, Poulton R, Ramrakha S, et al. (2020): What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis. Psychol Sci 31: 792–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Noble S, Spann MN, Tokoglu F, Shen X, Constable RT, Scheinost D (2017): Influences on the Test–Retest Reliability of Functional Connectivity MRI and its Relationship with Behavioral Utility. Cereb Cortex 27: 5415–5429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Tang L, Yu Q, Homayouni R, Canada KL, Yin Q, Damoiseaux JS, Ofen N (2021): Reliability of subsequent memory effects in children and adults: The good, the bad, and the hopeful. Dev Cogn Neurosci 52: 101037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Noble S, Scheinost D, Constable RT (2021): A guide to the measurement and interpretation of fMRI test-retest reliability. Curr Opin Behav Sci 40: 27–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hedge C, Powell G, Sumner P (2018): The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behav Res Methods 50: 1166–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Enkavi AZ, Eisenberg IW, Bissett PG, Mazza GL, MacKinnon DP, Marsch LA, Poldrack RA (2019): Large-scale analysis of test–retest reliabilities of self-regulation measures. Proc Natl Acad Sci 116: 5472–5477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gawronski B, Morrison M, Phills CE, Galdi S (2017): Temporal Stability of Implicit and Explicit Measures: A Longitudinal Analysis. Pers Soc Psychol Bull 43: 300–312. [DOI] [PubMed] [Google Scholar]
  • 23.Klein C (2020, June 3): Confidence Intervals on Implicit Association Test Scores Are Really Rather Large. PsyArXiv. 10.31234/osf.io/5djkh [DOI] [Google Scholar]
  • 24.Chen B, Xu T, Zhou C, Wang L, Yang N, Wang Z, et al. (2015): Individual Variability and Test-Retest Reliability Revealed by Ten Repeated Resting-State Brain Scans over One Month. PLOS ONE 10: e0144963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Noble S, Scheinost D, Constable RT (2019): A decade of test-retest reliability of functional connectivity: A systematic review and meta-analysis. NeuroImage 203: 116157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Baranger DAA, Lindenmuth M, Nance M, Guyer AE, Keenan K, Hipwell AE, et al. (2021): The longitudinal stability of fMRI activation during reward processing in adolescents and young adults. NeuroImage 232: 117872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Dang J, King KM, Inzlicht M (2020): Why Are Self-Report and Behavioral Measures Weakly Correlated? Trends Cogn Sci 24: 267–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Schimmack U (2021): The Implicit Association Test: A Method in Search of a Construct. Perspect Psychol Sci 16: 396–414. [DOI] [PubMed] [Google Scholar]
  • 29.Wennerhold L, Friese M (2020): Why self-report measures of self-control and inhibition tasks do not substantially correlate. Collabra Psychol 6. 10.1525/collabra.276 [DOI] [Google Scholar]
  • 30.Gelman A, Carlin J (2014): Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspect Psychol Sci 9: 641–651. [DOI] [PubMed] [Google Scholar]
  • 31.Marek S, Tervo-Clemmens B, Calabro FJ, Montez DF, Kay BP, Hatoum AS, et al. (2022): Reproducible brain-wide association studies require thousands of individuals [no. 7902]. Nature 603: 654–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kragel PA, Han X, Kraynak TE, Gianaros PJ, Wager TD (2021): Functional MRI Can Be Highly Reliable, but It Depends on What You Measure: A Commentary on Elliott et al. (2020). Psychol Sci 32: 622–626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Brown VM, Chen J, Gillan CM, Price RB (2020): Improving the reliability of computational analyses: Model-based planning and its relationship with compulsivity. Biol Psychiatry Cogn Neurosci Neuroimaging 5: 601–609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Chen G, Pine DS, Brotman MA, Smith AR, Cox RW, Haller SP (2021): Trial and error: A hierarchical modeling approach to test-retest reliability. NeuroImage 245: 118647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Rouder JN, Haaf JM (2019): A psychometrics of individual differences in experimental tasks. Psychon Bull Rev 26: 452–467. [DOI] [PubMed] [Google Scholar]
  • 36.Lord FM, Novick MR, Birnbaum A (1968): Statistical Theories of Mental Test Scores. Oxford, England: Addison-Wesley. [Google Scholar]
  • 37.Kelley TL (1947): Fundamentals of Statistics. Oxford, England: Harvard U. Press, pp xvi, 755. [Google Scholar]
  • 38.Kelley TL (1927): Interpretation of Educational Measurements. Yonkers-on-Hudson, N.Y.: World Book Company. [Google Scholar]
  • 39.Efron B, Morris C (1973): Stein’s Estimation Rule and Its Competitors--An Empirical Bayes Approach. J Am Stat Assoc 68: 117–130. [Google Scholar]
  • 40.Stein C (1956): Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution. STANFORD UNIVERSITY STANFORD; United States. Retrieved August 23, 2022, from https://apps.dtic.mil/sti/citations/AD1028390 [Google Scholar]
  • 41.James W, Stein C (1961): Estimation with Quadratic Loss. Proc Fourth Berkeley Symp Math Stat Probab Vol 1 Contrib Theory Stat 4.1: 361–380. [Google Scholar]
  • 42.Efron B, Morris C (1977): Stein’s Paradox in Statistics. Sci Am 236: 119–127. [Google Scholar]
  • 43.McGraw KO, Wong SP (1996): Forming inferences about some intraclass correlation coefficients. Psychol Methods 1: 30–46. [Google Scholar]
  • 44.Shrout PE, Fleiss JL (1979): Intraclass correlations: Uses in assessing rater reliability. Psychol Bull 86: 420–428. [DOI] [PubMed] [Google Scholar]
  • 45.Shieh G (2016): Choosing the best index for the average score intraclass correlation coefficient. Behav Res Methods 48: 994–1003. [DOI] [PubMed] [Google Scholar]
  • 46.Curran PJ (2003): Have Multilevel Models Been Structural Equation Models All Along? Multivar Behav Res 38: 529–569. [DOI] [PubMed] [Google Scholar]
  • 47.Chow S-M, Ho MR, Hamaker EL, Dolan CV (2010): Equivalence and Differences Between Structural Equation Modeling and State-Space Modeling Techniques. Struct Equ Model Multidiscip J 17: 303–332. [Google Scholar]
  • 48.Olsen JA, Kenny DA (2006): Structural equation modeling with interchangeable dyads. Psychol Methods 11: 127–141. [DOI] [PubMed] [Google Scholar]
  • 49.Gelman A, Pardoe I (2006): Bayesian Measures of Explained Variance and Pooling in Multilevel (Hierarchical) Models. Technometrics 48: 241–251. [Google Scholar]
  • 50.Williams DR, Martin SR, DeBolt M, Oakes L, Rast P (2020, November 20): A Fine-Tooth Comb for Measurement Reliability: Predicting True Score and Error Variance in Hierarchical Models. PsyArXiv. 10.31234/osf.io/2ux7t [DOI] [Google Scholar]
  • 51.Turner BM, Forstmann BU, Wagenmakers E-J, Brown SD, Sederberg PB, Steyvers M (2013): A Bayesian framework for simultaneously modeling neural and behavioral data. NeuroImage 72: 193–206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Ahn W-Y, Krawitz A, Kim W, Busemeyer JR, Brown JW (2011): A model-based fMRI analysis with hierarchical Bayesian parameter estimation. J Neurosci Psychol Econ 4:95–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Katahira K (2016): How hierarchical models improve point estimates of model parameters at the individual level. J Math Psychol 73: 37–58. [Google Scholar]
  • 54.Valton V, Wise T, Robinson OJ (2020): Recommendations for Bayesian hierarchical model specifications for case-control studies in mental health. 10.48550/arXiv.2011.01725 [DOI] [Google Scholar]
  • 55.Rouder JN, Lu J (2005): An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychon Bull Rev 12: 573–604. [DOI] [PubMed] [Google Scholar]
  • 56.Lee MD, Bock JR, Cushman I, Shankle WR (2020): An application of multinomial processing tree models and Bayesian methods to understanding memory impairment. J Math Psychol 95. 10.1016/j.jmp.2020.102328 [DOI] [Google Scholar]
  • 57.Huys QJM, Browning M, Paulus MP, Frank MJ (2021): Advances in the computational understanding of mental illness. Neuropsychopharmacology 46: 3–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Pike AC, Robinson OJ (2022): Reinforcement Learning in Patients With Mood and Anxiety Disorders vs Control Individuals: A Systematic Review and Meta-analysis. JAMA Psychiatry 79: 313–322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Eckstein M, Wilbrecht L, Collins A (2021, May 4): What do Reinforcement Learning Models Measure? Interpreting Model Parameters in Cognition and Neuroscience. PsyArXiv. 10.31234/osf.io/e7kwx [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Lockwood PL, Klein-Flügge MC (2021): Computational modelling of social cognition and behaviour—a reinforcement learning primer. Soc Cogn Affect Neurosci 16: 761–771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Zhang L, Lengersdorff L, Mikus N, Gläscher J, Lamm C (2020): Using reinforcement learning models in social neuroscience: frameworks, pitfalls and suggestions of best practices. Soc Cogn Affect Neurosci 15: 695–707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Palestro JJ, Bahg G, Sederberg PB, Lu Z-L, Steyvers M, Turner BM (2018): A tutorial on joint models of neural and behavioral measures of cognition. J Math Psychol 84: 20–48. [Google Scholar]
  • 63.SPM12 Software - Statistical Parametric Mapping (n.d.): Retrieved August 23, 2022, from https://www.fil.ion.ucl.ac.uk/spm/software/spm12/
  • 64.Turner BM, Forstmann BU, Love BC, Palmeri TJ, Van Maanen L (2017): Approaches to Analysis in Model-based Cognitive Neuroscience. J Math Psychol 76: 65–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Wilson RC, Niv Y (2015): Is Model Fitting Necessary for Model-Based fMRI? PLOS Comput Biol 11: e1004237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Lebreton M, Bavard S, Daunizeau J, Palminteri S (2019): Assessing inter-individual differences with task-related functional neuroimaging [no. 9]. Nat Hum Behav 3: 897905. [DOI] [PubMed] [Google Scholar]
  • 67.Haines N, Vassileva J, Ahn W-Y (2018): The Outcome-Representation Learning Model: A Novel Reinforcement Learning Model of the Iowa Gambling Task. Cogn Sci 42: 2534–2561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Månsson KNT, Waschke L, Manzouri A, Furmark T, Fischer H, Garrett DD (2022):Moment-to-Moment Brain Signal Variability Reliably Predicts Psychiatric Treatment Outcome. Biol Psychiatry 91: 658–666. [DOI] [PubMed] [Google Scholar]
  • 69.Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2015): Bayesian Data Analysis, 3rd ed. New York: Chapman and Hall/CRC. 10.1201/b16018 [DOI] [Google Scholar]
  • 70.Farrell S, Lewandowsky S (2018): Computational Modeling of Cognition and Behavior. Cambridge University Press. [Google Scholar]
  • 71.Bürkner P-C (2017): brms: An R Package for Bayesian Multilevel Models Using Stan. J Stat Softw 80: 1–28. [Google Scholar]
  • 72.Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. (2017): Stan: A Probabilistic Programming Language. J Stat Softw 76: 1–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.McElreath R (2022, August 23): Statistical Rethinking (2022 Edition) [R]. Retrieved August 23, 2022, from https://github.com/rmcelreath/stat_rethinking_2022
  • 74.Zhang L (2022, August 17): BayesCog [R]. Retrieved August 23, 2022, from https://github.com/lei-zhang/BayesCog_Wien

RESOURCES