Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 1.
Published in final edited form as: Psychophysiology. 2024 Jan 16;61(3):e14522. doi: 10.1111/psyp.14522

The Psychometric Upgrade Psychophysiology Needs

Peter E Clayson 1,*
PMCID: PMC10922751  NIHMSID: NIHMS1957127  PMID: 38228400

Abstract

Although biological measurements are constrained by the same fundamental psychometric principles as self-report measurements, these essential principles are often neglected in most fields of neuroscience, including psychophysiology. Potential reasons for this neglect could include a lack of understanding of appropriate measurement theory or a lack of accessible software for psychometric analysis. Generalizability theory is a flexible and multifaceted measurement theory that is well suited to handling the nuances of psychophysiological data, such as the often unbalanced number of trials and intraindividual variability of scores of event-related brain potential (ERP) data. The ERP Reliability Analysis Toolbox (ERA Toolbox) was designed for psychophysiologists and is tractable software that can support the routine evaluation of psychometrics using generalizability theory. Psychometrics can guide task refinement, data-processing decisions, and selection of candidate biomarkers for clinical trials. The present review provides an extensive treatment of additional psychometric characteristics relevant to studies of psychophysiology, including validity and validation, standardization, dimensionality, and measurement invariance. Although the review focuses on ERPs, the discussion applies broadly to psychophysiological measures and beyond. The tools needed to rigorously assess psychometric reliability and validate psychophysiological measures are now readily available. With the profound implications that psychophysiological research can have on understanding brain-behavior relationships and the identification of biomarkers, there is simply too much at stake to ignore the crucial processes of evaluating psychometric reliability and validity.

Keywords: event-related potentials (ERP), ERP psychometric reliability, validity, internal consistency, generalizability theory, psychophysiology, event-related potentials (ERPs), biomarker

1. Introduction

Success in the quantitative study of individual differences hinges on sound psychometrics. For example, the use of measurements with high psychometric reliability is crucial in any study of individual differences, including studies of personality characteristics, psychiatric symptoms, developmental changes, or any other quantitatively characterized relationship. The use of valid measures ensures that variations in measurements are caused by changes in the intended attribute or process of interest. Biological measurements are constrained by the same fundamental psychometric principles as self-report measurements, but these essential psychometric principles receive little to no attention in most fields of neuroscience, including psychophysiology. Grounding psychophysiological research in psychometrics will foster a better understanding of psychological and biological phenomena across a wide spectrum of disciplines, particularly those that pursue an individual differences approach, including clinical research and practice.

Psychometricians and theoretical psychologists have long been rigorously developing and debating approaches for evaluating psychometric reliability and concepts of validity (e.g., Brown, 1910; Cronbach & Meehl, 1955; Loevinger, 1957; Novick & Lewis, 1967; Spearman, 1910). Their focus was typically on elaborating and evaluating psychometric methods for self-report, achievement, or aptitude measures. Although robust methods are available to evaluate the psychometrics of self-report methods, only recently have some of these methods been tailored for use with psychophysiological data. These recent developments are important because commonplace approaches to assessing psychometric reliability are poorly equipped for the nuances of psychophysiological data. The purpose of this review article is to highlight some appropriate methods and solutions for evaluating the psychometric reliability and validation of psychophysiological data. The emphasis of the present review is on studies of event-related brain potentials (ERPs), but the principles described below are widely applicable to all psychophysiological modalities and beyond.

2. ERPs Are Operationalizations

An understanding of the basic nature of ERP recording and analysis is important in applying psychometric principles. The collection and analysis of ERP data are computationally intensive and methodologically complex, and many of the decisions in the data-processing pipeline leading up to and including the scoring of ERP waveforms are decided by the researcher (Clayson et al., 2019; Luck & Gaspelin, 2017). An ERP waveform represents biological signals that are spectrally and temporally complex. An ERP component refers to a distinct voltage change in the scalp-recorded ERP waveform related to neural events, but an observed ERP peak in that waveform can represent a contribution of a single or multiple ERP components (Donchin & Heffley, 1978; Kappenman & Luck, 2012). Varying the measurement of ERP peaks in temporal or spatial dimensions can capture distinct contributions of ERP components.

Considering the decisions required to distill raw continuous data into a single score for statistical analysis, the amalgamation of these decisions represents an operationalization of an ERP component. ERPs typically represent an operationalization of a construct because an ERP component is only one potential measure of any given construct (e.g., activity of anterior cingulate cortex [ACC]). Additionally, any particular approach to processing and scoring an ERP represents additional layers of that operationalization. ERP scores cannot be reified as they are no more than a descriptive summary of scores at a time and location.

The error-related negativity (ERN) component provides a ready example of an operationalization of error-related performance monitoring. An ERN is a negative-going peak that occurs within 100 ms following error responses and represents a manifestation of cognitive, affective, motivational, and motor processes (Gehring et al., 2012). Although the functional significance of ERN is debated (Clayson et al., 2021e), ERN1 is generally considered an index of error detection. Interpretation of ERN findings across studies is complicated by inconsistent operationalizations, including the use of different tasks for recording (e.g., flanker, Stroop, go/nogo), different data-processing pipelines (e.g., filtering, ocular artifact correction/rejection, trial inclusion criteria), and different scoring (e.g., time windows, electrode sites, difference waves, scoring approaches [peak amplitude, time-window mean amplitude, etc.]).

Any single operationalization of ERN makes assumptions about the neural activity of interest, and those assumptions ultimately impact interpretations. Meta-analytic work supports the possibility that variations in how ERN is recorded, analyzed, and scored alter interpretations by showing that ERN findings can be moderated by task (Lutz et al., 2021; Martin et al., 2018; Mathews et al., 2012; Pasion & Barbosa, 2019; Riesel, 2019), data-processing pipeline (Clayson, 2020), and scoring procedures (Boen et al., 2022; Pasion & Barbosa, 2019). Further evidence comes from a multiverse analysis of different data-processing pipelines and scoring procedures (3,456 total ERN scores per person) that indicated that different methodological decisions changed the final ERN average amplitudes and variability of scores (Clayson et al., 2021b). Therefore, ERN recorded during different tasks or analyzed using different pipelines or scoring procedures could reflect different neural activity—possibly reflecting differences in the strength of contributions of cognitive, affective, motivational, and motor processes—and could have different psychometric properties.

The present review is agnostic about the optimal way to record, process, and score ERPs. That approach will probably differ depending on the research question and the intended purpose of measurement. Nonetheless, psychometrically informed methods provide the techniques and tools to evaluate important characteristics of any operationalization of an ERP. Routine psychometric analysis of ERP operationalizations will ensure that the neural event of interest is being recorded, processed, and analyzed rigorously.

3. Psychometric Reliability

Generalizability theory (G theory), developed over sixty years ago (Brennan, 2001; Cronbach et al., 1972; Cronbach et al., 1963; Shavelson & Webb, 1991; Vispoel et al., 2018a, 2018b), is a measurement theory with a flexible and multifaceted framework that is well suited to handling the data structures typical of ERP scores. The application of G theory to ERP research has been extensively described elsewhere2 (Baldwin et al., 2015; Carbine et al., 2021; Clayson et al., 2021a; Clayson et al., 2021c; Clayson et al., 2021d; Clayson & Miller, 2017a, 2017b). Considering that classical test theory (CT theory) remains popular among psychophysiologists, CT theory will be briefly described, then a brief overview of important aspects of G theory is provided, and the two measurement theories3 are contrasted with an eye toward application to studies of psychophysiology.

3.1. Classical Test Theory

CT theory posits that a person’s observed score comprises two contributions: the true score and an error score (Lord & Novick, 1968; Novick, 1965; Novick & Lewis, 1967). The true score represents the intended characteristic(s) being measured. The error score represents unsystematic measurement error and captures all random inaccuracies—measurement or otherwise (e.g., variations in electrode impedances, physiological or environmental interference, mood variations, fatigue, ambiguity in questions). An assumption of CT theory is that the measurement error is random and uncorrelated with the true score. To estimate the psychometric reliability4 of scores, parallel forms of a measure are constructed, and it is assumed that the true scores are the same on both measures, the variability in scores is similar, and error variance is unsystematic (see Vispoel et al., 2018a). If all these conditions hold, then the correlation between parallel forms will represent reliability. Under this assumption of classically parallel forms, the variance of observed scores equals the combination of true-score variance and error variance, and reliability can be represented as the proportion of true-score variance to total observed-score variance (i.e., true-score variance + error variance; see Vispoel et al., 2018a).

3.2. Generalizability Theory

G theory represents an extension of CT theory and provides a comprehensive framework for decomposing the undifferentiated error term used in CT theory. The partitioning of true-score variance and error variance in CT theory is similar to a one-factor design in an analysis of variance (ANOVA). G theory can include this one-factor consideration or expand the estimation of reliability—simultaneously partitioning variances into universe-score variance and multiple sources of error, similar to a multiple factorial design in ANOVA (see Shavelson et al., 1989; Vispoel et al., 2018a). As a result, the emphasis of G theory is on estimating the universe score, and how the universe score is conceptualized provides the framework for estimating psychometric reliability. The use of the term ‘universe score’ in G theory is analogous to classical test theory’s ‘true’ score and emphasizes that any observed score and its reliability depend on the domain(s) (i.e., universe(s)) to which a researcher wants to generalize (Cronbach et al., 1972). Each potential universe of generalization (e.g., items, occasions, raters, tasks) is represented as a facet or a potential source of measurement error within the targeted G-theory design that is analogous to a factor or variable within an ANOVA model. Different items, occasions, raters, tasks, and so on would represent conditions within a facet and would parallel levels within a factor of an ANOVA. Failing to consider all relevant facets and conditions can result in an overestimation of the overall reliability of scores that properly account for all relevant sources of measurement error (Vispoel et al., 2018a). People are considered the object of measurement, and universe and population are logically interchangeable terms in G theory.

There are many potential facets relevant to studies of ERPs. Examples include the task or hardware used to record the ERP, event type, occasion, number of trials retained for averaging, and sensors used for scoring. Levels of the task facet could include a flanker or Stroop task, levels of event type could include congruent or incongruent trials, and sensors used for scoring could include Cz or FCz. The G-theory framework is inherently flexible by design, and the researcher can include any number of facets and levels of each facet. This powerful flexibility facilitates the identification of how much each facet contributes to the variability in observed scores.

Different types of reliability can be estimated based on whether a researcher is interested in characterizing the relative standing of persons (i.e., generalizability coefficient, which uses a relative error term) or the absolute differences in scores (i.e., dependability coefficient, which uses an absolute error term). Generalizability and dependability coefficients represent different ways of conceptualizing error (see Figure 1), with dependability coefficients having a broader definition than generalizability coefficients that encompasses mean differences in scores. Generalizability coefficients capture the ranking of individuals, such as whether the same participants score high at Time 1 and at Time 2. Generalizability coefficients can be useful for ERP researchers when systematic changes are expected between Time 1 and Time 2 (e.g., due to habituation or fatigue), and the occasion variance can be ignored. Dependability coefficients consider not only the ranking of individuals but also any change in the values of scores, such as whether the same participants score high at Time 1 and Time 2 and show similar values at Time 1 and Time 2.

Figure 1.

Figure 1.

Venn diagrams for a one-facet design (person × item) are shown. Total observed score variance is shown on the top, and score variance is represented by the circles with each circle representing a facet. The area of overlap between circles represents interactions between facets. These Venn diagrams illustrate the difference between generalizability coefficients (Eρ2), which use the relative error variance, and dependability coefficients (ϕ), which use the absolute error variance. Generalizability coefficients only consider variances in the error term that impact the relative standing or ranking of participants in the estimation of reliability (i.e., relative error), because the focus is on the suitability of the measurements for comparing participant scores. Therefore, only variances that impact the ‘Person’ circle will be included in the estimation of reliability. For this one-facet design, only the interaction between person and item (σpi,e2) overlaps with the ‘Person’ circle, shown on the left (see Table 1 for formulas for comparison). However, dependability coefficients consider all sources of variance that impact observed scores in the estimation of reliability (absolute error variance). For the one-facet design, that includes item-score variance (σi2) and the interaction between person and item (σpi,e2; shown on the right).

Two types of reliability coefficients relevant to ERP studies are coefficients of equivalence and coefficients of stability (see Table 1 and Figures 1 & 2). Coefficients of equivalence include measures of internal consistency for items considered randomly equivalent, which refer to how well peoples’ average scores distinguish differences between people (i.e., between-person variability) relative to the scores that contribute to a person’s average (i.e., within-person variability). Thus, coefficients of equivalence represent the extent to which results can be generalized across items. Coefficients of stability capture the stability of measurements over time for items considered randomly equivalent (i.e., temporal stability of measurements) and represent the extent to which results can be generalized across occasions. Coefficients of equivalence can use data from multiple sessions to estimate internal consistency. An advantage of this approach is that occasion-specific variance in scores is partitioned out of internal consistency estimates. Similarly, coefficients of stability can partition the variance that is relevant to internal consistency from the variance related to test-retest reliability.

Table 1.

Generalizability Theory Estimates of Internal Consistency and Test-Retest Reliability

Generalizability Coefficient (Eρ2) Dependability Coefficient (ϕ)
Single Session
CE: Internal Consistency σp2σp2+σpi,e2ni σp2σp2+σpi,e2ni+σi2ni
Multiple Sessions
CE: Internal Consistency σp2+σpo2noσp2+σpi2ni+σpo2no+σpio,e2nino σp2+σpo2noσp2+σpi2ni+σpo2no+σpio,e2nino+σi2ni+σio2nino
CS: Test-Retest Reliability σp2+σpi2niσp2+σpi2ni+σpo2no+σpio,e2nino σp2+σpi2niσp2+σpi2ni+σpo2no+σpio,e2nino+σo2no+σio2nino
CES σp2σp2+σpi2ni+σpo2no+σpio,e2nino σp2σp2+σpi2ni+σpo2no+σpio,e2nino+σi2ni+σo2no+σio2nino

Note: For single session data, internal consistency estimated using the ratio of universe score variance (σp2), item variance (σi2), error variance (σpi,e2), and the number of trials (ni). Multiple session data considers the universe score variance (σp2), item variance (σi2), occasion variance (σo2), transient error variance (σpo2), specific-facet trial score variance (σpi2), specific-facet occasion score variance (σpo2), error variance (σpio,e2), the number of trials (ni), and the number of occasions (no). These formulas and their derivations can be found elsewhere (see Clayson et al., 2021a; Clayson et al., 2021c; Clayson et al., 2021d; Clayson & Miller, 2017a; Vispoel et al., 2018a). CE = coefficient of equivalence; CS = coefficient of stability; CES = coefficient of equivalence and stability.

Figure 2.

Figure 2.

Venn diagrams for a two-facet design (person × item × occasion) are shown. Total observed score variance is shown on the top and score variance is represented by the circles. Each circle represents a facet, and the overlap between circles represents interactions between facets. Generalizability coefficients (Eρ2) only consider variance sources in the error term that impact the relative standing or ranking of participants (i.e., relative error) in the estimation of reliability. Therefore, only variances that overlap with the ‘Person’ circle will be included in the estimation of reliability. The contribution of error variance to the generalizability coefficient of equivalence and stability (CES; see Table 1) includes only those facets that interact with person scores: the interaction between person and item (σpi2), the interaction between person and occasion (σpo2), and the three-way interaction (σpio,e2). The contribution of error variance to the dependability (ϕ) CES includes all facets and interactions between facets (absolute error variance).

3.2.1. Two-Facet G-Theory Analysis.

The following example illustrates how G theory can be used to identify the key sources of error variance that limit the reliability of scores. G theory was used to examine the internal consistency (coefficient of equivalence) and test-retest reliability (coefficient of stability) of N2 ERP component scores following food-related go and nogo stimuli (Carbine et al., 2021; Clayson et al., 2021d). The estimation of reliability used a two-facet model that included items and measurement occasion, allowing the partitioning of observed-score variance into person-related, item-related, and occasion-related variances (see bottom row of Figure 1 for Venn diagram). Estimates of internal consistency, both dependability coefficients and generalizability coefficients, were high ( .97), suggesting that N2 scores to food-related stimuli were suitable for studying individual differences. Given the high level of internal consistency, these findings might point to reducing task length, if helpful for the purpose of the study. The relationship between the number of trials included in the ERP average and internal consistency for nogo N2 trials is shown in Figure 3.

Figure 3.

Figure 3.

Generalizability coefficients of equivalence (internal consistency, solid line) and coefficients of stability (test-retest reliability, dashed line) for nogo N2 scores using the estimated standard deviation components from Figure 1 in Clayson et al. (2021d; see also Carbine et al., 2021). Generalizability coefficients are plotted as the number of trials included in the reliability estimates increase from 1 to 100 trials.

The generalizability and dependability estimates of test-retest reliability were low, ranging from .48 to .68. An analysis of the variance components clarified that the person × occasion variance (σpo2) was large compared to other sources of variance, essentially placing a ceiling on the test-retest reliability estimates (see Figure 2 for relevant variance components). As shown in Table 1, the impact of person × occasion variance on overall test-retest reliability is not minimized by including more trials. Therefore, attempts to increase the internal consistency of N2 scores will have no beneficial impact on the test-retest reliability because there is no impact on the primary source of error variance: person × occasion variance. As shown in Figure 3, test-retest reliability appears to asymptote just below .70 due to the person × occasion variance not being impacted by the number of included trials. Therefore, this G-theory analysis pinpointed person × occasion variance as the necessary target for improving test-retest reliability of N2 scores. This indicates a need to determine the facets that impact individual differences between sessions, and in the context of the Carbine et al. (2021) study that could include controlling for levels of satiety, time of the testing session, and levels of fatigue across measurement occasions.

3.3. Features of ERP Data

The psychometric analysis of ERPs should reflect the characteristics of the underlying data. Best practices for psychometric analysis include choosing approaches that fit the researcher’s assumptions about the data, rather than forcing the data to fit the psychometric analysis approach. Therefore, it is helpful to characterize the features of a typical ERP score to identify the appropriate psychometric approaches for assessing reliability of ERP scores.

3.3.1. Normal Distribution.

The modal approach to scoring ERP components is to average the waveforms for all epochs together for a condition and a participant and to score the observed ERP component peak at one or more scalp electrode sites. The assumption of this averaging procedure is that any given trial includes the contribution of an ERP signal of interest and random noise (see Clayson et al., 2013). Therefore, epochs are averaged together to minimize the contribution of background noise to the ERP peaks of interest prior to scoring condition X participant averages. If random noise is constant across the task, then the ERP peaks will likely be normally distributed, assuming a constant ERP signal of interest over the course of the task. As such, ERP component scores are treated as normally distributed within participants5.

Time-window mean amplitude scores (e.g., average activity from 200 to 400 ms) are more likely to be normally distributed than peak amplitude or peak latency scores. The arithmetic average of single-trial, time-window mean amplitude scores is the same as the score extracted from the subject average waveforms, because the measurement window is not biased by single-trial noise. However, this is not the case for peak amplitude or latency scores because random noise during any trial waveform biases the selection of a peak for scoring (Clayson et al., 2013; Luck et al., 2021). Therefore, the standard analysis of item scores in G theory is most appropriate for time-window mean amplitude scores and those psychophysiological data that are normally distributed across trials within people (see Other Psychophysiological Data section for a discussion of the use of data splits for peak amplitude or latency scores).

3.3.2. Unbalanced Number of Trials.

During ERP data processing, epoch rejection procedures are commonly used to remove large artifacts from the ERP, such as gross motor artifact. As a result, even if the same number of stimuli is presented to each participant, the number of accepted trials may differ across conditions or participants. Those ERPs that depend on participant responses (e.g., correct or error trials in ERN analysis) may differ in number of trials per condition, because some high-performing participants might commit fewer errors than others.

3.3.3. Intraindividual Variability.

A feature of some ERPs, as with trial-wise data more broadly (e.g., response times; Williams et al., 2021), is that there are between-person differences in the intraindividual variability of ERP trial scores (Clayson et al., 2021c; Clayson et al., 2022c; Clayson et al., 2024; see also Volpert-Esmond, 2022). This intraindividual variability in scores means that some participants may be mischaracterized by group-level reliability estimates because some participants will shower higher or lower data quality than the average participant.

3.4. Assumptions of G Theory vs. CT Theory

The assumptions of G theory are less restrictive than those of CT theory, and CT theory provides a useful comparison as a well-known measurement theory. Comparisons of G theory and CT theory specifically for ERP researchers can also be found elsewhere (Baldwin et al., 2015; Clayson et al., 2021c; Clayson & Miller, 2017a, 2017b), along with other useful comparisons of measurement theories for broader audiences (Brennan, 2000, 2010; Vispoel et al., 2018a).

An assumption in CT theory is that any observed score is the sum of a true score and an error score and that these scores are independent of each other. Whereas CT theory assumes a single, undifferentiated source of random error, G theory partitions observed variance into multiple sources (i.e., facets), including error variance. Therefore, the impact of multiple sources of variance can be differentiated, which provides a more comprehensive and realistic analysis of psychometric reliability than CT theory. In ERP research there are numerous potential sources of error variance (e.g., see discussion of facets above), and the contribution of each source and combination of sources can be pinpointed and optimized for psychometric reliability. For example, a two-facet G-theory analysis (person × number of trials × occasion) identified that despite high internal consistency of ERP scores, individual differences in changes across occasions (person × occasion interaction) was placing an upper limit on test-retest reliability (see Two-Facet G-Theory Analysis section; see also Carbine et al., 2021; Clayson et al., 2021d).

The assumption of classically parallel measures is key in CT theory, and this refers to the notion that item splits (e.g., odd vs. even), forms (e.g., Stroop vs. flanker), or assessment occasions (e.g., time 1 vs. time 2) have equal means, standard deviations, error variances, and true scores. When these assumptions are not met, reliability coefficients can be misleading6 (Charter, 2001; Cho, 2016; Cronbach, 1951; Feldt & Charter, 2003; Warrens, 2015). G theory circumvents the assumption of equal variances by explicitly modeling the variance of each task split and has a less restrictive assumption of randomly parallel measures: all items come from the same universe of admissible observations with each item being considered exchangeable with any other (Cronbach et al., 1963; Vispoel et al., 2018a). In ERP studies, Spearman-Brown adjusted split-half reliability is widely used (Brown, 1910; Spearman, 1910), because it is straightforward to use with unbalanced data. However, the use of the coefficient likely leads to an overestimation of the internal consistency of ERP scores (Charter, 2001; Cho, 2016; Cronbach, 1951; Feldt & Charter, 2003; Warrens, 2015), as unequal variances between arbitrary splits are highly likely, particularly in instances when few trials are retained for averaging (see appendix of Clayson et al., 2021a).

Intraindividual variability appears endemic to ERP data (Clayson et al., 2022c; Clayson et al., 2024), and CT theory assumes homogeneity of variances of true and error scores across individuals. Even if one assesses the psychometric reliability of ERP scores from a group of participants and obtains one group-level coefficient that characterizes all participants, the danger is that group-level reliability estimates might mask the low reliability of some participant data. For example, Clayson et al. (2022c) observed that group-level internal consistency estimates mischaracterized subject-level reliability ERN scores (57% of correct-trial scores, 32% of error-trial scores), such that there was a sixfold increase from the smallest to largest internal consistency estimate. The flexible framework of G theory allows the incorporation of participant-specific error variances in the estimations of psychometric reliability and can provide participant-level reliability estimates to characterize each participant’s data (see Clayson et al., 2021c).

Some ERP research employs coefficient alpha (i.e., Cronbach’s alpha) to describe the internal consistency of scores. Cronbach has discouraged the continued use of coefficient alpha for most data7 and advocated the use of the standard error of measurement (Cronbach, 2004). However, coefficient alpha remains a well-known and popular estimate from CT theory and is akin to the average of all possible split halves (Cronbach, 1951). Coefficient alpha is poorly suited to ERP data. Coefficient alpha requires all participants to have the same number of trials, which is uncommon due to artifact rejection during data processing. Although ERP data from a group of participants could be truncated so that all participants have the same number of trials for the estimation of coefficient alpha, this practice is ill advised. Doing so would misrepresent the internal consistency of the entire dataset, which are the scores that are likely used for statistical analysis. Any changes over the course of the task or the full intraindividual variability of a person’s ERP scores will not be captured by truncating the ERP data.

3.5. Software for G Theory

The benefits of G theory’s comprehensive framework for understanding and estimating sources of variance come with the added cost of analytic complexity. G-theory equations are built around estimates of variance, but how variance is estimated is left up to the researcher. Some approaches use traditional ANOVA procedures (Brennan, 2001, 2003), structural equation models (SEMs; Vispoel et al., 2023; Vispoel et al., 2018a), or multilevel models (Brennan, 2010; Li, 2023). Variance components can be estimated with these procedures, and then these components can be used in equations for psychometric reliability estimates (see Table 1 for formulas). Although there are several software packages for estimating variance components and directly examining psychometric reliability using generalizability theory, few general-purpose software packages8 can handle the unbalanced and multifaceted nature of ERP data and estimate the typical psychometric reliability metrics of common interest to ERP researchers (see Table 1).

The ERP Reliability Analysis Toolbox (ERA Toolbox; https://peclayson.github.io/ERA_Toolbox/) was designed for psychophysiologists (Clayson et al., 2021a; Clayson et al., 2021c; Clayson et al., 2021d; Clayson & Miller, 2017a). The ERA Toolbox uses Bayesian multilevel models to estimate variance components of scores, and these models are well suited for the unbalanced trial-level data common in ERP studies by taking advantage of partial pooling. Participants from the same population are expected to be similar to each other, and partial pooling takes advantage of this by pulling extreme observations closer to the group mean, resulting in more efficient parameter estimation than the arithmetic solution (e.g., Gelman, 2006; Gelman et al., 2012). Therefore, participants with few trials do not strongly impact parameter estimation, and the variance of such scores is pulled toward the group mean.

The ERA Toolbox handles the estimation of variance components and uses these estimates to calculate psychometric reliability using generalizability theory. The toolbox provides traditional estimates of group-level internal consistency (Baldwin et al., 2015; Clayson & Miller, 2017a) and test-retest reliability (Carbine et al., 2021; Clayson et al., 2021d). The formulas for estimating the reliability coefficients are shown in Table 1. The toolbox can also provide the group-level internal consistency of difference scores (Clayson et al., 2021a), which are common in studies of ERPs. Additionally, participant-level data quality and internal consistency coefficients can be estimated by using participant-specific error variances from Bayesian multilevel models (Clayson et al., 2021c). Although the ERA Toolbox was designed with psychophysiology researchers in mind, the toolbox would work well with any trial-wise data, psychophysiological (e.g., fMRI, pupillography, facial electromyography) or otherwise (e.g., response times).

3.6. Applications of Psychometric Reliability

Psychometric reliability is a property of observed scores, but it is not a universal property of a measure (Thompson, 2003; Vacha-Haase, 1998). Psychometric reliability necessarily depends on many factors, such as the sample and population of interest, the hardware or task used for recording, the specific data-processing pipeline (e.g., sensors used for analysis), the contrast within a task (e.g., event-specific scores [error trials, unpleasant trials], difference scores [error minus correct, unpleasant minus pleasant]), or the scoring procedure (e.g., time-window mean amplitude, peak amplitude). A demonstration of adequate reliability in one narrow application (e.g., internal consistency of P300 in healthy undergraduates) cannot be assumed to generalize to other contexts (e.g., internal consistency of P300 in people diagnosed with schizophrenia). Poor psychometric reliability increases the likelihood both of finding non-replicable results and of missing true phenomena (Loken & Gelman, 2017) and even calls into question the validity of their interpretation (Meehl, 1986). These problems are particularly exacerbated in the small-sample studies common in psychopathology research, especially in clinical neuroscience (e.g., Szucs & Ioannidis, 2020) and in studies using ERPs (Clayson et al., 2019). Therefore, psychometric reliability should be routinely evaluated in any study of individual differences, and this practice is consistent with the author guidelines of Psychophysiology (Psychophysiology; Author Guidelines, 2023) and International Journal of Psychophysiology (International Journal of Psychophysiology; Guide for Authors, 2023).

Internal consistency estimates can guide decisions about how to process and analyze ERP scores for studies of individual differences. Different data-processing pipelines impact observed internal consistency, and the impact of some pipelines on the internal consistency of ERN scores, for example, has been examined. Key data-processing stages include event type (correct vs. error trial), filter cutoffs, reference schemes, eye-movement correction procedures, baseline adjustment windows, electrode sites, and amplitude measurement approaches (Clayson, 2020; Klawohn et al., 2020b; Sandre et al., 2020). Studies of ERN data quality also shed light on important methodological choices, because internal consistency is impacted by within-person variability (i.e., data quality). A multiverse analysis9 of 3,456 pipelines confirmed that each of the above data-processing stages had downstream effects on the data quality of ERN scores within the same sample of participants (Clayson et al., 2021b). The use of internal consistency estimates to guide data-processing decisions helps to identify optimal pipelines for studying individual differences in ERP scores.

Psychometric reliability estimates can guide the selection of candidate biological markers in clinical trials (Clayson et al., 2021d; Light et al., 2020; Light & Swerdlow, 2020). Ideally, therapeutic targets should show high internal consistency (an index of the capability of scores to distinguish participants) and test-retest reliability (stability of scores over a relevant time frame). For example, in studies of the frequency following response (FFR), there is no gold-standard measurement approach (Krizman & Kraus, 2019). Utilizing those measurement approaches that yield high internal consistency and test-retest reliability will identify the candidate markers most likely to be useful for intervention research. A psychometric evaluation of 18 different approaches to scoring FFR in schizophrenia patients (Clayson et al., 2022b) identified candidate measurements with high psychometric reliability for use in therapeutic interventions (Clayson et al., 2021f; see also Clayson et al., 2022a). Because unreliable data call into question the conclusions of a study, the verification of score reliability should be an early step in the selection of biological markers for use as therapeutic targets. A lack of attention to psychometric reliability has likely limited the widespread application of findings from studies of neural measures to psychopathology research (Lilienfeld & Strother, 2020).

3.7. Other Psychophysiological Data

There is an added challenge in evaluating the psychometric characteristics for psychophysiological data that are not normally distributed at the single-trial level. For example, time-frequency EEG poses such challenges because such data are intrinsically higher-dimensional in nature than ERP data and reliability estimates will require models that reflect that individual-trial scores show a Chi-square distribution. Although the psychometric reliability of time-frequency EEG data has been investigated (e.g., Rocha et al., 2020), the reliability of oscillatory phenomena has received almost no attention in the literature. Multilevel models tailored to the underlying data distributions could be used to estimate variance components for reliability estimation with the formulas in Table 1.

Another challenge is posed by data in which the score from the average of trials is not the same as the average scores of single trials. For example, when EEG trials are averaged together before calculating the average power in a frequency band (i.e., average power that primarily includes phase-locked information), different scores10 will be obtained than if power is first calculated on single trials and then single-trial estimates are averaged (i.e., total power that includes phase- and non-phase-locked information). Even more commonly, the average of single-trial ERP peaks is almost always higher than the peak of the average of single-trial ERP epochs, due to the variability in latency of the single-trial peaks. G theory has been developed to work on parallel item task splits, which could be appropriate for estimating reliability of average power or peak amplitude scores (see Vispoel et al., 2022). The use of task splits, instead of single-trial scores, would benefit from the mitigation of background EEG noise on power or peak amplitude scores afforded by averaging many trials together. When conducting a G-theory analysis of item splits, the scores derived from split data become the units of analysis. For example, average power from splits comprising 100 trials each could be extracted from data with 500 trials to estimate reliability and determine the optimal task length as a function of splits (i.e., 100 trials). The size of the splits is determined by the researcher and the appropriate length of the splits will depend on type of data.

4. Psychometric Validation

Validation research is ultimately about epistemology and concerns the methodological effort to evaluate the relationships among tests and the consequences of their use. The focus on the application of scores means that a measure can have different validities for different applications (Fahrenberg et al., 1986). For example, an ERP score might yield different validities for inferences from a sample with psychiatric illness than for inferences from a sample of healthy comparison participants.

4.1. Construct Validation

Construct validity refers to whether evidence supports the interpretation or meaning of scores for an intended construct (Cronbach & Meehl, 1955). Convergent and discriminate validity are two types of construct validity relevant to ERPs. Whereas convergent validity describes the extent to which two measures of the same construct produce similar inferences, discriminate validity concerns the extent to which two measures of different theoretically relevant constructs produce different inferences (Campbell & Fiske, 1959; Fiske, 1971). For studies of ERPs, convergent and discriminate validity are highly relevant for attempts to generalize ERP findings across paradigms—particularly for ERP components referred to via generic nomenclature. For example, different paradigms used to record N2, a negative-going, scalp-recorded ERP component that peaks between 200 and 350 ms after stimulus onset, may elicit different processes, such as attention, cognitive control, mismatch, novelty, or sequential matching (see Folstein & Van Petten, 2008).

Studies of convergent and discriminate validity can identify instances in which inferences might be based on jingle or jangle fallacies. A jingle fallacy occurs when two measures assess different constructs but are assumed to measure the same construct because the measures have the same or similar names, and a jangle fallacy occurs when two measures assess the same construct but are assumed to measure different constructs because the measures have different names (Kelley, 1927). An example of a jingle fallacy could be assuming that the N2 component findings from two different paradigms measure the same process, because they are both “N2s”. An example of a jangle fallacy could be to assume that N2 is distinct from ERN when recorded during the same paradigm because different component names are commonly used in some contexts.

A practical approach for examining the convergent and discriminate validity of ERPs is the use of a multitrait-multimethod matrix, wherein theoretically similar and dissimilar measurements are correlated (Campbell & Fiske, 1959). Bivariate correlations among scores that purportedly measure the same construct should be high (convergent validity), and correlations among scores that measure different constructs should be low (discriminate validity). This approach has been applied to ERPs in the form of an alternatively named multicomponent-multitask matrix (Riesel et al., 2013). Multicomponent refers to different ERP components, and multitask refers to different tasks used for recording ERPs.

A multicomponent-multitask matrix was used to evaluate the convergent and discriminate validity of ERN and error positivity (Pe) recorded during flanker, Stroop, and Go/Nogo tasks (Riesel et al., 2013). The error positivity component (Pe) is a slow, tonic waveform that follows ERN and is larger for error trials than for correct trials (Nieuwenhuis et al., 2001; Overbeek et al., 2005). Whereas ERN is considered to index early error detection, Pe is considered to reflect later error awareness (Steinhauser & Yeung, 2012; Ullsperger et al., 2010; Wessel, 2012; Wessel et al., 2012). Convergent validity of ERN and, separately, of Pe was supported by showing that each component correlated with itself across paradigms, and discriminate validity of the two components was supported by showing numerically smaller correlations between ERN and Pe within a task than between each component and itself across tasks (Riesel et al., 2013). A registered report provided a direct replication that supported these conclusions, and it conceptually replicated the findings by statistically comparing the correlations (Clayson et al., in press). Although the strength of the correlations was modest for ERN (.45 to .53) and Pe (.41 to .67)11, statistical comparisons supported the convergent and discriminate validity of these two components. These findings suggest that the interpretations of ERN and Pe are not falling prey to jingle or jangle fallacies.

Clayson et al. (in press) extended the findings of Riesel et al. (2013) by examining N2. Comparisons of ERN and N2 did not show evidence of discriminate validity, and the correlations between ERN and N2 were similar in magnitude (.49 to .60) to correlations of convergent validity of ERN (.45 to .53). These findings suggest that studies that rely on distinct interpretations of the functional significance of ERN and N2 could fall prey to the jangle fallacy. In fact, ERN and N2 have both been conceptualized as indices of discrepancy detection, with ERN related to response conflict and N2 related to stimulus conflict (Larson et al., 2014). The findings of Clayson et al. (in press) could provide weak support for the convergent validity of the two ERPs.

4.2. Generalizability Theory for Validation

G theory provides a helpful framework for evaluating construct validity. The inclusion of different facets of measurement in G theory analysis permits the examination of score consistency across different conditions of measurement. Ideally, scores would be consistent between different items, methodological parameters, paradigms used for recording, or groups. For example, if ERP scores show different between-person variances for different items (correct vs. error; pleasant vs. unpleasant; rare vs. frequent), experimental paradigms, or groups, then these differences might indicate the component is not consistently measuring the same construct across levels of a facet.

G theory can pinpoint facet-related differences in error variances, highlighting the sources of variability. Some sources of measurement error (i.e., within-person variability) could represent systematic differences between people. For example, particular participant groups might show greater within-person error variances than other groups, or certain paradigms might tend to have more variability in an ERP response than other paradigms. Other sources of measurement error might include methodological differences (e.g., filtering, ocular artifact correction) or interactions between facets (e.g., patient vs. control differences in error variances larger for an easy task than a difficult one). By identifying the sources of error variance, steps can be taken to minimize their impact on scores and consequently increase the validity of the measurements. Thus, G theory can be used to examine how scores generalize across conditions and facets of measurement and guide decisions for refining measures.

5. A Note on Psychometric Validity

Traditionally, validity is defined in terms of how a test is used. Some familiar definitions of validity12 include the extent to which a test measures what it purports to measure (Kelley, 1927), how well test scores correlate with scores on tests from other theoretically relevant constructs (Cronbach & Meehl, 1955), or how appropriate a test label is (Messick, 1989). These familiar definitions of validity emphasize the application of test scores (i.e., epistemology), descriptive theories of meaning, and correlations of a test with other tests or criterions. However, questions of psychometric validity are primarily theoretical ones and cannot be solely answered by methodological investigation.

A simple and intuitive definition of validity is that the attribute of interest exists and that changes in the attribute cause changes in measured observations (Borsboom, 2005; Borsboom & Mellenbergh, 2007; Borsboom et al., 2004). This alternative definition shifts validity theory away from epistemology, meaning, and correlation to ontology, reference, and causality (Borsboom et al., 2004). A theoretical attribute can include neural processes, such as hemodynamic or electromagnetic activity in the brain—the specific brain activity typically of interest is not directly observable, and a researcher requires inferential processes to judge differences in brain activity among people.

A measurement model describes how attributes cause changes in observations. For example, ERP waveforms represent a manifestation of activity from numerous postsynaptic potentials (see Luck, 2014). Psychophysiologists might assume that differences in observed ERP peaks are systematically related to differences in underlying brain activity (theoretical attribute) between people. In addition, a central aspect of psychophysiology involves studying psychological and biological phenomena together and determining individual differences in those phenomena. Biological phenomena (brain activity) have been advocated as measures of psychological phenomena, but clear measurement models are required to justify how psychological phenomena cause changes in brain activity or vice versa. Failing to do so (if it is even possible13) can be construed as a problem of test validity, which concerns any attempt to connect data to attributes.

The problem of validity is a fundamental and substantive issue for all fields that use measurements. Studies that evaluate psychometric reliability and validity can practically advance measurement practices, but such studies should be motivated by theories of the validity of attributes. The present review focuses on the evaluation of psychometrics, but the validation process cannot replace the theoretical validity problem.

6. Additional Psychometric Concepts for Clinical Application

6.1. Standardization

Many different projects are moving performance-monitoring ERP components—such as ERN—toward standardization (e.g., Clayson, 2020; Klawohn et al., 2020b), optimization (e.g., Clayson et al., 2021b; Sandre et al., 2020), and clinical application (e.g., Clayson et al., 2022c; Hajcak et al., 2019). For example, a large ERN data set was published as normative (Imburgio et al., 2020), despite incomplete understanding of the psychometric validity of ERN in the field (see commentary Clayson et al., 2021e). Such efforts to establish normative datasets are challenged by a lack of standardization of methods for recording ERN and a lack of an understanding of the methodological choices that impact ERN scores (Clayson et al., 2021e), which can be advanced by studies that evaluate psychometric reliability and validity.

On the path to clinical application and publication of normative databases, rigorous standardization of paradigms, administration, and data analysis is critical to achieving dependability (distinct from dependability in G theory), which is a fundamental concept in psychological testing. Dependability in neuropsychological assessment implies that the information from psychological measures is invariant across time and situations (Russell et al., 2005). For example, during neuropsychological evaluations, a client’s scores are compared to normative data on standardized assessment measures, and these comparisons are used for individualized diagnosis and treatment planning. Normative scores provide information about how an individual’s performance compares to a reference group.

A keystone of all neuropsychological evaluation is the standardization of test administration, scoring, and interpretation, because nuisance factors must be mitigated to make observations comparable across administrations. A failure to properly standardize measures leads to spurious inferences, because deviant scores from the normative sample could be extreme due to any number of factors in the administration or scoring of the measures, rather than differences in the construct of interest (Bigler & Dodrill, 1997). Therefore, strict standardization in recruitment (c.f. sample of convenience), assessment, and scoring should be followed when creating normative databases. Failing to standardize ERP assessment before providing normative databases has deleterious consequences for researchers and clinicians who may prematurely rely on them (Clayson et al., 2021e). After proper standardization and verification of psychometrics, G-theory approaches are well suited to assessing the reliability of optimal cut points (i.e., criterion scores) for distinguishing normal from abnormal function (e.g., Brennan, 2001; Vispoel et al., 2018a); for example, identifying an amplitude threshold to distinguish abnormally large ERN.

6.2. Measurement Dimensionality and Invariance

Psychometric validation includes the examination of measurement invariance, which is a demonstration of psychometric equivalence across relevant conditions, occasions, or groups that theoretically should not impact scores (Meredith, 1964; Struening & Cohen, 1963; Vandenberg & Lance, 2000). The demonstration of measurement invariance justifies the comparison of conditions, occasions, or groups and ensures that differences in scores are not primarily due to artifacts of the measurement process (e.g., involvement of different processes). Typically, multigroup confirmatory factor analysis is used to examine measurement invariance of questionnaire data, and the loading of each item onto a latent factor is examined (see Putnick & Bornstein, 2016; Steinmetz et al., 2009).

However, in the typical ERP study, univariate analyses—not summed or average item scores from self-report data—are the primary foci. Analyses could include examining the relationship between an ERP component and a measure of impulsivity, clinical group differences in ERP amplitude, or between-condition ERP differences within persons. Unlike items on a questionnaire, ERP scores are typically considered parallel in that there is no special meaning to the first vs. second vs. third trial14. For example, an ERP to an incongruent trial is considered exchangeable with any other incongruent trial in the task, and parallel items are an assumption of the averaging approach commonly used. A motivation of evaluating measurement invariance is to ensure that observations of condition, occasion, individual, or group differences are due to differences in the construct of interest rather than differences due to measurement issues.

Clinical studies of ERPs often focus on interpretations of how groups differ in ERP measures. The assumption is that any difference between ERPs is due to differences in the construct of interest. However, differences between groups could be due to differences in measurement error rather than differences in the construct of interest. High measurement error can cause magnitude (i.e., attenuated or exaggerated effects) and sign errors (i.e., change in pattern of effects, such as patients showing larger rather than smaller ERPs than controls) in between-group comparisons (Clayson et al., 2021c; Flegal et al., 2017; Gelman & Carlin, 2014) and in examinations of individual differences (Cooper et al., 2017; Hedge et al., 2017; Loken & Gelman, 2017; Rouder & Haaf, 2018; Seghier & Price, 2018). The observation of more error variance (i.e., within-person variability) in a clinical group than in a control group could be due to clinical factors related to the construct of interest or due to measurement issues (e.g., noisier data due to nuisance factors unrelated to the construct, such as movement artifact, attention, and fatigue). Directly examining between- and within-person variances can shed light on whether differences15 are due to a construct or due to measurement.

An extension of generalizability theory could include a straightforward statistical comparison using a multilevel location-scale model that examines between- and within-person variances. Traditional, location-only multilevel models include fixed and random effects to estimate conditional means but hold residual variance in the model constant across fixed/random effects (i.e., homoskedasticity). Multilevel location-scale models expand the multilevel structure to the scale portion of the model (i.e., variances of the data), permitting the simultaneous modeling of means (i.e., location) and variances (i.e., scale). Studies using location-scale models have been successfully applied in ERP studies (Clayson et al., 2021c; Clayson et al., 2022c; Clayson et al., 2024; for commentary on the use of these models in studies of ERPs, see Volpert-Esmond, 2022) and can directly examine condition-related differences (e.g., trial type, clinical group, occasion) differences in within-person variances and determine the impact of covariates on those variances.

6.3. Clinical Applications

The process of evaluating psychometric reliability and validity has implications for identifying biomarkers, which is an important focus of the NIMH Research Domain Criteria (RDoC) initiative. The RDoC initiative emphasizes examining the feasibility of psychological and neurophysiological measures of dimensional constructs with an eye toward clinical prediction and application (Cuthbert et al., 2023; Kozak & Cuthbert, 2016). The search for biomarkers of mental illness aims to improve precision medicine in psychiatry. This search will require explicit attention to the psychometrics of task measurements, psychophysiological or otherwise, applied to distinguish differences between people.

EEG and ERPs are often used in the search for tractable biomarkers because they are direct, non-invasive, and portable measures of neural activity. Some EEG/ERP features show promising clinical utility, such as ERN in anxiety (e.g., Hajcak et al., 2019), reward positivity in depression (e.g., Klawohn et al., 2020a), and auditory steady state responses in schizophrenia (Molina et al., 2020). However, their utility is limited by their psychometric properties and by the use of tasks that have not been optimized for clinical populations. As one example, standard practice employs tasks that show robust healthy vs. clinical group differences, but these tasks may not be well-suited to examining dimensional relationships (see Infantolino et al., 2018). A focus on psychometric reliability can inform which tasks to use for distinguishing people and can provide a basis for determining the length of tasks. Unfortunately, psychophysiological data rarely receive close psychometric scrutiny, which consequently leads to the reliability of many measures being virtually unknown for clinical or even healthy samples. Unless the evaluation of reliability becomes standard practice, it is hard to imagine any biomarker gaining meaningful clinical traction in precision medicine.

The G theory formulas provided in Table 1 and implemented in the ERA Toolbox (Clayson & Miller, 2017a) can be used to optimize tasks for clinical psychophysiology research. Because unreliable data call into question a study’s conclusions, estimation of score reliability should be done before extensive data collection is underway, and the verification of score reliability should be an early step in data analysis. The demonstration of psychometric reliability in RDoC-inspired research is a precondition for between-subjects measurement comparison—including matching the “right individuals” to the “right treatment”. Guiding biomarker evaluation through psychometrics will pave the way for better selection of biomarkers and task development, ultimately improving the clinical utility of these biomarkers in precision medicine.

7. Conclusion

Tools needed to rigorously assess the psychometric reliability of ERP scores and to validate ERP measures are now readily available for improving the psychometrics of ERPs. Generalizability theory is a multifaceted and flexible framework that is well suited to the psychometric analysis of ERPs and other trial-wise scores, and tractable software is already available to implement this measurement framework in studies of psychophysiological measurements (e.g., ERA Toolbox, https://peclayson.github.io/ERA_Toolbox/). The importance of routine evaluation of psychometrics in psychophysiology cannot be overstated; ERP work can only be as replicable as ERP measures are psychometrically reliable. As authors, journal reviewers, or grant reviewers, we have a responsibility to foster the integrity of the field by holding research to appropriately high standards concerning psychometrics. With the profound implications that psychophysiological research can have on theory and practice, there is simply too much at stake to ignore the crucial processes of evaluating psychometric reliability and validity.

Acknowledgements

The author thanks Scott A. Baldwin, Michael J. Larson, Gregory A. Miller, Walter P. Vispoel, and Cindy M. Yee-Bradbury for comments on an earlier draft of this manuscript. Present work was supported by a NIMH grant awarded to Peter Clayson (R01MH128208).

Footnotes

1

There are various theoretical explanations of ERN, including the reinforcement learning theory (Holroyd & Coles, 2002), the mismatch theory (Falkenstein et al., 1991; Gehring et al., 1993), and the conflict monitoring theory (Botvinick et al., 2001; Larson et al., 2014; Yeung et al., 2004). ERN likely represents a manifestation of a performance-monitoring system that supports behavioral adjustments (Weinberg et al., 2015), and ERN findings show relationships with cognitive and motivational processes, consistent with ERN’s potential role as a performance-monitoring signal (Larson et al., 2012; Larson et al., 2014; Olvet & Hajcak, 2008; Proudfit et al., 2013; Weinberg et al., 2015).

2

Barch and Mathalon (2011) provide a treatment of generalizability theory as it relates broadly to neural measurements. Strube and Newman (2007) describe generalizability theory with broad applications to psychophysiological measurements.

3

Item response theory (IRT) is another common psychometric theory with a different purpose than CT theory or G theory (Lord & Novick, 1968). Broadly speaking, IRT focuses on estimating the relationship between latent traits and scores on individual items, describing item characteristics like difficulty and discrimination. Whereas IRT emphasizes understanding item-level relationships with latent traits, CT theory and G theory are test-level theories, focusing on the reliability of composite scores. For an overview of the three measurement theories, see DeMars (2018), and for an application of IRT to identifying emotional images well suited to eliciting the late positive potential, see Wilson et al. (2021).

4

Reliability is a term that is used a bit differently across fields. Generally speaking, reliability refers to a quantification of consistencies or inconsistencies in measurements. In psychology, psychometric reliability typically refers to how clearly average scores distinguish differences between people (i.e., between-person variability) after considering the scores that contribute to those averages (i.e., within-person variability). However, in physics or engineering, reliability refers to how consistently an instrument measures a quantity (see Brandmaier et al., 2018). Reliability in the present review refers to the notion of psychometric reliability as used in the field of psychology.

5

Some ERP scores might systematically change over the course of a task (e.g., Brush et al., 2018; Volpert-Esmond et al., 2018). If these changes in ERP scores are ignored during statistical analysis, then the psychometric reliability analysis should similarly ignore them so that psychometric reliability estimates reflect how the data were analyzed. However, if for example time on task is considered in the statistical analysis, then time on task could be treated as a facet in the psychometric reliability analysis and include the appropriate levels to match the task.

6

For a discussion of the assumptions of strictly parallel forms and the less restrictive assumption of tau-equivalence as they relate to split-half internal consistency estimates of ERP scores, see the appendix of Clayson et al. (2021a).

7

There is an interesting publication describing Cronbach’s thoughts leading to the development of coefficient alpha and his thoughts on alpha 50 years after the publication of the seminal work. Cronbach also described his misgivings about the use of coefficient alpha in many contexts and advocated generalizability theory (Cronbach, 2004).

8

Two general purpose software packages likely of interest include mGENOVA (https://education.uiowa.edu/casma/computer-programs) and gtheory (Moore, 2016).

9

A multiverse analysis is a transparent approach that considers different reasonable decisions for recording, processing, or analyzing data. For example, a multiverse analysis could examine task-specific relationships in ERN scores recorded during different versions of a flanker task (Clayson et al., 2023). A review of multiverse analyses of psychophysiological data and a walkthrough of how to perform them is provided in Clayson (2023).

10

In many contexts power is assumed to be amplitude squared. Therefore, when trials are first averaged together before estimating power, some noise is canceled out during the averaging. However, power estimates from single trials will include some background noise that is not canceled out during averaging because all power values are positive (amplitude squared).

11

Although qualitative interpretations of validity coefficients are often arbitrary, they can provide useful benchmarks for comparison. Estimates of convergent validity are considered strong when the magnitude of correlations is above .70 and are considered weak or lacking support when the magnitude of the relationships is between .50 and .70 (Carlson & Herdman, 2012; Post, 2016). Based on these thresholds, support for convergent validity might be considered weak.

12

For a detailed history of conceptions of validity, consult Kane (2001).

13

A common assumption is that psychological phenomena are reducible to biology, but there remains an ontological gap between psychology and biology that undermines such reductionistic attempts. The philosophical issues and infeasibility of biological reductionism are described elsewhere (Borsboom et al., 2019; Miller, 1996, 2010; Miller & Bartholomew, 2020; Miller et al., 2014; Miller & Keller, 2000; Schwartz et al., 2016), and there are tractable alternatives to biological reductionism (Sharp & Miller, 2019; Thomas & Sharp, 2019).

14

Trial number of an ERP is typically not important or meaningful. It is unlikely that a person’s third ERP trial is related to another person’s third ERP trial in same way that questionnaire data are constructed. In study of clinical depression, a person’s answer on an item about anhedonia might be related to another person’s answer on an item about anhedonia. Therefore, ERP trials are considered exchangeable with any other trial (see Baldwin et al., 2015; Clayson & Miller, 2017a)

15

Although the example here is about clinical group differences, the same concepts apply to other differences, such as differences in conditions, time (time 1 vs. time 2), or any external correlate (e.g., dimensional analyses of personality characteristics).

References

  1. Baldwin SA, Larson MJ, & Clayson PE (2015). The dependability of electrophysiological measurements of performance monitoring in a clinical sample: A generalizability and decision analysis of the ERN and Pe. Psychophysiology, 52(6), 790–800. 10.1111/psyp.12401 [DOI] [PubMed] [Google Scholar]
  2. Barch DM, & Mathalon DH (2011). Using brain imaging measures in studies of procognitive pharmacologic agents in schizophrenia: Psychometric and quality assurance considerations. Biological Psychiatry, 70(1), 13–18. 10.1016/j.biopsych.2011.01.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bigler ED, & Dodrill CB (1997). Assessment of neuropsychological testing. Neurology, 49(4), 1180–1182. 10.1212/WNL.49.4.1180-a [DOI] [PubMed] [Google Scholar]
  4. Boen R, Quintana DS, Ladouceur CD, & Tamnes CK (2022). Age-related differences in the error-related negativity and error positivity in children and adolescents are moderated by sample and methodological characteristics: A meta-analysis. Psychophysiology, 59(6), e14003. 10.1111/psyp.14003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Borsboom D (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge University Press. [Google Scholar]
  6. Borsboom D, Cramer AOJ, & Kalis A (2019). Reductionism in retreat. Behavioral and Brain Sciences, 42. 10.1017/S0140525X18002091 [DOI] [PubMed] [Google Scholar]
  7. Borsboom D, & Mellenbergh GJ (2007). Test validity in cognitive assessment. In Leighton JP & Gierl MJ (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 85–115). [Google Scholar]
  8. Borsboom D, Mellenbergh GJ, & van Heerden J (2004). The concept of validity. Psychological Review, 111(4), 1061–1071. 10.1037/0033-295X.111.4.1061 [DOI] [PubMed] [Google Scholar]
  9. Botvinick MW, Carter CS, Braver TS, Barch DM, & Cohen JD (2001). Conflict monitoring and cognitive control. Psychological Review, 108, 624–652. 10.1037/0033-295X.108.3.624 [DOI] [PubMed] [Google Scholar]
  10. Brandmaier AM, Wenger E, Bodammer NC, Kühn S, Raz N, & Lindenberger U (2018). Assessing reliability in neuroimaging research through intra-class effect decomposition (ICED). Elife, 7. 10.7554/eLife.35718.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Brennan RL (2000). (Mis)conceptions about generalizability theory. Educational Measurement: Issues and Practice, 19(1), 5–10. 10.1111/j.1745-3992.2000.tb00017.x [DOI] [Google Scholar]
  12. Brennan RL (2001). Generalizability theory: Statistics for social science and public policy. Springer-Verlag. [Google Scholar]
  13. Brennan RL (2003). Coefficients and indices in generalizability theory. In CASMA Research Report No. 1 (pp. 1–48). Center for Advanced Studies in Measurement and Assessment. https://www.education.uiowa.edu/docs/default-source/casma---research/01casmareport.pdf?sfvrsn=2 [Google Scholar]
  14. Brennan RL (2010). Generalizability theory and classical test theory. Applied Measurement in Education, 24(1), 1–21. 10.1080/08957347.2011.532417 [DOI] [Google Scholar]
  15. Brown W (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3), 196–322. 10.1111/j.2044-8295.1910.tb00207.x [DOI] [Google Scholar]
  16. Brush CJ, Ehmann PJ, Hajcak G, Selby EA, & Alderman BL (2018). Using multilevel modeling to examine blunted neural responses to reward in major depression. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 1–8. 10.1016/j.bpsc.2018.04.003 [DOI] [PubMed] [Google Scholar]
  17. Campbell DT, & Fiske DW (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. 10.1037/h0046016 [DOI] [PubMed] [Google Scholar]
  18. Carbine KA, Clayson PE, Baldwin SA, LeCheminant J, & Larson MJ (2021). Using generalizability theory and the ERP Reliability Analysis (ERA) Toolbox for assessing test-retest reliability of ERP scores Part 2: Application to food-based tasks and stimuli. International Journal of Psychophysiology, 166, 188–198. 10.1016/j.ijpsycho.2021.02.015 [DOI] [PubMed] [Google Scholar]
  19. Carlson KD, & Herdman AO (2012). Understanding the impact of convergent validity on research results. Organizational Research Methods, 15(1), 17–32. 10.1177/1094428110392383 [DOI] [Google Scholar]
  20. Charter RA (2001). It is time to bury the Spearman-Brown “prophecy” formula for some common applications. Educational and Psychological Measurement, 61(4), 690–696. 10.1177/00131640121971446 [DOI] [Google Scholar]
  21. Cho E (2016). Making reliability reliable. Organizational Research Methods, 19(4), 651–682. 10.1177/1094428116656239 [DOI] [Google Scholar]
  22. Clayson PE (2020). Moderators of the internal consistency of error-related negativity scores: A meta-analysis of internal consistency estimates. Psychophysiology, 57(8), e13583. 10.1111/psyp.13583 [DOI] [PubMed] [Google Scholar]
  23. Clayson PE (2023). Beyond single paradigms, pipelines, and outcomes: Embracing multiverse analyses in psychophysiology. PsyArXiv. 10.31234/osf.io/5eawy [DOI] [PubMed] [Google Scholar]
  24. Clayson PE, Baldwin SA, & Larson MJ (2013). How does noise affect amplitude and latency measurement of event-related potentials (ERPs)? A methodological critique and simulation study. Psychophysiology, 50, 174–186. 10.1111/psyp.12001 [DOI] [PubMed] [Google Scholar]
  25. Clayson PE, Baldwin SA, & Larson MJ (2021a). Evaluating the internal consistency of subtraction-based and residualized difference scores: Considerations for psychometric reliability analyses of event-related potentials. Psychophysiology, 58(4), e13762. 10.1111/psyp.13762 [DOI] [PubMed] [Google Scholar]
  26. Clayson PE, Baldwin SA, Rocha HA, & Larson MJ (2021b). The data-processing multiverse of event-related potentials (ERPs): A roadmap for the optimization and standardization of ERP processing and reduction pipelines. NeuroImage, 245, 118712. 10.1016/j.neuroimage.2021.118712 [DOI] [PubMed] [Google Scholar]
  27. Clayson PE, Brush CJ, & Hajcak G (2021c). Data quality and reliability metrics for event-related potentials (ERPs): The utility of subject-level reliability. International Journal of Psychophysiology, 165, 121–136. 10.1016/j.ijpsycho.2021.04.004 [DOI] [PubMed] [Google Scholar]
  28. Clayson PE, Carbine KA, Baldwin SA, & Larson MJ (2019). Methodological reporting behavior, sample sizes, and statistical power in studies of event‐related potentials: Barriers to reproducibility and replicability. Psychophysiology, 111(6), 5–17. 10.1111/psyp.13437 [DOI] [PubMed] [Google Scholar]
  29. Clayson PE, Carbine KA, Baldwin SA, Olsen JA, & Larson MJ (2021d). Using generalizability theory and the ERP Reliability Analysis (ERA) Toolbox for assessing test-retest reliability of ERP scores Part 1: Algorithms, framework, and implementation. International Journal of Psychophysiology, 166, 174–187. 10.1016/j.ijpsycho.2021.01.006 [DOI] [PubMed] [Google Scholar]
  30. Clayson PE, Joshi YB, Thomas ML, Sprock J, Nungaray J, Swerdlow NR, & Light GA (2022a). Click-evoked auditory brainstem responses (ABRs) are intact in schizophrenia and not sensitive to cognitive training. Biomarkers in Neuropsychiatry, 6, 100046. 10.1016/j.bionps.2022.100046 [DOI] [Google Scholar]
  31. Clayson PE, Joshi YB, Thomas ML, Tarasenko M, Bismark A, Sprock J, Nungaray J, Cardoso L, Wynn JK, Swerdlow N, & Light GA (2022b). The viability of the frequency following response characteristics for use as biomarkers of cognitive therapeutics in schizophrenia. Schizophrenia Research, 243, 372–382. 10.1016/j.schres.2021.06.022 [DOI] [PubMed] [Google Scholar]
  32. Clayson PE, Kappenman ES, Gehring WJ, Miller GA, & Larson MJ (2021e). A commentary on establishing norms for error-related brain activity during the arrow flanker task among young adults. NeuroImage, 234, 117932. 10.1016/j.neuroimage.2021.117932 [DOI] [PubMed] [Google Scholar]
  33. Clayson PE, McDonald JB, Park B, Holbrook A, Baldwin SA, Riesel A, & Larson MJ (in press). Registered replication report of the construct validity of the error-related negativity (ERN): A multi-site study of task-specific ERN correlations with internalizing and externalizing symptoms. Psychophysiology. 10.1111/psyp.14496 [DOI] [PubMed] [Google Scholar]
  34. Clayson PE, & Miller GA (2017a). ERP Reliability Analysis (ERA) Toolbox: An open-source toolbox for analyzing the reliability of event-related potentials. International Journal of Psychophysiology, 111, 68–79. 10.1016/j.ijpsycho.2016.10.012 [DOI] [PubMed] [Google Scholar]
  35. Clayson PE, & Miller GA (2017b). Psychometric considerations in the measurement of event-related brain potentials: Guidelines for measurement and reporting. International Journal of Psychophysiology, 111, 57–67. 10.1016/j.ijpsycho.2016.09.005 [DOI] [PubMed] [Google Scholar]
  36. Clayson PE, Molina J, Joshi YB, Thomas ML, Sprock J, Nungaray J, Swerdlow N, & Light GA (2021f). Evaluation of the frequency following response as a predictive biomarker of response to cognitive training in schizophrenia. Psychiatry Research, 305, 114239. 10.1016/j.psychres.2021.114239 [DOI] [PubMed] [Google Scholar]
  37. Clayson PE, Rocha HA, Baldwin SA, Rast P, & Larson MJ (2022c). Understanding the error in psychopathology: Notable intraindividual differences in neural variability of performance monitoring. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 7(6), 555–565. 10.1016/j.bpsc.2021.10.016 [DOI] [PubMed] [Google Scholar]
  38. Clayson PE, Rocha HA, McDonald JB, Baldwin SA, & Larson MJ (2023). A registered report of a two-site study of variations of the flanker task: ERN experimental effects and data quality. PsyArXiv. 10.31234/osf.io/a3s42 [DOI] [PubMed] [Google Scholar]
  39. Clayson PE, Shuford J, Rast P, Baldwin SA, Weissman DH, & Larson MJ (2024). Normal congruency sequence effects in psychopathology: A behavioral and electrophysiological examination using a confound-minimized design. Psychophysiology, 61(1), e14426. 10.1111/psyp.14426 [DOI] [PubMed] [Google Scholar]
  40. Cooper SR, Gonthier C, Barch DM, & Braver TS (2017). The role of psychometrics in individual differences research in cognition: A case study of the AX-CPT. Frontiers in Psychology, 8, 136–116. 10.3389/fpsyg.2017.01482 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Cronbach LJ (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. 10.1007/BF02310555 [DOI] [Google Scholar]
  42. Cronbach LJ (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64(3), 391–418. 10.1177/0013164404266386 [DOI] [Google Scholar]
  43. Cronbach LJ, Gleser GC, Nanda H, & Rajaratnum N (1972). The dependability of behavioral measures: Theory of generalizability for scores and profiles. John Wiley. [Google Scholar]
  44. Cronbach LJ, & Meehl PE (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. 10.1037/h0040957 [DOI] [PubMed] [Google Scholar]
  45. Cronbach LJ, Rajaratnam N, & Gleser GC (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16(2), 137–163. 10.1111/j.2044-8317.1963.tb00206.x [DOI] [Google Scholar]
  46. Cuthbert BN, Miller GA, Sanislow C, & Vaidyanathan U (2023). The research domain criteria project: Integrative translation for psychopathology. In Blaney PH & Krueger RF (Eds.), Oxford Textbook of Psychopathology (Third ed., pp. 78–102). Oxford University Press. [Google Scholar]
  47. DeMars CE (2018). Classical Test Theory and Item Response Theory. In The Wiley Handbook of Psychometric Testing (pp. 49–73). 10.1002/9781118489772.ch2 [DOI] [Google Scholar]
  48. Donchin E, & Heffley EF (1978). Multivariate analysis of event-related potential data: A tutorial review. In Otto D (Ed.), Multidisciplinary perspectives in event-related brain potential research (pp. 555–572). US Government Printing Office. [Google Scholar]
  49. Fahrenberg J, Foerster F, Schneider H-J, Müller W, & Myrtek M (1986). Predictability of individual differences in activation processes in a field setting based on laboratory measures. Psychophysiology, 23(3), 323–333. 10.1111/j.1469-8986.1986.tb00640.x [DOI] [PubMed] [Google Scholar]
  50. Falkenstein M, Hohnsbein J, Hoormann J, & Banke L (1991). Effects of crossmodal divided attention on late ERP components. II. Error processing in choice reaction tasks. Electroencephalography and clinical neurophysiology, 78, 447–455. 10.1016/0013-4694(91)90062-9 [DOI] [PubMed] [Google Scholar]
  51. Feldt LS, & Charter RA (2003). Estimating the reliability of a test split into two parts of equal or unequal length. Psychological Methods, 8(1), 102–109. 10.1037/1082-989X.8.1.102 [DOI] [PubMed] [Google Scholar]
  52. Fiske DW (1971). Measuring the concepts of personality. Aldine. [Google Scholar]
  53. Flegal KM, Kit BK, & Graubard BI (2017). Bias in hazard ratios arising From misclassification according to self-reported weight and height in observational studies of body mass index and mortality. American Journal of Epidemiology, 187(1), 125–134. 10.1093/aje/kwx193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Folstein JR, & Van Petten C (2008). Influence of cognitive control and mismatch on the N2 component of the ERP: A review. Psychophysiology, 45(1), 152–170. 10.1111/j.1469-8986.2007.00602.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Gehring WJ, Goss B, Coles MGH, Meyer DE, & Donchin E (1993). A neural system for error detection and compensation. Psychological Science, 4, 385–390. 10.1111/j.1467-9280.1993.tb00586.x [DOI] [Google Scholar]
  56. Gehring WJ, Liu Y, Orr JM, & Carp J (2012). The error-related negativity (ERN/Ne). In Luck SJ & Kappenman ES (Eds.), Oxford handbook of event-related potential components (pp. 231–291). Oxford University Press. [Google Scholar]
  57. Gelman A, & Carlin J (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651. 10.1177/1745691614551642 [DOI] [PubMed] [Google Scholar]
  58. Hajcak G, Klawohn J, & Meyer A (2019). The utility of event-related potentials in clinical psychology. Annual Review of Clinical Psychology, 15(1), 71–95. 10.1146/annurev-clinpsy-050718-095457 [DOI] [PubMed] [Google Scholar]
  59. Hedge C, Powell G, & Sumner P (2017). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 103(3), 411–421. 10.3758/s13428-017-0935-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Holroyd CB, & Coles MGH (2002). The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109(4), 679–709. 10.1037/0033-295x.109.4.679 [DOI] [PubMed] [Google Scholar]
  61. Imburgio MJ, Banica I, Hill KE, Weinberg A, Foti D, & Macnamara A (2020). Establishing norms for error-related brain activity during the arrow Flanker task among young adults. NeuroImage, 213, 116694. 10.1016/j.neuroimage.2020.116694 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Infantolino ZP, Luking KR, Sauder CL, Curtin JJ, & Hajcak G (2018). Robust is not necessarily reliable: From within-subjects fMRI contrasts to between-subjects comparisons. NeuroImage, 173, 146–152. 10.1016/j.neuroimage.2018.02.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. International Journal of Psychophysiology; Guide for Authors. (2023). Retrieved 08/22/23 from https://www.elsevier.com/journals/international-journal-of-psychophysiology/0167-8760/guide-for-authors
  64. Kane MT (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–342. 10.1111/j.1745-3984.2001.tb01130.x [DOI] [Google Scholar]
  65. Kappenman ES, & Luck SJ (2012). ERP components: The ups and downs of brainwave recordings. In Luck SJ & Kappenman ES (Eds.), The Oxford handbook of event-related potential components (pp. 3–30). Oxford University Press, Inc. [Google Scholar]
  66. Kelley TL (1927). Interpretation of educational measurements. World Book Co. [Google Scholar]
  67. Klawohn J, Burani K, Bruchnak A, Santopetro N, & Hajcak G (2020a). Reduced neural response to reward and pleasant pictures independently relate to depression. Psychological Medicine, 59, 1–9. 10.1017/S0033291719003659 [DOI] [PubMed] [Google Scholar]
  68. Klawohn J, Meyer A, Weinberg A, & Hajcak G (2020b). Methodological choices in event-related potential (ERP) research and their impact on internal consistency reliability and individual differences: An examination of the error-related negativity (ERN) and anxiety. Journal of Abnormal Psychology, 129(1), 29–37. 10.1037/abn0000458 [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Kozak MJ, & Cuthbert BN (2016). The NIMH Research Domain Criteria Initiative: Background, issues, and pragmatics. Psychophysiology, 53(3), 286–297. 10.1111/psyp.12518 [DOI] [PubMed] [Google Scholar]
  70. Krizman J, & Kraus N (2019). Analyzing the FFR: A tutorial for decoding the richness of auditory function. Hearing research, 382, 107779–107716. 10.1016/j.heares.2019.107779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Larson MJ, Clayson PE, & Baldwin SA (2012). Performance monitoring following conflict: Internal adjustments in cognitive control? Neuropsychologia, 50(3), 426–433. 10.1016/j.neuropsychologia.2011.12.021 [DOI] [PubMed] [Google Scholar]
  72. Larson MJ, Clayson PE, & Clawson A (2014). Making sense of all the conflict: A theoretical review and critique of conflict-related ERPs. International Journal of Psychophysiology, 93(3), 283–297. 10.1016/j.ijpsycho.2014.06.007 [DOI] [PubMed] [Google Scholar]
  73. Li G (2023). Which method is optimal for estimating variance components and their variability in generalizability theory? evidence form a set of unified rules for bootstrap method. PLoS ONE, 18(7), e0288069. 10.1371/journal.pone.0288069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Light GA, Joshi YB, Molina JL, Bhakta SG, Nungaray JA, Cardoso L, Kotz JE, Thomas ML, & Swerdlow NR (2020). Neurophysiological biomarkers for schizophrenia therapeutics. Biomarkers in Neuropsychiatry, 2, 100012. 10.1016/j.bionps.2020.100012 [DOI] [Google Scholar]
  75. Light GA, & Swerdlow NR (2020). Selection criteria for neurophysiologic biomarkers to accelerate the pace of CNS therapeutic development. Neuropsychopharmacology, 45(1), 237–238. 10.1038/s41386-019-0519-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Lilienfeld SO, & Strother AN (2020). Psychological measurement and the replication crisis: Four sacred cows. Canadian Psychology, 61(4), 281–288. 10.1037/cap0000236 [DOI] [Google Scholar]
  77. Loevinger J (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694. 10.2466/pr0.1957.3.3.635 [DOI] [Google Scholar]
  78. Loken E, & Gelman A (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585. 10.1126/science.aal3618 [DOI] [PubMed] [Google Scholar]
  79. Lord FM, & Novick MR (1968). Statistical theories of mental test scores. Addison-Wesley. [Google Scholar]
  80. Luck SJ (2014). An introduction to the event-related potential technique (2nd ed.). The MIT Press. [Google Scholar]
  81. Luck SJ, & Gaspelin N (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn’t). Psychophysiology, 54(1), 146–157. 10.1111/psyp.12639 [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Luck SJ, Stewart AX, Simmons AM, & Rhemtulla M (2021). Standardized measurement error: A universal metric of data quality for averaged event-related potentials. Psychophysiology, 58(6), e13793. 10.1111/psyp.13793 [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Lutz MC, Kok R, Verveer I, Malbec M, Koot S, van Lier PAC, & Franken IHA (2021). Diminished error-related negativity and error positivity in children and adults with externalizing problems and disorders: A meta-analysis on error processing. Journal of Psychiatry and Neuroscience, 46, E615–E627. 10.1503/jpn.200031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Martin EA, McCleery A, Moore MM, Wynn JK, Green MF, & Horan WP (2018). ERP indices of performance monitoring and feedback processing in psychosis: A meta-analysis. International Journal of Psychophysiology, 132, 365–378. 10.1016/j.ijpsycho.2018.08.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Mathews CA, Perez VB, Delucchi KL, & Mathalon DH (2012). Error-related negativity in individuals with obsessive-compulsive symptoms: Toward an understanding of hoarding behaviors. Biological Psychology, 89(2), 487–494. 10.1016/j.biopsycho.2011.12.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Meehl PE (1986). Diagnostic taxa as open concepts: Metatheoretical and statistical questions about reliability and construct validity in the grand strategy of nosological revision. In Millon T & Klerman GL (Eds.), Contemporary directions in psychopathology: Toward the DSM-IV. (pp. 215–231). The Guilford Press. [Google Scholar]
  87. Meredith W (1964). Notes on factorial invariance. Psychometrika, 29(2), 177–185. 10.1007/BF02289699 [DOI] [Google Scholar]
  88. Messick S (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5–11. 10.3102/0013189X018002005 [DOI] [Google Scholar]
  89. Miller GA (1996). Presidential address: How we think about cognition, emotion, and biology in psychopathology. Psychophysiology, 33(6), 615–628. 10.1111/j.1469-8986.1996.tb02356.x [DOI] [PubMed] [Google Scholar]
  90. Miller GA (2010). Mistreating psychology in the decades of the brain. Perspectives on Psychological Science, 5(6), 716–743. 10.1177/1745691610388774 [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Miller GA, & Bartholomew ME (2020). Challenges in the relationships between psychological and biological phenomena in psychopathology. “Philosophical Issues in Psychiatry V: The Problems of Multiple Levels, Explanatory Pluralism, Reduction and Emergence”, May, 2018, Copenhagen, Denmark; This chapter is baed on an invited lecture presented at the aforementioned conference., [Google Scholar]
  92. Miller GA, Clayson PE, & Yee CM (2014). Hunting genes, hunting endophenotypes. Psychophysiology, 51(12), 1329–1330. 10.1111/psyp.12354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Miller GA, & Keller J (2000). Psychology and neuroscience: Making peace. Current Directions in Psychological Science, 9(6), 212–215. 10.1111/1467-8721.00097 [DOI] [Google Scholar]
  94. Molina JL, Thomas ML, Joshi YB, Hochberger WC, Koshiyama D, Nungaray JA, Cardoso L, Sprock J, Braff DL, Swerdlow NR, & Light GA (2020). Gamma oscillations predict pro-cognitive and clinical response to auditory-based cognitive training in schizophrenia. Translational Psychiatry, 10(1), 405. 10.1038/s41398-020-01089-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Moore CT (2016). gtheory: Apply generalizability theory with R. In (Version 0.1.2) https://CRAN.R-project.org/package=gtheory
  96. Nieuwenhuis S, Ridderinkhof KR, Blom J, Band GP, & Kok A (2001). Error-related brain potentials are differentially related to awareness of response errors: Evidence from an antisaccade task. Psychophysiology, 38, 752–760. 10.1111/1469-8986.3850752 [DOI] [PubMed] [Google Scholar]
  97. Novick MR (1965). The axioms and prinical results of classical test theory. ETS Research Bulletin Series, 1965(1), i–31. 10.1002/j.2333-8504.1965.tb00132.x [DOI] [Google Scholar]
  98. Novick MR, & Lewis C (1967). Coefficient alpha and the reliability of composite measurements. Psychometrika, 32(1), 1–13. 10.1007/BF02289400 [DOI] [PubMed] [Google Scholar]
  99. Olvet DM, & Hajcak G (2008). The error-related negativity (ERN) and psychopathology: Toward an endophenotype. Clinical Psychology Review, 28, 1343–1354. 10.1016/j.cpr.2008.07.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Overbeek TJM, Nieuwenhuis S, & Ridderinkhof KR (2005). Dissociable components of error processing: On the functional significance of the Pe vis-à-vis the ERN/Ne. Journal of Psychophysiology, 19, 319–329. 10.1027/0269-8803.19.4.319 [DOI] [Google Scholar]
  101. Pasion R, & Barbosa F (2019). ERN as a transdiagnostic marker of the internalizing-externalizing spectrum: A dissociable meta-analytic effect. Neuroscience & Biobehavioral Reviews, 103, 133–149. 10.1016/j.neubiorev.2019.06.013 [DOI] [PubMed] [Google Scholar]
  102. Post MW (2016). What to Do With “Moderate” Reliability and Validity Coefficients? Archives of Physical Medicine and Rehabilitation, 97(7), 1051–1052. 10.1016/j.apmr.2016.04.001 [DOI] [PubMed] [Google Scholar]
  103. Proudfit GH, Inzlicht M, & Mennin DS (2013). Anxiety and error monitoring: The importance of motivation and emotion. Frontiers In Human Neuroscience, 7, 636. 10.3389/fnhum.2013.00636 [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Psychophysiology; Author Guidelines. (2023). Retrieved 08/22/23 from https://onlinelibrary.wiley.com/page/journal/14698986/homepage/forauthors.html
  105. Putnick DL, & Bornstein MH (2016). Measurement invariance conventions and reporting: The state of the art and future directions for psychological research. Developmental Review, 41, 71–90. 10.1016/j.dr.2016.06.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  106. Riesel A (2019). The erring brain: Error-related negativity as an endophenotype for OCD-A review and meta-analysis. Psychophysiology, 56(4), e13348. 10.1111/psyp.13348 [DOI] [PubMed] [Google Scholar]
  107. Riesel A, Weinberg A, Endrass T, Meyer A, & Hajcak G (2013). The ERN is the ERN is the ERN? Convergent validity of error-related brain activity across different tasks. Biological Psychology, 93(3), 377–385. 10.1016/j.biopsycho.2013.04.007 [DOI] [PubMed] [Google Scholar]
  108. Rocha HA, Marks J, Woods AJ, Staud R, Sibille K, & Keil A (2020). Re-test reliability and internal consistency of EEG alpha-band oscillations in older adults with chronic knee pain. Clinical Neurophysiology, 131(11), 2630–2640. 10.1016/j.clinph.2020.07.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  109. Rouder J, & Haaf JM (2018). A psychometrics of individual differences in experimental tasks. Psychonomic Bulletin & Review, 26, 452–467. 10.3758/s13423-018-1558-y [DOI] [PubMed] [Google Scholar]
  110. Russell EW, Russell SLK, & Hill BD (2005). The fundamental psychometric status of neuropsychological batteries. Archives of Clinical Neuropsychology, 20(6), 785–794. 10.1016/j.acn.2005.05.001 [DOI] [PubMed] [Google Scholar]
  111. Sandre A, Banica I, Riesel A, Flake J, Klawohn J, & Weinberg A (2020). Comparing the effects of different methodological decisions on the error-related negativity and its association with behaviour and genders. International Journal of Psychophysiology, 156, 18–39. 10.1016/j.ijpsycho.2020.06.016 [DOI] [PubMed] [Google Scholar]
  112. Schwartz SJ, Lilienfeld SO, Meca A, & Sauvigné KC (2016). The role of neuroscience within psychology: A call for inclusiveness over exclusiveness. American Psychologist, 71(1), 52–70. 10.1037/a0039678 [DOI] [PubMed] [Google Scholar]
  113. Seghier ML, & Price CJ (2018). Interpreting and utilising intersubject variability in brain function. Trends in Cognitive Sciences, 22(6), 517–530. 10.1016/j.tics.2018.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  114. Sharp PB, & Miller GA (2019). Reduction and autonomy in psychology and neuroscience: A call for pragmatism. Journal of Theoretical and Philosophical Psychology, 39(1), 18–31. 10.1037/teo0000085 [DOI] [Google Scholar]
  115. Shavelson RJ, & Webb NM (1991). Generalizability theory: A primer. SAGE Publications, Inc. [Google Scholar]
  116. Shavelson RJ, Webb NM, & Rowley GL (1989). Generalizability theory. American Psychologist, 44(6), 922–932. 10.1037/0003-066X.44.6.922 [DOI] [Google Scholar]
  117. Spearman C (1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271–295. 10.1111/j.2044-8295.1910.tb00206.x [DOI] [Google Scholar]
  118. Steinhauser M, & Yeung N (2012). Error awareness as evidence accumulation: Effects of speed-accuracy trade-off on error signaling. Frontiers In Human Neuroscience, 6, 240. 10.3389/fnhum.2012.00240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  119. Steinmetz H, Schmidt P, Tina-Booh A, Wieczorek S, & Schwartz SH (2009). Testing measurement invariance using multigroup CFA: Differences between educational groups in human values measurement. Quality & Quantity, 43(4), 599–616. 10.1007/s11135-007-9143-x [DOI] [Google Scholar]
  120. Strube MJ, & Newman LC (2007). Psychometrics. In Cacioppo JT, Tassinary LG, & Berntson GG (Eds.), Handbook of Psychophysiology (3rd ed., pp. 789–811). Cambridge University Press. [Google Scholar]
  121. Struening EL, & Cohen J (1963). Factorial invariance and other psychometric characteristics of five opinions about mental illness factors. Educational and Psychological Measurement, 23(2), 289–298. 10.1177/001316446302300206 [DOI] [Google Scholar]
  122. Szucs D, & Ioannidis JPA (2020). Sample size evolution in neuroimaging research: An evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018) in high-impact journals. NeuroImage, 221, 117164. 10.1016/j.neuroimage.2020.117164 [DOI] [PubMed] [Google Scholar]
  123. Thomas JG, & Sharp PB (2019). Mechanistic science: A new approach to comprehensive psychopathology research that relates psychological and biological phenomena. Clinical psychological science, 7(2), 196–215. 10.1177/2167702618810223 [DOI] [Google Scholar]
  124. Thompson B (2003). Guidelines for authors reporting score reliability estimates. In Thompson B (Ed.), Score reliability: Contemporary thinking on reliability issues (pp. 91–101). Sage Publications, Inc. [Google Scholar]
  125. Ullsperger M, Harsay HA, Wessel JR, & Ridderinkhof KR (2010). Conscious perception of errors and its relation to the anterior insula. Brain Structure & Function, 214(5–6), 629–643. 10.1007/s00429-010-0261-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  126. Vacha-Haase T (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58(1), 6–20. 10.1177/0013164498058001002 [DOI] [Google Scholar]
  127. Vandenberg RJ, & Lance CE (2000). A Review and Synthesis of the Measurement Invariance Literature: Suggestions, Practices, and Recommendations for Organizational Research. Organizational Research Methods, 3(1), 4–70. 10.1177/109442810031002 [DOI] [Google Scholar]
  128. Vispoel WP, Lee H, Chen T, & Hong H (2023). Using structural equation modeling to reproduce and extend ANOVA-based generalizability theory analyses for psychological assessments. Psych, 5(2), 249–273. 10.3390/psych5020019 [DOI] [Google Scholar]
  129. Vispoel WP, Morris CA, & Kilinc M (2018a). Applications of generalizability theory and their relations to classical test theory and structural equation modeling. Psychological Methods, 23(1), 1–26. 10.1037/met0000107 [DOI] [PubMed] [Google Scholar]
  130. Vispoel WP, Morris CA, & Kilinc M (2018b). Practical applications of generalizability theory for designing, evaluating, and improving psychological assessments. Journal of Personality Assessment, 100(1), 53–67. 10.1080/00223891.2017.1296455 [DOI] [PubMed] [Google Scholar]
  131. Vispoel WP, Xu G, & Schneider WS (2022). Using parallel splits with self-report and other measures to enhance precision in generalizability theory analyses. Journal of Personality Assessment, 104(3), 303–319. 10.1080/00223891.2021.1938589 [DOI] [PubMed] [Google Scholar]
  132. Volpert-Esmond HI (2022). Looking at change: Examining meaningful variability in psychophysiological measurements. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 7(6), 530–531. 10.1016/j.bpsc.2022.02.006 [DOI] [PubMed] [Google Scholar]
  133. Volpert-Esmond HI, Merkle EC, Levsen MP, Ito TA, & Bartholow BD (2018). Using trial-level data and multilevel modeling to investigate within-task change in event-related potentials. Psychophysiology, 55(5), e13044. 10.1111/psyp.13044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  134. Warrens MJ (2015). A comparison of reliability coefficients for psychometric tests that consist of two parts. Advances in Data Analysis and Classification, 10(1), 71–84. 10.1007/s11634-015-0198-6 [DOI] [Google Scholar]
  135. Weinberg A, Dieterich R, & Riesel A (2015). Error-related brain activity in the age of RDoC: A review of the literature. International Journal of Psychophysiology, 98(Part 2), 276–299. 10.1016/j.ijpsycho.2015.02.029 [DOI] [PubMed] [Google Scholar]
  136. Wessel JR (2012). Error awareness and the error-related negativity: Evaluating the first decade of evidence. Frontiers In Human Neuroscience, 6, 88. 10.3389/fnhum.2012.00088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  137. Wessel JR, Danielmeier C, Morton JB, & Ullsperger M (2012). Surprise and error: Common neuronal architecture for the processing of errors and novelty. Journal of Neuroscience, 32(22), 7528–7537. 10.1523/JNEUROSCI.6352-11.2012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  138. Williams DR, Mulder J, Rouder JN, & Rast P (2021). Beneath the surface: Unearthing within-person variability and mean relations with Bayesian mixed models. Psychological Methods, 26(1), 74–89. 10.1037/met0000270 [DOI] [PMC free article] [PubMed] [Google Scholar]
  139. Wilson KA, Clark DA, & MacNamara A (2021). Using item response theory to select emotional pictures for psychophysiological experiments. International Journal of Psychophysiology, 162, 166–179. 10.1016/j.ijpsycho.2021.02.003 [DOI] [PubMed] [Google Scholar]
  140. Yeung N, Botvinick MM, & Cohen JD (2004). The neural basis of error detection: Conflict monitoring and the error-related negativity. Psychological Review, 111(4), 931–959. 10.1037/0033-295x.111.4.931 [DOI] [PubMed] [Google Scholar]

RESOURCES