Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Jan 1.
Published in final edited form as: Psychophysiology. 2017 Jan;54(1):123–138. doi: 10.1111/psyp.12629

Assessing the internal consistency of the event-related potential: An example analysis

Nina Thigpen 1, Emily Kappenman 2, Andreas Keil 1
PMCID: PMC5525326  NIHMSID: NIHMS879463  PMID: 28000264

Abstract

Event-related potentials (ERPs) are widely and increasingly used to address questions in Psychophysiological research. As discussed in this special issue, a renewed focus on questions of reliability and stability marks the need for intuitive, quantitative descriptors that allow researchers to communicate the robustness of ERP measures used in a given study. This report argues that well-established indices of internal consistency and effect size meet this need and can be easily extracted from most ERP data sets, as demonstrated with example analyses using a representative data set from a feature-based visual selective attention task. We demonstrate how to measure the internal consistency of three aspects commonly considered in ERP studies: Voltage measurements for specific time ranges at selected sensors, voltage dynamics across all time points of the ERP waveform, and the distribution of voltages across the scalp. We illustrate methods for quantifying the robustness of experimental condition differences, by calculating effect size for different indices derived from the ERP. The number of trials contributing to the ERP waveform was manipulated to examine the relationship between signal-to-noise ratio, internal consistency, and effect size. In the present example data set, satisfactory consistency (Cronbach’s alpha > 0.7) of individual voltage measurements was reached at lower trial counts than were required to reach satisfactory effect sizes for differences between experimental conditions. Comparing different metrics of robustness, we conclude that the SNR, internal consistency, and effect size of ERP findings greatly depend on the quantification strategy, the comparisons and analyses performed, and the signal-to-noise ratio.

Keywords: reliability, event-related potentials, Cronbach’s alpha, internal consistency, effect size, signal-to-noise ratio


Event-related potentials (ERPs) represent large-scale brain electric fields that are time-locked to an event. They are non-invasively recorded from the scalp and have been used to investigate brain processes for more than half a century (Luck, 2014). ERPs have also been discussed as potential biomarkers for a variety of psychiatric and neurological disorders (Foti, Kotov, & Hajcak, 2013; Light & Swerdlow, 2015; Luck et al., 2011; Perez, Swerdlow, Braff, Näätänen, & Light, 2014) and as indices of individual differences in non-clinical samples (Anokhin et al., 2001; Cohen & Polich, 1997). An ERP can be regarded as a spatio-temporal matrix, often recorded from many scalp locations, and containing time varying voltage information at high temporal resolution. Numerous indices can be extracted from this spatio-temporal matrix using different quantification methods. Some indices are univariate in nature, such as the latency of a given component, the mean amplitude across a time-window, or area measurements of the amplitude for a given component at a given sensor location (Kappenman & Luck, 2012). Others are multivariate, such as the topographical distribution of voltages across the scalp or the temporal sequence of components in a waveform (Spencer, Dien, & Donchin, 1999; Dien, Spencer, & Donchin, 2004). Given their rich potential for answering questions in Psychophysiology, potential clinical applications, and the myriad techniques used to quantify ERP indices, a discussion of their psychometric properties is becoming increasingly important.

Recently, discussions about replicability in the cognitive and neural sciences have arisen, particularly regarding the reliability of ERP measures (Keil et al., 2014). The need for establishing the psychometric properties of ERP measures (such as reliability) is obvious when authors are interested in quantifying inter-individual differences, especially in the context of clinical and translational work. Quantitative indices of robustness and consistency, however, are also desirable for experimental studies comparing ERP metrics under different conditions, typically using within-participants comparisons. The reliability of a given ERP effect depends on a number of factors, including the recording hardware and sensors (affecting the overall signal quality in the raw EEG), how the dependent variable was derived from the spatio-temporal ERP matrix (the quantification method), and how much error variance (noise) affected the desired ERP signal, which can be measured by computing the signal-to-noise ratio (SNR). Another point affecting replicability of experimental reports is the sensitivity of a given ERP index to differences between conditions, readily quantified by computing effect size. The range of acceptable SNRs, effect sizes, and reliability indices will vary by study goals, experimental paradigm, and the specific ERP component examined in the study.

In the present report, we address the issue of reliability of ERP measures not by suggesting recommended parameters for ERP studies, but by providing example analyses of SNR, internal consistency, and effect size that can be readily applied to any ERP data set. We discuss how authors of within-participant studies may quantify and document the within-study reproducibility of selected ERP metrics using the variability across experimental conditions. We also illustrate the consequences of different quantification methods, and compare the reliability and robustness (measured by the effect size) of different types of dependent variables.

Considerations regarding signal-to-noise ratio

Signal-to-noise ratio (SNR) quantifies the strength of a signal of interest in the presence of noise (Teplan, 2002). Often, SNR is defined as a function of signal (S) and noise (sigma) in a single trial, modulated by the square root of the number of trials n: SNR = sqrt(n) * S/sigma. Thus, SNR increases logarithmically with the number of trials1 averaged to produce an ERP (Handy, 2005). SNR is important in determining the robustness and replicability of a given ERP finding, and recommendations regarding the design of ERP studies are therefore often based on SNR. For example, Luck (2005) recommended designing ERP studies to contain a trial count high enough to reach an SNR of 10. A comprehensive discussion of trial count recommendations is outside the scope of the present paper, but it should be noted that many recommendations do not contain systematic quantitative or psychometric analyses in their support, discussed in the paper by Boudewyn, Luck, & Kappenman, in this issue, or in Kappenman & Luck (2012). Instead, the present paper focuses on simple methods for quantifying and documenting data quality that can inform researchers regarding the suitability of the ERP signal used as a dependent variable in a given study.

Quantifying ERP robustness using internal consistency

The present study examines an important facet of reliability known as internal consistency. Internal consistency refers to a measure’s ability to quantify the same underlying construct or variable (here: ERP data) with different items or sub-variables. ERP measures are considered internally consistent if the rank ordering of subjects remains stable for the extracted variable across different experimental conditions, trials, or sessions (Simons and Miles, 1990). Thus, internal consistency is considered particularly important in studies with between-participants factors, in which researchers aim to characterize individuals by means of the ERP component of interest.

Internal consistency of ERP measures is also desirable in studies interested in the effects of within-participants manipulations on a dependent ERP variable. To assess internal consistency of the ERP in a within-participants design, researchers may use condition-averaged ERPs to serve as “items” for Cronbach’s alpha, as condition-averaged ERPs represent distinct samples drawn from the same population of trials. This metric answers questions regarding the consistency/reliability of the ERP that is attainable in a specific design, given the number of trials available: High consistency would indicate that the ERP is reliably seen across different averages obtained from the same participant. Another possibility includes randomly dividing all experimental trials into X number of arbitrary groups (ignoring experimental conditions), and re-averaging to form new ERPs to create “items” for Cronbach’s alpha (Fabiani, Gratton, Karis, & Donchin, 1987; Handy, 2005). Although re-averaging randomly drawn trials is a feasible and informative approach, it does not allow quantification of robustness of experimental effects (as different conditions are averaged together) and may obscure changes of robustness associated with condition-specific ERP modulations.

In the present report, we address the issue of robustness and consistency of variables derived from ERPs by using the condition-averaged ERPs as the items for consistency analysis. The rationale for this is as follows: (1) Condition averages are the data used for hypothesis testing, and assessing their reliability thus has higher relevance compared to analyses based on surrogate data; (2) Condition averages are readily available and do not require re-sampling, thus lessening the burden on researchers; (3) Similar ERP waveforms across conditions are typical for studies in which conditions differ only by a specific manipulation; (4) The experimental effects (in which the consistency of the ERP across conditions may be reduced) are typically confined to specific temporal regions, allowing the use of other temporal regions for analyzing the consistency of the overall waveform; (5) The internal consistency among ERP variables derived from different conditions can be considered a necessary condition for treating the variables as indices of the same brain process, which is often assumed in ERP studies focusing on amplitude modulation between experimental conditions.

A potential limitation of this approach arises when determining the consistency of differences (e.g. values obtained by subtracting one experimental condition from another). Here, Cronbach’s alpha might under-estimate the robustness of the ERP measure, because consistency requires that a participant’s ordinal position of the dependent variable is maintained across conditions. This requirement will not be met if participants vary strongly in their sensitivity to the experimental manipulation (i.e., to the conditions). Empirically, this question can be addressed by comparing internal consistency of components that are modulated versus components that are not modulated by the experimental manipulation to determine the consistency of the overall waveform, which is one of the approaches illustrated in the present study.

The role of quantification techniques

Quantification techniques differ in their sensitivity to the quality of the data. For example, peak amplitude measures are more sensitive to high-frequency noise compared to mean amplitude measures (Luck, 2014). Conversely, applying different quantification techniques may also lead to different SNRs of the dependent variable extracted. Thus, quantification techniques such as measuring the peak voltage, or the mean voltage of a given component, may also affect the reproducibility of the findings of a given ERP effect. For instance, averaging time points and sensors for the ERP signal of interest is often thought to increase within-participant SNR and decrease error variance across participants (Marco-Pallares, Cucurell, Münte, Strien, & Rodriguez-Fornells, 2011; Pontifex et al., 2010). However, the choice of the temporal length and spatial extent of voltages to be averaged (or otherwise integrated) into dependent variables may sometimes seem arbitrary, which has led to discussions of peak-picking and averaging as potential causes contributing to false positive findings (Dien, 2010; Keil et al., 2014). The contribution by Luck and Gaspelin in this issue illustrates this problem in greater detail.

In addition, an extensive discussion in the ERP literature has identified challenges associated with measures that focus on finding component peaks in general (J. Dien, Spencer, & Donchin, 2003; Donchin et al., 1977; Fabiani, Gratton, Corballis, Cheng, & Friedman, 1998; Spencer, Dien, & Donchin, 1999). For example, peaks may not capture the signal of interest, and may instead reflect brain processes that are shared between conditions and/or groups. Using difference waveforms is another common approach to address some of these problems (Kappenman & Luck, 2012). In this approach, replicability may be affected by the fact that SNR tends to be lower for difference waveforms. That is, the SNR of the difference waveform is typically lower than the parent waveforms. Thus, the present example analysis examines the effect size between experimental conditions as a measure of robustness with various quantification strategies, and discusses the relationship between these effect sizes and the SNR of the difference waveforms.

The present research

The goal of the present study is to provide a set of example analyses of SNR, internal consistency, and effect size in a typical ERP study, using metrics that are rapidly and easily computed. The ultimate goal of this approach is to stimulate more extensive use of quantitative reports of reliability and replicability in empirical reports (Keil et al., 2014). We address this question using a data set involving pattern-onset ERPs in a feature-based visual selective attention task, containing four experimental conditions, each producing an ERP containing five well-known components (i.e., the P1, N1, P2, N2, and P3). Based on a number of studies with this paradigm, (see Harter, Russell M., Aine, 1984; Müller & Keil, 2004; Schoenfeld et al., 2007; Hopf, Boelmans, Schoenfeld, Luck, & Heinze, 2004; McGinnis & Keil, 2011) we expected the spatio-temporal properties of the ERPs elicited in each experimental condition to be similar. Specifically, experimental condition-averaged ERPs are expected to differ during a brief time window from 190 to 220 ms post stimulus-onset, over parieto-occipital electrode locations (Anllo-Vento & Hillyard, 1996; Harter, Russell M., Aine, 1984). In the literature, this time window is referred to as the selection negativity (SN), and has been shown to contain an amplitude enhancement for attended when compared to non-attended features. Thus, ERPs derived from the conditions in a feature-based attention paradigm are particularly suitable for demonstrating the use of Cronbach’s Alpha for quantifying internal consistency.

We quantified the internal consistency of selected variables derived from the pattern-onset ERP by measuring Cronbach’s Alpha, and the robustness of experimental effects (differences between conditions) by measuring effect sizes, expressed as R2. In an attempt to illustrate the sensitivity of these approaches to SNR, we also varied the number of trials entering the ERPs used in the analyses. Solutions and example outcomes are presented for different quantification techniques, notably, averaging in the time domain (i.e., across time points) and averaging in the spatial domain (across electrodes), prior to statistical analysis. Finally, when using ERP data for hypothesis testing, researchers may be interested in three different types of variables: (i) The ERP topography at a given point in time may be relevant to testing a hypothesis regarding spatial extent of a voltage change. (ii) The shape of the ERP waveform at a given sensor or sensor group may be of interest when testing hypotheses regarding the temporal evolution of neurocognitive processes. (iii) The voltage amplitude at specific time points and electrodes may be used to examine hypotheses regarding the differences in neural population activity between experimental conditions. Accordingly, for each of these variable types, we demonstrate simple methods for estimating consistency and effect size, and examine their relation for different trials counts and quantification techniques.

Method

Participants

Nineteen healthy volunteers (12 females; mean age 18.6, SD 0.9; 1 left-handed) participated in the experiment in exchange for course credit. Participants were excluded if their response accuracy fell below two standard deviations from the mean, which applied to two participants. All participants gave written informed consent prior to participating. The institutional review board of the University of Florida, in line with the Declaration of Helsinki, approved all procedures.

Stimuli and Procedure

Stimuli consisted of four sinusoidal gratings filtered with a Gaussian envelope (i.e., Gabor Patches). All Gabor Patches were composed of grayscale gratings and were presented against a gray background with the same mean luminance (31 cd/m2) as the Gabor Patches (see Figure 1). The four stimuli varied with respect to two task-relevant features: orientation and spatial frequency. Stimulus orientation was manipulated by rotating the Gabor patch grating relative to a vertical axis (1.5° or 358.5°). Stimulus spatial frequency was either 1.33 or 1.78 cycles per degree, at a visual angle of 4.5°, achieved by seating participants 140 cm from a 23-inch 3-D LED monitor (Samsung LS23A950) set to a vertical refresh rate of 120 Hz.

Figure. 1.

Figure. 1

Participants performed a feature-based visual selective attention task, which involved discriminating target stimuli (a Gabor Patch defined by a combination of orientation and spatial frequency) from non-target stimuli. The experimental session was organized into 12 experimental blocks, with the target patch varying between blocks. Prior to a given trial block, participants were presented with a target stimulus in the middle of the screen (e.g., stimulus A in Figure 1), and asked to memorize the two features defining this particular target, which included a specific orientation and spatial frequency. Once participants reported familiarization with the target stimulus, they completed a block containing 40 trials. Each trial contained one of the four stimuli (shown in Figure 1), presented at the center of the screen for 66.7 milliseconds. Participants were instructed to indicate whether the presented stimulus matched or did not match the target stimulus for that block. Participants responded with their dominant hand by pressing one arrow key of a standard keyboard when a stimulus identical to the target stimulus appeared, and the other arrow key when any non-target stimulus appeared. The keyboard was placed in a comfortable location and could easily be operated by all participants, and the mapping of the arrow keys to target/non-target conditions was counterbalanced across participants. Between stimulus presentations, a fixation circle occupying 0.5° of visual angle was present for an interval varying between 1.5–2.1 seconds. If participants did not press either response button during this interval, it was counted as an incorrect response. A new target stimulus was assigned at the start of each block. Participants were allowed breaks as needed in between blocks. Both the order of stimuli presented within a block, and the order of blocks was fully randomized. Participants were instructed to avoid head movements and to maintain gaze on the central fixation circle.

After data collection, each trial was assigned to one of four experimental conditions, contingent on the block’s target stimulus: trials containing (1) stimuli that matched the target’s spatial frequency and orientation (S+O+), (2) stimuli that matched the target’s spatial frequency, but not orientation (S+O−), (3) stimuli that matched the target’s orientation, but not spatial frequency (S-O+), and (4) stimuli that did not match the target in either spatial frequency or orientation (S-O−). A total of 120 trials were presented in each of the four conditions.

Behavioral Data

Participants’ accuracy and response time was calculated across blocks separately for each condition. This included the percentage of correctly identified targets (hits), incorrect responses to targets (misses), incorrect responses to a non-target (false alarms), and correct responses to a non-target (correct rejections). To ensure the four stimuli were comparable in their discriminability, a 2×2 repeated-measures analysis of variance (ANOVA) was conducted on both the hit rates and response times observed with the four different stimuli, with factors of spatial frequency and orientation.

Data Acquisition

Electroencephalogram (EEG) data were recorded continuously with a 129-channel Geodesic Sensor Net (Electrical Geodesic, Eugene Oregon) connected to a high-input impedance (> 200 MOhms) amplifier. Electrodes were evenly spaced across large areas of the head, including facial and neck regions (see Figure 2). Impedance for each electrode was kept below 60 kOhms, and the vertex electrode (Cz) was used as the recording reference. All channels were digitized at a rate of 500 Hz and filtered online using a Butterworth low pass filter with a 3 dB point (cut-off) at 200 Hz. All further data processing was done off-line.

Figure. 2.

Figure. 2

Trial Segmentation, Filtering and Artifact Handling

Continuous EEG data were digitally filtered offline using a 2nd order Butterworth high-pass filter having a 3 dB point at .15 Hz.2, as well as a 12th order Butterworth low-pass filter with a 3 dB point at 40 Hz. Eye movement artifacts were detected and corrected using an artifact correction method based on linear regression performed on residuals, implemented in the Biosig suite of Matlab functions (Schlögl et al., 2007; Vidaurre, Sander, & Schlögl, 2011). It creates a linear model of the data based on representative ocular events, in which the contribution of electro-ocular processes to the EEG measured at each time point is estimated and removed through subtraction of the weighted EOG. This procedure bears the risk that brain-related activity is removed if it shares spatial and temporal variance with EOG events (compare Gratton et al., 1983, for a different approach). In the present data set, however, re-running the pre-processing without eye correction resulted in suppressed, not in augmented ERP amplitude. Following EOG correction, segments were extracted from the continuous EEG, with each segment having a duration of 1000 ms (200 ms before and 800 ms after stimulus onset).

These segments were submitted to a semi-automated artifact detection procedure designed for multi-channel electrophysiology, which is based on distributions of trial and channel statistics (Junghofer, 2000). First, specific channels that were bad throughout the experimental session were detected with voltage data given relative to the original recording reference (i.e., Cz). That is, channels that fell above a 2.5 standard deviation threshold with respect to the median of three distributions calculated across all trials (amplitude, standard deviation, and gradient) were interpolated across all time points using spherical spline functions (Junghöfer, Elbert, Tucker, & Rockstroh, 2000). Data at eliminated channels were replaced with a statistically weighted spherical spline interpolation from the full channel set (Junghöfer, Elbert, Leiderer, Berg, & Rockstroh, 1997).

In a next step, based on the off-line average reference, distinct sensors from individual trials were also excluded and interpolated when located in the tails (2.5 standard deviation above the median) of the distribution of their absolute amplitude, maximum standard deviation, and gradient, calculated by integrating across the time points in each trial. Trials in which interpolated channels were clustered in one scalp region (quantified as described in Peyk, DeCesarei, & Junghöfer, 2011) and trials with less than 103 good channels were excluded entirely. Only trials with correct responses were retained for final ERP averaging, leading to an overall mean of 78.6 trials included per condition (SD = 14.3, range = 60 – 103). On average, 24% of trials were rejected due to artifact, and 11% of trials were not used for ERP analysis due to incorrect behavioral responses. The target condition (S+ O+) included a mean of 77.2 trials across participants (SD = 15.5, range = 61 – 99); condition S+ O− included a mean of 78.1 (SD = 14.9, range = 66 – 102); condition S- O+ included a mean of 79.2 trials (SD = 14.0, range = 61 – 103); and condition S- O− included a mean of 80.1 trials (SD = 16.7, range = 65 – 103).

Analysis of experimental effects: selection negativity

To ensure that the data set used for reliability analyses was representative, we established the extent to which the ERP waveforms in the present study replicated a large body of earlier work in this area (e.g., Anllo-Vento & Hillyard, 1999). Specifically, we expected to observe a greater posterior negativity for stimuli with target features compared to stimuli with no target features, during the time window of the selection negativity (typically 160 – 280 ms). To this end, a posterior cluster of electrodes was formed around Pz and its superior and inferior nearest neighbors (containing electrodes 54, 55, 61, 62, 72, 75, 78, 79, and 81, as shown in Figure 2), chosen based on earlier research with this paradigm (e.g., Keil & Müller, 2010). Then, considering the waveform differences seen in the grand mean as well as the previous studies discussed above (McGinnis & Keil, 2011; Müller & Keil, 2004), the mean voltage amplitude was extracted across this sensor cluster and across time windows representing early and late selection negativity (178 to 234 ms and 236 to 292 ms, respectively, in line with the studies cited above). A repeated-measures analysis of variance (ANOVA) was conducted on each of the early and late mean amplitudes with factors of Time (early SN, late SN), Spatial Frequency (match versus non-match with the target), and Orientation (match versus non-match with the target; Keil & Müller, 2010).

Signal-to-noise ratio

SNR was determined at each sensor for the components P1, N1, P3, and Selection Negativity using averages based on varying numbers of trials (10 through 80 trials in steps of 10). Specifically, SNRs were calculated for each participant by dividing the voltage measurement (peak amplitude) obtained for each component by the peak across the baseline variance (from -100 to 0 ms). The peak amplitude of each component was determined by taking either the maximum or minimum voltage amplitude (for positive or negative components, respectively), across time the window centered around the grand mean component peak (defined as P1: 100 to 130 ms, N1: 160 to 190 ms, P3: 300 to 330 ms, and SN: 190 to 220). The mean amplitude was calculated as the average voltage within these same time windows. SNR values were then averaged across participants. If not otherwise indicated, Figures display SNRs for the target condition.

Reliability Analyses

Reflective of the spatio-temporal nature of ERP data, reliability was assessed for various combinations of time windows and sensor clusters extracted from the ERP matrices. Many empirical studies do not form dependent variables based on peak amplitudes measures at single sensors, but use voltage averages across multiple electrodes and time points for hypothesis testing (Fabiani, Gratton, Karis, & Donchin, 1987). The effects of this strategy were examined here by systematically averaging across increasing numbers of time points around component peaks and across increasing numbers of electrodes, prior to calculating internal consistency. Internal consistency was quantified with Cronbach’s Alpha, a coefficient representing the consistency of “items” (variables or repetitions) across observations (e.g., participants). The formation of items is described for each example analysis in the results. Cronbach’s alpha has been widely used to test the response similarities of items on a questionnaire across observations, and is considered to represent acceptable reliability when the coefficient is above .70 (Cronbach, 1951). More recently, Hinton and colleagues (Hinton, Brownlow, McMurray, & Cozens, 2004) have suggested that Cronbach’s alpha exceeding .90 indicates excellent internal consistency, alphas between .70 and .90 indicate high internal consistency, and alphas from .50 to .70 indicate moderate internal consistency, whereas a coefficient below .50 is considered poor. Confidence intervals for each analysis are shown in Figure 11.

Figure. 11.

Figure. 11

Effect size analyses

Cronbach’s Alpha could not be used to examine the internal consistency of difference waveforms in the present design, because the difference waveforms of interest contained the same (target) condition and thus represented linear combinations of each other, which violates the independence-of-items assumption required for calculating Cronbach’s alpha. Instead, to quantify effect size of the well-established selection negativity difference, a trend (F-contrast) analysis was performed using a General Linear Model procedure (Rosnow, Rosnow, & Rosenthal, 1996), with weights based on the hypothesis that feature-based attention increases the SN amplitude with the number of attended features (Harter, Russell M., Aine, 1984; Hopf, Boelmans, Schoenfeld, Luck, & Heinze, 2004; McGinnis & Keil, 2011; Müller & Keil, 2004; Schoenfeld et al., 2007): Across the selection negativity time window (i.e. 160–280 ms), the four conditions were weighted according to their overlap with the target condition. The target condition itself was weighted the lowest (expected to show greatest selection negativity), the two conditions with one feature in common with the target (either orientation or spatial frequency) were weighted intermediate, and the no-matching features condition was weighted the highest (resulting in condition weights of −2, .5, .5, and 1, respectively). The effect size of the resulting F-contrast is readily expressed as R2, which reflects the proportion of trend-related variance relative to the total variance (i.e., trend variance plus unique error variance). Traditionally, R2 estimates of effect size are assigned to three levels: .14 a small effect, .39 a medium effect, and .59 a large effect (Cohen, 1988). The further computational steps taken to address different aspects of effect size are detailed below, in the results section.

Results

Behavioral data

Participants performed the task with high accuracy (M = 89% correct across all trials, SD = 9%), and response times as expected for this task (M = 760 ms, SD = 160 ms). The accuracy and response time data are shown in Table 1. The repeated-measures ANOVA for both hit rate and reaction time showed no significant effect for spatial frequency or orientation, suggesting that the four different Gabor patches did not differ in their discriminability or saliency as target stimuli.

Table 1.

Behavioral Data

Category Accuracy Response Time

Mean SD Mean SD
Hit .89 .9 764 161
Miss .11 .9 805 288
False Alarm .5 .8 933 351
Correct Rejection .95 .8 714 158

Note. Mean and standard deviation for accuracy and response time in milliseconds (N = 19).

ERP morphology and condition differences

Five well-known ERP components (i.e., the P1, N1, P2, N2, and P3) showed latencies and topographies typically consistent with previous studies of pattern-onset ERPs (see Figure 3), with component peaks in the grand mean centered at 120, 176, 210, 256, and 340 ms post-stimulus, respectively. A standard analysis of experimental effects (differences between voltage amplitudes) obtained in the different conditions was conducted to document the extent to which the present data set replicates known effects of feature based attention.

Figure. 3.

Figure. 3

Condition differences were only prominent during the selection negativity time window, paralleling previous work on feature-based attention with multi-feature stimuli (Anllo-Vento & Hillyard, 1996; Anllo-Vento, Luck, & Hillyard, 1998; Keil & Muller, 2010; McGinnis & Keil, 2011): Selection negativity was observed for attended features, over parieto-occipital sensors, at latencies between 178 and 292 ms post-stimulus. As shown in Figure 3, the selection negativity was most pronounced when comparing the target condition (S+ O+) to the condition with no target features (S- O−). That is, stimuli containing target features evoked larger negative deflections compared to stimuli with fewer target features, during the selection negativity time window (which encompasses the N1, P2, and N2 components). As expected, an ANOVA showed main effects of Orientation, F(1, 18) = 6.95, p < .05, ηp2= .27, and Spatial Frequency, F(1,18) = 8.96, p < 0.01, ηp2 = .33, which both reflected larger negative deflections for stimuli with target features. There was no main effect or interaction effect involving Time (i.e. early v. late selection negativity). Furthermore, the two factors corresponding to attended features (Orientation and Spatial Frequency) did not interact, which replicates previous work interpreting this finding to indicate additive effects of feature dimensions on the selection negativity. In summary, analyses of condition differences using traditional ANOVA suggested that the present data set is consistent with previous work in terms of direction and size of experimental effects. Because condition differences were confined to a specific temporal regions and absent during the remainder of the ERP epoch, this paradigm was considered particularly suitable for the purpose of internal consistency analysis, using conditions as items.

Signal-to-noise ratio

As expected, the signal-to-noise ratio increased as a function of the number of trials included in the average ERP waveform. With 10 averaged trials, SNR for component peaks relative to the peak of the baseline variance tended to be around 3, and increased linearly as trial count increased logarithmically (see Figure 4). Recommended SNRs (10 and above, Luck, 2014) for component peaks of P1, N1, and P3 were reached with 40 trials. Additional averaging led to SNRs around 20, showing topographical distributions consistent with the distribution of voltage maxima. As expected, the selection negativity difference waveform was associated with significantly lower SNR, as shown in Figure 4.

Figure. 4.

Figure. 4

Reliability analyses

Reliability of peak voltage at individual time points and sensors

One of the most common forms of ERP analysis is the statistical comparison of voltage measurements taken at a given sensor and time point. To illustrate how internal consistency of these measurements can be easily assessed, and to document the range of possible outcomes of such an analysis, Cronbach’s alpha was calculated for individual ERP voltages at each sensor and time point, using the 4 conditions as items and 19 participants as observations. This analysis yielded consistency estimates for each of the 129 sensors at all 501 epoch sample points (i.e., 1002 ms), for a total of 64,629 alphas. These calculations were repeated with varying numbers of trials included in the averaged ERP. The first 6 trial counts were based on subsets of 10, 20, 30, 40, 50, and 60 trials per participant, respectively, in each condition. The 7th calculation included all artifact-free trials, which included a median of 80 trials per participant (range: 60 – 103 trials).

Figure 5 shows the topographical distribution of Cronbach’s alpha for the peak across the baseline variance (−100 to 0) and each component peak voltage, calculated for different trial counts. Peak latencies were determined on the basis of the grand mean ERP waveform, a widely used practice in ERP studies. For all 5 ERP component peaks analyzed, high Cronbach’s alpha values (i.e., exceeding an alpha of .7) were observed with relatively low trial counts (20 trials), but only at scalp locations at which the respective signal was pronounced. For instance, peak voltages of P1 and N1 displayed excellent (> .9) internal consistency with as a few as 20 trials in the posterior portion of the scalp, at sensors surrounding site Oz of the international 10–20 system. Later components P2, N2, and P3 similarly displayed high internal consistency with few trials at scalp locations associated with their voltage maximum. Including more trials into the average was associated with more widespread internal consistency of peak voltages, across all component peaks examined. At a trial count of 40, internal consistency reached levels of .8 or greater at 127 out of 129 sensors, for all component peak voltages examined, representing high internal consistency. Thus, experimental conditions used as replications (items) displayed high consistency in estimating the underlying dimension (here: the peak voltage at a given sensor) at trial counts exceeding 40, over widespread scalp regions. Note that Cronbach’s alpha as used above is easily determined for voltage scores extracted from individual participants (n) and a given number of conditions (k), arranged in one or more n x k matrices, using a wide range of statistics or computing software packages.

Figure. 5.

Figure. 5

Reliability of the voltage topography for each time point

Researchers interested in the internal consistency of the voltage topography (the distribution of voltages across the electrode array) may also employ Cronbach’s alpha. In the present example, observations were voltages for all 129 sensors for 19 participants, and items were the 4 experimental conditions, resulting in a 2451 by 4 matrix. One such matrix was created for each ERP time point, and each matrix produced one Cronbach’s alpha value. Thus, a time series of internal consistency estimates resulted, reflective of the consistency of the individual topographies across the four experimental conditions (see Figure 6). Across all trial counts, internal consistency was low during the baseline segment, as expected. The internal consistency of the topographical distribution increased with the rising slope of the P1 (at 110 ms), and again strongly varied with trial count. Cronbach’s alpha values exceeded the threshold for high internal consistency (i.e., Cronbach’s alpha >.7) when averaging 30 trials. Excellent internal consistency (Cronbach’s alpha > .9) was observed between 115 and 620 ms post-stimulus for 40 or more trials. For the duration of this time window, moderate reliability (i.e., values near .68) was observed with 20 averaged trials, and low reliability was found with 10 trials (i.e., Cronbach’s alpha < .5). The time range of the selection negativity (178 to 292 ms) was characterized by a sharp transient decrease in cross-condition consistency. Experimental conditions (used as variables) systematically differed in amplitude and topography during this time range in the present task. This added variability as associated with decreased internal consistency across conditions, while still being at levels of satisfactory to excellent consistency. In combination with the analysis of individual voltage scores obtained from individual electrodes, this result highlights that a comparison of consistency indices across components may yield converging information about the reproducibility of the ERP measures of interest, across conditions.

Figure. 6.

Figure. 6

Reliability of the voltage time course for each sensor

A final example analysis quantified the internal consistency of the ERP time course (the entire voltage time series representing the post-stimulus ERP waveform), for each sensor. For each Cronbach’s alpha calculation, observations were 400 time points (the entire epoch, excluding the baseline) for 19 participants, and items were the 4 experimental conditions, resulting in a 7600 by 4 matrix. Each matrix yielded one Cronbach’s alpha value, and this value was determined for each EEG sensor. Thus, a topographical map of internal consistency estimates resulted, indicating the consistency of the voltage time courses across the four experimental conditions. Calculations were repeated for ERPs based on 10, 20, 40, and 80 averaged trials, in the same manner as the analyses described above. Paralleling other ERP measures, the internal consistency of the ERP time course at individual sensors increased with trial count (see Figure 7). As expected given the visual stimulus used in the present study, waveform consistency was highest at posterior sensors. High reliability of the ERP waveform across conditions was reached after 40 trials, with a majority of sensors displaying Cronbach’s alpha values exceeding .8. With all trials, Cronbach’s alpha values exceeded .9 (excellent internal consistency) at 110 (of 129) sensors, suggesting that waveforms were consistent across the four experimental conditions.

Figure. 7.

Figure. 7

Reliability after averaging across time points and sensors

In many empirical ERP studies, averaging voltage across time points and/or sensors prior to statistical analysis, assumed to reduce error variance and noise, forms the dependent variable. Thus, we also assessed internal consistency of pattern-onset ERPs using the more common approach in which dependent variables were formed by voltage averaging prior to statistical analysis. In this analysis, Cronbach’s alphas were calculated after parametrically increasing the number of sensors and time points included in an average. In terms of sensors, this procedure started with the midline electrode at which the grand mean pattern-evoked potential tended to be largest (Oz for P1 and N1, Pz for N2, P2, P3; see Figures 2 and 3), and then included increasing numbers of additional electrodes, added in sequence of proximity to the first electrode. Averaging across time started with the peak within a time-window for each component: P1 (100 to 140 ms), N1 (160 to 190 ms), P2 (200 to 230 ms), N2 (240 to 270 ms), and P3 (300 to 380 ms). Once the peak was found, each successive average included adjacent time points in both directions in a time-domain average until reaching the borders of the time windows encompassing the major pattern-onset ERP components. Cronbach’s alpha was calculated for each component separately, for all sensor cluster sizes and time window durations. Again, Cronbach’s alpha values were calculated for ERPs with different trial counts.

Quantifying Cronbach’s alpha for spatio-temporal voltage averages partly supported the notion that averaging across sensors and time points may heighten reliability. Figure 8 shows the increase in internal consistency associated with averaging across time points and sensors, for the N1 time window: Moderate consistency was observed for a measure of the N1 amplitude that was based on averaging across a 30 ms time window around the N1 peak (176 ms), at sensor Oz. High values (> .7) were obtained for ERPs based on 20 trials, when averaging across 2 to 10 posterior sensors irrespective of temporal averaging. In the same set of averages (20 trials), Cronbach’s alpha values were excellent (> .9) after averaging 15 sensors in a cluster around Oz, and 40 time points in the window average around the peak of the N1. Importantly, including additional sensors was associated with a sharp decrease in internal consistency, in line with the consistency analysis of peak voltages at individual sensors presented above. When using all trials, no additional benefit in terms of consistency emerged from additional across time points and sensors, but including sensors beyond the parieto-occipital region again led to a decrease in consistency. The same conclusions are suggested by analyses of the P1 and P3 components, as shown in Table 2.

Figure. 8.

Figure. 8

Table 2.

Percentage of Sensors and Time Points with Excellent Internal Consistency

Trial Count P1 P3

Time Sensors Time Sensors
10 45% 1 % 0% 0%
20 100 % 40% 100 % 25%
40 100 % 98% 100 % 43 %
80 100 % 100 % 100 % 98%

Note. Excellent internal consistency was defined as Cronbach’s alpha > .9. Here, P1 and P3 amplitude measurements are shown after averaging across the temporal and spatial domains. Paralleling the strategy described for Figures 8 and 10, Cronbach’s Alpha was calculated for increasing numbers of sample points within each component’s time range and for growing numbers of sensors included in the voltage measurement, for different trial counts (rows). After averaging and reliability analyses, the percentage of time windows and sensor clusters yielding high reliability values (Cronbach’s alpha > .9) are shown for each component. Data from the N1 and Selection Negativity components are shown in more detail in Figure 8 and Figure 10, respectively.

Robustness of differences between experimental conditions (effect size of the selection negativity)

Beyond internal consistency, replicability of ERP results in many cases depends on the robustness of differences between experimental conditions. In ERP research, these differences are often visualized using difference waveforms, such as the selection negativity in the current study. F-contrast analyses were used to quantify the effect size of the predicted voltage difference between attention conditions. Paralleling reliability analyses with Cronbach’s alpha, an initial effect size analysis was conducted at each electrode and time point within the selection negativity time range, for different trial counts.

As illustrated by a comparison of Figures 9 and 10, difference waveforms tended to have significantly lower SNR. We first quantified the effect size of the attention-related differences during the selection negativity time window (178 to 292 ms) using planned contrasts, calculated for each sensor location. As shown in Figure 9, medium (.39) to large (.59) effect sizes were reached only when including all available trials, and were confined to a parieto-occipital sensor cluster. The greatest effect size of .47 was observed at sensor Pz, where the SN displayed greatest SNR.

Figure. 9.

Figure. 9

Figure. 10.

Figure. 10

Robustness of condition differences after averaging across time points and sensors

Paralleling the approach of the reliability analyses above, we modeled the predicted differences between conditions after averaging time windows and sensor clusters surrounding Pz. The analysis started by calculating the effect size at the two difference wave peaks (230 and 256 ms, respectively) for sensor Pz. The analysis then grew to include larger time windows surrounding the peak, and larger sensor clusters expanding radially from Pz. Again, this analysis was conducted for all subsets of trials.

As shown in Figure 10, effect sizes of condition differences were affected by averaging across time points and sensors, and varied from early (178 to 234 ms) to late (236 to 292 ms) selection negativity (see Figure 10). In both early and late SN time windows, a moderate effect size was observed after averaging all trials, when averaging between 3 and 60 sensors around Pz, and for any group of time points within the respective early or late SN window. Effect sizes differed between the early and late SN time window. The early time window showed highest effect sizes (maximum of .52) after averaging across 20 ms of time around the temporal peak of the SN (i.e. 206 ms), and a cluster of 12 sensors surrounding Pz. The late SN displayed highest effect sizes (up to .41) when averaging across 60 ms of time around the peak, and 15 sensors surrounding Pz. Importantly, including time points or sensors that were not consistent with the voltage topography and time course of the selection negativity component tended to dramatically decrease effect size estimates.

Discussion

The goal of this study was to provide an example analysis for how SNR, internal consistency, and robustness may be established for dependent variables derived from ERPs in an experimental design with within-participants manipulations. To illustrate possible ways towards quantifying (and maximizing) the internal consistency of ERP results, we systematically examined the relation between the trial count (and thus the SNR) and internal consistency, while using effect size as a measure of ERP robustness. In addition, the effects of several commonly used quantification techniques on reliability were investigated, such as measuring the peak voltage or the mean voltage across time points and/or electrodes. Given the spatio-temporal nature of ERP data, different types of dependent variables may be extracted from the electrode by time matrices available for each participant and condition. Of these different variables, we examined the internal consistency of (i) peak and mean voltage at selected sensors, (ii) the entire voltage topography at selected time points, and (iii) the entire waveform at selected electrodes. The findings have implications for a series of questions that are of theoretical and practical relevance for ERP researchers, discussed below.

Does calculating internal consistency metrics in studies of within-participants effects have practical value?

The present analysis found that the reproducibility of all variables examined across repeated measurements in the same participant was readily captured by calculating internal consistency using the experimental conditions as “items.” Notably, this approach was sensitive to several properties of ERP data known to affect reproducibility. For example, high SNR strongly predicted high consistency, and consistency also displayed spatial and temporal specificity reflective of the known time course and topography of pattern-evoked ERPs. An important question is how reactivity to the experimental manipulation will affect internal consistency, compared to consistency of components that are not modulated by the experimental manipulations. In the current study, Cronbach’s alpha for occipital voltage amplitude was relatively reduced during a narrow time-window (the selection negativity time-window: 160–280 ms), although still being at satisfactory to very good levels (see Figure 5). Thus, reactivity (the change of the ERP variable in response to the experimental manipulations) in the present study did not drastically alter the ranking of participants across conditions, again supporting the use of conditions as items for estimating consistency. In practice, to document the consistency of the ERP in a given study, researchers may compare consistency of aspects of the time-varying ERP signal that are outside versus inside the temporal region of interest. Relatively lower consistency accompanied by satisfactory effect size during the time window of interest then would point to an effect that is built on consistent, robust individual ERPs, as opposed to noisy and irregular waveforms.

Are there differences in internal consistency between spatial and temporal properties of the ERP?

The ERP is given as a 2-dimensional matrix with temporal and spatial properties. Thus, researchers often use portions of the temporal and/or spatial information to quantify the latency (Miller, Patterson, & Ulrich, 1998) and topographical distribution (McCarthy & Wood, 1985) of ERP components. Latency and topographical distribution of an ERP component can be utilized to compare amplitude differences across time, locations on the scalp, and experimental conditions (Dien, Spencer, & Donchin, 2004; Cuthbert, Schupp, Bradley, Birbaumer, & Lang, 2000; Foxe & Simpson, 2002; Kappenman & Luck, 2012).

The current study found both the topographical distribution of voltages (at individual time points), and the ERP waveform (the sequence of voltage changes over time at individual sensors) internally consistent under certain conditions: The reliable quantification (Cronbach’s alpha > .7) of the voltage topography at a given time point was restricted to the time range in which clear ERP components were seen, and required averages containing 40 to 50 trials, which corresponded to SNRs above 10 in the present analysis. The time-varying internal consistency of the voltage topography was most sensitive to differences between the experimental conditions. Since conditions were used as items here, a significant drop in consistency marked the time range of the selection negativity, in which the differences between the four experimental conditions were most pronounced. Thus, although the present approach of using experimental conditions as items or repetitions for consistency analysis is convenient and widely applicable in most empirical studies, caution is warranted in situations in which between-condition differences are expected to lead to qualitative changes in the ERP topography. Conversely, the time-varying consistency analysis demonstrated here may provide a sensitive, quantitative data mining tool for detecting time periods of systematic topographical differences between conditions. Future work may build on existing work by systematically examining the effects of electrode density on voltage measurements (Junghöfer et al., 1997) and source estimation (Hauk, Keil, Elbert, & Müller, 2002), and include the aspect of reliability of experimental differences.

Reliably measuring the temporal sequence of ERP components across the entire epoch was possible when using averages comprising 30 or more trials, across major portions of the posterior scalp. At these trial counts, SNRs of the main ERP components varied between approximately 4 and 10. Although the findings of the present report will necessarily be paradigm-specific, this may be taken to indicate that the dynamics of the waveform are replicable at lower SNR than is necessary to reliably capture voltage topographies. To fully harness the internal consistency available at relatively low trial counts however, it is crucial to measure the voltage waveform at time points and sensors associated with the maximum SNR across components. In the present data set, high SNR across components was seen at parieto-occipital sensor locations only. Accordingly, reliable waveform estimation at anterior electrode sites requires substantially higher trial counts, compared to posterior sensors. This relation highlights the important role of SNR for reliable measurement of ERP voltages, discussed in greater detail in the next paragraph.

How many trials are required for the robust quantification of an ERP effect?

Recommendations for trial counts are often based on experience, tradition, laboratory lore, or estimates of signal-to-noise of a given ERP component (Woodman, 2010). An alternative approach taken here consists of quantitatively assessing different psychometric and quality criteria. Internal consistency of an ERP measure is a minimal condition for its use as an index of a given brain process. Across the present study, high Cronbach’s alpha values (> .7) were observed for different measures derived from the ERP, at trial counts that may be considered surprisingly low: When considering the cross-condition internal consistency of individual (peak) voltage amplitude measurements, high consistency was observed after averaging 30 or more trials, at posterior scalp regions, and across different ERP components. At the same time, SNR for the 30-trial averages was in the range between 4 and 10 in parietal and occipital clusters, considered not optimal for empirical studies (Luck, 2005). It is important to keep in mind that Cronbach’s alpha indexes the extent to which a number of items (here, 4 experimental conditions) co-vary across observations, which is often interpreted as evidence of their measuring the same underlying construct (here, the brain process of interest). Thus, internal consistency can be regarded as a minimum necessary, but not sufficient condition for robust estimation of ERP effects: Authors interested in using an ERP voltage measure (for example, a component peak such as the P3) as a marker for individual participants may rely on relatively low trial counts. However high internal consistency of the non-difference voltages does not imply that any condition differences will be reliably detected.

The effects of SNR and quantification techniques on capturing experimental effects (condition differences) were measured by the effect size of the selection negativity effect across the four conditions. This procedure has the advantage that it takes all experimental conditions into account, thus paralleling the Cronbach’s alpha analyses. Deviating strongly from the internal consistency measures however, the observation of moderate to high effect sizes required using all available trials (i.e., median of 80 artifact-free trials with correct responses). Likewise, SNR of the difference waveform was strongly attenuated compared to non-difference waveforms, with 80-trial averages associated with SNRs ranging around 5–6, at parieto-occipital sensors. These findings support trial count recommendations targeting a SNR of 10 (Luck, 2005), which in the present data set would be expected to lead to greater spatial extent of high effect size measurements, across wide areas of the scalp. In many studies with clinical, pediatric, or aging populations, however, these trial counts may not always be achievable, and explicit measurements of SNR may assist authors wishing to document the data quality available in a given study. Keeping in mind the paradigm-specificity of the present results, researchers may expect to obtain reliable findings (at moderate to high effect size) when the SNR of the difference waveform is in the range of 5–6, specifically in studies where the time range and electrode location of the expected effect are known a-priori. This a-priori knowledge allows further improvement of SNR by using appropriate quantification techniques, discussed next.

How does the measurement technique impact reliability and effect size?

Previous studies examining reliability have focused on a particular ERP component of interest, often measured in many different ways. Measurement techniques widely used in ERP studies include averaging or integrating voltages across time points and sensors, with substantial variability regarding the extent (and type) of averaging in both the time and the spatial domain. Conventions for measuring a given ERP waveform are often grounded in tradition and tend to be flexibly adjusted to changing demands, for example, posed by studying a specific population or using a different paradigm or experimental task.

The current analysis demonstrated, not surprisingly, that averaging across time points prior to analyses improved reliability and effect size. This is of particular relevance for researchers interested in quantifying spatio-temporal dynamics at high temporal resolution (Dien, Spencer, & Donchin, 2004), based on spatial information derived from individual sample points. As illustrated in a recent analysis of the low-amplitude C1 component, (Foxe et al., 2008), such approaches should be guided by caution, because important spatio-temporal information may be lost by generous averaging across time points when measuring mean amplitude. The temporal and spatial specificity, and thus the external validity, of ERP measurements may be endangered particularly in situations where mean amplitudes are computed across extended time periods of ERP signals measured at low SNR (Ravden & Polich, 1999). Many strategies have been proposed to address this issue, including combining time points according to their multivariate structure into temporal factors (Dien, 2010) or by capitalizing on the rich information contained in the single trials entering the ERP average (Makeig, Debener, Onton, & Delorme, 2004). In a similar vein, techniques that use the variability in the time course and topography to determine temporally stable “micro-states” in the ERP (Pascual-Marqui, Michel, & Lehmann, 1995) may assist in ensuring that the integration of voltages at subsequent time points into one index does not reduce the validity of the measurements.

Averaging across any of the available domains (trials, time, or sensors) may increase both the signal-to-noise ratio and the internal consistency, at different rates for each domain. The present study strongly suggests caution however when applying this approach, because SNR, effect size, and internal consistency were all negatively affected by excessive averaging across electrodes and time points. For instance, varying the number of trials averaged together produced internally consistent results after ~40 trials for the entire time-course, but only at EEG sensors located over occipital and parietal areas. Including frontal or facial EEG sensors drastically decreased internal consistency and effect size. Thus, the major components of the pattern-evoked visual ERP may be consistently measured based on a 40-trial average at any occipital or parietal sensor, but voltage differences at frontal or facial sensors will not be reliably captured by such an analysis. In a similar vein, measuring individual peaks of the pattern-evoked ERP from 40-trial averages is possible for posterior sensors, but the same 40-trial average would result in unsatisfactory internal consistency when considering anterior sensors. It is highly likely that these specific numerical results will not apply to other ERP studies, given differences in ERP components evoked from different stimuli and in different paradigms, along with varying data quality drawn from different populations and EEG systems. However, analyses of internal consistency are easily implemented and may accompany reports using new analysis techniques, new ERP variables, or ERP measurement techniques, ideally accompanied by reporting the SNR. Communicating quantitative indices of internal consistency such as Cronbach’s alpha may assist both the authors and readers in assessing the robustness of effects, thus helping to increase reproducibility in future studies with similar paradigms.

Further highlighting this point, the present study found generally non-linear relations between SNR, internal consistency, and effect size, for different measurement techniques such as averaging across domains (e.g., trials, time, or sensors). These indices also greatly varied by the scalp location and time segment included in the analyses. As predicted, SNR increased logarithmically as a result of averaging across trials, such that doubling the trial count produces a linear SNR increase, but this relation was specific to scalp locations sensitive to the component under consideration. For example, as shown in Figure 4, SNR for the P1 ERP component measured at sensor Oz increased linearly as the number of trials doubled. Sensors near Oz (the location of the P1 maximum) showed similar increases in SNR, whereas sensors in frontal areas (distal to the location of the P1 maximum) showed low SNR regardless of trial count. By contrast, internal consistency increased with the number of trials averaged, across wide areas of the scalp, including at frontal and lateral sensors. Frontal EEG sensors reached excellent reliability with all trials at component peaks, despite small SNR at those locations.

Whereas internal consistency increased logarithmically with added trials, it changed quadratically as a function of averaging across multiple time points within a component. For example, for the N1 component (based on 10 trials) measured at sensor Oz, averaging 10, 20, 30, and 40 time points surrounding the peak yielded Cronbach’s alphas of .7, .85, .9, and .8, respectively. Thus, the N1 component was most consistent when using a 20–30 ms window average centered on the N1 peak, but reliability decreased if this window was expanded further. This is consistent with the intuitive notion that measuring the mean amplitude improves internal consistency only as long as the averaging window includes time points that are part of the component of interest. Including time points with different properties, with polarity being an obvious example, will necessarily reduce SNR and reliability, as well as external validity of the measurements. Time domain averaging for other components, including the Selection Negativity, produced reliability fluctuations in a similar quadratic pattern, with a time window of approximately 30 ms found to maximize internal consistency and effect size. Thus, averaging across trials and averaging across time points within a given component window each increased reliability, but trial domain averaging reached ceiling (Cronbach’s alpha values nearing 1) with all trials, while time domain averaging began to decrease internal consistency when extending the window beyond 30 ms.

In ERP studies, spatial averaging is sometimes implemented in a data-driven way, by selecting the EEG sensor with the largest SNR ratio to serve as the center of an electrode cluster containing sensors for spatial averaging. In analogy to the temporal averaging described above, the present study examined effects of this simple technique by averaging across electrode clusters containing increasing numbers of sensors, while comparing quality indices of the data. Paralleling averaging across time points, spatial averaging showed strong non-linear effects on quality indices: For example, the analysis for the P1 component started with sensor Oz, where the SNR distribution of the P1 component showed a maximum. Additional sensors were then added to the cluster based on spatial proximity. As shown in Figure 8, internal consistency was highest when the cluster was smallest (only Oz), and tended to decrease as sensors were added to the cluster. Thus, cluster sizes of 5, 20, 35, and 50 were associated with internal consistency values of approximately .95, .9, .85, and .8, respectively for the peak P1 voltage extracted from a 20-trial average. This somewhat unexpected negative relationship was apparent for all components examined (the P1, N1, P3, and Selection Negativity), and for all averaged trial counts (except 80 averaged trials, where internal consistency remained near 1 for nearly all cluster sizes). Close examination of Figure 8 shows that sensor averaging may result in very modest consistency increases compared to individual sensor measurements. The feature-based attention difference waveform (containing the SN) is, by virtue of being a difference waveform, particularly dependent on the signal to noise ratio of the non-difference ERPs on which it is based. In the present study, SN showed modest effect sizes with trial counts of 80 when considering individual time points and sensors. Alphas were substantially greater when averaging across sensors and time points: Pooling voltages for posterior midline sensors in the time range of the N1 and N2 components, where the SN was maximal, resulted in the highest effect sizes, but still only reached values around 0.6. Together, these findings suggest internal consistency is promoted by measuring pattern-evoked ERPs by including time points, but to a lesser extent by including sensors into component scores used as a dependent variable. Given the wide range of practices used in the ERP literature, the increased availability of similar quantitative analyses of quality indices would be desirable, allowing comparison of different quantification approaches.

Conclusions and Outlook

The present study explored ways in which the internal consistency of ERP measurements can be assessed. A representative data set from a selective attention task was used, involving pattern-evoked visual ERPs recorded by means of dense-array EEG. Main results converged to show high internal consistency of measurements taken from non-difference ERPs, even at surprisingly low trial counts, corresponding to relatively low SNRs. By contrast, robust quantification of voltage differences between experimental conditions, measured by the effect size, required significantly greater SNRs. Overall, consistency as well as effect size varied by SNR, but not in a linear fashion: SNR predicted consistency and effect size at posterior scalp locations where the pattern-evoked ERP signal was pronounced, but not at other sites. A comparison of quantification techniques assessed differences between measuring the peak amplitude and measuring the mean amplitude with varying time points and electrode sites included in the mean. Throughout these analyses, internal consistency and effect size benefitted from measuring mean voltage, compared to the peak voltage in situations where (a) SNR of the signal of interest was low, and (b) when including only neurophysiologically plausible time points and sensors into a mean amplitude measurement (that is, time points and electrode locations that captured the same process). Including additional scalp locations and time points was associated with a sharp decrease in internal consistency and effect size. Thus, the common method of measuring mean amplitude as spatio-temporal averages across a subset of the ERP matrix may be informed by quantitative analyses of consistency, to ensure that a given practice reliably captures the desired aspect of the ERP signal.

It will be an interesting goal for future studies to explore the extent to which the present analyses may be extended to other experimental paradigms. Because the necessary computation efficiency and technical training are now widely available in ERP laboratories, quantitative analyses of internal consistency could easily accompany reports on experimental findings. Future studies may also wish to characterize the reliability using additional paradigms and measurements common in ERP studies, such as a component’s temporal peak or metrics extracted from independent or principal component analysis. Overall, given the growing number of methodological developments, novel paradigms, and increased use of sophisticated measurement techniques, extensive practice of reporting internal consistency may be a welcome addition to the psychophysiologist’s toolbox.

Footnotes

1

It should be noted that a logarithmic relationship between SNR and trial count is based on the assumption that random noise in an ERP waveform decreases as trials are added to an average. In this sense, only random noise (and not systematic noise) decreases as trials are added to an average, leading to a higher SNR. Systematic noise could include an increase in alpha power over the recording session, or effects arising from oculomotor activity that may be temporally correlated with stimulus onset or offset, and would not necessarily be diminished with higher trial counts.

2

The P3 ERP component can be distorted by high-pass filters (Duncan-Johnson & Donchin, 1979). To ensure that the present filter sitting did not significantly distort the P3 component in the current study, we re-analyzed our data after preprocessing with a 2nd order Butterworth high-pass filter having a 3 dB point set at .1 Hz. As expected, amplitude of either component examined here was unaffected.

References

  1. Anllo-Vento L, Hillyard SA. Selective attention to the color and direction of moving stimuli: Electrophysiological correlates of hierarchical feature selection. Perception & Psychophysics. 1996;58(2):191–206. doi: 10.3758/bf03211875. http://doi.org/10.3758/BF03211875. [DOI] [PubMed] [Google Scholar]
  2. Anokhin AP, van Baal GCM, van Beijsterveldt CEM, de Geus EJC, Grant J, Boomsma DI. Genetic Correlation Between the P300 Event-Related Brain Potential and the EEG Power Spectrum. Behavior Genetics. 2001;31(6):545–554. doi: 10.1023/a:1013341310865. http://doi.org/10.1023/A:1013341310865. [DOI] [PubMed] [Google Scholar]
  3. Cohen J, Polich J. On the number of trials needed for P300. International Journal of Psychophysiology. 1997;25(3):249–255. doi: 10.1016/s0167-8760(96)00743-x. http://doi.org/10.1016/S0167-8760(96)00743-X. [DOI] [PubMed] [Google Scholar]
  4. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–334. http://doi.org/10.1007/BF02310555. [Google Scholar]
  5. Cuthbert BN, Schupp HT, Bradley MM, Birbaumer N, Lang PJ. Brain potentials in affective picture processing: covariation with autonomic arousal and affective report. Biol Psychol. 2000;52(2):95–111. doi: 10.1016/s0301-0511(99)00044-7. [DOI] [PubMed] [Google Scholar]
  6. Dien J. The ERP PCA Toolkit: an open source program for advanced statistical analysis of event-related potential data. Journal of Neuroscience Methods. 2010;187(1):138–45. doi: 10.1016/j.jneumeth.2009.12.009. http://doi.org/10.1016/j.jneumeth.2009.12.009. [DOI] [PubMed] [Google Scholar]
  7. Dien J, Spencer KM, Donchin E. Localization of the event-related potential novelty response as defined by principal components analysis. Brain Res Cogn Brain Res. 2003;17(3):637–50. doi: 10.1016/s0926-6410(03)00188-5. [DOI] [PubMed] [Google Scholar]
  8. Dien J, Spencer KM, Donchin E. Parsing the late positive complex: mental chronometry and the ERP components that inhabit the neighborhood of the P300. Psychophysiology. 2004;41(5):665–78. doi: 10.1111/j.1469-8986.2004.00193.x. [DOI] [PubMed] [Google Scholar]
  9. Dien J, Spencer KM, Donchin E. Parsing the late positive complex: mental chronometry and the ERP components that inhabit the neighborhood of the P300. Psychophysiology. 2004;41(5):665–78. doi: 10.1111/j.1469-8986.2004.00193.x. http://doi.org/10.1111/j.1469-8986.2004.00193.x. [DOI] [PubMed] [Google Scholar]
  10. Donchin E, Callaway E, Cooper R, Desmedt JE, Goff WR, Hillyard S, Sutton S. Publication Criteria for Studies of Evoked Potentials in Man. In: Desmedt JE, editor. Attention, Voluntary Contraction and Event-Related Cerebral Potentials. Vol. 1. Brussels: Karger; 1977. pp. 1–11. [Google Scholar]
  11. Duncan-Johnson CC, Donchin E. The Time Constant in P300 Recording. Psychophysiology. 1979;16(1):53–55. doi: 10.1111/j.1469-8986.1979.tb01440.x. http://doi.org/10.1111/j.1469-8986.1979.tb01440.x. [DOI] [PubMed] [Google Scholar]
  12. Fabiani M, Gratton G, Corballis PM, Cheng J, Friedman D. Bootstrap assessment of the reliability of maxima in surface maps of brain activity of individual subjects derived with electrophysiological and optical methods. Behavior Research Methods, Instruments, & Computers. 1998;30(1):78–86. http://doi.org/10.3758/BF03209418. [Google Scholar]
  13. Fabiani M, Gratton G, Karis D, Donchin E. Definition, identification, and reliability of the P300 component of the event-related brain potential. 1987;2:1–78. Retrieved from http://www.researchgate.net/publication/225304622_Definition_identification_and_reliability_of_the_P300_component_of_the_event-related_brain_potential. [Google Scholar]
  14. Foti D, Kotov R, Hajcak G. Psychometric considerations in using error-related brain activity as a biomarker in psychotic disorders. 2013 doi: 10.1037/a0032618. Retrieved from http://psycnet.apa.orgjournals/abn/122/2/520. [DOI] [PubMed]
  15. Foxe JJ, Simpson GV. Flow of activation from V1 to frontal cortex in humans: a framework for defining “early” visual processing. Experimental Brain Research. 2002;142:139–150. doi: 10.1007/s00221-001-0906-7. [DOI] [PubMed] [Google Scholar]
  16. Foxe JJ, Strugstad EC, Sehatpour P, Molholm S, Pasieka W, Schroeder CE, McCourt ME. Parvocellular and magnocellular contributions to the initial generators of the visual evoked potential: high-density electrical mapping of the “C1” component. Brain Topography. 2008;21(1):11–21. doi: 10.1007/s10548-008-0063-4. http://doi.org/10.1007/s10548-008-0063-4. [DOI] [PubMed] [Google Scholar]
  17. Handy TC. Event-related Potentials: A Methods Handbook. MIT Press; 2005. Retrieved from https://books.google.com/books?hl=en&lr=&id=OQyZEfgEzRUC&pgis=1. [Google Scholar]
  18. Harter Russell M, Aine CJ. Brain Mechanisms of Visual Selective Attention. Retrieved June. 1984;8:2015. , from http://www.researchgate.net/profile/Cheryl_Aine/publication/243784610_Brain_Mechanisms_of_Visual_Selective_Attention/links/53f4d0690cf2888a74912369.pdf. [Google Scholar]
  19. Hauk O, Keil A, Elbert T, Müller MM. Comparison of data transformation procedures to enhance topographical accuracy in time-series analysis of the human EEG. Journal of Neuroscience Methods. 2002;113(2):111–122. doi: 10.1016/s0165-0270(01)00484-8. http://doi.org/10.1016/S0165-0270(01)00484-8. [DOI] [PubMed] [Google Scholar]
  20. Hinton P, Brownlow C, McMurray I, Cozens B. SPSS explained Routledge. 2004 Retrieved from https://scholar.google.com/scholar?hl=en&q=hinton+brownlow+mcmurray+cozens&btnG=&as_sdt=1&percnt;2C10&as_sdtp=#1.
  21. Hopf JM, Boelmans K, Schoenfeld MA, Luck SJ, Heinze HJ. Attention to features precedes attention to locations in visual search: evidence from electromagnetic brain responses in humans. J Neurosci. 2004;24(8):1822–32. doi: 10.1523/JNEUROSCI.3564-03.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Junghöfer M, Elbert T, Leiderer P, Berg P, Rockstroh B. Mapping EEG-potentials on the surface of the brain: A strategy for uncovering cortical sources. Brain Topography. 1997;9(3):203–217. doi: 10.1007/BF01190389. http://doi.org/10.1007/BF01190389. [DOI] [PubMed] [Google Scholar]
  23. Junghöfer M, Elbert T, Tucker DM, Rockstroh B. Statistical control of artifacts in dense array EEG/MEG studies. Psychophysiology. 2000;37(4):523–532. http://doi.org/10.1111/1469-8986.3740523. [PubMed] [Google Scholar]
  24. Kappenman ES, Luck SJ. ERP Components: The Ups and Downs of Brainwave Recordings. In: Luck SJ, Kappenman ES, editors. Oxford Handbook of ERP Components. New York: Oxford University Press; 2012. [Google Scholar]
  25. Keil A, Debener S, Gratton G, Junghofer M, Kappenman ES, Luck SJ, Yee CM. Committee report: Publication guidelines and recommendations for studies using electroencephalography and magnetoencephalography. Psychophysiology. 2014;51(1):1–21. doi: 10.1111/psyp.12147. http://doi.org/10.1111/psyp.12147. [DOI] [PubMed] [Google Scholar]
  26. Keil A, Muller MM. Feature selection in the human brain: electrophysiological correlates of sensory enhancement and feature integration. Brain Res. 2010;1313:172–84. doi: 10.1016/j.brainres.2009.12.006. http://doi.org/10.1016/j.brainres.2009.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Light GA, Swerdlow NR. Future clinical uses of neurophysiological biomarkers to predict and monitor treatment response for schizophrenia. Annals of the New York Academy of Sciences. 2015;1344(1):105–19. doi: 10.1111/nyas.12730. http://doi.org/10.1111/nyas.12730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Luck SJ. An introduction to the event-related potential technique. Cambridge, MA: MIT Press; 2005. [Google Scholar]
  29. Luck SJ. An Introduction to the Event-Related Potential Technique 2014 [Google Scholar]
  30. Luck SJ, Mathalon DH, O’Donnell BF, Hämäläinen MS, Spencer KM, Javitt DC, Uhlhaas PJ. A roadmap for the development and validation of event-related potential biomarkers in schizophrenia research. Biological Psychiatry. 2011;70(1):28–34. doi: 10.1016/j.biopsych.2010.09.021. http://doi.org/10.1016/j.biopsych.2010.09.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Makeig S, Debener S, Onton J, Delorme A. Mining event-related brain dynamics. Trends in Cognitive Sciences. 2004;8(5):204–10. doi: 10.1016/j.tics.2004.03.008. http://doi.org/10.1016/j.tics.2004.03.008. [DOI] [PubMed] [Google Scholar]
  32. Marco-Pallares J, Cucurell D, Münte TF, Strien N, Rodriguez-Fornells A. On the number of trials needed for a stable feedback-related negativity. Psychophysiology. 2011;48(6):852–860. doi: 10.1111/j.1469-8986.2010.01152.x. http://doi.org/10.1111/j.1469-8986.2010.01152.x. [DOI] [PubMed] [Google Scholar]
  33. McCarthy G, Wood CC. Scalp distributions of event-related potentials: an ambiguity associated with analysis of variance models. Electroencephalography and Clinical Neurophysiology. 1985;62:203–208. doi: 10.1016/0168-5597(85)90015-2. [DOI] [PubMed] [Google Scholar]
  34. McGinnis EM, Keil A. Selective processing of multiple features in the human brain: Effects of feature type and salience. PLoS ONE. 2011;6(2):1–12. doi: 10.1371/journal.pone.0016824. http://doi.org/10.1371/journal.pone.0016824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Miller J, Patterson T, Ulrich R. Jackknife-based method for measuring LRP onset latency differences. Psychophysiology. 1998;35(1):99–115. [PubMed] [Google Scholar]
  36. Müller MM, Keil A. Neuronal synchronization and selective color processing in the human brain. Journal of Cognitive Neuroscience. 2004;16(3):503–522. doi: 10.1162/089892904322926827. http://doi.org/10.1162/089892904322926827. [DOI] [PubMed] [Google Scholar]
  37. Pascual-Marqui RD, Michel CM, Lehmann D. Segmentation of brain electrical activity into microstates: model estimation and validation. IEEE Trans Biomed Eng. 1995;42(7):658–65. doi: 10.1109/10.391164. http://doi.org/10.1109/10.391164. [DOI] [PubMed] [Google Scholar]
  38. Perez VB, Swerdlow NR, Braff DL, Näätänen R, Light GA. Using biomarkers to inform diagnosis, guide treatments and track response to interventions in psychotic illnesses. Biomarkers in Medicine. 2014;8(1):9–14. doi: 10.2217/bmm.13.133. http://doi.org/10.2217/bmm.13.133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Peyk P, DeCesarei A, Junghöfer M. Electro Magneto Encephalograhy Software: overview and integration with other EEG/MEG toolboxes. Computational Intelligence and Neuroscience. 2011;2011 doi: 10.1155/2011/861705. Article ID 861705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Pontifex MB, Scudder MR, Brown ML, O’Leary KC, Wu CT, Themanson JR, Hillman CH. On the number of trials necessary for stabilization of error-related brain activity across the life span. Psychophysiology. 2010;47(4):767–773. doi: 10.1111/j.1469-8986.2010.00974.x. http://doi.org/10.1111/j.1469-8986.2010.00974.x. [DOI] [PubMed] [Google Scholar]
  41. Ravden D, Polich J. On P300 measurement stability: habituation, intra-trial block variation, and ultradian rhythms. Biol Psychol. 1999;51(1):59–76. doi: 10.1016/s0301-0511(99)00015-0. [DOI] [PubMed] [Google Scholar]
  42. Rosnow RL, Rosnow RL, Rosenthal R. Computing contrasts, effect sizes, and counternulls on other people’s published data: General procedures for research consumers. Psychological Methods. 1996;1(4):331. [Google Scholar]
  43. Schlögl A, Keinrath C, Zimmermann D, Scherer R, Leeb R, Pfurtscheller G. A fully automated correction method of EOG artifacts in EEG recordings. Clinical Neurophysiology : Official Journal of the International Federation of Clinical Neurophysiology. 2007;118(1):98–104. doi: 10.1016/j.clinph.2006.09.003. http://doi.org/10.1016/j.clinph.2006.09.003. [DOI] [PubMed] [Google Scholar]
  44. Schoenfeld MA, Hopf JM, Martinez A, Mai HM, Sattler C, Gasde A, Hillyard SA. Spatio-temporal analysis of feature-based attention. Cereb Cortex. 2007;17(10):2468–77. doi: 10.1093/cercor/bhl154. [DOI] [PubMed] [Google Scholar]
  45. Spencer KM, Dien J, Donchin E. A componential analysis of the ERP elicited by novel events using a dense electrode array. Psychophysiology. 1999;36:409–414. doi: 10.1017/s0048577299981180. [DOI] [PubMed] [Google Scholar]
  46. Teplan M. Fundamentals of EEG Measurement. Retrieved June. 2002;8:2015. , from http://www.edumed.org.br/cursos/neurociencia/MethodsEEGMeasurement.pdf. [Google Scholar]
  47. Vidaurre C, Sander TH, Schlögl A. BioSig: the free and open source software library for biomedical signal processing. Computational Intelligence and Neuroscience. 2011;2011:935364. doi: 10.1155/2011/935364. http://doi.org/10.1155/2011/935364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Woodman GF. A brief introduction to the use of event-related potentials in studies of perception and attention. Attention, Perception, & Psychophysics. 2010;72(8):2031–2046. doi: 10.3758/APP.72.8.2031. http://doi.org/10.3758/BF03196680. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES