Skip to main content
STAR Protocols logoLink to STAR Protocols
. 2025 Mar 6;6(1):103682. doi: 10.1016/j.xpro.2025.103682

Protocol for semi-automatic EEG preprocessing incorporating independent component analysis and principal component analysis

Guang Ouyang 1,2,3,, Yingzhe Li 1
PMCID: PMC11930125  PMID: 40053447

Summary

Preprocessing is a critical yet challenging step in electroencephalography (EEG) research due to its significant potential impact on results. We present a protocol for semi-automatic EEG preprocessing incorporating independent component analysis (ICA) and principal component analysis (PCA) with step-by-step quality checking to ensure removal of large-amplitude artifacts. We describe steps for interpolating bad channels, removal of major artifacts by ICA and PCA correction, and exporting processed data. This protocol produced consistent results from users with a broad range of experience.

Subject area: Bioinformatics, Cognitive Neuroscience, Behavior

Graphical abstract

graphic file with name fx1.jpg

Highlights

  • Instructions for EEG preprocessing protocol with step-by-step quality checking

  • Guidance on bandpass filtering and bad channel interpolation

  • Procedures for ICA-based ocular artifact correction

  • Steps for PCA-based large-amplitude transient artifact correction


Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.


Preprocessing is a critical yet challenging step in electroencephalography (EEG) research due to its significant potential impact on results. We present a protocol for semi-automatic EEG preprocessing incorporating independent component analysis (ICA) and principal component analysis (PCA) with step-by-step quality checking to ensure removal of large-amplitude artifacts. We describe steps for interpolating bad channels, removal of major artifacts by ICA and PCA correction, and exporting processed data. This protocol produced consistent results from users with a broad range of experience.

Before you begin

Non-invasive electroencephalography (EEG) has long been a prominent research tool in cognitive research and is increasingly extending its impact to higher level domains such as social psychology and education. As the number of users grows, there is a pressing need for a robust EEG preprocessing procedure because: 1) EEG preprocessing is essential for ‘cleaning up’ raw EEG data by removing major artifacts. Inadequate preprocessing can significantly impact the end results, rendering them questionable. 2) Many EEG users, especially those without extensive training in neuro/bio-signal processing, seek a packed tool for collecting reliable neural indicators without delving into the intricate technical details underneath the generation of the indicators. However, new users often face challenge due to the high variability and ambiguity in preprocessing in the EEG research community.1,2,3,4,5,6,7 Given the absence of a fully standardized and automatic procedure the feasibility concerns,8 the field urgently needs a practical protocol for users at the early stages of exploring neural correlates. For them, effectively ruling out of major artifacts is a higher priority. Subsequent confirmation of whether the neural correlates or indicators are influenced by remaining artifacts can be conducted later in various ways. The presence of large, visible artifacts from the outset will undoubtedly lower data quality and increase the risk of misinterpretation.

Based on these considerations, we present a robust protocol designed to guide users through step-by-step procedures with demonstration and explanation for each step. The protocol is intended for EEG recording without electrooculography (EOG) and does not apply to EOG-based artifact handling. The scope and rationales for this protocol is provided below. First, EEG preprocessing steps can be found in various tutorials online and literally in every single EEG paper. However, there is still a high degree of variety and uncertainty in intermediate decision-making processes. Users may worry if there are still major remaining artifacts in their own data even after strictly following all procedures on their own data. For instance, one of the main issues faced by new users is that the major ocular artifacts do not seem to be successfully isolated by the well-established method of Independent Component Analysis (ICA). This could simply be due to the existence of large-amplitude transient artifacts that violates the stationarity assumption of ICA.9 A simple solution—not well-known by new users—is to select the stationary segment for ICA decomposition. Therefore, the current protocol features step-by-step quality checking.

This protocol does not claim theoretical superiority over other methods. Instead, it aims to provide a user-friendly set of procedures with accessible examination of each step’s effects while maintaining high reproducibility standards. The primary goal is to remove major artifacts that significantly impact late-stage neural signal analysis. While the protocol may have theoretical limitations, such as not guaranteeing the removal of all subtle, low-amplitude artifacts, it ensures that large, clearly removable artifacts do not substantially affect end results.

In summary, this protocol focuses on removing large artifacts that significantly contaminate results. It does not aim to identify and eliminate all artifact types based on their nature and specific characteristics, and the protocol is not based on EOG channels. The three types of large-amplitude artifacts to be removed are: 1) Large-amplitude random signal mainly due to electrode disconnectedness, 2) Ocular artifacts including eye blinks and movements, 3) Large-amplitude, non-specific transient artifacts such as muscle vibration and line noise. Ocular artifacts will be removed using ICA, and other large-amplitude, transient artifacts will be addressed using Principal Component Analysis (PCA) as a practical method for removing large-amplitude principal components.

Summary of the protocol framework

To provide some context, typical EEG fluctuations at the scalp level rarely exceed a range of +/− 100 μV.5 In the EEG dataset collected by our lab, the range of continuous clean EEG data filtered between 1 and 40 Hz usually falls between −30 and +30 μV. Furthermore, continuous scalp-level EEG signals display stationarity feature (characterized by a lack of abrupt jumps or systematic trend and maintaining constant variance over time). Figure 1 presents examples of EEG segments with clearly identifiable artifacts, as well as an example of a typical clean EEG segment. These plots illustrate the goal of the protocol: to process EEG data with features similar to those in Figures 1A–1C and achieve a result resembling the clean segment shown in Figure 1D.

Figure 1.

Figure 1

Examples of artifact patterns and typical clean data

Each plot shows a segment of 32-channel EEG data overlaying with each other.

(A) Typical eye-blink caused artifacts.

(B) Large-amplitude artifact without known source.

(C) Potential muscle artifact.

(D) A segment free of clearly identifiable artifacts.

The procedure involves three mandatory major steps with some optional minor steps interspersed. The major steps are mandatory because omitting them could result in strong remaining artifacts, posing significant risk to end result analysis. The minor steps are optional either because they are subject of ongoing discussion (e.g., whether to use average reference) or they depend on users’ need (e.g., whether to downsample). The three major steps are: 1) Basic bandpass filtering and bad channel interpolation, 2) ICA decomposition and ocular artifact removal, 3) Large-amplitude idiosyncratic artifact removal based on Principal Component Analysis (PCA). The rationale behind the arrangement of these three major steps is briefly explained below.

Since the current protocol operates in a no-EOG setting and relies on ICA as the key method for removing ocular artifacts, proper bandpass filtering is critical to ensure proper ICA decomposition. Several studies10,11,12 as well as the official EEGLAB tutorial (https://eeglab.org/tutorials/06_RejectArtifacts/RunICA.html) have pointed out that a relatively high cutoff high-pass filter (1–2 Hz) is crucial for obtaining good ICA decomposition, which is essential for isolating major ocular artifacts such as eye blinks and movements. However, filtering out the activity below 1 Hz may remove potentially useful information below 1 Hz. Therefore, the protocol also provides procedures to extract ICA weights from data filtered at a higher high-pass cutoff and then apply it to data filtered at a lower high-pass cutoff.

ICA requires data stationarity to achieve a proper decomposition result.9 As such, one might ask why the removal of large-amplitude transient artifacts is placed after ICA. Considering the scenario where the data contain significant amount of abrupt transient jumps (e.g., > 1000 μV, see Figure 1B); this can greatly compromise the quality of ICA decomposition or even lead to unsuccessful decomposition. One way to mitigate this issue is to ensure that the data segment fed into ICA is stationary. This segment can be taken from either a short and stationary (but must contain ocular artifacts) piece from the data to be analyzed or a specific segment dedicated to capturing ocular artifacts (e.g., a short task asking participants to blink and move their eyes as shown in the ocular movement task video from the Github page). This treatment ensures a proper ICA decomposition, which can then be applied to the entire data segment to be analyzed. The reason for placing the step of removing large-amplitude idiosyncratic components after ICA is that algorithms designed to remove large-amplitude artifacts will inevitably treat ocular artifacts as such. Consequently, they will attempt to remove ocular artifacts. However, to date, the most established algorithm to remove ocular artifacts without EOG is still infomax ICA.13 If the algorithms for removing large-amplitude artifact removing algorithms do not remove ocular artifacts properly, applying ICA to a distorted version of the data will again result in risks. Taking all these factors into consideration, our protocol thus follows these major procedures: 1) bandpass filtering, 2) ICA-based ocular artifact removal based on a stationary segment with ocular artifacts, 3) removal of large-amplitude idiosyncratic artifacts that lack consistent statistical properties and are difficult to be extracted by ICA.

A final note regarding the philosophy of the current protocol is that it aims to correct the data rather than remove data segments. There are pros and cons to both approaches, both treatments, similar to the choice in statistics between marking problematic samples as missing and discarding them or applying interpolation. One of the main advantages of the correction-based approach is the preservation of data structure and uniformity across samples (e.g., the same number of trials for all subjects, groups, or conditions), which is what the protocol aims to achieve. However, for subjects with excessive non-consistent artifacts, the removal of the entire subject is recommended.

Before presenting the step-by-step instructions, we summarize all the steps involved in the current protocol including optional and mandatory ones in Figure 2 below.

Figure 2.

Figure 2

Overview of the processing steps involved in the protocol

Institutional permissions

The EEG data were collected from experiments that had been approved by Human Research Ethics Committee from The University of Hong Kong. The study was conducted in accordance with the Declaration of Helsinki.

Dataset

Two datasets were used for the demonstration. The first EEG dataset is from a task where participants watched a sequence of cartoon faces exhibiting different emotional expressions and were instructed to press the space button upon seeing a happy face. All the participants were college students between 18-35 years old, balanced in gender. The experiment was designed as a simple facial expression recognition task. There are three types of facial expressions: angry, happy, and sad. Each face was shown for 2 s if there is no response from participant, followed by an inter-stimulus interval (ISI) that was uniformly distributed between 2 and 10 s. If participant responds by pressing the space, the face immediately disappears, and a feedback signal is shown for 500 ms before proceeding to the ISI. The feedback texts were as follows: 1) 'correct' for a correct button press, 2) 'missed' for a failure to press the button within 2 s of the stimulus onset, and 3) 'wrong' for an incorrect button press. There were 80 trials in total, with 12 happy faces randomly interspersed. The EEG data were collected with a 1000 Hz online sampling rate by 32 electrodes (Brain Products, actiCHamp) uniformly distributed on the scalp according to the 10–20 system. The online reference point was set to the electrode Fpz. The dataset contains 20 participants, each one of which performed the task twice. The second dataset also contains 20 participants performing two tasks. The second dataset was utilized here to demonstrate the situation when a separate session of pure ocular artifacts is available. The first task is an ocular movement task in which the participant was instructed by a short video to perform various ocular actions including eye blinks and eye movements to different directions. The video can be downloaded from the Github page. The second task is a sensorimotor speed task where participants are required to press the space button as quickly as possible in response to any visual stimulus, with the intertrial interval randomized between 0.5 and 2.5 s. This task is the simplest task (simple reaction task) requiring basic sensory perception and motor response and is typically used to measure basic sensorimotor speed and the associated ERP waveforms. The two datasets provided here serve to demonstrate the procedures to preprocess EEG with and without an ocular movement reference session. All data can be downloaded from the Github page.

Setup

Inline graphicTiming: 20 min

  • 1.

    Install Matlab.

Note: The Matlab version being tested in the current protocol was R2024a. If you encounter a Matlab error message reporting a missing Matlab toolbox, simply install that toolbox.

  • 2.
    Install EEGLAB.14
    • a.
      Download EEGLAB from https://eeglab.org/download/.
      Note: The EEGLAB version being tested in this protocol was v2022.0. At the time of testing this protocol, we found a technical bug in ICLabel extension in EEGLAB v2024.0, which is contained in ICLabel 1.5 (see https://Github.com/sccn/eeglab/issues/788 or https://eeglab.org/others/EEGLAB_revision_history.html). This bug has since been fixed in ICLabel 1.6 contained in EEGLAB v2024.1 or later. However, to ensure exact replication during the learning process, users are recommended to use EEGLAB v2022.0. Users can also try EEGLAB v2024.1 or later to see if the results can be replicated.
    • b.
      Extract the archive to a folder of your choice and add the directory path to your Matlab environment.
      Note: This would suffice for launching EEGLAB by typing ‘eeglab’ in the Matlab command window and hitting ‘enter’. More detailed instructions on EEGLAB installation can be found on to the EEGLAB official website.
    • c.
      Install relevant data importing extensions.
      Note: For the current demonstration, you need to install the extension ‘bva-io’ to import raw data from Brain Vision Analyzer because we are using data from Brain Products. Install it through ‘File’ → ‘Manage EEGLAB extensions’. When applying the protocol to your own data with a different format, you can download a different EEGLAB data import extension that suits you.
  • 3.

    Add all the Matlab functions we additionally developed for this protocol to complete the interactive processes in this protocol later. Download all m-script files within the ‘Protocol’ folder from the hosting Github page and simply put all files into the Current Folder of Matlab or add the folder containing them to the Matlab search path (Matlab -> Set Path).

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

Raw EEG data, code, and video This protocol https://github.com/guangouyang/EEG_preprocessing_protocol

Software and algorithms

MATLAB R2024a The MathWorks, Inc. RRID:SCR_001622; https://ww2.mathworks.cn/en/products/matlab.html
EEGLAB Delorme and Makeig, 2004 RRID: SCR_007292; https://eeglab.org/download

Step-by-step method details

Part 1: Bandpass filtering and bad channel interpolation

Inline graphicTiming: 2 min

The major step 1 is a highly standardized procedure in EEG preprocessing. It bandpass filters the EEG data and interpolates bad channel(s).

  • 1.
    Load the raw data into EEGLAB.
    • a.
      Launch Matlab, execute ‘eeglab’ in the command window.
    • b.
      Within EEGLAB GUI, click File -> Import data -> Using EEGLAB functions and plug-ins -> From Brain Vis. Rec. .vhdr or .ahdr file, then select the ‘face_001_1.vhdr’ file.
      Note: This sample data file contains 32 channels (electrodes) with labels but does not have channel location information. Since they follow international standards, we can add channel location information within EEGLAB through Edit -> Channel locations and click ‘OK’ throughout. If you are using your own data that already contains electrode information, you can skip this step.
    • c.
      (Optional) Down sample the data, e.g., to 250, to increase subsequent computational efficiency through Tools -> Change Sampling rate -> overwrite the current set.
  • 2.
    Bandpass filtering.
    • a.
      Filter the data within a typical band 1–40 Hz through Tools -> Filter the data -> Basic FIR filter, then choose the high and low cutoff values and click OK to overwrite the current set.

Note: The choice of boundaries for bandpass filtering the EEG data is an open question with ongoing discussion in the field. What is clear is that a cutoff at around 1–2Hz for high-pass filtering is necessary for a good ICA decomposition.10,11,12 The common choice of around 40 Hz as the high cutoff is that and the AC current (50 Hz or 60 Hz depends on the region) must be knocked out and high frequency bands are usually not relevant for ERP analysis. However, the band range used here will inevitably eliminate neural signals lower than 1 Hz and higher than 40 Hz, which may be expected to be important for some researchers. To see how to use a filtering ideal for ICA decomposition while preserving information using another filtering, see content below.

Parameters of the filter: There are multiple filter options available in EEGLAB. Here we used the basic FIR (Finite Impulse Response) filter (zero-phase, non-causal, filter order: 3.3 s, transition bandwidth: 1 Hz, passband edge(s): [1 40] Hz, cutoff frequencies (−6 dB): [0.5 40.5] Hz).

  • 3.
    Identify and interpolate bad channels.
    • a.
      Scroll the data (Plot -> Channel data (scroll)) to check the overall quality for all channels. If you see bad channels (abnormal fluctuations like Figure 3, or totally flat line), interpolate it (them) through Tools -> Interpolate electrodes -> Select data from channels -> choose the channel(s) you want to interpolate and click OK throughout and overwrite the current set.

Note: Bad channels refer to those channels that are clearly not connected to the scalp and will pose significant risk to the sub-sequent analysis as disconnected channels usually display huge-variance fluctuations.

Note: You can also implement these steps using the following Matlab script. The script can be downloaded from the Github page hosting this protocol.

clear;

eeglab nogui;

%change below to your own data path, e.g., 'D:∖dataset1_face'

data_path = 'your_data_path';

%change below to your EEGLAB path

eeglab_elec = 'your_EEGLAB_path∖eeglab2022.0∖plugins∖dipfit∖standard_BEM∖elec∖standard_1005.elc';

data_name = 'face_001_1.vhdr';

%load data, down_sample, and filter

EEG = pop_loadbv(data_path, data_name, [], []);

EEG = pop_chanedit(EEG, 'lookup',eeglab_elec);

EEG = pop_resample(EEG, 250);

EEG = pop_eegfiltnew(EEG,'hicutoff',40,'locutoff',1,'plotfreqz',0);

%check problematic channels

figure('WindowState', 'maximized');

plot(EEG.data' - ones(size(EEG.data,2),1)∗[1:size(EEG.data,1)]∗500);

for jj = 1:size(EEG.data,1) text(size(EEG.data,2),-500∗jj,EEG.chanlocs(jj).labels);end

axis tight;axis off;

%need to interpolate channel? If yes, run below

EEG=pop_interp(EEG);

Note: The second section of the script (‘check problematic channels’) will plot an overview of the EEG data for the entire task session to facilitate identification of problematic channels. As shown in Figure 4, the problematic channel of FT9 is easy to identify. It is worth noting that the overview as shown in Figure 4 would not be able to reveal all problematic channels if one of the problematic channels displays variance that is too huge to allow other problematic channels to be shown. In this case, you would need to re-check this result after interpolating the first problematic channel (the one with enormous variance). In some cases, the problematic channel may start to display issues only from the mid-point of the task, in which case interpolation is also recommended. This is also the reason why having an overview of the entire course is helpful. After executing the line ‘EEG = pop_interp(EEG);’ and selecting the channel to interpolate (here FT9), you can run the script under ‘check problematic channels’ to regenerate Figure 4, in which case you would see a clean version (FT9 now corrected).

Figure 3.

Figure 3

An example of identified bad channel

Here, FT9 is problematic.

Figure 4.

Figure 4

Overview of the all-channel data to identify problematic channel

Here, FT9 is problematic. After executing ‘EEG = pop_interp(EEG);’ and selecting the channel to interpolate (here FT9), you can re-plot this figure and you will see that the problematic FT9 has been interpolated.

In Matlab, script can be run in parts. To run a part of the script, simply select that part, right click and choose ‘Evaluate Selection in Command Window’.

Part 2: ICA-based ocular artifact correction

Inline graphicTiming: 3 min

This major step serves to remove ocular artifacts based on ICA and ICLabel algorithms.14,15

Note: Briefly, ICA is a type of blind source separation method16 that aims to recover the patterns of different source signals based on data from different sensors recording mixtures of various source projections. In the context of EEG data, the data from different EEG electrodes represent a mixture of signals projected by various sources, including neural and non-neural ones (e.g., artifacts). The main purpose of applying ICA in the EEG preprocessing is to isolate and remove those non-neural ones as they are likely to confound and data analysis and mislead the interpretation about neural effects. There are more than one algorithms to implement ICA. EEGLAB also provides various options for running ICA. The one we applied here is the most used one: extended infomax algorithm.13 This algorithm assumes that different sources are statistically independent from each other and aims to find an unmixing matrix that can restore a set of sources that have minimum mutual information with each other.

  • 4.

    After the major step 1, run the script ‘EEG = ICA_correction(EEG);’ and you will see the interface as shown in Figure 5.

Note: For this step, we used the sample data ‘face_010_2’ for better illustration (see the full script including the major step 1 below). Throughout the whole manuscript, we may use different datasets to demonstrate the removal of different artifacts because it is hard to find a single dataset that contains all type of artifacts.

  • 5.
    Select a stationary time window for ICA decomposition.
    • a.
      Click ‘Select Time Window’ to select a time window to feed into ICA decomposition module.
      Note: You can select a relatively stationary segment (e.g., 2 mins or above) with clearly presented eye-blink artifact (see Figure 5, green window).
    • b.
      Click ‘Correct’.
      Note: ICA will start to decompose the selected segment (which takes a few seconds) before a scalp map window shows up (Figure 6). The scalp map shows the probability of different ICs being categorized as different sources by ICLabel (e.g., Eye, Brain). Besides, the cleaned EEG signals after removing ‘Eye’ artifacts with probability larger than 0.8 have been shown on the ICA_correction interface, lower panel (Figure 5). To ensure replicability of the results, the selected time window for ICA decomposition will be automatically recorded in EEG.ica_win.
    • c.
      Check the quality of ICA results.
      Note: At this point, check two things: 1) Ensure there is at least one IC component resembling eye blink (e.g., IC 1, 2 in Figure 6) and at least one IC component resembling eye movements (e.g., IC 3, 4 in Figure 6), 2) Verify that the ocular artifacts have been largely removed (see Figure 5, lower panel). Occasionally, the ICA decomposition and the ICLabel identification may not successfully identify ocular components. This is rare but occurs in about 1 out of 10 cases in the current dataset, leaving significant artifacts in the data shown in the lower panel of Figure 5. Therefore, always check the lower panel in Figure 5. This issue may be due to non-stationarity in the data. If this occurs, click ‘Select Time Window’ again and choose another time window from the top panel of Figure 5 (e.g., another 2-min segment, or longer segment) and rerun the procedure until the two criteria above are fulfilled. In most cases, you will achieve results that meet these criteria. In exceptionally rare cases, you may only identify eye-blink IC(s) (like IC 1 or 2 in Figure 6) but not any eye-movement IC (like IC 3 or 4 in Figure 6). If the ocular artifact has been largely corrected, as shown in the lower panel of Figure 5, you may still accept the results.
      The stop criterion in ICA has been changed to 0.001 instead of the original 1 × 10−6 as we found the former is sufficient for a clean identification of ocular artifacts. Users can change the stop criterion back to 1 × 10−6 by changing the code in ‘ICA_correction.m’.
  • 6.

    Click ‘Save Data’.

Inline graphicCRITICAL: Clicking ‘Save Data’ is very important. Otherwise, all the ocular artifacts will remain during the subsequent PCA correction procedure. After the ICA correction, the ocular artifacts should have been largely corrected. There may be still large-amplitude idiosyncratic components remaining, which will be corrected in the next main step. Clicking ‘Save Data’ will update the EEG data with ICA-cleaned data.

Note: A brief tutorial for understanding and assessing the topographies of ICA sources is provided here. Figure 6 shows the pattern of independent sources projecting to the scalp. Roughly, they can be divided into three major categories: 1) Ocular artifacts, 2) Brain activity sources, and 3) Others. Ocular artifacts are usually divided into two types: eye blinks and eye-movement. Eye-blink artifacts typically exhibit the pattern shown in IC 1 and 2 with a clear and focal distribution on the forehead. Eye-movement is predominantly contributed by horizontal eye-movement with a pattern shown in IC 3. Note that the topography can be shown with a reverse polarity because the polarity is arbitrary in ICA topography results. Vertical eye-movement has a pattern similar to eye-blinks and can thus be merged to it. These are very typical patterns, so an experienced EEG researcher can easily identify them. A more detailed description of ocular artifact patterns and their underlying mechanisms can be found at.17 The second major type is brain activity source. The topography pattern of this type is also easy to identify: they typically show a structured pattern with a continuous and relatively widespread distribution centered at a specific brain region (e.g., frontal, central, parietal, temporal, occipital or mixture of them). It has been claimed that they are dipole like.18 The third type is those sources that are more difficult to identify: they typically show unstructured topography patterns with sporadic focal points and are usually ordered in the late part of the IC list. Typically, a good ICA decomposition would produce a few clearly identified ocular artifacts, with dozens of clearly structured brain activity sources, and unidentified sources in the rest. If most of the IC sources are unstructured and hard to identify, it is most likely due to the low quality of the EEG data collected.

Note: We provided the following sample dataset (face_010_2) and script for users to try the ICA correction procedure. The sample dataset does not have problematic electrodes, so the overall visual checking and channel interpolation can be skipped.

clear;

eeglab nogui;

%change below to your own data path, e.g., 'D:∖dataset1_face'

data_path = 'your_data_path';

%change below to your EEGLAB path

eeglab_elec = 'your_EEGLAB_path∖eeglab2022.0∖plugins∖dipfit∖standard_BEM∖elec∖standard_1005.elc';

data_name = 'face_010_2.vhdr';

%load data, down_sample, and filter

EEG = pop_loadbv(data_path, data_name, [], []);

EEG = pop_chanedit(EEG, 'lookup',eeglab_elec);

EEG = pop_resample(EEG, 250);

EEG = pop_eegfiltnew(EEG,'hicutoff',40,'locutoff',1,'plotfreqz',0);

%we skip the visual checking and electrode interpolation for this data as

%no electrode is problematic; you can find the code from script_1

%remove ocular

EEG = ICA_correction(EEG);

%this data does not require PCA procedure, you can directly save it:

pop_saveset(EEG,'filepath','your_saving_path','filename','face_010_2.set');

Figure 5.

Figure 5

The ICA correction interface

Figure 6.

Figure 6

Classification results from ICLabel

Part 3: PCA-based large-amplitude artifact correction

Inline graphicTiming: 0–5 min (depending on the amount of artifact)

This major step follows the ICA_correction step and serves to remove those transiently occurred large-amplitude artifacts that do not have statistically stable features for ICA to isolate.

Note: The main assumption here is that those large-amplitude, transient segments are caused by singular, large-amplitude, idiosyncratic sources. Because of the idiosyncrasy of the large-amplitude source, it is assumed to be uncorrelated with other sources and thus can be isolated by PCA. PCA is a signal processing method used to decompose high-dimensional data into separate components with the first few components capturing the major variance in the multi-dimensional data. It does this by transforming the data into a new set of variables, called principal components, which are orthogonal (uncorrelated) and ordered by the amount of original variance they capture.

  • 7.

    Run the script ‘EEG = PCA_correction(EEG);’ and you will see the interface as in Figure 7.

Note: The previous example ‘face_010_2’ does not have much transient large-amplitude artifact to remove. We will use the data 'face_014_2′ for the illustration. You can apply the previous steps to this data by yourself first.

  • 8.

    Use Matlab zoom-in tool (hover on the top-right corner and it will show up) to zoom into a segment that contains large-amplitude abnormal artifacts. Click on ‘Select Time Window’ to select time windows that precisely cover the artifacts. An example is shown in Figure 8.

Note: In practice, users may set a cutoff to label segments that contain large-amplitude artifacts as problematic ones. The ideal cutoff in the current dataset is ± 100 μV, which is a common threshold used to discard EEG segments in the community of EEG research.5 However, we have seen participants having Alpha oscillations (around 10 cycles per second) that exceed the range of ± 100 μV! Be cautious not to mistake these Alpha oscillations as artifacts. You can always zoom in and count how many cycles per second to confirm if they are Alpha oscillation. An example of strong Alpha oscillation is provided in the end of the article. In some cases, even if the clusters of large-amplitude artifact do not exceed ± 100 μV (which is just an empirical setting), they may still appear clearly transient and abnormal compared to the ongoing pattern. A tiny muscle twitch, for example, could generate such artifacts. This is one of the limitations discussed below.

  • 9.

    After selecting the time window(s), click ‘Correct’.

Note: PCA will remove the large-amplitude artifacts until the global field power of the back-projected data does not exceed the median + 2×(median absolute deviation) of the remaining data segments. You can choose to select and correct a single segment at a time or select multiple segments and correct them for simultaneous correction. The latter is recommended, as selecting all artifact segments at once ensures that the remaining clean data for comparison is cleaner. To ensure the replicability of the results, all the time windows selected are recorded in EEG.pca_wins.

  • 10.

    Click ‘Save Data’, close the PCA_correction window, and run the following script to save the EEG dataset to local disk: “pop_saveset(EEG,'filepath', 'your_path','filename','name.set']);”.

Note: Now, the cleaned version has been saved, and you can do all the subsequent analyses on them. Before saving the final clean data, you may want to re-reference your data to the common average. To run this, simply execute “EEG = pop_reref(EEG,[]);”. You can also run this step after loading the data set in the future, but make sure it is done before calculating subsequent neural metrics (e.g., event-related potentials).

Note: We provided the following sample dataset (face_014_2) and script for users to try the PCA correction procedure. This sample dataset does not have problematic electrodes, so the overall visual checking and channel interpolation can be skipped.

clear;

eeglab nogui;

%change below to your own data path, e.g., 'D:∖dataset1_face'

data_path = 'your_data_path';

%change below to your EEGLAB path

eeglab_elec = 'your_EEGLAB_path∖eeglab2022.0∖plugins∖dipfit∖standard_BEM∖elec∖standard_1005.elc';

data_name = 'face_014_2.vhdr';

%load data, down_sample, and filter

EEG = pop_loadbv(data_path, data_name, [], []);

EEG = pop_chanedit(EEG, 'lookup',eeglab_elec);

EEG = pop_resample(EEG, 250);

EEG = pop_eegfiltnew(EEG,'hicutoff',40,'locutoff',1,'plotfreqz',0);

%we skip the visual checking and electrode interpolation for this data as

%no electrode is problematic; you can find the code from script_1

%remove ocular

EEG = ICA_correction(EEG);%based on the time window from around 50 s to 200 s (remember to click 'Save Data')

%remove large-amplitude artifacts

EEG = PCA_correction(EEG);

pop_saveset(EEG,'filepath','your_saving_path','filename','face_014_2.set');

Note: Below we also provide Matlab code for the situation when there is a calibration session of ocular movement. In EEG preprocessing, a common practice involves deliberately collecting a short segment of EEG data (e.g., a few minutes) that contains only eye blinks and movement activities, as instructed by the experimenter or task. This is referred to as calibration data.19 This calibration segment is used to support ICA decomposition, which is then applied to task data to remove ocular artifacts. One theoretical advantage of this approach is that the ICA-decomposed artifact components will contain minimal neural effects of interest in the task, as the ICA decomposition was trained exclusively on a dataset separate from the task data. For this situation, the major step 1 as described above needs a slight modification. Instead of feeding a segment of task data to ICA, we will now use ocular calibration data to obtain the ICA decomposition results. To maximumly facilitate all operations in a way that is consistent with our previous practice, we can simply concatenate the ocular calibration/template data with the task data, and then select a window from the ocular calibration data for ICA correction. This ensures all subsequent operations remain essentially the same.

Figure 7.

Figure 7

PCA correction interface

Figure 8.

Figure 8

Selecting time window for PCA correction

Multiple time windows can be selected and corrected simultaneously. Top: before PCA correction; bottom: after PCA correction.

For this demonstration, we will use a dataset from a single participant with two separate sessions: one ocular movement session (EEG1), where the participant was instructed to perform several eye blinks and eye movements in different directions (see the ocular_movement_task.mp4 video in the Github page), and one task session (EEG2), where the participant pressed the space button as quickly as possible in response to any visual stimulus, with the intertrial interval randomized between 0.5 and 2.5 s. The concatenation of the two EEG sets can be done using the following scripts.

clear;

eeglab nogui;

%change below to your own data path, e.g., 'D:∖dataset2_reactiontime'

data_path = 'your_data_path';

%change below to your EEGLAB path

eeglab_elec = 'your_EEGLAB_path∖eeglab2022.0∖plugins∖dipfit∖standard_BEM∖elec∖standard_1005.elc';

data_name1 = 'arti_029.vhdr';

data_name2 = 'reaction time_029.vhdr';

%ocular template

EEG1 = pop_loadbv(data_path, data_name1, [], []);

EEG1 = pop_chanedit(EEG1, 'lookup',eeglab_elec);

EEG1 = pop_resample(EEG1, 250);

EEG1 = pop_eegfiltnew(EEG1,'hicutoff',40,'locutoff',1,'plotfreqz',0);

%experiment data to clean

EEG2 = pop_loadbv(data_path, data_name2, [], []);

EEG2 = pop_chanedit(EEG2, 'lookup',eeglab_elec);

EEG2 = pop_resample(EEG2, 250);

EEG2 = pop_eegfiltnew(EEG2,'hicutoff',40,'locutoff',1,'plotfreqz',0);

%merge

EEG = pop_mergeset(EEG1,EEG2); EEG.tplpoint = max(EEG1.times);

%channel checking and interpolation can be skipped as there is no

%problematic electrodes in this data

figure('WindowState', 'maximized');

plot(EEG.data' - ones(size(EEG.data,2),1)∗[1:size(EEG.data,1)]∗500);

for jj = 1:size(EEG.data,1) text(size(EEG.data,2),-500∗jj,EEG.chanlocs(jj).labels);end

axis tight;axis off;

%need to interpolate channel?

EEG=pop_interp(EEG);

%remove ocular

EEG = ICA_correction(EEG);

%remove large-amplitude artifacts

EEG = PCA_correction(EEG);

pop_saveset(EEG,'filepath','your_saving_path','filename','data_name.set');

Note: After concatenating the two EEG segments, the ICA correction part will display a red vertical line (Figure 9) that separates the ocular template session and the task session. At this point, the user can simply select a segment from the ocular template session and run all the subsequent procedures in a way that is identical to the procedures above. Figure 9 shows the final outcome after ICA correction. Note that there are still spikes (e.g., t = 180 s and t = 220 s) that may need a further PCA correction.

Note: Below we also provide Matlab code for the situation when the bandpass filter setting for ICA decomposition is different from the frequency band needed to be kept. In the previous session we mentioned that the bandpass filter, particularly the high-pass cutoff, significantly impacts ICA decomposition performance.10,11,12 A high-pass filter with a cutoff of 1–2 Hz has been recommended in the literature10,11,12 as well as in the official EEGLAB tutorial (https://eeglab.org/tutorials/06_RejectArtifacts/RunICA.html). However, such a filter would remove neural activities that researchers might want to preserve (e.g., those below 1 Hz). When an ocular artifact session is available, this issue can be easily addressed by using different bandpass filter settings for the ocular artifact data and the data of interest (e.g., task data). Based on our previous example, we only need to adjust the bandpass filter settings for the second segment to be merged. Specifically, we will use a bandpass filter setting of 0.5–80 Hz, with a notch filter to remove 50 Hz AC noise, for the task data. For the ocular artifact calibration data, we will use a bandpass filter setting of 1–40 Hz to improve ICA decomposition and achieve better extraction of ocular ICs.

clear;

eeglab nogui;

%change below to your own data path, e.g., 'D:∖dataset2_reactiontime'

data_path = 'D:∖Dropbox∖work∖projects∖protocol∖vhdr∖dataset2_reactiontime';

%change below to your EEGLAB path

eeglab_elec = 'D:∖Dropbox∖work∖code∖eeglab2022.0∖plugins∖dipfit∖standard_BEM∖elec∖standard_1005.elc';

data_name1 = 'arti_029.vhdr';

data_name2 = 'reaction time_029.vhdr';

%ocular template

EEG1 = pop_loadbv(data_path, data_name1, [], []);

EEG1 = pop_chanedit(EEG1, 'lookup',eeglab_elec);

EEG1 = pop_resample(EEG1, 250);

EEG1 = pop_eegfiltnew(EEG1,'hicutoff',40,'locutoff',1,'plotfreqz',0);

%experiment data to clean

EEG2 = pop_loadbv(data_path, data_name2, [], []);

EEG2 = pop_chanedit(EEG2, 'lookup',eeglab_elec);

EEG2 = pop_resample(EEG2, 250);

EEG2 = pop_eegfiltnew(EEG2,'hicutoff',80,'locutoff',0.5,'plotfreqz',0);

EEG2 = pop_eegfiltnew(EEG2, 'locutoff',49,'hicutoff',51,'revfilt',1,'plotfreqz',0);

%merge

EEG = pop_mergeset(EEG1,EEG2); EEG.tplpoint = max(EEG1.times);

%channel checking and interpolation can be skipped as there is no

%problematic electrodes in this data

figure('WindowState', 'maximized');

plot(EEG.data' - ones(size(EEG.data,2),1)∗[1:size(EEG.data,1)]∗500);

for jj = 1:size(EEG.data,1) text(size(EEG.data,2),-500∗jj,EEG.chanlocs(jj).labels);end

axis tight;axis off;

%need to interpolate channel?

EEG=pop_interp(EEG);

%remove ocular

EEG = ICA_correction(EEG);

%remove large-amplitude artifacts

EEG = PCA_correction(EEG);

pop_saveset(EEG,'filepath','your_saving_path','filename','data_name.set');

Note: If there is not an ocular calibration data segment and the researchers still want to use different filter settings for ICA decomposition and for keeping the data, they can use the following procedure. Here, we use the sensorimotor speed task data (reaction time_029.vhdr) to demonstrate, assuming that the ocular artifact session does not exist. First, we merge the two datasets from the same data source but with different bandpass filtering settings:

clear;

eeglab nogui;

%change below to your own data path, e.g., 'D:∖dataset2_reactiontime'

data_path = 'your_data_path';

%change below to your EEGLAB path

eeglab_elec = 'your_EEGLAB_path∖eeglab2022.0∖plugins∖dipfit∖standard_BEM∖elec∖standard_1005.elc';

data_name1 = 'reaction time_029.vhdr';

data_name2 = 'reaction time_029.vhdr';

%ocular template

EEG1 = pop_loadbv(data_path, data_name1, [], []);

EEG1 = pop_chanedit(EEG1, 'lookup',eeglab_elec);

EEG1 = pop_resample(EEG1, 250);

EEG1 = pop_eegfiltnew(EEG1,'hicutoff',40,'locutoff',1,'plotfreqz',0);

%experiment data to clean

EEG2 = pop_loadbv(data_path, data_name2, [], []);

EEG2 = pop_chanedit(EEG2, 'lookup',eeglab_elec);

EEG2 = pop_resample(EEG2, 250);

EEG2 = pop_eegfiltnew(EEG2,'hicutoff',80,'locutoff',0.5,'plotfreqz',0);

EEG2 = pop_eegfiltnew(EEG2, 'locutoff',49,'hicutoff',51,'revfilt',1,'plotfreqz',0);

%merge

EEG = pop_mergeset(EEG1,EEG2); EEG.tplpoint = max(EEG1.times);

%channel checking and interpolation can be skipped as there is no

%problematic electrodes in this data

figure('WindowState', 'maximized');

plot(EEG.data' - ones(size(EEG.data,2),1)∗[1:size(EEG.data,1)]∗500);

for jj = 1:size(EEG.data,1) text(size(EEG.data,2),-500∗jj,EEG.chanlocs(jj).labels);end

axis tight;axis off;

%need to interpolate channel?

EEG=pop_interp(EEG);

%remove ocular

EEG = ICA_correction(EEG);

%remove large-amplitude artifacts

EEG = PCA_correction(EEG);

%cut out the segment for training ICA?

% EEG = pop_select(EEG,'point',[fix(EEG.tplpoint∗EEG.srate/1000)+1,size(EEG.data,2)]);

pop_saveset(EEG,'filepath','your_saving_path','filename','data_name.set');

Note: After doing this, select the session before the red vertical line for ICA decomposition. After clicking ‘Correct’ and the ICA results will be extracted from your selected window (which is from a filter setting good for ICA) and the correction will be applied to the entire segment. The second half of the data will have its ocular artifacts removed while preserving a different frequency band of the data. After the ICA correction and a subsequent PCA correction (if needed), you can run the following line to cut out the first half and retain the second half before saving the data to the local computer using the data saving line above. In this situation, cutting out the first half is highly necessary when there are subsequent procedures that need the time markers to derive average ERPs. Without cutting out the data, the trials will be duplicated.

EEG = pop_select(EEG,'point',[fix(EEG.tplpoint∗EEG.srate/1000)+1,size(EEG.data,2)]);

Figure 9.

Figure 9

ICA correction based on ocular calibration data

The calibration data is the segment before the red vertical line.

Expected outcomes

The primary goal of the three major steps is to remove artifacts that contribute significant variance to the data and contaminate the end results, such as ERPs. The protocol also includes intermediate visual checks to ensure the effectiveness of the major artifact removal steps. For instance, if ocular artifacts are not thoroughly removed, they can be easily identified in Figure 5. Additionally, the third major step, PCA correction, ensures that extremely large-amplitude artifacts are eliminated.

While the protocol involves a degree of subjectivity, reproducibility can be ensured if: 1) all decision-making parameters are documented (which is guaranteed because all selected time windows are recorded in the EEG structure data), and 2) different operators, following the same principles, produce consistent end results. To test this, we recruited three operators to conduct EEG preprocessing according to the three major steps outlined above. The first operator was an undergraduate student with no EEG data processing experience; the second was a first-year PhD student with one year of EEG analysis experience; and the third was an experienced researcher with over 10 years of EEG analysis experience. These three operators independently conducted the preprocessing according to the described procedures and principles. We expect that the three independent practices will lead to highly convergent ERP results if the protocol is stable. The comparison of the three operators' results is shown in Figure 10, which demonstrates high consistency.

Figure 10.

Figure 10

Comparison of end result (grand-average ERP from Pz) from three researchers independently conducting the pre-processing protocol

The digital results shown in Figure 10 and the preprocessed data from the three operators have been uploaded to the hosting GitHub page. Since all the time windows for ICA and PCA correction have been documented in the data, the results in Figure 10 can be replicated by strictly following the selected windows. Readers interested in practicing the protocol can perform the preprocessing themselves using the raw data (also uploaded) from these 20 participants to see if the ERPs they generate are close to the results in Figure 10. If they are, it would confirm the stability of the end results despite the subjective, principle-guided decisions made during the process.

We further applied the protocol to the second dataset and identified a simple research question to demonstrate the applicability of the protocol to cognitive research. Since the second dataset contains ocular calibration session, we applied the protocol described above for the situation when there is a calibration session of ocular movement. The simple research question we identified was whether single trial ERP latencies are indicative of cognitive speed. In Figure 11, we plotted two versions of single trial ERPs sorted by RT, further averaged across 20 participants (dataset can be downloaded from Github page). The first version is from the data after applying the pre-processing protocol (Figures 11A and 11B). The second version is from the data after only simple band-pass filtering (Figures 11C and 11D). As can be seen from the difference between Figures 11A and 11C, the protocol effectively removed the main artifacts, making the single trial data much cleaner. We further plotted the scatterplot between single trial RTs (averaged across 20 participants) and the detected single trial peak latencies, the results showed that the cleaned version (Figure 11B) displayed clearer association, while the uncleaned version appeared to contain significant random variation caused artifacts that compromised the association. The correlation from Figures 11B and 11D are 0.79 and 0.73, respectively. The data is from electrode Oz where the later ERP component is the strongest in this dataset.

Figure 11.

Figure 11

Single trial ERP and its association with cognitive speed

(A) Single trial ERP from Oz averaged across 20 participants, sorted by single trial RTs.

(B) The scatterplot between detected peak latencies and single trial RTs from plot A. The results from (A and B) have been pre-processed by the protocol.

(C and D) Same as (A and B) but have not been pre-processed excepted for bandpass filtering.

Limitations

Scope of the protocol

As stated at the beginning, this protocol is designed to support researchers using EEG for cognitive research by removing major artifacts with intermediate data quality inspection, ensuring that no significant, large-amplitude artifacts remain in the pre-processed data. The protocol is applicable to most early-stage research or research paradigms aiming to obtain and characterize major neural correlates or indicators from EEG data. These neural correlates or indicators typically have well-established patterns that distinguish them clearly from artifacts (e.g., N170, P300, N400, alpha suppression effect, etc.).

However, this protocol may not be suitable for research that aims to examine highly subtle neural effects, as it is not positioned as one that thoroughly removes all artifacts with theoretical and methodological completeness. For example, eye movements produce muscle spike potentials that can mimic gamma oscillations and require very sophisticated procedures to remove,12,20,21 which are beyond the scope of the current protocol.

In terms of task and experimental paradigms, the protocol is best suited for laboratory-based paradigms where participants remain still throughout the entire task with minimal behavioral responses (e.g., only button presses). It may not be applicable to paradigms that tend to produce excessive artifacts (e.g., continuous speech production tasks, free-moving outdoor tasks, tasks performed by infants, etc.), which may require more aggressive methods to remove excessive artifacts.22,23

In summary, the protocol is oriented towards removing large-amplitude artifacts that are more likely to contribute variance to the end results. Some artifacts, though clearly non-neural in origin, may not need to be removed in certain contexts if they do not significantly contribute variance to the end results. One example is heartbeat-related artifacts. Compared to large-amplitude artifacts such as eye blinks and eye movements, heartbeat-related artifacts contribute much less variance to the EEG data and are less consistently found across participants. This is also the case for the current sample dataset with 20 participants. There were very few participants having heart-beat artifacts detected by ICLabel with high confidence, making the decision to remove them difficult. To demonstrate the difference between heartbeat artifacts and ocular artifacts in the variance they contribute to the end results, we identified a participant for whom ICLabel identified a heart-beat artifact IC with relatively high probability (around 40%). The time course of the source activity of this IC component is shown in Figure 12B, which clearly shows a typical heartbeat frequency.

Figure 12.

Figure 12

Comparison of contributions of ocular artifacts and heart-heat artifacts to the trial-average ERP

(A–C) The scalp map, time segment, and spectrum of an IC identified as heart-beat artifact from a single participant.

(D–F) Three versions of trial-average ERP from Pz: 1) without removing ocular artifacts (D), 2) after removing ocular artifacts (E), 3) after removing both ocular artifacts and the identified heart-bear artifact (F).

We then plotted three versions of ERPs: Without any artifact identification and removal (Figure 12D); With ocular artifacts removed (Figure 12E); With both ocular artifacts and the specifically identified heart-beat artifact removed (Figure 12F). The comparison of Figures 12D–12F clearly shows that the heart-beat artifact’s contribution to the end result is negligible (difference between Figures 12E and 12F) compared to the ocular artifact’s contribution (difference between Figures 12D and 12E). This result further supports the utility of the current protocol: focusing on removing large-amplitude artifacts can achieve a high level of practical utility.

Another limitation lies in the decision-making process during the PCA correction procedure. The PCA correction procedure is primarily designed to remove abnormally large and transient artifacts. By definition, “transient” means that there is no stable statistical property across the occurrences of the artifact, implying the idiosyncratic nature of each artifact. This characteristic makes these artifacts theoretically unremovable by the ICA procedure. Based on the assumption that large-amplitude segments are produced by high-variance sources independent of brain activity (e.g., temporary disconnection of a single electrode), PCA can effectively remove them by eliminating the large-amplitude PCs. However, deciding whether to correct a segment may not be straightforward.

One can establish a criterion beforehand and follow it during practice or instruct other data processing operators to adhere to it. For instance, we may decide that any segment in which a cluster exceeds the range of ±100 μV in the EEG (after the bandpass filtering step) must be corrected. This approach would inevitably leave some very transient-looking artifacts uncorrected. For example, Figure 13 shows two instances of muscle-like artifacts. If we set the cutoff at ±100 μV, the first instance would be corrected in the PCA correction procedure, while the second instance would likely be left out. However, the second cluster also appears quite transient and abnormal, and would convincingly be regarded as an artifact.

Figure 13.

Figure 13

An EEG segments with 32 channels superimposed showing two clusters of suspected artifacts

This exemplifies one of the most fundamental questions in the field: What constitutes signal and what constitutes noise? How can they be clearly separated? Nonetheless, this issue is less significant for ERP research if: 1) Participants follow instructions and minimize movements. 2) There are a sufficient number of trials and participants to effectively cancel out noise produced by ambiguous remaining artifacts in single trials. 3) During the ERP averaging procedure, an outlier detection on single trials can be conducted to exclude outlier trials before cross-trial averaging. For example, the mean global field power of each single trial ERP can be calculated, and outliers can be defined as median ± n × (median absolute deviation), where n can be set by the users (e.g., 5).

Another limitation that needs to be addressed concerns the theoretical foundation of the PCA correction procedure. The PCA correction procedure is theoretically appropriate if its main assumption is met: that the transient artifact has a large amplitude and is independent of genuine neural activity signals. A typical example is a short-period disconnection of a single electrode (e.g., a few seconds) or an unexpected sudden dragging of the cap, both of which can lead to extremely high fluctuations at the affected electrode(s). These types of artifacts typically manifest as large variance that is significantly different from the second PC and beyond in the scree plot (which shows the amplitude/variance of each PC). Additionally, the scalp map will appear non-brain-like. Two examples of these artifacts and their corrections are shown in Figure 14. In such scenarios, the assumption is met, and the PCA correction procedure can effectively correct the transient artifact (Figure 14).

Figure 14.

Figure 14

Scenarios when PCA is effective in removing artifacts with high abnormality

However, there are scenarios where the artifactual PCs are not among the first few large-amplitude PCs that are distinctly different from all other neural PCs. In such cases, removing the first few PCs until the remaining signals' global field power is normalized is not necessarily a theoretically sound approach but more of a practical solution. This issue is particularly evident in transient muscle artifacts. Muscle vibrations generate electrical potentials that are not statistically stationary and are not consistently found across participants, making these artifacts difficult to remove.

The complexity increases when some muscle artifacts are not single-sourced; there could be multiple sources underlying the same cluster of muscle artifacts. Figure 15 illustrates such an example. Several features of the PCA decomposition demonstrate the difficulty of removing these muscle artifacts: 1) The scree plot (Figure 15B) shows a gradually decreasing pattern of PC amplitudes rather than a clear drop at the early stages, implying non-differentiability of the PCs. 2) The time courses of the first eight PCs all display patterns that capture the transient activity of the muscle artifacts, further confirming that these artifacts cannot be isolated. 3) The scalp maps of the first eight PCs show diverse patterns, indicating that the muscle artifacts are distributed over the scalp and head in a spatially complex manner. Therefore, in scenarios similar to Figure 15, removing the first few PCs is a practical procedure aimed at reducing the contribution of large-amplitude components to the variance of the end results, rather than a theoretically sound one.

Figure 15.

Figure 15

An example of EEG segment that contains muscle artifact-like component

This artifact is not generated by a separate source as shown by its permeation to most PCs.

(A) EEG segment containing artifact.

(B) Scree plot of PCA.

(C–J) Decomposed principal components and their scalp maps.

It is important to recognize that there are scenarios contrary to the previous ones: neural activities could be mistaken for artifacts. This can occur with Alpha activity. Although uncommon, we found in our datasets that some participants exhibit very strong Alpha oscillations, reaching up to ±100 μV. In the EEG segment shown in Figure 16, the top panel displays segments with spiky activity bursts reaching ±100 μV. Upon zooming in on a 2-s segment (Figure 16B), we observed that this bursting activity exhibits precise oscillations at 11 Hz, predominantly in a right-lateralized occipital area. These characteristics strongly suggest that this burst activity is an Alpha oscillation, distinct from a muscle artifact typically above 30 Hz. Therefore, when using the PCA procedure to correct clusters of large-amplitude artifacts, it is advisable to check the oscillation frequency of the cluster. A precise oscillation frequency around 10 Hz would indicate that the cluster is an Alpha oscillation, representing neural activity rather than artifacts.

Figure 16.

Figure 16

An EEG segment containing high amplitude, Alpha oscillation-like activity with a central frequency at 11 Hz

(A) EEG segment containing potential artifacts.

(B) Artifact-like activity with alpha band oscillation.

Another important note about PCA correction is that the procedure is designed to address transient (short-lasting), abnormal clusters or bursts of unexpected artifacts, typically lasting from a few hundred milliseconds to a few seconds. If persistently long artifacts (ranging from dozens of seconds to minutes) are detected and their nature is unclear, it is recommended to either remove the entire segment or exclude the participant’s data entirely.

Finally, we would like to remind readers that the current protocol is intended to present a package of procedures, rather than a set of theoretical arguments for the underlying methodologies. Every method involved, including PCA, ICA, and even filtering, has inherent limitations and issues. For example, while ICA can effectively correct ocular artifacts, it may also result in under-correction or over-correction. These issues are extensively discussed in the literature (e.g.,12,24,25). We hope that this protocol will help researchers effectively manage the complex and substantial artifacts in their data, allowing them to access neural signals more reliably.

Troubleshooting

Problem 1

Error prompted from Matlab: “Unrecognized function or variable 'pop_loadbv'”.

Potential solution

The current dataset for demonstration is collected by EEG equipment from Brain Products that saved the data in BrainVision Recorder format where each separate data contains three files, a header file (∗.vhdr), a marker file (∗.vmrk) and a raw EEG data file (∗.eeg). Because there are numerous commercial EEG systems that save EEG data in different formats, the data import tools are not automatically installed in EEGLAB but are available for download. Therefore, users should install the data import extension for BrainVision Recorder format to avoid the error reported above. To install it, launch EEGLAB and click “Manage EEGLAB extension” and find ‘bva-io’ and install it. You can find different data import extensions from here.

Problem 2

EEGLAB and ICLabel version issue, step 5.

Potential solution

As of the testing of our protocol, we noticed some bugs in ICLabel (version 1.5) results in EEGLAB in some of 2023 and 2024 versions. This bug reports unambiguous ocular artifact ICs as brain signals. It has been reported to the EEGLAB team and has been fixed (https://Github.com/sccn/eeglab/issues/788) in EEGLAB v2024.1 (with ICLabel 1.6). We recommend users to use EEGLAB 2022.0 version to practice the procedures described in the current protocol. If users want to use the latest EEGLAB version, we recommend checking EEGLAB v2024.1 or later. If the users see that the ocular artifacts are correctly reported by ICLabel, they can proceed with it.

Problem 3

In the PCA correction procedure, ocular artifacts are still present, step 9.

Potential solution

This is because the users forgot to click ‘Save data’ in the previous ICA correction step before closing the window.

Problem 4

There is a clearly identified eye artifact from the ICLabel scalp map results. But the eye-blink related traces are still clearly visible in the ‘after’ panel (Figure 5).

Potential solution

This issue occurs in about 5% of our practices, which is why conducting real-time checking of the artifact removal effect is important. Typically, ICA produces more than one ocular artifacts. Even for eye-blink, there could be more than one IC components associated with it. It could happen that some major eye-blink artifact ICs fall within the level of 80% probability. If you still see clear eye-blink-liked activities in the ‘after panel in Figure 5, you could try a different time window for ICA training, or lengthen/shorten the time window until you reach a result that shows no eye-blink related traces in the ‘after’ panel in Figure 5.

Problem 5

Missing Matlab toolbox.

Potential solution

Whenever Matlab reports missing of Matlab toolbox (e.g., Statistics and Machine Learning Toolbox, which is needed for PCA part), simply follow the Matlab’s instructions to install the toolbox (also called Add-on in some contexts).

Problem 6

Not able to obtain good ICA decomposition even when the segment of calibration data ‘looks’ good?

Potential solution

If you are using a segment of ocular calibration data to obtain ICA decomposition results for ocular artifact removal in task data, you may encounter situations where the combination of ICA decomposition and ICLabel fails to properly identify and remove ocular artifacts, as shown in the ‘after’ panel in Figure 9. Several factors could contribute to this issue. One possibility is that the calibration data is too short. In the example we provided (ocular_movement_task.mp4), the calibration duration was only 81 s, which may be insufficient. We recommend using a longer calibration period (e.g., 3 min or more) to allow ICA to robustly identify all ocular artifacts.

Another potential cause could be the impact of the reference used. In a previous case where this issue occurred in our own practice, we found that applying an average reference before ICA decomposition resolved the problem. Although we do not fully understand how reference selection affects the performance of ICA and ICLabel, it appears to have a significant impact. However, when applying average re-referencing, ensure it is done after handling (interpolating or removing) bad channels.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Guang Ouyang (ouyangg@hku.hk).

Technical contact

Technical questions on executing this protocol should be directed to and will be answered by the technical contact, Guang Ouyang (ouyangg@hku.hk).

Materials availability

This study did not generate new unique reagents.

Data and code availability

The data, code are available at https://github.com/guangouyang/EEG_preprocessing_protocol DOI: https://doi.org/10.5281/zenodo.14816760.

Acknowledgments

This work was supported by the Hong Kong Research Grant Council GRF (17609321) and Seed Fund for Basic Research from the University of Hong Kong (2202100568 and 2203100569) to G.O.

Author contributions

G.O. designed the protocol, developed the code, and wrote the manuscript. Y.L. co-developed the code and user interface, tested the code, and edited the manuscript.

Declaration of interests

The authors declare no competing interests.

References

  • 1.Pedroni A., Bahreini A., Langer N. Automagic: Standardized preprocessing of big EEG data. Neuroimage. 2019;200:460–473. doi: 10.1016/j.neuroimage.2019.06.046. [DOI] [PubMed] [Google Scholar]
  • 2.Delorme A. EEG is better left alone. Sci. Rep. 2023;13:2372. doi: 10.1038/s41598-023-27528-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gabard-Durnam L.J., Mendez Leal A.S., Wilkinson C.L., Levin A.R. The Harvard Automated Processing Pipeline for Electroencephalography (HAPPE): Standardized Processing Software for Developmental and High-Artifact Data. Front. Neurosci. 2018;12:97. doi: 10.3389/fnins.2018.00097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Bigdely-Shamlo N., Mullen T., Kothe C., Su K.M., Robbins K.A. The PREP pipeline: standardized preprocessing for large-scale EEG analysis. Front. Neuroinform. 2015;9:16. doi: 10.3389/fninf.2015.00016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Luck S.J. MIT Press; 2014. An Introduction to the Event-Related Potential Technique. [Google Scholar]
  • 6.Debnath R., Buzzell G.A., Morales S., Bowers M.E., Leach S.C., Fox N.A. The Maryland analysis of developmental EEG (MADE) pipeline. Psychophysiology. 2020;57 doi: 10.1111/psyp.13580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bailey N.W., Biabani M., Hill A.T., Miljevic A., Rogasch N.C., McQueen B., Murphy O.W., Fitzgerald P.B. Introducing RELAX: An automated pre-processing pipeline for cleaning EEG data - Part 1: Algorithm and application to oscillations. Clin. Neurophysiol. 2023;149:178–201. doi: 10.1016/j.clinph.2023.01.017. [DOI] [PubMed] [Google Scholar]
  • 8.Clayson P.E. Beyond single paradigms, pipelines, and outcomes: Embracing multiverse analyses in psychophysiology. Int. J. Psychophysiol. 2024;197 doi: 10.1016/j.ijpsycho.2024.112311. [DOI] [PubMed] [Google Scholar]
  • 9.Klug M., Gramann K. Identifying key factors for improving ICA-based decomposition of EEG data in mobile and stationary experiments. Eur. J. Neurosci. 2021;54:8406–8420. doi: 10.1111/ejn.14992. [DOI] [PubMed] [Google Scholar]
  • 10.Winkler I., Debener S., Müller K.-R., Tangermann M. IEEE; 2015. On the Influence of High-Pass Filtering on ICA-Based Artifact Reduction in EEG-ERP; pp. 4101–4105. [DOI] [PubMed] [Google Scholar]
  • 11.Chaumon M., Bishop D.V.M., Busch N.A. A practical guide to the selection of independent components of the electroencephalogram for artifact correction. J. Neurosci. Methods. 2015;250:47–63. doi: 10.1016/j.jneumeth.2015.02.025. [DOI] [PubMed] [Google Scholar]
  • 12.Dimigen O. Optimizing the ICA-based removal of ocular EEG artifacts from free viewing experiments. Neuroimage. 2020;207 doi: 10.1016/j.neuroimage.2019.116117. [DOI] [PubMed] [Google Scholar]
  • 13.Lee T.W., Girolami M., Sejnowski T.J. Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Comput. 1999;11:417–441. doi: 10.1162/089976699300016719. [DOI] [PubMed] [Google Scholar]
  • 14.Delorme A., Makeig S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods. 2004;134:9–21. doi: 10.1016/j.jneumeth.2003.10.009. [DOI] [PubMed] [Google Scholar]
  • 15.Pion-Tonachini L., Kreutz-Delgado K., Makeig S. ICLabel: An automated electroencephalographic independent component classifier, dataset, and website. Neuroimage. 2019;198:181–197. doi: 10.1016/j.neuroimage.2019.05.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Comon P., Jutten C. Academic Press; 2010. Handbook of Blind Source Separation: Independent Component Analysis and Applications. [Google Scholar]
  • 17.Plochl M., Ossandon J.P., Konig P. Combining EEG and eye tracking: identification, characterization, and correction of eye movement artifacts in electroencephalographic data. Front. Hum. Neurosci. 2012;6:278. doi: 10.3389/fnhum.2012.00278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Delorme A., Palmer J., Onton J., Oostenveld R., Makeig S. Independent EEG sources are dipolar. PLoS One. 2012;7 doi: 10.1371/journal.pone.0030135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Winkler I., Haufe S., Tangermann M. Automatic classification of artifactual ICA-components for artifact removal in EEG signals. Behav. Brain Funct. 2011;7:30. doi: 10.1186/1744-9081-7-30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Keren A.S., Yuval-Greenberg S., Deouell L.Y. Saccadic spike potentials in gamma-band EEG: characterization, detection and suppression. Neuroimage. 2010;49:2248–2263. doi: 10.1016/j.neuroimage.2009.10.057. [DOI] [PubMed] [Google Scholar]
  • 21.Muthukumaraswamy S.D. High-frequency brain activity and muscle artifacts in MEG/EEG: a review and recommendations. Front. Hum. Neurosci. 2013;7:138. doi: 10.3389/fnhum.2013.00138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chang C.-Y., Hsu S.-H., Pion-Tonachini L., Jung T.-P. IEEE; 2018. 2018, Evaluation of artifact subspace reconstruction for automatic EEG artifact removal; pp. 1242–1245. [DOI] [PubMed] [Google Scholar]
  • 23.Ouyang G., Dien J., Lorenz R. Handling EEG artifacts and searching individually optimal experimental parameter in real time: a system development and demonstration. J. Neural. Eng. 2022;19 doi: 10.1088/1741-2552/ac42b6. [DOI] [PubMed] [Google Scholar]
  • 24.Groppe D.M., Makeig S., Kutas M., Diego S. Independent component analysis of event-related potentials. Cogn. Sci. 2008;6:1–44. [Google Scholar]
  • 25.Pontifex M.B., Gwizdala K.L., Parks A.C., Billinger M., Brunner C. Variability of ICA decomposition may impact EEG signals when used to remove eyeblink artifacts. Psychophysiology. 2017;54:386–398. doi: 10.1111/psyp.12804. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data, code are available at https://github.com/guangouyang/EEG_preprocessing_protocol DOI: https://doi.org/10.5281/zenodo.14816760.


Articles from STAR Protocols are provided here courtesy of Elsevier

RESOURCES