Set-pMAE: spatial-spEctral-temporal based parallel masked autoEncoder for EEG emotion recognition

Chenyu Pan; Huimin Lu; Chenglin Lin; Zeyi Zhong; Bing Liu

doi:10.1007/s11571-024-10162-5

. 2024 Aug 14;18(6):3757–3773. doi: 10.1007/s11571-024-10162-5

Set-pMAE: spatial-spEctral-temporal based parallel masked autoEncoder for EEG emotion recognition

Chenyu Pan ^1,², Huimin Lu ^1,^2,^✉, Chenglin Lin ^1,², Zeyi Zhong ^1,², Bing Liu ¹

PMCID: PMC11655997 PMID: 39712088

Abstract

The utilization of Electroencephalography (EEG) for emotion recognition has emerged as the primary tool in the field of affective computing. Traditional supervised learning methods are typically constrained by the availability of labeled data, which can result in weak generalizability of learned features. Additionally, EEG signals are highly correlated with human emotional states across temporal, spatial, and spectral dimensions. In this paper, we propose a Spatial-spEctral-Temporal based parallel Masked Autoencoder (SET-pMAE) model for EEG emotion recognition. SET-pMAE learns generic representations of spatial-temporal features and spatial-spectral features through a dual-branch self-supervised task. The reconstruction task of the spatial-temporal branch aims to capture the spatial-temporal contextual dependencies of EEG signals, while the reconstruction task of the spatial-spectral branch focuses on capturing the intrinsic spatial associations of the spectral domain across different brain regions. By learning from both tasks simultaneously, SET-pMAE can capture the generalized representations of features from the both tasks, thereby reducing the risk of overfitting. In order to verify the effectiveness of the proposed model, a series of experiments are conducted on the DEAP and DREAMER datasets. Results from experiments reveal that by employing self-supervised learning, the proposed model effectively captures more discriminative and generalized features, thereby attaining excellent performance.

Keywords: EEG, Emotion recognition, Self-supervised learning, Transformer

Introduction

Affective computing aims to develop systems that can recognize, understand, process, and simulate human emotions. Emotion recognition is the core task in this field as it directly associates to the system’s accurately perceived ability and respond to the user’s emotional state. Currently, emotion recognition is widely applied in various domains, including education, healthcare, security, and finance. Emotion recognition methods can be broadly classified into two categories: one based on non-physiological signals, such as facial expressions (Liu et al. 2023), body postures (Yin et al. 2024), and speech signals (Wagner et al. 2023), and the other based on physiological signals, such as Electroencephalogram (EEG) (Jafari et al. 2023), Electromyography (EMG) (Xu et al. 2023a) and Electroencephalogram (ECG) (Fan et al. 2023), among others. Among these, EEG signals are defined as electrical signals recorded from different cortical locations that reflect the firing produced by neurons in the corresponding brain region. Compared with other physiological signals, it is more sensitive to the activity state of the brain and has a stronger correlation with emotions. Furthermore, EEG has the advantages of high resolution and hard to camouflage in comparison to facial expressions and sounds (Li et al. 2023b). However, it also faces numerous challenges: EEG signals are high-dimensional and nonlinear, containing a significant amount of noise and artifacts. These make it extremely difficult to extract meaningful emotional features from signals. Additionally, emotional states are subjective and multidimensional; annotating emotional data typically relies on self-report or external observation, which may introduce bias. Moreover, the rapid technological development of non-invasive EEG recording methods has facilitated the widespread use of EEG signals in research on emotion recognition (Zhang et al. 2019; Xu et al. 2023b; Li et al. 2023a), evolving into focal areas of study within the domains of human-computer interaction and affective computing (Can et al. 2023). Currently, in EEG-based emotion recognition, there are two primarily emotional models: discrete models and dimensional models. Discrete models categorize emotions into six basic types: anger, disgust, fear, happiness, sadness, and surprise. Aydın and Onbaşı (2024); Kılıç and Aydın (2022) have extensively investigated discrete emotion models. Dimensional models define emotions as points in a Cartesian coordinate system, incorporating dimensions such as valence, arousal, and dominance. Depending on the dimensions used, dimensional models can be categorized into two-dimensional and three-dimensional models: valence-arousal and valence-arousal-dominance.The dimensional model is capable of demonstrating a greater range of emotional nuances than the discrete model, reflecting a closer alignment with the actual human perception of external stimuli. Therefore, the dimensional model has been adapted in this research.

With the development of deep learning, EEG emotion recognition methods based on deep learning have been widely proposed. Song et al. (2018) proposed a multi-channel EEG emotion recognition method based on a novel Dynamic Graph Convolutional Neural Network (DGCNN) by modelling multi-channel EEG signal features with graph structure. Yin et al. (2021) extracted EEG graph domain features with a multiple GCNN and extracted temporal features with long short-term memory (LSTM), and finally fused the two features for emotion classification. However, convolutional neural networks (CNNs) tend to ignore global information, while recurrent neural networks (RNNs) are unable to capture spatial information and have low parallel efficiency. There are plenty of temporal, spatial, and spectral features in EEG signals, and Gong et al. (2023b) performed emotion recognition by fusing the spatial and temporal features of the EEG signals. Wang et al. (2023) integrated the spectral and spatial information of EEG signals for emotion recognition. Although existing emotion recognition models have achieved high accuracy, most of them only consider one single type of feature or a combination of two types of feature. These models lack the complementarity between features, which limits the performance of the models and leading to inadequate performance in complex emotion recognition tasks. In addition, most machine learning or deep learning models for emotion recognition are conducted under the supervised learning paradigm, which has some limitations. First, in typical supervised learning paradigm, models need to be trained from scratch for each task, which require significant computational resources and time. Additionally, the representations learned from the trained supervised model are typically specific to the dataset, which may result in overfitting and poor generalized capacity. Finally, supervised learning methods rely on large manually annotated datasets for training. However, the annotation process is time-consuming, labor-intensive, and poses challenges for widespread application. To overcome these limitations, self-supervised learning demonstrates excellent data efficiency and generalisation capabilities. Self-supervised learning leverages the structural information inherent in data for learning, significantly reducing reliance on manually annotated data. This not only reduces the cost of annotation but also improves model generalization, enabling excellent performance across various emotion recognition tasks.

To address the issues mentioned above, this paper proposes a Spatial-spEctral-Temporal based parallel Masked Autoencoder (SET-pMAE) model for EEG emotion recognition. The model consists of two branches: the spatial-temporal branch and the spatial-spectral branch. It is pre-trained by reconstructing masked spatial-temporal and spatial-spectral features simultaneously. This approach trains two tasks concurrently, enables information interaction and mutual reinforcement between the model’s two branches, thereby enhancing the effectiveness of pre-training. Through masked reconstruction pre-train to empower two branches of the model with strong capabilities to extract temporal context features and spatial-spectral correlation information, respectively, which is beneficial for downstream emotion recognition tasks.

The main contributions of this paper are as follows:

In order to learn the generic representations of EEG signals, we design a mask reconstruction task for both spatial-temporal features and spatial-spectral features of EEG signals. By learning these two tasks simultaneously, the network can extract spatial-temporal contextual information and spectral-spatial correlation information, which reduces the overfitting of the network to the emotion recognition task.
We propose a parallel masked self-encoder network that integrates the complex spatial, spectral, and temporal dimensions features of EEG signals simultaneously through a dual-branch network in an end-to-end framework. The EEG signals are transformed into 3D spatial-temporal representations and 3D spatial-spectral representations before input.
We conducted experiments on DEAP and DREAMER datasets, in the DEAP dataset, the average accuracy rates for valence and arousal is 97.79 $%$ , and 97.90 $%$ , respectively. In the DREAMER dataset, the average accuracy rates for valence, arousal, and dominance is 95.08 $%$ , 96.44 $%$ , and 96.90 $%$ , respectively. These results obtain the best performance than the other methods on both datasets and demonstrate the prominence of our method.

The remainder of this article is organized as follows. Related work is described in Sect. 2, and Sect. 3 presents the details of the proposed SET-pMAE network, including EEG signal preprocessing, network structure, self-supervised pretraining and fine-tuning. The experiments conducted in Sect. 4 aim to demonstrate the effectiveness of the proposed SET-pMAE and show the experimental results. Section 5 conclude this paper..

Related work

EEG-based emotion recognition

EEG-based emotion recognition is currently attracting more and more attention. The general process of EEG emotion recognition includes feature extraction and classification. One of the most critical stages is feature extraction, the feature extractors need to have good performance since the complexity of EEG signals. Before deep learning methods were widely adopted, EEG spectral features such as power spectral density (PSD) (Wang et al. 2022), differential entropy (DE) (Duan et al. 2013), and differential asymmetry (DASM) (Liu and Sourina 2013) were commonly used, then used support vector machine (SVM), random forest (RF), or logistics regression (LR) to classify the extracted spectral features. Among them, DE features are proven to be the most accurate and stable features in EEG-based emotion recognition tasks (Zheng et al. 2017).

With the wide application of deep learning methods, various techniques such as CNNs, RNNs, GNNs (Graph Neural Networks), etc., are employed to further extract features from EEG signals. These methods are gradually being explored to elucidate the spectral, spatial, and temporal domains of EEG features. Compared with traditional machine learning algorithms, deep learning can automatically utilise the features and mine the intrinsic information of the data during the training process. Aydın (2020) proposed a novel emotion complexity marker, combining principal component analysis (PCA) with phase space trajectory matrix (PSTM), utilizing deep neural networks for the classification of discrete emotions. Du et al. (2022) utilized an attention-based LSTM, enabling the model to focus on EEG channels related to emotions through the attention mechanism. Gao et al. (2021) developed an innovative channel-fused dense convolutional network that effectively extracts features from noisy EEG signals. Considering that existing studies have neglected the complex dependencies between neighbouring signals of EEG, Zhang et al. (2019) transformed EEG signals to a 2-dimensional spatial mapping and proposed two deep learning-based frameworks, both of which consisted of CNN and RNN, to efficiently explore retained spatial and temporal information in a cascade or parallel manner. Zhou et al. (2023) performed emotion recognition by computing PSD and DE features of EEG signals and mapping them to a 2-dimensional spatial mapping, then extracting temporal and spatial features with BiLSTM and UNet, respectively. In addition to RNNs and CNNs, recent work has achieved good results in Transformer and GNN. Gong et al. (2023a) proposed an attention-based convolutional Transformer hybrid network that achieves SOTA performance. Wei et al. (2023) proposed a Transformer capsule network, which mainly contains an EEG transformer module for extracting EEG features and an emotion capsule module for refining the features and classifying the emotional states, which showing effectiveness on multiple datasets. Liu et al. (2023a) employed a graph convolutional network (GCN) to integrate both global and local features, thereby facilitating the extraction of more complex features. Zhong et al. (2020) proposed a Regularised Graph Neural Network (RGNN) for EEG-based emotion recognition. The RGNN takes into account the biological topology between different brain region structures to capture the local and global relationships between different EEG channels.

Self-supervised learning

Self-supervised learning is a popular method for learning intrinsic information using unlabeled data. This method can be used to learn large amounts of unlabeled data, which genders self-supervised models great potential for learning the generalisability and robustness of representations. In the field of natural language processing, BERT (Devlin et al. 2018) is a typical example of self-supervised learning, which aims to learn by randomly masking words in text. And predicting the masked words is its self-supervised task. SpanBERT (Joshi et al. 2020) proposes to mask successive ranges instead of random individual words. ERNIE (Sun et al. 2019) proposes entity-level masking and phrase-level masking, which combines the phrase-level and entity-level knowledge into linguistic representations. In the field of computer vision, numerous studies have been conducted to obtain general features of images through coloring or puzzle tasks and apply them to image classification tasks (Zhang et al. 2016, 2017; Noroozi and Favaro 2016; Chen et al. 2021). Gidaris et al. (2018) employed the spatial structural properties of an image to artificially construct a self-supervised task of predicting the angle of rotation of an image to train a network. Furthermore, self-supervised methods based on contrast learning have achieved excellent performance (Chen et al. 2020; Grill et al. 2020).

In the field of EEG-based emotion recognition, Xie et al. (2021) proposed a pretext task that performs six different transformations of EEG signals and learns a generic representation of EEG signals through the recognition of transformations. Kostas et al. (2021) used contrastive self-supervised learning to learn a representation of the EEG signals. Li et al. (2023b) employed a contrastive self-supervised learning approach, integrating multiple self-supervised tasks to learn a more generic EEG signal representation. These tasks included spatial and spectral puzzles, as well as a contrast learning task. The masked autoencoder (MAE) (He et al. 2022) is an autoencoder-based unsupervised learning algorithm. It is capable of learning semantic information in high-dimensional data (e.g., images) more efficiently by masking part of the data. MAE has been employed in sleeping state EEG signal classification studies (Chien et al. 2022) and EEG-based motor imagery research (Cai and Zeng 2024). The majority of existing MAE-based methods aim to learn EEG signals by masking the temporal information of the raw EEG signals, while disregarding the temporal or spatial domain of the EEG signal.

Methods

In order to take advantage of the abundant information of the EEG signal in the spatial, spectral and temporal domains, the original EEG signals are transformed into a better 3D representation first. SET-pMAE consists of independent spatial-temporal branch and spatial-spectral branch with the same structure. The two branches consist of the same transformer structure, and the inputs of each branch are the 3D spatial-temporal representation and the 3D spatial-spectral representation of the EEG signal, respectively. Then self-supervised pre-training is performed on both branches to extract the feature representations of the EEG signals, and finally the downstream task of emotion classification is conducted by fusing the features extracted by the two networks. Figure 1 shows the overall structure of our proposed SET-pMAE.

Fig. 1 — The pipeline of the proposed SET-pMAE method. The top section shows the pre-train phase, beginning with the transformation of input data and proceeding to masked reconstruction pre-training. The bottom section illustrates the fine-tune phase, where the weights obtained from pre-training are utilized to separately initialize the temporal-spatial and spectral-spatial encoders, and the features $Z^{T}$ and $Z^{F}$ are subsequently fused for classification

EEG 3D Representation

In order to preserve the spatial information of the brain regions corresponding to the electrode channels, all the channels are projected into a 2D matrix. The transformation of the raw signals can be seen in Fig. 2.

Fig. 2 — Flowchart of EEG signals converted into 2D matrix. This operation projects all EEG channels into a 2D matrix to preserve positional information

For 3D representations acquisition of spatial-temporal features of EEG signals, the baseline signals are firstly subtracted from the original signals to eliminate the baseline, and subsequently the original EEG signals $S_{n} = (s_{n}^{1}, \dots, s_{n}^{c}, \dots, s_{n}^{C}) \in R^{C}$ of n-th time frames are converted into a 2D matrix $m_{n} \in R^{H \times W}$ , where c is the c-th channel, C is the number of channels, n denotes the time frame and $H = W = 9$ . Therefore, the spatial-temporal features of each sample can be expressed as $X^{T} = (m_{1}, \dots, m_{N}) \in R^{N \times H \times W}$ , and 1 s EEG signals are taken as a sample in this paper. Therefore, N is the sampling rate of the signal.

In order to obtain the 3D representations of the spatial-spectral of EEG signals, the baseline-eliminated EEG signals are divided into multiple frequency bands and extract features from each band. Previous studies have shown that entropy features play a significant role in EEG analysis (Özçelik and Altan 2023). For individual samples, the theta [4-8 Hz], alpha [8-14 Hz], beta [14-31 Hz], and gamma [31-45 Hz] frequency bands of each EEG signal electrode channel are filtered out. Both DE and PSD features have been proven to be highly effective in EEG-based emotion recognition (Duan et al. 2013; Zheng and Lu 2015). Therefore, we extract the DE and PSD features for each of the four frequency bands from every electrode lead, as illustrated in Fig. 3.

Fig. 3 — The DE and PSD feature extraction process of EEG signals. Four frequency bands DE and PSD features are extracted from each channel within a 1-second window for each EEG signal segment

Differential entropy extends from the concept of Shannon’s entropy, which can be used to measure the complexity of continuous random variables, and the differential entropy is defined as in Eq. (1), where x is a random variable and $f (x)$ is the probability density function of X.If the random variable obeys a Gaussian distribution with $N (μ, σ^{2})$ , the differential entropy can be simplified and expressed as Eq. (2), where $μ$ and $σ$ denote the mean and standard deviation of the signal x, respectively.

\begin{matrix} DE = - \int_{- \infty}^{\infty} f (x) l o g (f, (x)) x d x \end{matrix}

\begin{matrix} DE = - \int_{- \infty}^{\infty} \frac{1}{\sqrt{2 π σ^{2}}} e^{\frac{(x - μ^{2})}{2 σ^{2}}} log (\frac{1}{\sqrt{2 π σ^{2}}}, e^{- \frac{(x - μ^{2})}{2 σ^{2}}}) d x \\ = \frac{1}{2} log (2, π, σ^{2}) \end{matrix}

PSD is defined as Eq. (3), where x denotes the random signal, i.e., the EEG signal in a time step.

\begin{matrix} PSD = E [x^{2}] \end{matrix}

Then, similar to the spatial-temporal feature representations, the extracted DE features and PSD features are converted into 2D matrix representations as well, and then spliced along the frequency band dimensions to obtain the spatial-spectral representations $X^{F} = (b_{1}, \dots, b_{H \times W}) \in R^{H \times W \times 2 B}$ for each sample, where B denotes the number of bands. Finally, each sample is normalized.

Model architecture

The proposed network framework for self-supervised emotion recognition comprises two network branches: the spatial-temporal branch and the spatial-spectral branch. Initially, a generic feature representation of the EEG signal is obtained through self-supervised pre-training. This representation is then applied to the downstream task of emotion recognition. The structure of the network will be described in detail in the following section.

Spatial-temporal branch

The spatial-temporal branch enables the network to extract temporal context information with great efficacy by masking out a specific percentage of time frame data and then restoring the masked information through the decoder. For the spatial-temporal 3D representations $X^{T}$ of the EEG signals, the visible spatial-temporal 3D representations $X_{visible}^{T} \in R^{ρ N \times H \times W}$ can be obtained after randomly masking out a specific percentage of time frames by the masking ratio $ρ$ . The encoder of spatial-temporal branch maps each time frame $m_{n}$ of the $X_{visible}^{T}$ to the latent representations $Z_{n}^{T} \in R^{ρ N \times d}$ and the decoder reconstructs the latent representations to the original EEG signals ${\hat{X}}^{T} \in R^{N \times H \times W}$ ,where d is the embedding dimension. The input of the encoder is $X_{visible}^{T}$ and in order to reduce the amount of network computation, the asymmetric design is adopted with a lightweight decoder for decoding the latent representation and reconstructing the original EEG spatial-temporal 3D representation at the time frame level. Each component is described in detail below.

Input embedding

In order to process the EEG signals with the Transformer, the 2D matrices of the EEG spatial-temporal 3D representation are encoded into d dimensional feature vectors representing the embedding of each time frame of the EEG signal first. A FC layer is used to perform this encoding, so that for the input $X_{visible}^{T}$ , after encoding with the input coding, the tokens $t_{n} = (t_{1}, t_{2}, \dots, t_{ρ \times N}) \in R^{ρ N \times d}$ for each visible time frame of the signal can be obtained.

Given that Self-Attention cannot capture the positional information of each input, we employ a fixed sinusoidal position embedding of dimension d to generate indication of the position. The sinusoidal position embedding $P_{i}$ of the i th token $t_{i}$ is formulated as:

\begin{matrix} P_{i, 2 l} = & s i n (\frac{i}{10000^{\frac{2 l}{d}}}) \end{matrix}

\begin{matrix} P_{i, 2 l + 1} = & c o s (\frac{i}{10000^{\frac{2 l}{d}}}) \end{matrix}

where d is the position embedded dimension, and $l \in [0, \frac{d}{2}]$ is the l-th dimension. Before feeding into the encoder, the sinusoidal positional embedding is added to the corresponding unmasked time frame tokens to obtain the visible time frame tokens $\bar{t} = \{t_{1} + P_{1}, t_{2} + P_{2}, \dots, t_{N} + P_{N}\}$ with positional information.

Temporal Encoder & Decoder

The vision transformer (ViT) architecture is used as the spatial-temporal Encoder, which is very effective in capturing remote contextual dependencies through self-attention, as shown in Fig. 4. The input of the Encoder is the tokens of the visible time frame $\bar{t}$ . The Encoder has multiple transformer blocks, and each block contains the Multi-head Self-Attention (MHSA), Feed-Forward Network (FFN), and Layer Normalization (LN) modules. In MHSA, the input vector is mapped into three vector sequences, namely $Q \in R^{n \times d_{q}}$ , $K \in R^{n \times d_{k}}$ , $V \in R^{n \times d_{v}}$ , by three different linear transformations, where $d_{q}$ , $d_{k}$ and $d_{v}$ are the dimensions of the Q, K, and V vectors, respectively, n is number of time frame tokens. Then the multi-head H is gained by the transformation of Q, K, and V by linear mapping of dimension $d_{h}$ which is formulated as:

\begin{matrix} h e a d_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

where $W_{i}^{Q} \in R^{d_{q} \times d_{h}}$ , $W_{i}^{K} \in R^{d_{k} \times d_{h}}$ and $W_{i}^{V} \in R^{d_{v} \times d_{h}}$ are the parameter matrices of the i-th head and $A t t e n t i o n (∙)$ denotes the self-attention mechanism, which is calculated as:

\begin{matrix} A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \end{matrix}

From this, we can get the output of the MHSA by concatenating the outputs of multiple headers $h e a d_{i}$ together and then a fully connected layer is used:

\begin{matrix} MHSA (\bar{t}) = F C ([{head}_{1}, ‖, {head}_{2}, ‖ \dots ‖, {head}_{H}]) \end{matrix}

Subsequently, F can be gained after the residual connection layer and LN operations:

\begin{matrix} F = MHSA (LN, (\bar{t})) + \bar{t} \end{matrix}

Finally, $F$ is feed into the FFN layer, which consists of two linear mapping layers, and through the residual connection layer and LN operations, we can get the output of the block:

\begin{matrix} Y^{'} = FFN (LN (F)) + F \end{matrix}

The obtained $Y^{'}$ continues to feed into next block, and then the latent representation encoded by the encoder $Z^{T} = (z_{1}^{T}, z_{1}^{T}, \dots, z_{ρ N}^{T})$ can be got.

Fig. 4 — The structures of spatial-temporal branch masked autoencoder. The encoder is composed of random masking, linear projection, and n transformer block layers, where n is the hyperparameter

The decoder contains Transformer blocks with the same structure as the encoder. The input to the decoder is the $Z^{T}$ .To fill the masked parts of the input embeddings, $N \times (1 - ρ)$ learnable randomly initialized mask tokens are utilized. These tokens are combined with the $Z^{T}$ from the encoder according to their temporal order. Full sinusoidal positional embeddings are then added to all tokens to restore the positional information of the missing parts. The final layer of the decoder is a linear projection layer, which reconstructs the input by predicting all values at the time frame level.

Spatial-spectral branch

The spatial-spectral branch is mainly used to obtain the spectral spatial correlation information of EEG signals. Different from the spatial-temporal branch, the spatial-spectral branch mainly focuses on the spectral and spatial features of the EEG. Since the DE and PSD features of the EEG signal lack of temporal characteristics, after transforming them into 3D representations, we can explore the spatial characteristics of the EEG signal spectrum. For the spatial-spectral 3D representations $X^{F}$ of the EEG signal, we randomly mask out some spatial information by the same masking ratio $ρ$ as the spatial-temporal branching to get the visible spectral-space 3D representations $X_{visible}^{F} \in R^{ρ (H \times W) \times 2 B}$ . The spatial-spectral branching encoder maps the DE and PSD features at each spatial location of $X_{visible}^{F}$ to the latent representation $Z^{F} \in R^{ρ (H \times W) \times d}$ , where d is the embedding dimension, and the $Z^{F}$ are reconstructed by decoder as the spatial-spectral representation of EEG ${\hat{X}}^{F} \in R^{H \times W \times 2 B}$ .

Input embedding

The $H \times W$ position information of the EEG spatial-spectral representations are encoded as a d dimensional feature vector, representing the embeddings of each spatial position of the EEG spatial-spectral representation. As with the spatial-temporal branch, a FC layer is used to perform this encoding, so that for the input EEG spatial-spectral representation $X_{visible}^{F}$ , after encoding each channel space using the input encoding, the tokens $f = (f_{1}, f_{2}, \dots, f_{ρ (H \times W)}) \in R^{ρ (H \times W) \times d}$ for each position space in the 3D representation of the spectral features can be obtained. Before feeding into the encoder, the sinusoidal positional embedding of unmasked positions is summed with the corresponding spectral positional tokens, according to the original positional information during the unmasked phase. Then the final visible spectral positional tokens ${\bar{f}}_{n} = \{f_{1} + P_{1}, f_{2} + P_{2}, \dots, f_{H \times W} + P_{H \times W}\}$ can be generated.

Spectral Encoder &Decoder

The architectures of the spectral encoder and decoder are the same as the spatial-temporal branch, as shown in Fig. 5, where the input of the encoder is $X_{visible}^{F}$ , and the output is the latent representation $Z^{F}$ . The input of decoder is denoted as $Z^{F}$ . To fill in the masked portions of the input embeddings, $(1 - ρ) \times (H \times W)$ learnable randomly initialized mask tokens are used. These tokens are integrated with the $Z^{F}$ based on their original positions. Subsequently, sinusoidal positional embeddings are then added to all tokens to restore the positional information of the missing parts. The final layer of the decoder is a linear projection layer, which forecasts DE and PSD feature values at the spectral-spatial level to finish the input reconstruction task.

Self-supervised pre-training

In order to learn sufficient temporal context information and spectral spatial correlation information comprehensively, a portion of the time frame embeddings and spectral spatial embeddings are randomly mask, respectively. Then the masked information through the respective decoder is reconstructed, respectively. The mean square error is used between the predicted and true values as the reconstruction loss during pretraining:

\begin{matrix} L_{t} = \frac{1}{N_{t}} \sum_{j = 1}^{N_{t}} {({\hat{X}}_{j}^{T} - X_{j}^{T})}^{2} \end{matrix}

\begin{matrix} L_{f} = \frac{1}{N_{f}} \sum_{j = 1}^{N_{f}} {({\hat{X}}_{j}^{F} - X_{j}^{F})}^{2} \end{matrix}

where $N_{t}$ is the number of time frames masked, $N_{f}$ is the number of spatial locations masked, ${\hat{X}}_{j}^{T}$ is the reconstructed spatial-temporal representation, $X_{j}^{T}$ is the original spatial-temporal representation input, ${\hat{X}}_{j}^{F}$ is the reconstructed spatial-spectral representation, and $X_{j}^{F}$ is the original spatial-spectral representation input. The total pretrain loss is:

\begin{matrix} L o s s = L_{t} + L_{f} \end{matrix}

Fine-tuning & classifier

After pre-training is completed, we fine-tune the pre-trained model to perform binary sentiment prediction. This step is supervised with labeled data. The weights obtained after pre-training are used to initialize the spatial-temporal and spatial-spectral encoders, respectively. Notably, the encoder omits the masking operation in the fine-tuning phase. The outputs of the two branches are then fused through the fusion layer. In the classification stage, a linear classifier F is adopted directly on top of the fusion layer, and the inputs of the model are the spatial-spectral and the frequency-space representations of the entire EEG signal. We then fine-tune all the parameters of the model, including the pre-trained weights. The final sentiment classification result obtained is: $Y_{classification} = F (z_{n}^{T}, z_{n}^{F})$ , $Y_{classification}$ denotes the final sentiment classification and $z_{n}^{T}, z_{n}^{F}$ are the outputs of the spatial-temporal encoder and the spatial-spectral encoder respectively.

In this task, cross entropy is used as the loss function, defined as follows:

\begin{matrix} L = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{m = 1}^{M} y_{n}^{m} log ({\hat{y}}_{n}^{m}) \end{matrix}

where N denotes the number of batch sizes, and M denotes the number of categories. $y_{n}^{m}$ and ${\hat{y}}_{n}^{m}$ are the true labels and predicted probabilities of the corresponding categories, respectively. Finally, we obtain the classification probability for each category. The category with the highest probability is considered the final prediction.

Experiment and results

Datasets

The DEAP dataset (Koelstra et al. 2011) is a classical EEG emotion dataset, which is a large open-source dataset containing multiple physiological signals for emotion assessment. Its database captures 32 EEG channel signals and 8 peripheral physiological signals from 32 participants while watching 40 60-second music videos in the form of music video material-evoked stimuli. In this paper, only the 32-channel EEG signals are used. The subjects (50 $%$ male, 50 $%$ female) were between the ages of 19 and 37, and the recorded EEG signals consisted of a 3-second baseline. All signals are downsampled to 128 Hz. Following the viewing of each video, the subjects rated their valence, arousal, liking, and dominance on a scale from 1 to 9.

The DREAMER (Katsigiannis and Ramzan 2017) dataset comprises EEG signals from 14 electrodes from 23 subjects (14 males and 9 females). The subjects were asked to watch 18 film clips to induce emotion. The duration of each film clip ranged from 65 to 393 s, and the recorded EEG signals also included a baseline signal lasting 4 s. All EEG signals were recorded at a sampling rate of 128 Hz and filtered using a band-pass Hamming sinusoidal linear phase FIR filter to eliminate artifacts. Following the viewing of each video, participants rated their valence, arousal, liking, and dominance on a scale from 1 to 5.

In the experiments, emotional labels are divided into two states with the threshold 5 and 3 in DEAP (low: < 5, high: $⩾$ 5) and DREAMER (low: < 3, high: $⩾$ 3) datasets, respectively. It should be noted that the stimulated signals are computed by subtracting the baseline data from the trial signals first, and then DE and PSD features are extracted.

Model training

For both the DEAP dataset and the DREAMER dataset, the EEG signals for each subject were segmented using a 1-second sliding window to avoid interactions between different periods. After segmentation, there were 2400 samples per subject in DEAP and 3728 samples per subject in DREAMER (both excluding the baseline signal). Each sample is converted to spatial-temporal representation, and then the DE and PSD features of each sample are extracted and converted to spatial-spectral representation. To evaluate the model, five-fold cross-validation was conducted for each subject, using average accuracy and standard deviation as the final metrics. The model was implemented with PyTorch 1.12.1, and experiments were performed on an Nvidia RTX 3090 24GB GPU.

Pre-training. For the DEAP dataset, we use the EEG signals of all 32 subjects in the DEAP dataset for pre-training, and the total pre-training data is 32 $\times$ 2400 samples. For the DREAMER dataset, we also used the EEG signals of all 23 subjects for pre-training, with a total pre-training data of 23 $\times$ 3728 samples. The hyperparameters of VIT encoder and decoder are as follows: encoder embedding dimension d = 256, encoder depth is 4, and number of attention heads is 4; decoder embedding dimension d = 128, decoder depth is 2, and number of attention heads is 2. In the experiments on the DEAP dataset, we found that the best model fine-tuning performance can be found for the downstream task of valence/arousal classification when the mask ratio is 0.4/0.25, while in the DREAMER dataset experiments, the best performance mask ratio is 0.45, details of which can be found in section 4.5.2. Using the AdamW optimizer algorithm and loss function Loss to update the model parameters, the learning rate is 0.001, The learning rate decay method is ReduceLROnPlateau, the pre-training batch size is 256, and the training epochs is 300.

Fine-tuning. We use the above datasets to fine-tune our models (encoders and classifiers) to enable the models to perform downstream emotion recognition tasks. We use a 5-fold cross-validation to evaluate our approach. When fine-tuning our model, we use pre-trained parameters to initialise the spatial-temporal encoder and the spatial-spectral encoder, and then add a linear classifier after the fusion layer, which has a hidden layer of size [256, 512] supported by a bn and a relu, as well as an output layer that projects the outputs projected to the categories in an output layer. For each fold, the model was fine-tuned for 30 epochs using the AdamW optimisation algorithm with lr = 0.0005 and weight_decay = 0.0005, with the same learning rate decay strategy as in pre-training, and to avoid overfitting the model, we used a dropout = 0.7 in the linear classifier.

Results and comparison with baseline models

To validate the effectiveness of our proposed method, we compared the performance of SET-pMAE with various baseline methods, including SVM (Cortes and Vapnik 1995), GRU (Chung et al. 2014), LSTM (Hochreiter and Schmidhuber 1997), EEGNet (Lawhern et al. 2018), and DGCNN (Song et al. 2018). We used the average accuracy and standard deviation from five-fold cross-validation as our metrics and performed a detailed statistical analysis using a paired t-test.

Comparison of the DEAP dataset

Table 1 presents the detailed average accuracy and standard deviation for the valence and arousal classification tasks on the DEAP dataset. The results demonstrate that our proposed method achieved superior performance in both valence and arousal dimensions, with average accuracies of 97.79 $%$ and 97.90 $%$ , respectively. Compared to SVM, GRU, LSTM, EEGNet, and DGCNN, SET-pMAE showed an increase in average accuracy for valence by 9.37 $%$ , 17.02 $%$ , 16.35 $%$ , 15.48 $%$ , and 5.79 $%$ , respectively. For arousal dimension, the improvements were 8.47 $%$ , 16.23 $%$ , 15.4 $%$ , 12.9 $%$ , and 5.79 $%$ . Statistical analysis showed all p-values < 0.001, indicating that the results are highly statistically significant. Figure 6 shows the binary emotion recognition results of each subject in Valence and Arousal dimensions, the average accuracy of 32 subjects in Valence and Arousal dimensions are 97.79 $%$ and 97.90 $%$ , respectively, we can see that SET-pMAE has a excellent performance on DEAP dataset. Among them, except for subjects 17 and 26, whose accuracy in Arousal is 94.87% and 94.67 $%$ , respectively, the rest of the subjects have an accuracy of more than 95 $%$ in both dimensions. This can be attributed to our proposed self-supervised tasks, which significantly enhanced the model’s generalization ability, resulting in an excellent and balanced performance. Most of the subjects can be classified accurately on both dimensions, proving the robustness of the method.

Table 1.

Comparison of average accuracy ( $%$ ) for valence and arousal tasks on the DEAP dataset with other baseline models, along with statistical analysis results. The significance level is set at $α$ = 0.05

Model	Valence				Arousal
	Avg accuracy ± std ( $%$ )	p value	95% Confidence interval		Avg accuracy ± std ( $%$ )	p value	95% Confidence Interval
	Avg accuracy ± std ( $%$ )	p value	Lower	Upper	Avg accuracy ± std ( $%$ )	p value	Lower	Upper
SVM	88.42 ± 5.27	< 0.001	0.076	0.111	89.43 ± 5.48	< 0.001	0.066	0.104
GRU	80.77 ± 4.53	< 0.001	0.156	0.184	81.67 ± 5.20	< 0.001	0.146	0.179
LSTM	81.44 ± 4.50	< 0.001	0.150	0.177	82.50 ± 5.16	< 0.001	0.137	0.171
EEGNet	82.31 ± 6.36	< 0.001	0.111	0.147	85.00 ± 5.82	< 0.001	0.111	0.147
DGCNN	91.82 ± 4.40	< 0.001	0.045	0.074	91.93 ± 4.30	< 0.001	0.046	0.074
SET-pMAE	97.79 ± 1.01	–	–	–	97.90 ± 1.16	–	–	–

Open in a new tab

The bold fonts indicate best results

Fig. 6 — The performance of SET-pMAE on DEAP dataset

Comparison of the DREAMER dataset

On the DREAMER dataset, Table 2 presents a detailed average accuracy and standard deviation for valence, arousal, and dominance classification tasks. Our proposed method achieves average accuracies of 95.08 $%$ , 96.44 $%$ , and 96.90 $%$ , respectively. The results demonstrate that our proposed method outperformed the other five baseline methods in valence, arousal, and dominance dimensions. Compared to SVM, GRU, LSTM, EEGNet, and DGCNN, SET-pMAE showed improvements in average accuracy for valence by 11.96 $%$ , 7.28 $%$ , 8.82 $%$ , 12.6 $%$ , and 10.8 $%$ , respectively. For arousal, the improvements were 10.14 $%$ , 4.54 $%$ , 5.26 $%$ , 8.14 $%$ , and 6.53 $%$ . In dominance, the improvements were 9.58 $%$ , 4.61 $%$ , 4.78 $%$ , 7.95 $%$ , and 6.41 $%$ . Figure 7 shows the binary emotion recognition results for each subject in the valence, arousal, and dominance dimensions. SET-pMAE demonstrates significance performance on the DREAMER dataset as well.

Table 2.

Comparison of average accuracy ( $%$ ) for valence, arousal and dominance tasks on the DREAMER dataset with other baseline models, along with statistical analysis results. The significance level is set at $α$ = 0.05

Model	Valence				Arousal				Dominance
	Avg accuracy ± std(%)	p value	95% Confidence Interval		Avg accuracy ± std (%)	p value	95% Confidence Interval		Avg accuracy ± std (%)	p value	95% Confidence Interval
	Avg accuracy ± std(%)	p value	Lower	Upper	Avg accuracy ± std (%)	p value	Lower	Upper	Avg accuracy ± std (%)	p value	Lower	Upper
SVM	83.12 ± 5.16	< 0.001	0.103	0.136	86.30 ± 8.40	< 0.001	0.075	0.128	87.32 ± 8.03	< 0.001	0.068	0.122
GRU	87.80 ± 5.14	< 0.001	0.059	0.086	91.90 ± 6.24	< 0.001	0.031	0.060	92.29 ± 5.78	< 0.001	0.031	0.061
LSTM	86.98 ± 5.11	< 0.001	0.066	0.096	91.18 ± 6.64	< 0.001	0.036	0.070	92.12 ± 5.85	< 0.001	0.032	0.063
EEGNet	82.48 ± 6.29	< 0.001	0.107	0.145	88.30 ± 8.55	< 0.001	0.055	0.108	88.95 ± 7.61	< 0.001	0.055	0.103
DGCNN	84.28 ± 5.24	< 0.001	0.092	0.124	89.91 ± 6.92	< 0.001	0.046	0.084	90.49 ± 6.71	< 0.001	0.044	0.084
SET-pMAE	95.08 ± 3.08	–	–	–	96.44 ± 3.28	–	–	–	96.90 ± 2.70	-	-	-

Open in a new tab

The bold fonts indicate best results

Fig. 7 — The performance of SET-pMAE on DREAMER dataset

Comparison with the state-of-the-art models

To further validate the superior performance of our proposed method, we conducted a detailed comparison with other state-of-the-art models in EEG emotion recognition using the DEAP and DREAMER datasets. Tables 3 and 4 provide the comparison results of average accuracy and standard deviation of models on the DEAP and DREAMER datasets, respectively. The comparison with the state-of-the-art models in the tables is sourced from the original literature to ensure the reliability of the results.

Table 3.

Comparison of mean binary classification accuracy ( $%$ ) between DEAP valence and arousal (mean/standard deviation)

Model	Cross-validation	Average accuracy
Model	Cross-validation	Valence	Arousal
4D-CRNN (Shen et al. 2020)	5-fold cross-validation	94.22 ± 2.61	94.58 ± 3.69
ACRNN (Tao et al. 2020)	10-fold cross-validation	93.72 ± 3.21	92.38 ± 3.73
ECLGCNN (Yin et al. 2021)	5-fold cross-validation	90.45 ± /	90.60 ± /
MTCA-CapsNet (Li et al. 2022)	10-fold cross-validation	97.24 ± 1.58	97.41 ± 1.47
4D-aNN (Xiao et al. 2022)	5-fold cross-validation	96.90 ± 1.65	97.39 ± 1.75
GLFANet (Liu et al. 2023a)	10-fold cross-validation	94.53 ± /	94.91 ± /
TR &CA (Peng et al. 2023)	10-fold cross-validation	95.18 ± 2.46	95.58 ± 2.28
AMDET (Xu et al. 2023b)	5-fold cross-validation	96.85 ± 1.66	97.48 ± 0.99
Caps-EEGNet (Chen et al. 2023)	10-fold cross-validation	96.67 ± 2.01	96.75 ± 1.90
LResCapsule (Fan et al. 2024)	10-fold cross-validation	97.45 ± 1.49	97.58 ± 1.31
SET-pMAE (Ours)	5-fold cross-validation	97.79 ± 1.01	97.90 ± 1.16

Open in a new tab

The bold fonts indicate best results

Table 4.

Comparison of mean binary classification accuracy ( $%$ ) for DREAMER valence, arousal and dominance (mean/standard deviation)

Model	Cross-validation	Average accuracy
Model	Cross-validation	Valence	Arousal	Dominance
THR (Topic and Russo 2021)	10-fold cross-validation	88.20 ± 2.99	90.43 ± 4.00	89.92 ± 3.52
gcForest (Cheng et al. 2020)	10-fold cross-validation	89.03 ± 5.56	90.41 ± 5.33	89.89 ± 6.19
MTCA-CapsNet (Li et al. 2022)	10-fold cross-validation	94.96 ± 3.60	95.54 ± 3.63	95.52 ± 3.78
GLFANet (Liu et al. 2023a)	10-fold cross-validation	94.57 ± /	94.82 ± /	95.51 ± /
Caps-EEGNet (Chen et al. 2023)	10-fold cross-validation	91.12 ± 3.82	92.60 ± 5.10	93.74 ± 5.64
ICaps-ResLSTM (Fan et al. 2024)	10-fold cross-validation	94.71 ± 3.63	94.97 ± 3.71	/
LResCapsule (Fan et al. 2024)	10-fold cross-validation	95.15 ± 3.51	95.77 ± 3.82	95.59 ± 3.82
SET-pMAE (Ours)	5-fold cross-validation	95.08 ± 3.08	96.44 ± 3.28	96.90 ± 2.70

Open in a new tab

The bold fonts indicate best results

For the DEAP dataset, we compared our method with 4D-CRNN (Shen et al. 2020), ACRNN (Tao et al. 2020), ECLGCNN (Yin et al. 2021), MTCA-CapsNet (Li et al. 2022), 4D-aNN (Xiao et al. 2022), GLFANet (Liu et al. 2023a), TR &CA (Peng et al. 2023), AMDET (Xu et al. 2023b), Caps-EEGNet (Chen et al. 2023) and LResCapsule(Fan et al. 2024). These methods perform well on the DEAP dataset. Overall, it can be seen that our methods have a significant superiority in terms of mean accuracy and standard deviation on both the valence and arousal dimensions of emotion recognition task. Compared to the model of CNN + RNN (4D-CRNN), the accuracy of Valence and Arousal is improved by 3.57 $%$ and 3.32 $%$ , respectively. Compared to the model of GCNN + RNN (ECLGCNN), our model has a significant advantage, with accuracy improvements of 7.34 $%$ and 7.30 $%$ for Valence and Arousal, respectively. Our method is further compared with other methods, with an accuracy improvement of 0.89 $%$ and 0.53 $%$ with the Attention-based CNN network (4D-aNN) on binary classification tasks. There is also an improvement in our model compared to Transformer-based networks (TR &CA and AMDET). Our method also shows superiority over other methods. Futhermove, our proposed method has the most minor standard deviation on valence.

For the DREAMER dataset, we compare our method with six other state-of-the-art models, namely THR (Topic and Russo 2021), gcForest (Cheng et al. 2020), MTCA-CapsNet (Li et al. 2022), GLFANet (Liu et al. 2023a), Caps-EEGNet (Chen et al. 2023), ICaps-ResLSTM (Fan et al. 2024) and LResCapsule (Fan et al. 2024). It can be seen that our methods have an outstanding superiority in terms of mean accuracy and standard deviation on all three emotion recognition tasks. Compared to the CNN model (THR) based on topographic and holographic feature map representations of the EEG signal, our model improves the average accuracy by 6.88 $%$ , 6.01 $%$ and 6.98 $%$ in valence, arousal and dominance, respectively. Compared to the traditional Deep Forest algorithm (gcForest), our method not only has a far superior average accuracy, but also a significantly lower standard deviation. Compared to the latest GCN based network (GLFANet), the accuracies on the binary classification task are improved by 0.51 $%$ , 1.62 $%$ and 1.39 $%$ , respectively. The average accuracy of our model is also improved compared to capsule network based models (MTCA-CapsNet, Caps-EEGNet, ICaps-ResLSTM), and our method has a smaller standard deviation. Experimental results indicate that our model achieves optimal performance.

Thus, benefiting from spatial-spectral-temporal feature fusion and self-supervised pre-training, the findings demonstrate that our model is capable of extracting more discriminative features from the spatial, spectral, and temporal domains of EEG signals, thereby enhancing the classification performance. In addition, the more minor standard deviation indicates that the performance of our model is more reliable for all subjects.

Discussions

Ablation study

We conduct exhaustive ablation experiments to demonstrate the effectiveness of our proposed SET-pMAE model. To explore the contribution of spatial-temporal branching and spatial-spectral branching with the pre-training task to the classification results of valence and arousal, we firstly report the results of the spatial-temporal branching and spatial-spectral branching (referred to as T-S and S-S) under supervised training and after using our proposed self-supervised pre-training method. And then we report the results after fusing the features of the two branches in supervised training (referred to as S-S-T) and after using our proposed self-supervised pre-training method. As shown in Table 5, it is clear from the table that the performance of the model with pre-training is significantly better than the model without pre-training on average, both for the single T-S branch and the S-S branch as well as the fusion of the two. The performance of the model proves that our proposed self-supervised method can mine more discriminative spatial-temporal and frequency-specific features. Comparing the models using separate spatial-temporal or spatial-spectral features, the fused dual-branch model with or without self-supervised pre-training outperforms the single-branch model in terms of accuracy and standard deviation on average, which demonstrates the effectiveness of our proposed temporal-spatial-spectral fusion model. Figures 8 and 9 show a more detailed comparison for each subject in the ablation experiments.

Table 5.

Ablation study of mean accuracy ( $%$ ) on DEAP and DREAMER. ‘wo’ means without, and ‘w’ means with

Ablation Models	DEAP		DREAMER
Ablation Models	Valence	Arousal	Valence	Arousal
S-S_wo_pretrain	87.01 ± 5.13	88.08 ± 5.03	86.14 ± 5.99	90.34 ± 7.07
S-S_w_pretrain	94.49 ± 3.14	95.43 ± 2.84	90.45 ± 5.89	93.61 ± 5.42
T-S_wo_pretrain	94.16 ± 3.0	94.72 ± 2.84	89.37 ± 4.48	93.13 ± 5.60
T-S_w_pretrain	96.85 ± 1.23	97.05 ± 1.16	94.16 ± 3.70	95.69 ± 4.28
S-S-T_wo_pretrain	95.57 ± 1.79	95.98 ± 2.02	91.20 ± 4.27	93.60 ± 5.31
SET-pMAE	97.79 ± 1.01	97.90 ± 1.16	95.08 ± 3.08	96.44 ± 3.28

Open in a new tab

The bold fonts indicate best results

Fig. 8 — Mean accuracy ( $%$ ) of classification tasks for each subject performing ablation experiments on the DEAP dataset

Fig. 9 — Mean accuracy ( $%$ ) of classification tasks for each subject performing ablation experiments on the DREAMER dataset

Effects of mask ratio

The mask ratio is a crucial parameter of the MAE model, and its value can be different depending on the type of task. Figure 10 show the downstream classification results for the DEAP dataset when the mask ratio ranges from 0.2 to 0.8. For valence, as the mask ratio increases from 0.2 to 0.80, the accuracy increases at begin then decreases, reaching a maximum accuracy of 97.79 $%$ at mask ratio equals 0.40. For arousal, as the mask ratio increases from 0.2 to 0.80, the accuracy reaches a maximum of 97.90 $%$ at mask ratio equals 0.25 and gradually decreases. For the DREAMER dataset, as shown in Fig. 11, valence, arousal, and dominance all reach the highest classification accuracy at mask ratio equals 0.45, then followed by a gradual decrease. We found that different mask ratio settings have a significant impact on the quality of representation, and the optimal mask ratio for the model varies between datasets or different downstream tasks, which reflects the significant differences between datasets as well. We consider that the reason for the smaller optimal mask ratio for the DEAP dataset than the DREAMER dataset is that the DEAP dataset has a larger number of electrodes than the DREAMER dataset, resulting in more high-level semantic information in the EEG signals of the DEAP dataset than in the EEG signals of the DREAMER dataset.

Fig. 10 — Fine-tuning performance of models pretrained with different masking rates on the DEAP dataset

Fig. 11 — Fine-tuning performance of models pretrained with different masking rates on the DREAMER dataset

Computational complexity

The SET-pMAE model employs the Transformer as its backbone network, and the total computational complexity of the pre-training phase is 0.344G FLOPs on the DREAMER dataset. This is due to the fact that some information in the data is masked during the pre-training phase. The subsequent fine-tuning of each subject, the computational complexity is 0.498G FLOPs. Benefiting from the Transformer’s excellent global context capture capability and spatial-spectral-temporal feature fusion, our model effectively achieves a balance between model performance and computational complexity.

Limitations and future directions

Despite the excellent classification performance achieved by the proposed method, our current work still has some limitations. Firstly, all available EEG channels from the dataset were incorporated into the model. However, practical applications may involve fewer or damaged channels. Therefore, in future work, we will consider using dynamic adaptive EEG channel selection to assist self-supervised training so that it can cope with a wide range of situations. Secondly, our method currently uses only EEG signals, whereas emotion is reflected by a combination of multiple physiological signals, and the single EEG signal may not be able to comprehensively capture all aspects of emotional changes. Therefore, our future pivotal work is to integrate self-supervised emotion recognition methods that incorporate other physiological signals. The multimodal approach can compensate for the limitations of the single EEG signal by leveraging the complementary and integrated nature of multiple signals, thereby enhancing the accuracy and robustness of emotion recognition. Additionally, applying the model to more emotion recognition datasets and explore avenues to further enhance the overall performance of emotion recognition systems.

Conclusion

In this study, we propose a novel model called Spatial-spEctral-Temporal based parallel Mask AutoEncoder. By reconstructing the spatio-temporal representation and the spatial-spectral representation of the randomly masked EEG signal, the model can extract complex semantic information from the signal. We also investigate the performance of the model under different mask ratios. Benefiting from the representations learned from pre-training, our encoder shows better generalization capabilities and excellent performance. We conduct extensive experiments on the DEAP and DREAMER datasets, and our model achieves superior performance in valence and arousal classification compared to state-of-the-art methods. In particular, the effectiveness of the proposed self-supervised pre-training method and the fusion of spatial-spectral-temporal information is demonstrated through ablation studies. SET-pMAE employs a self-supervised approach that does not rely on large amounts of manually annotated data, which is particularly crucial for EEG-based emotion recognition, because the acquisition of high-quality emotion labels is both time-consuming and costly. By leveraging vast amounts of unlabeled data, self-supervised methods can reveal latent structures and features within the data, enabling the model to learn robust and more generalizable feature representations. This capability holds significant implications for improving the performance of emotion recognition. With the increasing availability of large-scale unlabeled data, SET-pMAE has the potential to be applied in fields such as psychological health and brain-computer interfaces.

Author contributions

Chenyu Pan: Conceptualization, Methodology, Software, Writing - original draft. Huimin Lu: Project Funding, Supervision, Project Administration, Writing-review & editing. Chenglin Lin: Revision, Writing - Review & Editing. Zeyi Zhong: Data curation. Bing Liu: Project Administration.

Funding Information

This research is supported by the Industrial Technology Research and Development Special Project of Jilin Provincial Development and Reform Commission in 2023 (No. 2023C042-6), the Key Project of Science and Technology Research Plan of Jilin Provincial Department of Education in 2023 (No. JJKH20230763KJ), and the Project of Science and Technology Research Plan of Jilin Provincial Department of Education in 2023 (No. JJKH20230765KJ).

Data availability

The datasets analyzed during the current study are public datasets. It is available at the following URL: DREAMER dataset: https://zenodo.org/record/546113; DEAP dataset: http://www.eecs.qmul.ac.uk/mmv/datasets/deap/.

Declarations

Conflict of interest

The authors declare that they have no Conflict of interest. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors. Informed consent was obtained from all individual participants included in the study.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Aydın S (2020) Deep learning classification of neuro-emotional phase domain complexity levels induced by affective video film clips. IEEE J Biomed Health Inform 24(6):1695–1702. 10.1109/JBHI.2019.2959843 [DOI] [PubMed] [Google Scholar]
Aydın S, Onbaşı L (2024) Graph theoretical brain connectivity measures to investigate neural correlates of music rhythms associated with fear and anger. Cogn Neurodyn 18(1):49–66. 10.1007/s11571-023-09931-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai M, Zeng Y (2024) Mae-eeg-transformer: a transformer-based approach combining masked autoencoder and cross-individual data augmentation pre-training for eeg classification. Biomed Signal Process Control 94(106):131. 10.1016/j.bspc.2024.106131 [Google Scholar]
Can YS, Mahesh B, André E (2023) Approaches, applications, and challenges in physiological emotion recognition-a tutorial overview. Proc IEEE 111(10):1287–1313. 10.1109/JPROC.2023.3286445 [Google Scholar]
Chen P, Liu S, Jia J (2021) Jigsaw clustering for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11,526–11,535
Chen K, Jing H, Liu Q et al (2023) A novel caps-eegnet combined with channel selection for eeg-based emotion recognition. Biomed Signal Process Control 86(105):312. 10.1016/j.bspc.2023.105312 [Google Scholar]
Chen T, Kornblith S, Norouzi M, et al (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp 1597–1607
Cheng J, Chen M, Li C et al (2020) Emotion recognition from multi-channel eeg via deep forest. IEEE J Biomed Health Inf 25(2):453–464. 10.1109/JBHI.2020.2995767 [DOI] [PubMed] [Google Scholar]
Chien HYS, Goh H, Sandino CM, et al (2022) Maeeg: Masked auto-encoder for eeg represen- tation learning. In: NeurIPS Workshop, https://arxiv.org/abs/2211.02625
Chung J, Gulcehre C, Cho K, et al (2014) Empir- ical evaluation of gated recurrent neural net- works on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014http://arxiv.org/abs/1412.3555
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. 10.1007/BF00994018 [Google Scholar]
Devlin J, Chang MW, Lee K, et al (2019) BERT: Pre-training of deep bidirectional transform- ers for language understanding. In: Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers). Asso- ciation for Computational Linguistics, pp 4171– 4186, 10.18653/v1/N19-1423
Du X, Ma C, Zhang G et al (2022) An efficient lstm network for emotion recognition from multichannel eeg signals. IEEE Trans Affect Comput 13(3):1528–1540. 10.1109/TAFFC.2020.3013711 [Google Scholar]
Duan RN, Zhu JY, Lu BL (2013) Differential entropy feature for eeg-based emotion classification. In: 2013 6th international IEEE/EMBS conference on neural engineering (NER), IEEE, pp 81–84, 10.1109/NER.2013.6695876
Fan T, Qiu S, Wang Z et al (2023) A new deep convolutional neural network incorporating attentional mechanisms for ecg emotion recognition. Comput Biol Med 159(106):938. 10.1016/j.compbiomed.2023.106938 [DOI] [PubMed] [Google Scholar]
Fan C, Wang J, Huang W et al (2024) Light-weight residual convolution-based capsule network for eeg emotion recognition. Adv Eng Inf 61(102):522. 10.1016/j.aei.2024.102522 [Google Scholar]
Fan C, Xie H, Tao J et al (2024) Icaps-reslstm: improved capsule network and residual lstm for eeg emotion recognition. Biomed Signal Process Control 87(105):422. 10.1016/j.bspc.2023.105422 [Google Scholar]
Gao Z, Wang X, Yang Y et al (2021) A channel-fused dense convolutional network for eeg-based emotion recognition. IEEE Trans Cognit Dev Syst 13(4):945–954. 10.1109/TCDS.2020.2976112 [Google Scholar]
Gidaris S, Singh P, Komodakis N (2018) Unsu- pervised representation learning by predicting image rotations. In: 6th International Confer- ence on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.10.1016/j.aei.2024.102522
Gong L, Li M, Zhang T et al (2023a) Eeg emotion recognition using attention-based convolutional transformer neural network. Biomed Signal Process Control 84(104):835. 10.1016/j.bspc.2023.104835 [Google Scholar]
Gong P, Jia Z, Wang P, et al (2023b) Astdf-net: attention-based spatial-temporal dual-stream fusion network for eeg-based emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, p 883-892, 10.1145/3581783.3612208
Grill JB, Strub F, Altché F et al (2020) Bootstrap your own latent-a new approach to self-supervised learning. Adv Neural Inf Process Syst 33(21):271 [Google Scholar]
He K, Chen X, Xie S, et al (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 16,000–16,009
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural comput 9(8):1735–1780. 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
Jafari M, Shoeibi A, Khodatars M et al (2023) Emotion recognition in eeg signals using deep learning methods: a review. Comput Biol Med 165(107):450. 10.1016/j.compbiomed.2023.107450 [DOI] [PubMed] [Google Scholar]
Joshi M, Chen D, Liu Y et al (2020) Spanbert: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 8:64–77. 10.1162/tacl_a_00300 [Google Scholar]
Katsigiannis S, Ramzan N (2017) Dreamer: a database for emotion recognition through eeg and ecg signals from wireless low-cost off-the-shelf devices. IEEE J Biomed Health Inf 22(1):98–107. 10.1109/JBHI.2017.2688239 [DOI] [PubMed] [Google Scholar]
Kılıç B, Aydın S (2022) Classification of contrasting discrete emotional states indicated by eeg based graph theoretical network measures. Neuroinformatics 20(4):863–877. 10.1007/s12021-022-09579-2 [DOI] [PubMed] [Google Scholar]
Koelstra S, Muhl C, Soleymani M et al (2011) Deap: a database for emotion analysis; using physiological signals. IEEE Trans Affect Comput 3(1):18–31. 10.1109/T-AFFC.2011.15 [Google Scholar]
Kostas D, Aroca-Ouellette S, Rudzicz F (2021) Bendr: using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data. Front Human Neurosci 15(653):659. 10.3389/fnhum.2021.653659 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lawhern VJ, Solon AJ, Waytowich NR et al (2018) Eegnet: a compact convolutional neural network for eeg-based brain-computer interfaces. J Neural Eng 15(5):056,013. 10.1088/1741-2552/aace8c [DOI] [PubMed] [Google Scholar]
Li R, Ren C, Ge Y et al (2023a) Mtlfusenet: a novel emotion recognition model based on deep latent feature fusion of eeg signals and multi-task learning. Knowl-Based Syst 276(110):756. 10.1016/j.knosys.2023.110756 [Google Scholar]
Li Y, Chen J, Li F et al (2023b) Gmss: graph-based multi-task self-supervised learning for eeg emotion recognition. IEEE Trans Affect Comput 14(3):2512–2525. 10.1109/TAFFC.2022.3170428 [Google Scholar]
Li C, Wang B, Zhang S et al (2022) Emotion recognition from eeg based on multi-task learning with capsule network and attention mechanism. Comput Biol Med 143(105):303. 10.1016/j.compbiomed.2022.105303 [DOI] [PubMed] [Google Scholar]
Liu Y, Sourina O (2013) Real-time fractal-based valence level recognition from eeg. In: Transactions on computational science XVIII: special issue on Cyberworlds, Springer, pp 101–120, 10.1007/978-3-642-38803-3_6
Liu D, Dai W, Zhang H et al (2023) Brain-machine coupled learning method for facial emotion recognition. IEEE Trans Pattern Anal Mach Intell 45(9):10,703-10,717. 10.1109/TPAMI.2023.3257846 [DOI] [PubMed] [Google Scholar]
Liu S, Zhao Y, An Y et al (2023) Glfanet: a global to local feature aggregation network for eeg emotion recognition. Biomed Signal Process Control 85(104):799. 10.1016/j.bspc.2023.104799 [Google Scholar]
Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision, Springer, pp 69–84, 10.1007/978-3-319-46466-4_5
Özçelik YB, Altan A (2023) A comparative analysis of artificial intelligence optimization algorithms for the selection of entropy-based features in the early detection of epileptic seizures. In: 2023 14th International Conference on Electrical and Electronics Engineering (ELECO), IEEE, pp 1–5, 10.1109/ELECO60389.2023.10415957
Peng G, Zhao K, Zhang H et al (2023) Temporal relative transformer encoding cooperating with channel attention for eeg emotion analysis. Comput Biol Med 154(106):537. 10.1016/j.compbiomed.2023.106537 [DOI] [PubMed] [Google Scholar]
Shen F, Dai G, Lin G et al (2020) Eeg-based emotion recognition using 4d convolutional recurrent neural network. Cognit Neurodyn 14:815–828. 10.1007/s11571-020-09634-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Song T, Zheng W, Song P et al (2018) Eeg emotion recognition using dynamical graph convolutional neural networks. IEEE Trans Affect Comput 11(3):532–541. 10.1109/TAFFC.2018.2817622 [Google Scholar]
Sun Y, Wang S, Li Y, et al (2019) Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.0922310.48550/arXiv.1904.09223
Tao W, Li C, Song R et al (2020) Eeg-based emotion recognition via channel-wise attention and self attention. IEEE Trans Affect Comput 14(1):382–393. 10.1109/TAFFC.2020.3025777 [Google Scholar]
Topic A, Russo M (2021) Emotion recognition based on eeg feature maps through deep learning network. Eng Sci Technol Int J 24(6):1442–1454. 10.1016/j.jestch.2021.03.012 [Google Scholar]
Wagner J, Triantafyllopoulos A, Wierstorf H et al (2023) Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Trans Pattern Anal Mach Intell 45(9):10,745-10,759. 10.1109/TPAMI.2023.3263585 [DOI] [PubMed] [Google Scholar]
Wang J, Song Y, Gao Q et al (2023) Functional brain network based multi-domain feature fusion of hearing-impaired eeg emotion identification. Biomed Signal Process Control 85(105):013. 10.1016/j.bspc.2023.105013 [Google Scholar]
Wang Z, Wang Y, Hu C et al (2022) Transformers for eeg-based emotion recognition: A hierarchical spatial information learning model. IEEE Sens J 22(5):4359–4368. 10.1109/JSEN.2022.3144317 [Google Scholar]
Wei Y, Liu Y, Li C et al (2023) Tc-net: a transformer capsule network for eeg-based emotion recognition. Comput Biol Med 152(106):463. 10.1016/j.compbiomed.2022.106463 [DOI] [PubMed] [Google Scholar]
Xiao G, Shi M, Ye M et al (2022) 4d attention-based neural network for eeg emotion recognition. Cognit Neurodyn. 10.1007/s11571-021-09751-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Xie Z, Zhou M, Sun H (2021) A novel solution for eeg-based emotion recognition. In: 2021 IEEE 21st International Conference on Communication Technology (ICCT), IEEE, pp 1134–1138, 10.1109/ICCT52962.2021.9657922
Xu M, Cheng J, Li C et al (2023a) Spatio-temporal deep forest for emotion recognition based on facial electromyography signals. Comput Biol Med 156(106):689. 10.1016/j.compbiomed.2023.106689 [DOI] [PubMed] [Google Scholar]
Xu Y, Du Y, Li L et al (2023b) Amdet: attention based multiple dimensions eeg transformer for emotion recognition. IEEE Trans Affect Comput. 10.1109/TAFFC.2023.3318321 [Google Scholar]
Yin Y, Jing L, Huang F et al (2024) Msa-gcn: multiscale adaptive graph convolution network for gait emotion recognition. Pattern Recognit 147(110):117. 10.1016/j.patcog.2023.110117 [Google Scholar]
Yin Y, Zheng X, Hu B et al (2021) Eeg emotion recognition using fusion model of graph convolutional neural networks and lstm. Appl Soft Comput 100(106):954. 10.1016/j.asoc.2020.106954 [Google Scholar]
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer, pp 649–666, 10.1007/978-3-319-46487-9_40
Zhang D, Yao L, Chen K et al (2019) Making sense of spatio-temporal preserving representations for eeg-based human intention recognition. IEEE Trans Cybern 50(7):3033–3044. 10.1109/TCYB.2019.2905157 [DOI] [PubMed] [Google Scholar]
Zhang R, Zhu JY, Isola P, et al (2017) Real-time user-guided image colorization with learned deep priors. Acm Transactions on Graph- ics 36(4):119. 10.1145/3072959.073703
Zheng WL, Lu BL (2015) Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks. IEEE Trans Auton Mental Dev 7(3):162–175. 10.1109/TAMD.2015.2431497 [Google Scholar]
Zheng WL, Zhu JY, Lu BL (2017) Identifying stable patterns over time for emotion recognition from eeg. IEEE Trans Affect Comput 10(3):417–429. 10.1109/TAFFC.2017.2712143 [Google Scholar]
Zhong P, Wang D, Miao C (2020) Eeg-based emotion recognition using regularized graph neural networks. IEEE Trans Affect Comput 13(3):1290–1301. 10.1109/TAFFC.2020.2994159 [Google Scholar]
Zhou Q, Shi C, Du Q et al (2023) A multi-task hybrid emotion recognition network based on eeg signals. Biomed Signal Process Control 86(105):136. 10.1016/j.bspc.2023.105136 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR2] Aydın S (2020) Deep learning classification of neuro-emotional phase domain complexity levels induced by affective video film clips. IEEE J Biomed Health Inform 24(6):1695–1702. 10.1109/JBHI.2019.2959843 [DOI] [PubMed] [Google Scholar]

[CR1] Aydın S, Onbaşı L (2024) Graph theoretical brain connectivity measures to investigate neural correlates of music rhythms associated with fear and anger. Cogn Neurodyn 18(1):49–66. 10.1007/s11571-023-09931-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] Cai M, Zeng Y (2024) Mae-eeg-transformer: a transformer-based approach combining masked autoencoder and cross-individual data augmentation pre-training for eeg classification. Biomed Signal Process Control 94(106):131. 10.1016/j.bspc.2024.106131 [Google Scholar]

[CR4] Can YS, Mahesh B, André E (2023) Approaches, applications, and challenges in physiological emotion recognition-a tutorial overview. Proc IEEE 111(10):1287–1313. 10.1109/JPROC.2023.3286445 [Google Scholar]

[CR6] Chen P, Liu S, Jia J (2021) Jigsaw clustering for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11,526–11,535

[CR5] Chen K, Jing H, Liu Q et al (2023) A novel caps-eegnet combined with channel selection for eeg-based emotion recognition. Biomed Signal Process Control 86(105):312. 10.1016/j.bspc.2023.105312 [Google Scholar]

[CR7] Chen T, Kornblith S, Norouzi M, et al (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp 1597–1607

[CR8] Cheng J, Chen M, Li C et al (2020) Emotion recognition from multi-channel eeg via deep forest. IEEE J Biomed Health Inf 25(2):453–464. 10.1109/JBHI.2020.2995767 [DOI] [PubMed] [Google Scholar]

[CR9] Chien HYS, Goh H, Sandino CM, et al (2022) Maeeg: Masked auto-encoder for eeg represen- tation learning. In: NeurIPS Workshop, https://arxiv.org/abs/2211.02625

[CR10] Chung J, Gulcehre C, Cho K, et al (2014) Empir- ical evaluation of gated recurrent neural net- works on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014http://arxiv.org/abs/1412.3555

[CR11] Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. 10.1007/BF00994018 [Google Scholar]

[CR12] Devlin J, Chang MW, Lee K, et al (2019) BERT: Pre-training of deep bidirectional transform- ers for language understanding. In: Proceedings of the 2019 Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers). Asso- ciation for Computational Linguistics, pp 4171– 4186, 10.18653/v1/N19-1423

[CR13] Du X, Ma C, Zhang G et al (2022) An efficient lstm network for emotion recognition from multichannel eeg signals. IEEE Trans Affect Comput 13(3):1528–1540. 10.1109/TAFFC.2020.3013711 [Google Scholar]

[CR14] Duan RN, Zhu JY, Lu BL (2013) Differential entropy feature for eeg-based emotion classification. In: 2013 6th international IEEE/EMBS conference on neural engineering (NER), IEEE, pp 81–84, 10.1109/NER.2013.6695876

[CR17] Fan T, Qiu S, Wang Z et al (2023) A new deep convolutional neural network incorporating attentional mechanisms for ecg emotion recognition. Comput Biol Med 159(106):938. 10.1016/j.compbiomed.2023.106938 [DOI] [PubMed] [Google Scholar]

[CR15] Fan C, Wang J, Huang W et al (2024) Light-weight residual convolution-based capsule network for eeg emotion recognition. Adv Eng Inf 61(102):522. 10.1016/j.aei.2024.102522 [Google Scholar]

[CR16] Fan C, Xie H, Tao J et al (2024) Icaps-reslstm: improved capsule network and residual lstm for eeg emotion recognition. Biomed Signal Process Control 87(105):422. 10.1016/j.bspc.2023.105422 [Google Scholar]

[CR18] Gao Z, Wang X, Yang Y et al (2021) A channel-fused dense convolutional network for eeg-based emotion recognition. IEEE Trans Cognit Dev Syst 13(4):945–954. 10.1109/TCDS.2020.2976112 [Google Scholar]

[CR19] Gidaris S, Singh P, Komodakis N (2018) Unsu- pervised representation learning by predicting image rotations. In: 6th International Confer- ence on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.10.1016/j.aei.2024.102522

[CR20] Gong L, Li M, Zhang T et al (2023a) Eeg emotion recognition using attention-based convolutional transformer neural network. Biomed Signal Process Control 84(104):835. 10.1016/j.bspc.2023.104835 [Google Scholar]

[CR21] Gong P, Jia Z, Wang P, et al (2023b) Astdf-net: attention-based spatial-temporal dual-stream fusion network for eeg-based emotion recognition. In: Proceedings of the 31st ACM International Conference on Multimedia. Association for Computing Machinery, p 883-892, 10.1145/3581783.3612208

[CR22] Grill JB, Strub F, Altché F et al (2020) Bootstrap your own latent-a new approach to self-supervised learning. Adv Neural Inf Process Syst 33(21):271 [Google Scholar]

[CR23] He K, Chen X, Xie S, et al (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 16,000–16,009

[CR24] Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural comput 9(8):1735–1780. 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]

[CR25] Jafari M, Shoeibi A, Khodatars M et al (2023) Emotion recognition in eeg signals using deep learning methods: a review. Comput Biol Med 165(107):450. 10.1016/j.compbiomed.2023.107450 [DOI] [PubMed] [Google Scholar]

[CR26] Joshi M, Chen D, Liu Y et al (2020) Spanbert: improving pre-training by representing and predicting spans. Trans Assoc Comput Linguist 8:64–77. 10.1162/tacl_a_00300 [Google Scholar]

[CR27] Katsigiannis S, Ramzan N (2017) Dreamer: a database for emotion recognition through eeg and ecg signals from wireless low-cost off-the-shelf devices. IEEE J Biomed Health Inf 22(1):98–107. 10.1109/JBHI.2017.2688239 [DOI] [PubMed] [Google Scholar]

[CR28] Kılıç B, Aydın S (2022) Classification of contrasting discrete emotional states indicated by eeg based graph theoretical network measures. Neuroinformatics 20(4):863–877. 10.1007/s12021-022-09579-2 [DOI] [PubMed] [Google Scholar]

[CR29] Koelstra S, Muhl C, Soleymani M et al (2011) Deap: a database for emotion analysis; using physiological signals. IEEE Trans Affect Comput 3(1):18–31. 10.1109/T-AFFC.2011.15 [Google Scholar]

[CR30] Kostas D, Aroca-Ouellette S, Rudzicz F (2021) Bendr: using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data. Front Human Neurosci 15(653):659. 10.3389/fnhum.2021.653659 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] Lawhern VJ, Solon AJ, Waytowich NR et al (2018) Eegnet: a compact convolutional neural network for eeg-based brain-computer interfaces. J Neural Eng 15(5):056,013. 10.1088/1741-2552/aace8c [DOI] [PubMed] [Google Scholar]

[CR33] Li R, Ren C, Ge Y et al (2023a) Mtlfusenet: a novel emotion recognition model based on deep latent feature fusion of eeg signals and multi-task learning. Knowl-Based Syst 276(110):756. 10.1016/j.knosys.2023.110756 [Google Scholar]

[CR34] Li Y, Chen J, Li F et al (2023b) Gmss: graph-based multi-task self-supervised learning for eeg emotion recognition. IEEE Trans Affect Comput 14(3):2512–2525. 10.1109/TAFFC.2022.3170428 [Google Scholar]

[CR32] Li C, Wang B, Zhang S et al (2022) Emotion recognition from eeg based on multi-task learning with capsule network and attention mechanism. Comput Biol Med 143(105):303. 10.1016/j.compbiomed.2022.105303 [DOI] [PubMed] [Google Scholar]

[CR37] Liu Y, Sourina O (2013) Real-time fractal-based valence level recognition from eeg. In: Transactions on computational science XVIII: special issue on Cyberworlds, Springer, pp 101–120, 10.1007/978-3-642-38803-3_6

[CR35] Liu D, Dai W, Zhang H et al (2023) Brain-machine coupled learning method for facial emotion recognition. IEEE Trans Pattern Anal Mach Intell 45(9):10,703-10,717. 10.1109/TPAMI.2023.3257846 [DOI] [PubMed] [Google Scholar]

[CR36] Liu S, Zhao Y, An Y et al (2023) Glfanet: a global to local feature aggregation network for eeg emotion recognition. Biomed Signal Process Control 85(104):799. 10.1016/j.bspc.2023.104799 [Google Scholar]

[CR38] Noroozi M, Favaro P (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In: European conference on computer vision, Springer, pp 69–84, 10.1007/978-3-319-46466-4_5

[CR39] Özçelik YB, Altan A (2023) A comparative analysis of artificial intelligence optimization algorithms for the selection of entropy-based features in the early detection of epileptic seizures. In: 2023 14th International Conference on Electrical and Electronics Engineering (ELECO), IEEE, pp 1–5, 10.1109/ELECO60389.2023.10415957

[CR40] Peng G, Zhao K, Zhang H et al (2023) Temporal relative transformer encoding cooperating with channel attention for eeg emotion analysis. Comput Biol Med 154(106):537. 10.1016/j.compbiomed.2023.106537 [DOI] [PubMed] [Google Scholar]

[CR41] Shen F, Dai G, Lin G et al (2020) Eeg-based emotion recognition using 4d convolutional recurrent neural network. Cognit Neurodyn 14:815–828. 10.1007/s11571-020-09634-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] Song T, Zheng W, Song P et al (2018) Eeg emotion recognition using dynamical graph convolutional neural networks. IEEE Trans Affect Comput 11(3):532–541. 10.1109/TAFFC.2018.2817622 [Google Scholar]

[CR43] Sun Y, Wang S, Li Y, et al (2019) Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.0922310.48550/arXiv.1904.09223

[CR44] Tao W, Li C, Song R et al (2020) Eeg-based emotion recognition via channel-wise attention and self attention. IEEE Trans Affect Comput 14(1):382–393. 10.1109/TAFFC.2020.3025777 [Google Scholar]

[CR45] Topic A, Russo M (2021) Emotion recognition based on eeg feature maps through deep learning network. Eng Sci Technol Int J 24(6):1442–1454. 10.1016/j.jestch.2021.03.012 [Google Scholar]

[CR46] Wagner J, Triantafyllopoulos A, Wierstorf H et al (2023) Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Trans Pattern Anal Mach Intell 45(9):10,745-10,759. 10.1109/TPAMI.2023.3263585 [DOI] [PubMed] [Google Scholar]

[CR47] Wang J, Song Y, Gao Q et al (2023) Functional brain network based multi-domain feature fusion of hearing-impaired eeg emotion identification. Biomed Signal Process Control 85(105):013. 10.1016/j.bspc.2023.105013 [Google Scholar]

[CR48] Wang Z, Wang Y, Hu C et al (2022) Transformers for eeg-based emotion recognition: A hierarchical spatial information learning model. IEEE Sens J 22(5):4359–4368. 10.1109/JSEN.2022.3144317 [Google Scholar]

[CR49] Wei Y, Liu Y, Li C et al (2023) Tc-net: a transformer capsule network for eeg-based emotion recognition. Comput Biol Med 152(106):463. 10.1016/j.compbiomed.2022.106463 [DOI] [PubMed] [Google Scholar]

[CR50] Xiao G, Shi M, Ye M et al (2022) 4d attention-based neural network for eeg emotion recognition. Cognit Neurodyn. 10.1007/s11571-021-09751-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] Xie Z, Zhou M, Sun H (2021) A novel solution for eeg-based emotion recognition. In: 2021 IEEE 21st International Conference on Communication Technology (ICCT), IEEE, pp 1134–1138, 10.1109/ICCT52962.2021.9657922

[CR52] Xu M, Cheng J, Li C et al (2023a) Spatio-temporal deep forest for emotion recognition based on facial electromyography signals. Comput Biol Med 156(106):689. 10.1016/j.compbiomed.2023.106689 [DOI] [PubMed] [Google Scholar]

[CR53] Xu Y, Du Y, Li L et al (2023b) Amdet: attention based multiple dimensions eeg transformer for emotion recognition. IEEE Trans Affect Comput. 10.1109/TAFFC.2023.3318321 [Google Scholar]

[CR55] Yin Y, Jing L, Huang F et al (2024) Msa-gcn: multiscale adaptive graph convolution network for gait emotion recognition. Pattern Recognit 147(110):117. 10.1016/j.patcog.2023.110117 [Google Scholar]

[CR54] Yin Y, Zheng X, Hu B et al (2021) Eeg emotion recognition using fusion model of graph convolutional neural networks and lstm. Appl Soft Comput 100(106):954. 10.1016/j.asoc.2020.106954 [Google Scholar]

[CR57] Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer, pp 649–666, 10.1007/978-3-319-46487-9_40

[CR56] Zhang D, Yao L, Chen K et al (2019) Making sense of spatio-temporal preserving representations for eeg-based human intention recognition. IEEE Trans Cybern 50(7):3033–3044. 10.1109/TCYB.2019.2905157 [DOI] [PubMed] [Google Scholar]

[CR58] Zhang R, Zhu JY, Isola P, et al (2017) Real-time user-guided image colorization with learned deep priors. Acm Transactions on Graph- ics 36(4):119. 10.1145/3072959.073703

[CR59] Zheng WL, Lu BL (2015) Investigating critical frequency bands and channels for eeg-based emotion recognition with deep neural networks. IEEE Trans Auton Mental Dev 7(3):162–175. 10.1109/TAMD.2015.2431497 [Google Scholar]

[CR60] Zheng WL, Zhu JY, Lu BL (2017) Identifying stable patterns over time for emotion recognition from eeg. IEEE Trans Affect Comput 10(3):417–429. 10.1109/TAFFC.2017.2712143 [Google Scholar]

[CR61] Zhong P, Wang D, Miao C (2020) Eeg-based emotion recognition using regularized graph neural networks. IEEE Trans Affect Comput 13(3):1290–1301. 10.1109/TAFFC.2020.2994159 [Google Scholar]

[CR62] Zhou Q, Shi C, Du Q et al (2023) A multi-task hybrid emotion recognition network based on eeg signals. Biomed Signal Process Control 86(105):136. 10.1016/j.bspc.2023.105136 [Google Scholar]

PERMALINK

Set-pMAE: spatial-spEctral-temporal based parallel masked autoEncoder for EEG emotion recognition

Chenyu Pan

Huimin Lu

Chenglin Lin

Zeyi Zhong

Bing Liu

Abstract

Introduction

Related work

EEG-based emotion recognition

Self-supervised learning

Methods

Fig. 1.

EEG 3D Representation

Fig. 2.

Fig. 3.

Model architecture

Spatial-temporal branch

Fig. 4.

Spatial-spectral branch

Fig. 5.

Self-supervised pre-training

Fine-tuning & classifier

Experiment and results

Datasets

Model training

Results and comparison with baseline models

Comparison of the DEAP dataset

Table 1.

Fig. 6.

Comparison of the DREAMER dataset

Table 2.

Fig. 7.

Comparison with the state-of-the-art models

Table 3.

Table 4.

Discussions

Ablation study

Table 5.

Fig. 8.

Fig. 9.

Effects of mask ratio

Fig. 10.

Fig. 11.

Computational complexity

Limitations and future directions

Conclusion

Author contributions

Funding Information

Data availability

Declarations

Conflict of interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases