Abstract
Electroencephalogram (EEG) emotion recognition plays a vital role in affective computing. A limitation of the EEG emotion recognition task is that the features of multiple domains are rarely included in the analysis simultaneously because of the lack of an effective feature organization form. This paper proposes a video-level feature organization method to effectively organize the temporal, frequency and spatial domain features. In addition, a deep neural network, Channel Attention Convolutional Aggregation Network, is designed to explore deeper emotional information from video-level features. The network uses a channel attention mechanism to adaptively captures critical EEG frequency bands. Then the frame-level representation of each time point is obtained by multi-layer convolution. Finally, the frame-level features are aggregated through NeXtVLAD to learn the time-sequence-related features. The method proposed in this paper achieves the best classification performance in SEED and DEAP datasets. The mean accuracy and standard deviation of the SEED dataset are 95.80% and 2.04%. In the DEAP dataset, the average accuracy with the standard deviation of arousal and valence are 98.97% ± 1.13% and 98.98% ± 0.98%, respectively. The experimental results show that our approach based on video-level features is effective for EEG emotion recognition tasks.
Keywords: EEG, Emotion recognition, Video-level features, NeXtVLAD
Introduction
Emotion is a complex psychological and physical state that helps humans establish relationships and adapt to survive. There is no doubt that emotion is vital to us (Aadam et al. 2022). As human–computer interaction and artificial intelligence technology improve by leaps and bounds, there is a kind of computing about emotion, emotion generation, and influence emotion: Affective Computing (Picard 2010). Its purpose is to enable machines to automatically perceive, recognize, analyze and speculate people's emotions so that machines have better intelligent interaction.
Emotion recognition is a critical link in affective computing. At present, non-physiological signals such as speech (Liu et al. 2018), text (Tahir et al. 2022), and body movements (Ahmed et al. 2020) are widely used in emotion recognition tasks because their data are easier to obtain. However, individuals can subjectively control them. In lie detection, mental illness examination, and other tasks, the individual can deceive the machine through disguise to achieve their goals, so the credibility of recognition is low.
In contrast, physiological signals such as GSR (Wu et al. 2010), EMG (Mithbavkar and Shah 2021), and EEG (Yin et al. 2021) are not easy to disguise. They can objectively reflect the emotional state of individuals, so more and more scholars have begun to engage in emotion recognition research based on physiological signals. Compared with other physiological signals, EEG signals originate from the synchronous synaptic activity of many neurons in the cerebral cortex. It has a high correlation with emotion, so it is more suitable for emotion research (Rahman et al. 2022). However, EEG signals are complex because they contain information from multiple domains. Specifically, EEG signals are time-varying signals that include temporal information, as well as synchronous neural pulse signals that carry frequency-domain information. Additionally, EEG signals are collected through electrodes placed at different locations on the scalp according to international 10–20 arrangement standard, providing spatial-domain information. Taken together, analyzing and processing EEG signals requires consideration of multiple domains.
In previous studies, to solve the need for considering multiple domain features, scholars have employed different feature organization forms and model structures. For example, Song et al. (2020) divide the EEG signals into five frequency bands. Then extract the differential entropy feature (DE), the power spectral density feature (PSD), the differential causality feature (DCAU), the differential asymmetry feature (DASM), and the rational asymmetry feature (RASM) from each frequency band and all frequency bands. A DGCNN method is proposed to model the multi-channel EEG features. The accuracy of each feature is calculated separately, and the DE features extracted from all frequency bands achieve the highest average accuracy of 90.40% in SEED dataset. However, this feature organization form does not contain enough time information. Similarly, Jia et al. (2020) divide EEG signals into five frequency bands and extract DE features from them. In order to consider three domains of EEG signals: time domain, frequency domain, and Spatial domain, they design two kinds of 3D feature organization forms spatial-spectral and spatial–temporal. Moreover, a two-stream network (Simonyan and Zisserman 2014) is proposed to process them respectively. Finally, the features learned by the two models are fused and classified, achieving an accuracy of 96.02% in SEED datasets and 84.92% in SEED-IV datasets. Although this approach considers the three domain features of EEG signals, the data organization and model structure are too complex and need further optimization and improvement.
In order to solve the above problems, this paper proposes an effective feature organization method in video form. Firstly, the EEG features from different electrodes are organized into a 2D feature map, which can be understood as an input image. Secondly, frequency bands are extracted from EEG signals, which can be understood as the RGB channel in the image, as shown in the upper half of Fig. 1. Finally, the 2D feature maps are ordered in a temporal sequence, which can be understood as a series of frames in a video, as shown in the lower half of Fig. 1. Furthermore, this paper designs a single-stream deep learning model, including an effective channel attention network, for the weighted fusion of EEG features in different frequency bands. The convolutional neural network is used to extract frame features, and NeXtVLAD is used to aggregate frame features into a video-level feature. The contributions of this paper are summarized below:
This paper converts EEG signals into a feature form similar to video features and designs a Channel Attention Convolutional Aggregation Network (CACAN) based on video-level features, providing new ideas for the research of EEG emotion recognition.
This paper introduces an ECA module with EEG band-based attention mechanism to adaptively capture the key EEG frequency band. The study finds that the gamma band is more sensitive and discriminative in emotion recognition tasks, and ECA-weighted EEG signals generate topographic maps that display clearer signals with reduced noise while retaining highly activated brain regions.
This paper applies the NeXtVLAD module to aggregate frame-level features into video-level features and finds that the NeXtVLAD module not only has higher accuracy compared to recurrent neural network models such as LSTM, but also demonstrates advantages in handling long time series data and capturing richer features.
Fig. 1.
The similarity between EEG and video
The remaining of this paper is organized as follows, Section "Related work" reviews the related work of predecessors. Section "Method" describes the data preprocessing process and critical work. Section "Experiments" introduces the datasets, experimental setup, results, and analysis. Section "Conclusion" summarizes the work in this paper, and Section "Future work" discusses future work. In order to maintain the conciseness of the main body of the paper, we have included some experimental results in the “Appendices”.
Related work
Deep learning has gradually developed into two camps led by computer vision and natural language processing. These two directions are developing rapidly and achieving remarkable results. Researchers from diverse fields are attempting to incorporate deep learning methods from these two domains into their own, especially in EEG emotion recognition. For example, Cheah et al. (2021) reconstruct the ResNet network designed for image processing according to the EEG signal characteristics, and propose the ResNet18 architecture. The average accuracy in the SEED dataset is 93.42%. Bagherzadeh et al. (2022) use the direct Directed Transfer Function (dDTF) method to obtain 32 × 32 images and then input them into four different pre-training models in the field of image processing: AlexNet, ResNet-50, Inception-v3, and VGG-19. Then fine-tune the parameters of these models independently. The average accuracy of this method in MAHNOB-HCI and DEAP datasets is 99.41% and 98.17%, respectively. The transformer was first proposed in natural language processing and then introduced into computer vision and speech recognition and performed very well. Aadam et al. (2022) applied transformer to the field of EEG emotion recognition. The valence and arousal are 90.41% and 91.49% in the DEAP dataset. Previous work proves that applying models from computer vision and natural language processing to EEG emotion recognition is feasible, but this work faces two problems.
One of the major challenges is transforming raw EEG signals into feature forms for feeding into deep learning networks. Convolutional Neural Network (CNN) has been extensively applied in the field of computer vision. Since raw EEG signals are one-dimensional vectors, previous work mostly avoided using CNN for EEG signals. Sunhee Hwang et al. (2020) project the DE features of 62 electrode channels and five frequency bands into a three-dimensional coordinate system by the polar projection method. Then, fill the empty with three different interpolation methods: nearest neighbor, linear, and Clough-Tocher cubic interpolation, to get a 32 × 32 × 5 EEG image. Finally, the generated EEG image is input into the 2D convolutional neural network. The method achieves the best average accuracy of 90.41% in the SEED dataset. Phan et al. (2021) divide EEG signals into four frequency bands, THETA, ALPHA, BETA, and GAMMA, and extract six features from each frequency band: differential entropy (DE), mean (), first-order-diff-mean (), second-order-diff-mean (), variance (σ2) and standard deviation (σ), and normalize them, for each feature, map 32 electrode channels to a 9 × 9 two-dimensional matrix according to the electrode position, and fill the empty with 0. In order to preserve the multi-feature representation, merge the two-dimensional matrix of 4 frequency bands into a sizeable two-dimensional matrix. Each sizeable matrix represents a feature and finally obtains an 18 × 18 × 6 three-dimensional matrix. The obtained matrix is sent to the convolutional neural network with multi-scale kernels and verified in the DEAP dataset. The valence and arousal reach 98.27% and 98.36%, respectively.
The second problem is applying and improving the model to achieve the best performance for EEG signals. Shen et al. (2020) propose a convolutional neural network. Instead of each convolutional layer followed by a pooling layer, they only keep the last pooling layer. Because the 2D feature map is small, remove the pooling layer in the middle to avoid excessive information loss. At the same time, using the LSTM network captures time-series features. The two models are fused more deeply by segmentation and verified in both SEED and DEAP datasets, with an average accuracy of 94.74%(SEED), 94.22% (DEAP-valence), 94.58% (DEAP-arousal). Sun et al. (2022) designed a parallel network structure of a transformer and a three-dimensional convolutional neural network. First, the preprocessed data passes a sliding window to get the first data format of shape 32 × 128. Then the first data format is reformed into a second data format with a shape of 9 × 9 × 128 according to the electrode position. The first data format is input into the transformer, and the second is input into the 3D convolutional neural network. Finally, fuse the features of the two models into the final features for classification. The method has been verified in the DEAP, DREAMER, and SEED datasets. The highest average accuracy achieved in the DEAP dataset, the arousal, and valence, is 98.53% and 98.27%, respectively.
The above study shows that it is feasible to convert EEG signals into images and analyses according to traditional computer vision (CV) engineering. This paper further advances this approach by incorporating the temporal dimension of the data, expanding from image-based to video-based analysis. Additionally, we address the two major problems of data organization and model adaptation. This work provides a new perspective and insight for emotion recognition based on EEG signals.
Method
Data preparation
Video-level feature organization
As shown in Fig. 2, this paper divides the EEG signal collected in each trial into Clips of the same size. Each Clip represents a small video clip. Because time t(s) determines Nc, it is necessary to discard the parts of the original signal that are not divisible by . So, the EEG signal is defined as . Where denotes seconds, denotes sampling frequency, denotes clip, denotes the number of electrode channels, denotes the length of the time series after discarding. Then each Clip is split into frames , denotes the number of sample points per frame. In this paper, it is half the sampling frequency. Next, a third-order Butterworth band-pass filter filters the signals of each frame to get , where b denotes five frequency bands( [1–4 Hz], [4–8 Hz], [8–14 Hz], [14–31 Hz], [31–51 Hz]). Then according to Formula (1) (Zheng et al. 2019), extract the DE feature from to get , where and denote the mean and variance of EEG signals . Finally, this paper maps channels of according to the position mapping matrix shown in Fig. 3, and obtains graph representation , where H is the height of the feature map, this paper sets it to 8, W is the width of the feature map, and this paper sets it to 9.
| 1 |
Fig. 2.
Video-level feature organization method. We first segment the raw EEG signal at the bottom of the figure into clips and take each clip as an input for model. Then, we further segment each clip into frames. Finally, we filter each frame, extract DE features, concat them, and map them to obtain the final frame
Fig. 3.
Position mapping matrix
Average baseline correction
Since the DEAP dataset contains a 3 s pretest signal, this paper uses the average baseline correction method inspired by the work of Yang et al. (2018) to eliminate EEG noise caused by spontaneous brain waves. Firstly, divide the 3 s pretest signal into six baseline frames with a span of 0.5 s. Secondly, extract each baseline frame's five frequency bands' DE feature. Thirdly, for each frequency band, take the average of the DE features of 6 baseline frames as the average baseline value, as shown in Formula (2), where , to represents the DE feature from baseline frame one to baseline frame six at band . Finally, the average baseline value is subtracted from the DE feature of each frame of the test signal) to obtain each frame's average baseline correction result, as shown in Formula (3).
| 2 |
| 3 |
Model
Efficient channel attention
The study of Duan et al. (2013) shows that the human emotional state is closely related to the five frequency bands of EEG signals. However, the previous work usually indiscriminately sends the EEG features of the five frequency bands into the neural network, which reduces the model's ability to select the critical frequency bands. Therefore, this paper introduces a channel attention mechanism in the network architecture. That can capture the importance of different frequency bands and weight each to strengthen the critical frequency bands and suppress non-critical frequency bands.
The Effective Channel Attention (ECA) (Wang et al. 2020) is an optimization of the Squeeze-and-Excitation Network (SE-Net) (Hu et al. 2020). It replaces the linear layer in SE-Net with one-dimensional convolution. To realize local cross-channel interaction and avoid the information loss caused by dimensionality reduction. At the same time, propose a method of adaptive selection of convolution kernel size, as shown in Formula (4), which avoids the process of manual parameter tuning. As illustrated in Fig. 4, pass the input features through global average pooling (GAP). And then, the weight of each frequency band is calculated by fast one-dimensional convolution. Next, use the sigmoid function to normalize the weights. Finally, the normalized weights and input features are multiplied band by band to obtain the weighted features. Compute the method by Formulas (5, 6 and 7) which contains only k parameters (k ≤ 9), greatly reducing the complexity of the model.
| 4 |
where k is the convolution kernel size, C is the channel dimension, is the mapping relationship between k and C, denotes the nearest odd number. In the experiment, when C = 5, k = 3.
| 5 |
where is the input feature, W, H, and C are the width, height, and the number of channels of the input feature, and is channel-wise global average pooling.
| 6 |
| 7 |
where denotes the weight of the channel, C1D denotes 1D convolution, denotes the sigmoid function, k denotes the convolution kernel size, and is the output feature.
Fig. 4.

Efficient channel attention model
Convolutional neural networks
The structure of a convolutional neural network model is shown in Fig. 5, which can extract the spatial domain and frequency domain information of the band-weighted feature map. In order to ensure the size of the data after convolution remains unchanged, this paper adds 0 to the data boundary. All convolution has a step size of 1 in all directions. After each convolution operation, ReLU is used as the activation function to improve the nonlinear fitting ability of the model. In order to reduce the amount of computation and increase the model's generalization ability. This paper uses the maximum pooling layer to downsample features after the last layer of convolution, where the step size of the pooling operation is 2. Then the output of the pooling layer is flattened with the flatten layer, then sent into the linear layer to obtain a one-dimensional representation of the feature. Finally, use the ReLU function again to ensure that the output result is not negative.
Fig. 5.
Convolutional neural networks model
NeXtVLAD module
Each clip consists of a series of frames. To classify clips, we use ECA and CNN modules above to extract features from each frame. Then, we need to consider how to aggregate the features of all frames to generate video-level features. Since there is a lot of similarity between frames, a simple way to aggregate their features is to use the average value of all frames, as shown in Fig. 6a. However, this approach causes the extracted video-level features to lose a lot of information, such as temporal information and the variation in activation intensity of different brain regions.
Fig. 6.
a The process of mean aggregation; b The process of NeXtVLAD aggregation; c VLAD vector for K = 64
Instead, NeXtVLAD (Lin et al. 2019) partitions all frame features into multiple clusters, as shown by the dashed lines in Fig. 6b, and aggregates similar frame features within each cluster to obtain an aggregated feature. Finally, all the aggregated features are concatenated into a global feature vector as the video feature. The aggregation process of NeXtVLAD is implemented using the VLAD method. This method can remove the features from the frames themselves, and only retain the differences between the frame features and the cluster centers to which they belong. These differences can be represented by a VLAD vector. The VLAD vector illustrated in Fig. 6c is calculated by summing the differences , which are obtained from subtracting the cluster center from the frame features belonging to the same cluster, as shown in Fig. 6b. K represents the number of clustering centers.
Figure 7 shows the model structure of NeXtVLAD. Specifically, the input feature of M frames and N dimensions is expanded via a linear expansion layer to obtain , which has the shape (M, λN), where λ is the width multiplier. It is set to 2 in the experiment. Secondly, to reduce the model parameters is grouped to obtain with a shape of (M, G, λN/G), where G is the number of groups. It is set to 3 in the experiment. In order to aggregate each group's features, group attention mechanisms are introduced into the model, which can obtain the group weights. The computing method is shown in Formula (8).
| 8 |
where denotes the weight of the frame feature in group , is the sigmoid function, .
Fig. 7.
NeXtVLAD model
Unlike hard assignment, which assigns each feature point to a unique cluster center, soft assignment means that the is classified into different cluster center . The probability corresponding to each , as shown in Formula (9), when is closer to the , the value of is close to 1. Otherwise, it is close to 0, where , K is the number of cluster centers. In the experiment, it is set to 64.
| 9 |
Then multiply the product of and by and , respectively. After calculating the sum in the M and G dimensions, the difference is made to obtain the VLAD vector, as shown in Formula (10), where .
| 10 |
Finally, L2 normalization is performed on , then reshaped into a one-dimensional vector and sent to the fully connected layer. Obtain the output vector of O = 1024 size.
Network architecture for CACAN
The CACAN architecture is shown in Fig. 8. In order to accelerate the training speed of the model and expand the number of samples. This paper uses Clip as a unit for model training with the shape (F, H, W, B). F denotes the number of frames, H denotes the height of the feature map, W denotes the width of the feature map, and B denotes the number of frequency bands. The clip is composed of a series of frames. In Section "The effect of the number of frames", this paper explores the different values of frames. Specifically, firstly, a series of frames are sent to the ECA model to obtain the band-weighted features, and this process does not change the shape of the features. Secondly, to prevent excessive attention to the frequency band, a residual structure is added to achieve the purpose of regularization. Thirdly, the features after the residual structure are fed into the CNN model to extract the frame features of the feature map and obtain the one-dimensional features with a size of 768. The same ECA and CNN process all frames, so their weights are shared. Fourthly, each frame's one-dimensional representation in the Clip is concatenated in the time dimension and then fed into the NeXtVLAD model to obtain a one-dimensional vector of 1024 as a video-level feature. Finally, connect the one-dimensional vector to the output node through a fully connected layer. Obtain the final classification result through softmax.
Fig. 8.
Network architecture for CACAN
Experiments
Datasets
SEED dataset
SEED (Duan et al. 2013; Zheng and Lu 2015) dataset is a public EEG emotion recognition dataset provided by BCMI laboratory. Fifteen subjects, including seven males and eight females, are invited to watch 15 movie clips. The duration of each video is about 4 min, and there is 5 s hint before each video is played. After each video is played, there are self-evaluation time of 45 s and a rest time of 15 s. The subjects will participate in three sessions in total, with an interval of about one week. Each session watch all 15 video materials. The data structure of the dataset is shown in Table 1. The 62 electrode channels are arranged according to the international 10–20 system, as shown in Fig. 9. The experiment uses the preprocessed data provided by the official website. The data is down-sampled to 200 Hz and filtered by a 0-75 Hz band-pass filter. All the samples will be divided into three categories (positive, neutral, and negative). Among the 15 video materials, there are five films of each type.
Table 1.
The data structure of the SEED dataset
| Array | Array shape | Array content |
|---|---|---|
| Data | 15 × 3 × 15 × 62 × 37,000 | Subjects × sessions × videos × channel × data (200 Hz × 185 s) |
| Labels | 15 × 3 × 15 × 1 | Subjects × sessions × videos × label (+ 1 for positive/0 for neutral/− 1 for negative) |
Fig. 9.

International 10–20 system
DEAP dataset
DEAP dataset (Koelstra et al. 2012) is a public dataset provided by Koelstra et al., which is used as an evaluation standard in many EEG emotion recognition studies. The experiment invites 32 subjects, including 16 men and 16 women, to watch 40 music video materials, each lasting 60 s, and have a silence time of 3 s before each video is played. The data structure of the dataset is shown in Table 2. Among the 40 signal channels, the first 32 are EEG channels, and the remaining 8 are other physiological signal channels, with a sampling frequency of 512 Hz. The experiment uses the preprocessed data provided by the official website. The preprocessing process of the official website is to resample the original data at 128 Hz and remove the ocular electrical noise. Every time the subjects watch an experiment, they should take about 15 s of SAM evaluation, measured by numbers 1–9. The two-dimensional Valence-Arousal coordinate system composed of valence and arousal is the mainstream method for emotion evaluation, which measures two kinds of emotional states, respectively. Valence indicates the degree of emotional pleasure, from 1 to 9, gradually transited from negative to positive. Arousal indicates the level of emotional excitement, from 1 to 9, by calm gradually transitioning to excitement. In this paper, the threshold is set as 5. Labels greater than five are judged as high arousal and high valence, while labels less than or equal to five are judged as low arousal and low valence. In 1280 samples, there are 708 samples with valence greater than five, 572 samples less than or equal to five, 737 samples with arousal greater than five, and 543 samples less than or equal to five.
Table 2.
The data structure of the DEAP dataset
| Array | Array shape | Array content |
|---|---|---|
| Data | 32 × 40 × 40 × 8064 | Subjects × videos × channels × data (128 Hz × 63 s) |
| Labels | 32 × 40 × 4 | Subjects × videos × labels (Valence/Arousal/Dominance/Liking) |
Experimental settings
The CACAN model is implemented in python under the pytorch framework. The experimental environment is Intel (R) Xeon (R) Gold 6134 CPU @ 3.20 Hz, the memory size is 128 GB, and the 64-bit Linux system is used. The pytorch version is 1.11.0, the python version is 3.7.0, and the CUDA version is 11.2. The maximum number of training epochs is 200, and the batch size is 64. AdamW (Xin et al. 2023) is used to optimize the label smoothing cross entropy (Müller et al. 2019) loss function. The initial learning rate is 5e−5, and before setting the initial learning rate, the learning rate is warmed up through 30 steps of linear warmup. And dropout is set to 0.3 in the NeXtVLAD model. In this paper, we carry out the subject-dependent experiment, and each subject’s average accuracy and standard deviation are obtained by tenfold cross-validation to evaluate the model's performance. An example of the training process for subject 1 on the DEAP dataset can be seen in Fig. 10.
Fig. 10.
The training process of subject 1 in terms of training and validation accuracy and loss on the DEAP dataset in subject-dependent experiment
Experimental results and discussion
In this section, we conduct some experiments aiming to answer the following key research questions and provide some statistical support in terms to the RQs:
RQ1: During the data processing, how many frames should be included in a Clip to achieve higher classification accuracy?
RQ2: Are there certain frequency bands that are more sensitive and discriminative for emotional classification tasks? What are the advantages of the ECA module?
RQ3: What are the advantages of the NeXtVLAD module compared to RNNs?
RQ4: Are different modules in the model effective? And how much do each of the parts contribute to the model's performance?
RQ5: How does CACAN perform compared to state-of-the-art methods that also use the DEAP and SEED public datasets in subject-dependent experiments?
The effect of the number of frames
When an external stimulus is applied to the nervous system, it takes time for it to respond to external stimulus. At the same time, the emotional intensity is also variable and changes over time (Pereira et al. 2021). Currently, there is no research on how long EEG signals should be included in the sample to obtain high classification accuracy. Therefore, this paper takes the number of sample frames as the only variable to explore the influence of the number of sample frames on classification accuracy. In this paper, each frame contains 0.5 s of EEG signals. The experimental results are shown in Fig. 11. In the SEED dataset, the model accuracy increases with the increase of frame number. It reaches a peak of 95.80% when the frame number equals 30 and 36. Then the classification accuracy decreases with the increase of frame number. Similar to the SEED dataset, the classification accuracy of the DEAP dataset also shows a trend of first increasing and then decreasing. When the number of frames is equal to 24, the valence and arousal dimensions of the DEAP dataset reach the peak of 98.98% and 98.97% simultaneously. The curves of the arousal and valence dimensions almost coincide, which indicates that there is a specific interaction between the arousal and valence dimensions. Generally, when the number of frames is less than 12, the EEG signal in the sample is too short, and the amount of information is insufficient, resulting in low classification accuracy. When the number of frames is greater than 36, the EEG signal contained in the sample is too long. The adjustment mechanism of the nervous system changes the emotional intensity, which introduces too much noise to the sample data, resulting in the decline of classification accuracy of all datasets, especially the DEAP dataset. Therefore, it is reasonable to set the frame number between 12 and 36.
Fig. 11.
Accuracy of SEED and DEAP datasets at different frame numbers
ECA captures crucial EEG frequency bands
The learned attention weights of the Efficient Channel Attention (ECA) module are presented in Fig. 12. The results demonstrate that the Gamma band has higher weights than the other four frequency bands, which indicates that Gamma band is more sensitive and discriminative for emotion classification tasks. This finding is consistent with (Ullah and Halim 2021; Zhong et al. 2022). In contrast, the weights of Beta, Alpha, Theta, and Delta bands decrease gradually, suggesting that their abilities to capture emotion-related features diminish. Additionally, previous research has shown that the Gamma and Beta band activity in the brain increases when experiencing positive emotions such as joy and surprise (Zheng and Lu 2015). Compared to low-frequency signals, high-frequency signals vary more rapidly over shorter periods of time. Thus, high-frequency signals can provide more detailed and informative features in temporal and spectral domains, which can be fully utilized by the attention mechanism in emotion classification tasks.
Fig. 12.
The attention weights for the five frequency bands change with varying numbers of frames
The study employs the MNE library for generating topographic maps for 5 frequency bands in the SEED dataset. A linear summation approach is subsequently applied to analyze the EEG for 5 frequency bands before and after applying ECA weighting, and generated corresponding topographic maps as shown in Fig. 13. The redder color indicates strong activation in that area. Our research results indicate that compared to unprocessed data, ECA-weighted topographic maps display significantly clearer signals and fewer noise while retaining highly activated brain regions. This discovery not only provides a feasible solution to limit the impact of noise on EEG signals, but also offers new insights for further improving EEG signal feature extraction.
Fig. 13.
EEG topographic maps of five frequency bands and before and after ECA weighting
In order to verify that the ECA model is superior to other channel attention models, this paper replaced the ECA model in the CACAN model with other channel attention models, including CA and SE (Hu et al. 2020). CA is a modification of the Multi-Head Attention model (Vaswani et al. 2017), that is, self-attention in the channel dimension. To do this, first reshape the input (F, H, W, B) to (F × H × W, B), where F is the number of frames, H is the height of the feature map, W is the width of the feature map, and B is the number of frequency bands. Then it is fed into the Multi-Head Attention model with the same parameter Settings as the original paper. The experimental results are shown in Table 3. The results show that ECA performs better than the other two channel attention models. At the same time, ECA uses convolution instead of the linear layer of other models, which significantly reduces the complexity of the model and avoids overfitting. In summary, this paper chooses the ECA model to weigh the importance of the frequency band.
Table 3.
The result of ECA and other channel attention models ACC ± STD (%)
| Method | SEED | DEAP-valence | DEAP-arousal |
|---|---|---|---|
| CA | 94.93% ± 2.36% | 98.64% ± 1.55% | 98.72% ± 1.49% |
| SE | 95.60% ± 2.32% | 98.86% ± 1.09% | 98.86% ± 1.33% |
| ECA | 95.80% ± 2.04% | 98.98% ± 0.98% | 98.97% ± 1.13% |
Comparisons with RNN models
Recurrent neural networks have advantages in processing time series, which are commonly used to extract temporal features in many video classification tasks, so this paper compares NeXtVLAD with recurrent neural networks, including LSTM (Hochreiter and Schmidhuber 1997), BiLSTM (Joshi et al. 2022), GRU (Cho et al. 2014), BiGRU (Chung et al. 2014). In terms of model parameters, the input size of all recurrent neural networks is set to 768 in this paper. The hidden size is 1024 for LSTM and GRU and 512 for BiLSTM and BiGRU. The hidden layer features are used as the output so that the output features of all recurrent neural networks are one-dimensional vectors of size 1024. The experimental results are shown in Table 4. The experiment proves NeXtVLAD can more effectively extract spatial information changing with time than recurrent neural network models such as LSTM.
Table 4.
The result of NeXtVLAD and common RNN models ACC ± STD (%)
| Method | SEED | DEAP-valence | DEAP-arousal |
|---|---|---|---|
| LSTM | 91.59% ± 3.22% | 96.27% ± 2.44% | 96.89% ± 2.16% |
| BiLSTM | 91.48% ± 3.33% | 94.72% ± 2.83% | 95.64% ± 3.00% |
| GRU | 92.45% ± 3.30% | 98.47% ± 1.80% | 98.53% ± 1.52% |
| BiGRU | 92.41% ± 3.08% | 98.09% ± 2.27% | 98.13% ± 2.20% |
| NeXtVLAD | 95.80% ± 2.04% | 98.98% ± 0.98% | 98.97% ± 1.13% |
We compared the accuracy of NeXtVLAD and LSTM at different frame numbers, as shown in Fig. 14. As shown in the figure, the accuracy of NeXtVLAD gradually improves from the 6th to the 24th frame and begins to decline only after the 24th frame. In contrast, the accuracy of LSTM declines consistently since the 12th frame. The results demonstrate that NeXtVLAD performs better in handling longer time series data. Our analysis suggests two main reasons for this. Firstly, due to the similarity between adjacent frames, NeXtVLAD can aggregate similar frame features by clustering, which achieves feature dimensionality reduction. In contrast, LSTM and other recurrent neural networks need to process each frame feature individually at each time step, resulting in increased model size, higher computational complexity, and greater training difficulty. Secondly, due to the feedback mechanism employed in LSTM, gradients need to propagate over a large number of time steps. However, these gradients decay or grow exponentially, causing the vanishing or exploding gradient problem, which makes it difficult for the model to capture long-term dependencies and thus reduces the accuracy. By comparison, NeXtVLAD can better handle long time series data and capture richer features, thus enhancing the training effectiveness of the model.
Fig. 14.
a Accuracy of DEAP-V at different frame numbers; b Accuracy of DEAP-A at different frame numbers
Ablation test
The CACAN model consists of three base models (ECA, CNN, NeXtVLAD) and a residual connection structure (SC). In order to verify each part's effect on the model's performance, an ablation test is designed in this paper, and the experimental results are shown in Table 5. Firstly, remove the ECA model. Since the SC structure is no longer effective after the ECA model is removed (the result obtained through SC is only simple multiple relations with the result without SC). ECA and SC are removed in this paper, and other parts of the model remain unchanged. The results show that SEED, DEAP valence, and DEAP arousal are decreased by 0.41%, 0.04%, and 0.03%, respectively, indicating that ECA is slightly helpful in improving the performance of the model. Secondly, SC is removed in this paper, and other parts remain unchanged. In the SEED dataset, the model's classification accuracy decreased by 1.26%. However, in the DEAP dataset, the classification accuracy of the valence and arousal dimension models increased by 0.11% and 0.11%, respectively. That indicates that the SC structure can improve the model accuracy in the SEED dataset but slightly reduce the DEAP dataset's model accuracy. However, the SC structure reduces the classification accuracy of the model in DEAP datasets. It should be noted that the SC structure can significantly reduce the standard deviation of the model classification results in both datasets. That makes the classification results more stable. Therefore, the SC structure is meaningful. Thirdly, the CNN model is removed. Because the shape of NeXtVLAD input features has changed, the Flatten layer and the linear layer with input features size of 360 and output feature size of 768 are added before NeXtVLAD. The feature shape changes from (batch_size, 32, 8, 9, 5) to (batch_size, 32, 768), and other parts remain unchanged. It can be calculated from the table that the SEED, DEAP valence, and DEAP arousal decrease by 1.83%, 2.71%, and 2.07%, respectively. The results show that the CNN model can effectively extract the local features of the feature map. Finally, the NeXtVLAD model is removed, and the last linear layer of the CNN model is connected to the output node. The ReLU function after the linear layer has also been removed, and the other parts remain unchanged. It can be seen from the results that the classification performance of all datasets has significantly decreased after removing the NeXtVLAD model. SEED, DEAP valence, and DEAP arousal are decreased by 15.44%, 8.91%, and 7.77%, respectively, which indicates that NeXtVLAD contributes the most to the performance of the CACAN model. Furthermore, it can learn more emotion-related features from video-level EEG features.
Table 5.
The result of ablation test ACC ± STD (%)
| Method | SEED | DEAP-valence | DEAP-arousal |
|---|---|---|---|
| ALL | 95.80% ± 2.04% | 98.98% ± 0.98% | 98.97% ± 1.13% |
| No ECA&SC | 95.39% ± 2.24% | 98.94% ± 1.09% | 98.94% ± 1.25% |
| No SC | 94.54% ± 2.48% | 99.09% ± 1.10% | 99.08% ± 1.34% |
| No CNN | 93.97% ± 2.91% | 96.27% ± 2.06% | 96.91% ± 2.00% |
| No NeXtVLAD | 80.36% ± 5.90% | 90.07% ± 3.94% | 91.21% ± 3.92% |
Comparisons with other methods
We compare our method with other methods that also use SEED and DEAP datasets, and the results are shown in Table 6. TOPO-FM (Topic and Russo 2021) extracts band power (BP), differential entropy (DE), and nine other features from EEG signals. Then create topographic maps according to the electrode positions defined by the 10–20 system. Then, CNN is used to extract the features of each topographic map separately, and a fully connected layer is used to concatenate them together. Finally, the SVM model is used for classification and obtains the final result after tenfold cross-validation. HOLO-FM (Topic and Russo 2021) is similar to TOPO-FM. However, computer-generated holography is used to generate two-dimensional feature maps instead of electrode positions to generate topographic maps. ATDD-LSTM model (Du et al. 2022) is to input the differential entropy (DE) features into the 2-layer LSTM to capture the temporal features. Then feeds the temporal features into the attention-based encoder-decoder and domain discriminator in parallel to learn the domain-invariant and emotion-related features. Two kinds of experiments are carried out in the paper, subject-dependent and subject-independent. Because the subject-dependent is the same as this paper's experimental setup, we only compare it with its subject-dependent experimental results. Fused CNN-LSTM (Ramzan and Dawn 2021) combines the original signals of each sampling point into a 2D image according to the electrode position and arranges them according to the time sequence. A group of 2D images is sent to the CNN network by sliding window to extract the features of each 2D image. Then the features extracted from each 2D image are spliced. Meanwhile, 1D signals are sent to two layers of LSTM to extract time sequence features. Finally, the features extracted by CNN and LSTM are spliced into the fully connected layer. The final result is obtained through tenfold cross-verification. 4D-CRNN (Shen et al. 2020) converts the differential entropy feature into a 4D structure. Then it sends it into the CRNN model combined with CNN and LSTM. The final result is obtained through the fivefold cross-verification. DE-CNN-BiLSTM (Cui et al. 2022) inputs the differential entropy features into the convolutional neural network and BiLSTM. Obtain the final result through the fivefold cross-verification. All the results are copied from their papers. Compared with other methods, the proposed method achieves the best performance in the SEED dataset on the three classification problems of positive, neutral, and negative, and the DEAP dataset on the two binary classification problems of valence dimension and arousal dimension.
Table 6.
The result of ours compared with other methods ACC ± STD (%)
| Method | SEED | DEAP-valence | DEAP-arousal |
|---|---|---|---|
| TOPO-FM (Topic and Russo 2021) | 70.37% ± 3.41% | 76.30% ± 1.86% | 76.54% ± 2.69% |
| HOLO-FM (Topic and Russo 2021) | 88.45% ± 1.56% | 76.61% ± 2.13% | 77.72% ± 2.87% |
| ATDD-LSTM (Du et al. 2022) | 91.08% ± 6.42% | 91.53% ± 9.00% | 91.55% ± 11.32% |
| Fused CNN-LSTM (Ramzan and Dawn 2021) | 93.74% | 97.41% | 97.39% |
| 4D-CRNN (Shen et al. 2020) | 94.74% ± 2.32% | 94.22% ± 2.61% | 94.58% ± 3.69% |
| DE-CNN-BiLSTM (Cui et al. 2022) | 94.82% | 94.02% | 94.86% |
| CACAN (Ours) | 95.80% ± 2.04% | 98.98% ± 0.98% | 98.97% ± 1.13% |
Conclusion
In this paper, we propose an organization method of video-level features, which effectively fuses the time domain, frequency domain, and spatial domain information of EEG signals. In the meantime, we design a deep neural network based on video-level features, Channel Attention Convolutional Aggregation Network (CACAN), to carry out EEG emotion recognition. The network adaptively captures critical EEG frequency bands through an efficient channel attention model. Use the residual structure to improve the stability of the model classification. The convolutional neural network model is used to extract the frequency and spatial features of the feature map. Through the NeXtVLAD model to learn the spatial features over time. The method is validated on two public datasets, SEED and DEAP. The average classification accuracy with the standard deviation of the SEED dataset is 95.80% ± 2.04%, while that of the DEAP dataset is 98.98% ± 0.98% (valence) and 98.97% ± 1.13% (arousal). Compared with other methods, the method’s performance in this paper is State-of-the-Art. The results demonstrate that it is feasible to organize EEG signals into video forms and study EEG emotion recognition from the perspective of video classification.
Future work
Since EEG signals can be organized into the same feature form as video, we plan to combine facial and EEG signals for multimodal learning. Meanwhile, because EEG signals are individual differences, the performance of the cross-subject emotion classification task is not good. We can solve this problem simultaneously through multimodal learning.
Appendices
Fig. 15.
The accuracy of each subject on the DEAP dataset using CACAN
Fig. 16.
The accuracy of each subject on the SEED dataset using CACAN
Fig. 17.
a The arousal dimension's confusion matrix on the DEAP dataset using CACAN; b The valence dimension's confusion matrix on the DEAP dataset using CACAN; c The confusion matrix on the SEED dataset using CACAN
Table 7.
The accuracy, recall, precision, and F1 score of each subject and their averages on the DEAP dataset using CACAN
| Subject | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) | ||||
|---|---|---|---|---|---|---|---|---|
| Arousal | Valence | Arousal | Valence | Arousal | Valence | Arousal | Valence | |
| S01 | 100.00 | 99.00 | 100.00 | 98.10 | 100.00 | 100.00 | 100.00 | 99.04 |
| S02 | 98.00 | 98.50 | 97.50 | 98.89 | 97.50 | 97.80 | 97.50 | 98.34 |
| S03 | 100.00 | 99.50 | 100.00 | 98.89 | 100.00 | 100.00 | 100.00 | 99.44 |
| S04 | 98.50 | 98.50 | 98.26 | 99.17 | 99.12 | 98.35 | 98.69 | 98.76 |
| S05 | 98.00 | 97.00 | 97.14 | 97.50 | 99.03 | 95.12 | 98.08 | 96.30 |
| S06 | 99.50 | 99.00 | 99.13 | 98.00 | 100.00 | 98.00 | 99.56 | 98.00 |
| S07 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| S08 | 97.50 | 98.00 | 97.65 | 98.89 | 96.51 | 96.74 | 97.08 | 97.80 |
| S09 | 99.00 | 98.50 | 97.50 | 98.00 | 100.00 | 98.99 | 98.73 | 98.49 |
| S10 | 98.50 | 99.50 | 98.89 | 100.00 | 97.80 | 99.01 | 98.34 | 99.50 |
| S11 | 98.50 | 97.00 | 100.00 | 97.50 | 97.66 | 95.12 | 98.81 | 96.30 |
| S12 | 99.50 | 100.00 | 100.00 | 100.00 | 97.22 | 100.00 | 98.59 | 100.00 |
| S13 | 99.50 | 99.50 | 100.00 | 100.00 | 96.77 | 99.14 | 98.36 | 99.57 |
| S14 | 99.50 | 99.50 | 100.00 | 100.00 | 98.48 | 99.01 | 99.24 | 99.50 |
| S15 | 98.50 | 100.00 | 99.05 | 100.00 | 98.11 | 100.00 | 98.58 | 100.00 |
| S16 | 100.00 | 99.50 | 100.00 | 100.00 | 100.00 | 99.21 | 100.00 | 99.60 |
| S17 | 99.50 | 98.00 | 98.75 | 97.78 | 100.00 | 97.78 | 99.37 | 97.78 |
| S18 | 99.50 | 99.50 | 100.00 | 98.75 | 98.68 | 100.00 | 99.34 | 99.37 |
| S19 | 100.00 | 99.50 | 100.00 | 98.82 | 100.00 | 100.00 | 100.00 | 99.41 |
| S20 | 99.50 | 98.50 | 97.78 | 97.65 | 100.00 | 98.81 | 98.88 | 98.22 |
| S21 | 99.50 | 100.00 | 100.00 | 100.00 | 97.56 | 100.00 | 98.77 | 100.00 |
| S22 | 94.50 | 96.00 | 88.75 | 96.40 | 97.26 | 96.40 | 92.81 | 96.40 |
| S23 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
| S24 | 99.50 | 99.50 | 97.14 | 99.09 | 100.00 | 100.00 | 98.55 | 99.54 |
| S25 | 100.00 | 99.50 | 100.00 | 99.05 | 100.00 | 100.00 | 100.00 | 99.52 |
| S26 | 99.00 | 99.00 | 100.00 | 97.14 | 98.29 | 100.00 | 99.14 | 98.55 |
| S27 | 99.50 | 100.00 | 98.46 | 100.00 | 100.00 | 100.00 | 99.22 | 100.00 |
| S28 | 98.00 | 99.00 | 99.09 | 97.33 | 97.32 | 100.00 | 98.20 | 98.65 |
| S29 | 97.50 | 99.50 | 98.67 | 98.82 | 94.87 | 100.00 | 96.73 | 99.41 |
| S30 | 99.00 | 98.00 | 100.00 | 96.92 | 98.13 | 96.92 | 99.06 | 96.92 |
| S31 | 99.50 | 99.50 | 100.00 | 100.00 | 99.06 | 98.84 | 99.53 | 99.42 |
| S32 | 98.00 | 99.00 | 96.92 | 99.00 | 96.92 | 99.00 | 96.92 | 99.00 |
| ALL | 98.97 | 98.98 | 98.77 | 98.80 | 98.63 | 98.88 | 98.69 | 98.84 |
Table 8.
The accuracy, recall, precision, and F1 score of each subject and their averages on the SEED dataset using CACAN
| Subject | Accuracy (%) | Recall (%) | Precision (%) | F1-Score (%) |
|---|---|---|---|---|
| S01 | 93.86 | 93.87 | 93.96 | 93.92 |
| S02 | 92.32 | 92.69 | 92.43 | 92.56 |
| S03 | 95.24 | 95.35 | 95.27 | 95.31 |
| S04 | 92.63 | 92.77 | 92.58 | 92.67 |
| S05 | 96.31 | 96.36 | 96.31 | 96.33 |
| S06 | 97.08 | 97.10 | 97.07 | 97.09 |
| S07 | 95.70 | 95.72 | 95.69 | 95.71 |
| S08 | 97.39 | 97.39 | 97.39 | 97.39 |
| S09 | 96.16 | 96.15 | 96.21 | 96.18 |
| S10 | 94.62 | 94.64 | 94.62 | 94.63 |
| S11 | 96.62 | 96.63 | 96.62 | 96.63 |
| S12 | 97.08 | 97.11 | 97.09 | 97.10 |
| S13 | 97.70 | 97.71 | 97.70 | 97.70 |
| S14 | 94.32 | 94.32 | 94.35 | 94.33 |
| S15 | 100.00 | 100.00 | 100.00 | 100.00 |
| ALL | 95.80 | 95.85 | 95.82 | 95.84 |
Funding
This work is supported by the Science and Technology Project of the Education Department of Jilin Province (JJKH20220245KJ, JJKH20220226SK), the National Natural Science Foundation of China Joint Fund Project (U19A200496).
Data availability
The datasets analyzed during the current study are public datasets. It is available at the following URL: SEED dataset: https://bcmi.sjtu.edu.cn/home/seed/; DEAP dataset: http://www.eecs.qmul.ac.uk/mmv/datasets/deap/.
Declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Aadam TA, Al-Obeidat F, Halim Z, Waqas M, Qayum F (2022) EmoPercept: EEG-based emotion classification through perceiver. Soft Comput 26:10563–10570. 10.1007/s00500-021-06578-4 10.1007/s00500-021-06578-4 [DOI] [Google Scholar]
- Ahmed F, Bari ASMH, Gavrilova ML (2020) Emotion recognition from body movement. IEEE Access 8:11761–11781. 10.1109/ACCESS.2019.2963113 10.1109/ACCESS.2019.2963113 [DOI] [Google Scholar]
- Bagherzadeh S, Maghooli K, Shalbaf A, Maghsoudi A (2022) Emotion recognition using effective connectivity and pre-trained convolutional neural networks in EEG signals. Cogn Neurodyn 16:1087–1106. 10.1007/s11571-021-09756-0 10.1007/s11571-021-09756-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheah KH, Nisar H, Yap VV, Lee CY, Sinha GR (2021) Optimizing residual networks and VGG for classification of EEG signals: identifying ideal channels for emotion recognition. J Healthc Eng 2021:5599615. 10.1155/2021/5599615 10.1155/2021/5599615 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1724–1734. 10.3115/v1/D14-1179
- Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555
- Cui F, Wang R, Ding W, Chen Y, Huang L (2022) A novel DE-CNN-BiLSTM multi-fusion model for EEG emotion recognition. Mathematics 10:582 10.3390/math10040582 [DOI] [Google Scholar]
- Du X, Ma C, Zhang G, Li J, Lai YK, Zhao G, Deng X, Liu YJ, Wang H (2022) An efficient LSTM network for emotion recognition from multichannel EEG signals. IEEE Trans Affect Comput 13:1528–1540. 10.1109/TAFFC.2020.3013711 10.1109/TAFFC.2020.3013711 [DOI] [Google Scholar]
- Duan RN, Zhu JY, Lu BL (2013) Differential entropy feature for EEG-based emotion classification. In: 2013 6th international IEEE/EMBS conference on neural engineering (NER), pp 81–84
- Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. 10.1162/neco.1997.9.8.1735 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
- Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42:2011–2023. 10.1109/TPAMI.2019.2913372 10.1109/TPAMI.2019.2913372 [DOI] [PubMed] [Google Scholar]
- Hwang S, Hong K, Son G, Byun H (2020) Learning CNN features from DE features for EEG-based emotion recognition. Pattern Anal Appl 23:1323–1335. 10.1007/s10044-019-00860-w 10.1007/s10044-019-00860-w [DOI] [Google Scholar]
- Jia Z, Lin Y, Cai X, Chen H, Gou H, Wang J. (2020). SST-EmotionNet: spatial-spectral-temporal based attention 3D dense network for EEG emotion recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2909–2917
- Joshi VM, Ghongade RB, Joshi AM, Kulkarni RV (2022) Deep BiLSTM neural network model for emotion detection using cross-dataset approach.Biomed Signal Process Control 73:103407 10.1016/j.bspc.2021.103407 [DOI] [Google Scholar]
- Koelstra S, Muhl C, Soleymani M, Lee J, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2012) DEAP: a database for emotion analysis; using physiological signals. IEEE Trans Affect Comput 3:18–31. 10.1109/T-AFFC.2011.15 10.1109/T-AFFC.2011.15 [DOI] [Google Scholar]
- Liu Z-T, Xie Q, Wu M, Cao W-H, Mei Y, Mao J-W (2018) Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing 309:145–156. 10.1016/j.neucom.2018.05.005 10.1016/j.neucom.2018.05.005 [DOI] [Google Scholar]
- Mithbavkar SA, Shah MS (2021) Analysis of EMG based emotion recognition for multiple people and emotions. In: 2021 IEEE 3rd Eurasia conference on biomedical engineering, healthcare and sustainability (ECBIOS), pp 1–4. 10.1109/ECBIOS51820.2021.9510858
- Müller R, Kornblith S, Hinton G (2019) When does label smoothing help? In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp 4694–4703. 10.5555/3454287.3454709
- Pereira ET, Gomes HM, Veloso LR, Mota MRA (2021) Empirical evidence relating EEG signal duration to emotion classification performance. IEEE Trans Affect Comput 12:154–164. 10.1109/TAFFC.2018.2854168 10.1109/TAFFC.2018.2854168 [DOI] [Google Scholar]
- Phan TD, Kim SH, Yang HJ, Lee GS (2021) EEG-based emotion recognition by convolutional neural network with multi-scale kernels. Sensors (basel) 21:5092. 10.3390/s21155092 10.3390/s21155092 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Picard RW (2010) Affective computing: from laughter to IEEE. IEEE Trans Affect Comput 1:11–17 10.1109/T-AFFC.2010.10 [DOI] [Google Scholar]
- Rahman AU, Tubaishat A, Al-Obeidat F, Halim Z, Tahir M, Qayum F (2022) Extended ICA and M-CSP with BiLSTM towards improved classification of EEG signals. Soft Comput 26:10687–10698. 10.1007/s00500-022-06847-w 10.1007/s00500-022-06847-w [DOI] [Google Scholar]
- Ramzan M, Dawn S (2021) Fused CNN-LSTM deep learning emotion recognition model using electroencephalography signals. Int J Neurosci. 10.1080/00207454.2021.1941947 10.1080/00207454.2021.1941947 [DOI] [PubMed] [Google Scholar]
- Shen F, Dai G, Lin G, Zhang J, Kong W, Zeng H (2020) EEG-based emotion recognition using 4D convolutional recurrent neural network. CognNeurodyn 14:815–828. 10.1007/s11571-020-09634-1 10.1007/s11571-020-09634-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1:568–576
- Song T, Zheng W, Song P, Cui Z (2020) EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans Affect Comput 11:532–541. 10.1109/TAFFC.2018.2817622 10.1109/TAFFC.2018.2817622 [DOI] [Google Scholar]
- Sun J, Wang X, Zhao K, Hao S, Wang T (2022) Multi-channel EEG emotion recognition based on parallel transformer and 3D-convolutional neural network. Mathematics 10:3131 10.3390/math10173131 [DOI] [Google Scholar]
- Tahir M, Halim Z, Rahman AU, Waqas M, Tu S, Chen S, Han Z (2022) Non-acted text and keystrokes database and learning methods to recognize emotions. ACM Trans Multimed Comput Commun Appl 18:1–24. 10.1145/3480968 10.1145/3480968 [DOI] [Google Scholar]
- Topic A, Russo M (2021) Emotion recognition based on EEG feature maps through deep learning network. Eng Sci Technol Int J 24:1442–1454.10.1016/j.jestch.2021.03.012 10.1016/j.jestch.2021.03.012 [DOI] [Google Scholar]
- Ullah S, Halim Z (2021) Imagined character recognition through EEG signals using deep convolutional neural network. Med Biol Eng Comput 59:1167–1183. 10.1007/s11517-021-02368-0 10.1007/s11517-021-02368-0 [DOI] [PubMed] [Google Scholar]
- Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, Long Beach, California, USA
- Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q (2020) ECA-Net: Efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11531–11539. 10.1109/CVPR42600.2020.01155
- Wu G, Liu G, Hao M (2010) The analysis of emotion recognition from GSR based on PSO. In: 2010 International symposium on intelligence information processing and trusted computing, pp 360–363
- Xin R, Miao F, Cong P, Zhang F, Xin Y, Feng X (2023) Multiview feature fusion attention convolutional recurrent neural networks for EEG-based emotion recognition. J Sens 2023:9281230. 10.1155/2023/9281230
- Yang Y, Wu Q, Qiu M, Wang Y, Chen X (2018) Emotion recognition from multi-channel EEG through parallel convolutional recurrent neural network. In: 2018 international joint conference on neural networks (IJCNN), pp 1–7. 10.1109/IJCNN.2018.8489331
- Yin Y, Zheng X, Hu B, Zhang Y, Cui X (2021) EEG emotion recognition using fusion model of graph convolutional neural networks and LSTM. Appl Soft Comput 100:106954. 10.1016/j.asoc.2020.106954 10.1016/j.asoc.2020.106954 [DOI] [Google Scholar]
- Zheng W, Lu B (2015) Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks. IEEE Trans Auton Ment Dev 7:162–175. 10.1109/TAMD.2015.2431497 10.1109/TAMD.2015.2431497 [DOI] [Google Scholar]
- Zheng W-L, Zhu J-Y, Lu B-L (2019) Identifying stable patterns over time for emotion recognition from EEG. IEEE Trans Affect Comput 10:417–429 10.1109/TAFFC.2017.2712143 [DOI] [Google Scholar]
- Zhong P, Wang D, Miao C (2022) EEG-based emotion recognition using regularized graph neural networks. IEEE Trans Affect Comput 13:1290–1301. 10.1109/TAFFC.2020.2994159 10.1109/TAFFC.2020.2994159 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analyzed during the current study are public datasets. It is available at the following URL: SEED dataset: https://bcmi.sjtu.edu.cn/home/seed/; DEAP dataset: http://www.eecs.qmul.ac.uk/mmv/datasets/deap/.















