SST-CRAM: spatial-spectral-temporal based convolutional recurrent neural network with lightweight attention mechanism for EEG emotion recognition

Yingxiao Qiao; Qian Zhao

doi:10.1007/s11571-024-10114-z

. 2024 Apr 30;18(5):2621–2635. doi: 10.1007/s11571-024-10114-z

SST-CRAM: spatial-spectral-temporal based convolutional recurrent neural network with lightweight attention mechanism for EEG emotion recognition

Yingxiao Qiao ¹, Qian Zhao ^1,^✉

PMCID: PMC11639745 PMID: 39678724

Abstract

Through emotion recognition with EEG signals, brain responses can be analyzed to monitor and identify individual emotional states. The success of emotion recognition relies on comprehensive emotion information extracted from EEG signals and the constructed emotion identification model. In this work, we proposed an innovative approach, called spatial-spectral-temporal-based convolutional recurrent neural network (CRNN) with lightweight attention mechanism (SST-CRAM). Firstly, we combined power spectral density (PSD) with differential entropy (DE) features to construct four-dimensional (4D) EEG feature maps and obtain more spatial, spectral, and temporal information. Additional, with a spatial interpolation algorithm, the utilization of the obtained valuable information was enhanced. Next, the constructed 4D EEG feature map was input into the convolutional neural network (CNN) integrated with convolutional block attention module (CBAM) and efficient channel attention module (ECA-Net) for extracting spatial and spectral features. CNN was used to learn spatial and spectral information and CBAM was employed to prioritize global information and obtain detailed and accurate features. ECA-Net was also used to further highlight key brain regions and frequency bands. Finally, a bidirectional long short-term memory (LSTM) network was used to explore the temporal correlation of EEG feature maps for comprehensive feature extraction. To assess the performance of our model, we tested it on the publicly available DEAP dataset. Our model demonstrated excellent performance and achieved high accuracy (98.63% for arousal classification and 98.66% for valence classification). These results indicated that SST-CRAM could fully utilize spatial, spectral, and temporal information to improve the emotion recognition performance.

Keywords: EEG, Emotion recognition, Lightweight attention mechanism, Convolutional recurrent neural network

Introduction

Emotion plays a crucial role in various cognitive, decision-making, and social activities for individuals to express their feelings and understand inner emotions of others (George 2000). Emotion recognition involves the classification of discrete emotional states (Aydin 2020) (Kılıç and Aydın 2022), arousal and valence scores (Shen et al. 2020a), and three states (pleasant, unpleasant, and neutral states) (Tepelena et al. 2017). Additionally, emotion identification has been significantly improved in the domians of human–computer interaction (Cowie et al. 2001; Fragopanagos and Taylor 2005), mental disorder diagnosis (Collin et al. 2013; Fieldhouse et al. 2023), fatigue driving detection (Sikander and Anwar 2019; Shang et al. 2022), and mental workload assessment (Bagheri and Power 2020; Giorgi et al. 2021). Emotion recognition research mainly emphasizes non-physiological signals like expressions on the face (Busso et al. 2004) and body movements (Ahmed et al. 2020), as well as physiological signals including electroencephalogram (EEG) (Li et al. 2023), electromyogram (EMG) (Kerdjidj et al. 2023), and electrocardiogram (ECG) (Nita et al. 2022). According to neuroscientific studies, several major brain cortical regions are closely associated with emotion (Lotfi and Akbarzadeh-T. 2014). EEG signals, in particular, have definite benefits, including their non-invasiveness, low cost, and high temporal resolution, thus facilitating the objective capture of potential emotional responses (Ehrlich et al. 2017). Consequently, a growing number of investigators have recently focused on EEG-based emotion recognition.

Prior to the emergence of deep learning, EEG features used for emotion recognition primarily encompass low-level features such as temporal domain, frequency domain (Koelstra et al. 2012), and time–frequency domain features. However, with these features alone, emotions cannot be fully identified at a fundamental level. Deep learning techniques have recently outpaced conventional machine learning (Goh et al. 2018). Deep learning algorithms have become mainstream technologies in many fields, including image recognition (He et al. 2016), epilepsy detection (Özçelik and Altan 2023a), diabetic retinopathy classification (Özçelik and Altan 2023b), and crop classification (Karasu and Altan 2022), as they can obtain advanced features automatically.

EEG-based emotion recognition has made substantial use of deep learning algorithms, which are mainly the convolutional neural network (CNN) and recurrent neural network (RNN). Li et al. (2018) organized the differential entropy (DE) from various channels into a 2D mapping, preserved the spatial and spectral information of the EEG signals, and applied a hierarchical convolutional neural network (HCNN) in emotion classification. Li et al. (2019) extracted the power spectral density (PSD) features of EEG signals and captured the temporal dynamics of emotions with the long short-term memory (LSTM) network. Hwang et al. (2020) extracted the DE from EEG and employed LSTM to learn temporal correlations. Wang et al. (2018) designed a 3D CNN network to simultaneously extract spatial and temporal features and found that even with deep learning algorithms, the extracted single feature alone could not capture comprehensive emotion-related information contained in EEG signals. In addition, a single neural network failed to adaptively focus on more important features, thus leading to the poor emotion recognition performance.

To address the aforementioned issues, feature extraction and classification algorithms have been extensively explored. Shen et al. (2020a) converted DE from different channels into a 4D structure to combine spatial, spectral, and temporal information and then utilized a 4D convolutional recurrent neural network (4DCRNN) for emotion recognition. Zhang et al. (2020) compared the performance of CNN and CNN-LSTM fusion networks based on DEAP dataset and found that the recognition accuracy of the CNN-LSTM fusion network was 4.05% higher than that of CNN alone. Tang et al. (2023) proposed a spatial–temporal information learning network (STILN) composed of CNN and Bi-LSTM based on an attention mechanism with the 2D power topographic map of PSD features as the input and demonstrated that Bi-LSTM performed better than LSTM in capturing temporal dependencies. However, it is challenging to combine different types of information for different types of EEG features and construct a deep learning model capable of capturing the most effective key features for emotion recognition. In this study, we integrated multiple types of information with diverse features and constructed a hybrid deep neural network model to adaptively and accurately capture and identify EEG information related to emotions from important brain regions, frequency bands, and timestamps. The constructed model solved the problems of incomplete extraction of EEG features and the inability to adaptively capture the discriminative information in EEG signals in existing methods and improved the accuracy and stability of EEG emotion recognition.

In this paper, we constructed a novel hybrid emotion recognition model called SST-CRAM based on DE and PSD through integrating spatial, spectral, and temporal information into 4D EEG feature maps. Firstly, we extracted PSD and DE from each channel and frequency band to capture the spectral EEG information. Secondly, we preserved the spatial information through electrode location mapping. Finally, we arranged the EEG feature maps of each time slice in chronological order to explore the correlations and dependencies between different time slices, thus resulting in the 4D EEG feature maps. Subsequently, with CNN, we extracted the spatial and spectral information from the 4D EEG feature maps and captured the temporal information of the 4D EEG feature map with Bi-LSTM. Besides, we introduced two lightweight attention modules to the CNN model to effectively leverage the differences between various brain regions and frequency bands, adaptively captured the crucial information of the EEG, and focused on key emotional information. We tested this method on the DEAP dataset, compared it with other approaches, and proved its superior feature extraction capability and improved accuracy of arousal and valence recognition. The main conclusions are drawn below. We proposed a new method called SST-CRAM based on two efficient modules (CBAM and ECA) to achieve effective feature enhancement and integrated two key features (DE and PSD) by constructing feature maps. The merged feature maps outperformed those created with DE or PSD. In addition, we employed a spatial interpolation algorithm to enhance the EEG feature maps and mitigate the impact of minor electrode position deviations on maps.

In the subsequent sections, we described the proposed method and presented the utilized dataset, experimental setup, and experimental results. Finally, conclusions are drawn.

Methods

Overview of SST-CRAM

Figure 1 depicts the architecture of SST-CRAM, an advanced framework, for identifying emotions with EEG data. SST-CRAM comprises 4D spatial-spectral-temporal feature maps as the input of the spatial and spectral feature extraction network, the temporal feature extraction network, and a classifier for emotion classification. The parameters are displayed in Table 1.

Fig. 1 — Overview of SST-CRAM. *Notes* 32 denotes the 3D feature maps’ width and length; 8 denotes two times of the number of frequency bands; and 2 T represents the segment length

Table 1.

Parameters of SST-CRAM

Layers	Operations	Kernel size/stride	Padding	Activation function
CBAM	Channel and spatial-wise attention	1× 1 / 1	SAME	ReLU
Conv1	Convolutional (2D)	5× 5 / 1	SAME	ReLU
Conv2	Convolutional (2D)	4× 4 / 1	SAME	ReLU
Conv3	Convolutional (2D)	4× 4 / 1	SAME	ReLU
Conv4	Convolutional (2D)	1× 1 / 1	SAME	ReLU
MP	MaxPooling (2D)	2× 2 / 1	–	–
ECA	Channel and spatial-wise attention	1× 1 / 1	SAME	–
FC	Linear	–	–	ReLU
Bi-LSTM	LSTM	–	–	–
Output	Linear	–	–	SoftMax

Open in a new tab

Construction of 4D EEG feature maps

We designed the 4D spatial-spectral-temporal EEG feature maps as the input of the proposed SST-CRAM. These feature maps are used as the training data for SST-CRAM and provide an intuitive representation of the spatial, spectral, and temporal information of the EEG. The formation process of the 4D feature maps is shown in Fig. 2. Firstly, based on previous studies (Shen et al. 2020a; Xiao et al. 2022), to increase training samples, for each subject’s preprocessed EEG signals in the dataset, N non-overlapping emotional segments of length Ts were extracted. Subsequently, these emotional segments were decomposed into four frequency bands: θ [4 ~8 Hz], α [8 ~14 Hz], β [14 ~31 Hz], and γ [31 ~51 Hz] with a fifth-order Butterworth filter according to a sampling frequency of 128 Hz. The low-pass and high-pass cutoff frequencies of the filter corresponded to the frequency ranges of corresponding bands. δ[0 ~ 4 Hz] was not significant in awake individuals (Dadebayev et al. 2022), so it was not considered here. Then, within 0.5 s windows, DE (Topic and Russo 2021) and PSD (Sarma and Barma 2021), which had been proven to be promising and stable (Zheng et al. 2019), were extracted from each of the four frequency bands to obtain EEG spectral information.

Fig. 2 — Formation process of 4D EEG feature maps. We generated the 4D EEG feature maps by extracting DE and PSD from four frequency bands of each Ts EEG signal segment with a window of 0.5 s and then transformed the features into 4D EEG feature maps consisting of 2 T slices

DE is a generalization of Shannon entropy $- \sum_{x} p (x) log (p (x))$ to continuous variables. DE can be used to describe the differences in energy distribution and activity patterns between different frequency bands. DE is calculated as:

D E = - \int_{a}^{b} p (x) log (p (x)) d x,

where p(x) denotes the probability density function of the continuous information and [a, b] denotes the range of information values. For a segment with a specific length approximately following a Gaussian distribution N (μ, σ²), DE is equal to the logarithm of its energy spectrum in a specific frequency band and can be expressed as:

D E = - \int_{a}^{b} \frac{1}{\sqrt{2 π σ^{2}}} exp \frac{{(x - μ)}^{2}}{2 σ^{2}} log \frac{1}{\sqrt{2 π σ^{2}}} exp \frac{{(x - μ)}^{2}}{2 σ^{2}} d x = \frac{1}{2} log (2 π e σ^{2})

where e represents the natural logarithm, and σ denotes the standard deviation of the time series.

PSD represents the distribution of power across different frequencies within that band and is defined as:

P S D = E [x^{2}],

where x represents a signal in a certain frequency band obtained from a certain EEG channel.

1D vectors $V_{f}^{D} = (s_{f}^{1}, s_{f}^{2}, . . ., s_{f}^{E}) \in R^{E}$ and $V_{f}^{P} = (s_{f}^{1}, s_{f}^{2}, . . ., s_{f}^{E}) \in R^{E}$ are defined to respectively represent the DE and PSD of all EEG channels in the frequency band f; E is 32 and represents the number of electrodes; the superscripts D and P respectively indicate the initials of DE and PSD. To preserve the spatial information of EEG electrodes, spatial mapping was performed with EEG features based on electrode positions. $V_{f}^{D}$ and $V_{f}^{P}$ were transformed into 2D feature maps $M_{f}^{D} \in R^{h \times w}$ and $M_{f}^{P} \in R^{h \times w}$ , where h and w indicate the height and width of 2D feature maps and are set as 8 and 9, respectively. As shown in Fig. 3, the left image represents 32 recording electrodes. Based on their relative positions, DE and PSD of every frequency band were mapped to the 2D matrix (Fig. 3b). EEG features filled the electrode places and zeros occupied the remaining positions.

Fig. 3 — a 32 channels of 2D compact mapping and b different colors indicating different brain regions

However, the aforementioned method still has some limitations. Filling missing electrodes with zeros affected the accuracy and reliability of spatial information in EEG electrodes and introduced additional noise, thereby interfering with subsequent data processing. With spatial interpolation algorithm, the feature values of zero-filled electrodes were calculated. We overcame the limitations of zero-filling and enhanced the resolution of 2D compact maps, which thus accurately indicated the true positions of the electrodes on the head. Therefore, we adopted an improved bicubic-EEG interpolation method (Deng et al. 2021) to perform interpolation the 2D feature maps of each frequency band of DE and PSD separately. The width and height of the interpolated 2D feature maps were both 32. The following is the principle of the enhanced bicubic-EEG interpolation algorithm:

The original bicubic interpolation algorithm (Keys 1981) is

B (x) = \{\begin{matrix} \{(a + 2)) {|x|}^{2} - (a + 3) {|x|}^{2} + 1, |x| \leq 1 \\ a {|x|}^{3} - 5 a {|x|}^{2} + 8 a |x| - 4 a, 1 < |x| < 2 \\ 0, |x| \geq 2 \end{matrix}),

where a ∈ [-1,0]. Keys (1981) achieved the best results with a = -0.5. The interpolated EEG channels are denoted as (x, y), with points (x_i, y_j) captured in a 3 × 3 region (i, j = 0, 1, 2,..). To calculate the interpolation points, use the formula:

F (x, y) = \{\begin{matrix} \sum_{i = 0}^{2} \sum_{j = 0}^{2} f (x_{i}, y_{j}) B (x - x_{i}) B (y - y_{j}), f (x_{i}, y_{j}) \neq 0 \\ A v g (\sum_{i = 0}^{2} \sum_{j = 0}^{2} f (x_{i}, y_{j})), f (x_{i}, y_{j}) = 0 \end{matrix}),

where f(x, y) indicates the value to be filled at (x, y), and f(x_i, y_j) indicates the value of the EEG channel at (x_i, y_j).

Next, after stacking the interpolated 2D compact maps of all frequency bands for DE and PSD, we obtained 3D DE and 3D PSD feature maps, denoted as $F_{t}^{D} = [M_{1}^{D}, M_{2}^{D}, \dots \dots, M_{f}^{D}] \in R^{h \times w \times f}$ and $F_{t}^{P} = [M_{1}^{P}, M_{2}^{P}, . . . . . ., M_{f}^{P}] \in R^{h \times w \times f}$ respectively. Here, t = 1, 2,…, 2 T represents the time index; f represents the number of frequency bands (f = 4). Furthermore, the 3D DE feature maps and 3D PSD feature maps were concatenated to generate fused feature maps, denoted as $F_{t}^{Cont} = [F_{t}^{D}, F_{t}^{P}] \in R^{h \times w \times 2 f}$ . Finally, to explore the correlation and dependency between different time slices, $F_{t}^{Cont}$ from each interval of 0.5 s in Ts was arranged in chronological order, thus resulting in the 4D EEG feature maps $X^{} = [F_{1}^{Cont}, F_{2}^{Cont}, . . ., F_{2 T}^{Cont}] \in R^{2 T \times h \times w \times 2 f}$ . Ultimately, $X_{n} \in R^{2 T \times h \times w \times 2 f}, n = 1, 2, . . ., N$ was used as the input for SST-CRAM.

Spatial-spectral feature extraction network

Firstly, spatial and spectral features were extracted from the 4D EEG feature maps. The spatial and spectral feature extraction network consisted of CBAM module, CNN, and ECA-Net module. After extensive modification and debugging, the order, quantity, and parameters of the modules were determined. Deep learning models often rely on a large amount of data. However, only a small number of key fine-grained data are ultimately influential in the training process. In order to pinpoint these key fine-grained emotion-related features, we introduced CBAM (Woo et al. 2018), a lightweight and versatile module which could be seamlessly integrated into any CNN architecture. CBAM is widely applied to enhance the representation power of CNN. The composition of CBAM is shown in Fig. 4. Channel attention module (CAM) and spatial attention module (SAM) are the two sub-modules of CBAM. Figures 4 a and b respectively show the structural diagrams of CAM and SAM. In CBAM, with the feature map X as the input, one-dimensional channel attention maps and two-dimensional spatial attention maps are sequentially generated: $M_{C} (\cdot) \in R^{C \times 1 \times 1}$ and $M_{S} (\cdot) \in R^{1 \times H \times W}$ , where W and H are the width and height of feature maps and C is the number of channels in feature maps. Therefore, the refinement process of features with CBAM can be represented as follows:

F_{CAM} = M_{C} (F) \otimes F,

F_{CBAM} = M_{S} (F_{CAM}) \otimes F_{CAM},

where $\otimes$ indicates element-wise multiplication. During the multiplication operation, the channel attention values are duplicated along specific dimensions so as to match the shape required for the multiplication. $F_{CAM} \in R^{C \times H \times W}$ and $F_{CBAM} \in R^{C \times H \times W}$ respectively indicate the channel characteristics and the final output of CBAM. $M_{C} (\cdot)$ and $M_{S} (\cdot)$ are expressed as follows:

M_{C} (F) = σ (M L P [G_{Avg} (F)] + M L P [G_{Max} (F)])

M_{S} (F) = σ (G_{CONV} (c o n c a t ([G_{Avg} (F)], [G_{M ax} (F)])))

where $σ (\cdot)$ indicates the activation function; MLP indicates the multilayer perceptron; $G_{Avg}$ and $G_{Max}$ respectively indicate average pooling and max pooling; $G_{CONV}$ indicates 2D convolution. Based on the consideration of the small size of our image and the requirement for finer features, the kernel size for this convolution was set to 1 × 1.

The refined feature from CBAM module was then fed into the ECA-CNN module. To further explore discriminative local patterns in brain regions and frequency bands, optimize the representation capacity of feature maps, and emphasize their importance, we introduced an efficient channel attention module for CNN, called ECA-Net. ECA-Net (Wang et al. 2020a) is an enhanced SENet (Hu et al. 2020) and can solve the problem of information loss caused by dimensionality reduction in SENet. Unlike SENet, ECA-Net does not require dimensionality reduction. Instead, ECA-Net achieves cross-channel interaction through 1D convolution, and significantly reduces model complexity. Additionally, ECA-Net adaptively selects the size of 1D convolution kernel so as to determine the coverage range of local cross-channel interactions. In existing methods, more complex attention modules are required for achieving the better performance, but ECA-Net involves only a small number of parameters for improving the performance. Figure 5 displays the ECA-Net architecture. GAP denotes global average pooling for compressing the data with the dimensions of C × H × W into C × 1 × 1 and enhancing the fusion of each channel feature (Hu et al. 2020). Based on the consideration of each channel along with its k adjacent channels, we captured local cross-channel interactions. The size of the convolutional kernel, k, indicates the coverage range of local cross-channel interactions, namely, the number of neighboring channels involved in the attention prediction for a channel and is calculated as:

k = ψ (C) = {|\frac{{log}_{2} (C)}{γ} + \frac{b}{γ}|}_{odd},

where b indicates the offset of the linear mapping between k and the number of channels C (Wang et al. 2020b). Typically, the ECA-Net module firstly uses GAP to fuse the convolutional features and then determines the value of k for 1D convolutional operation. Finally, it uses the sigmoid function to learn the attention weights for local neighboring channels. This structure allows the ECA-Net module to maintain its original meaning and functionality under higher efficiency and speed.

ECA-CNN module (Fig. 1) includes four convolutional layers, four ECA-Net modules, one max pooling layer, one fully connected layer, five Dropout layers, and five Batch Normalization (BN) layers. The filter sizes of four convolutional layers are respectively 64, 128, 256, and 64 with kernel sizes of 5 × 5, 4 × 4, 4 × 4, and 1 × 1. Zero padding and Rectified Linear Unit (ReLU) activation functions are applied to all convolutional layers. To prevent overfitting, batch normalization and dropout layers are added after each layer and the dropout rate is 0.2. Due to the limited dataset and feature information, We implemented a 2 × 2 max pooling layer with a stride of 2 just subsequent to the fourth convolutional layer in order to improve the network’s resilience and preserve more details. After each convolutional layer, an ECA-Net module is inserted to improve the representation of frequency band correspondences. Ultimately, the output of the ECA-CNN module is flattened and fed into a fully connected layer with 512 units. The final output of the spatial-spectral feature extraction network is expressed as $O_{t} \in R^{512}, t = 1, 2, . . ., 2 T$ .

Temporal feature extraction network

Due to the inclusion of temporal information in the variations between different time slices in 4D EEG feature maps, it is possible to achieve more accurate emotion recognition by extracting temporal features from the output of the ECA-CNN module by using a bidirectional LSTM (Bi-LSTM) with 128 memory units.

Compared to unidirectional LSTM, Bi-LSTM combines the advantages of bidirectionality and gated structure and allows the better retention and information processing (Graves et al. 2013). Fig. 6 depicts the structure of Bi-LSTM. The forward network links information from previous sequences to the present sequence, while the backward network links information from future sequences to the current sequence (Zheng and Chen 2021). Lastly, the anticipated value is calculated as:

y_{t} = δ (W_{h} \cdot [h_{t}, h_{t}^{'}] + b_{h}),

where δ indicates the sigmoid function; W denotes the weight matrix; b represents the bias vector associated with respective weights; h_t and $h_{t}'$ are respectively the hidden states of the forward LSTM and backward LSTM at time step t. The output of Bi-LSTM, $y_{n} \in R^{256}$ integrates spatial, spectral, and temporal information of an EEG segment with a length of Ts. Finally, with two softmax classifiers, two labels of the 4D EEG feature maps are predicted: valence and arousal.

Experiments

In this part, we presented a commonly used dataset, introduced the experimental devices, and discussed the outcomes of various experiments conducted on the dataset.

DEAP dataset

In DEAP dataset (Koelstra et al. 2012), video clips were used as visual stimuli to evoke various emotions. It comprised 40 1-min video segments. The emotional categories of the videos were annotated based on arousal and valence ratings, which ranged from 1 to 9. These video segments were viewed by 32 participants in total. EEG signals were recorded with a 32-channel Biosemi ActiveTwo device in compliance with the global 10–20 system. To ascertain whether the associated emotions were successfully aroused, participants self-evaluated their degrees of arousal and valence at each trial during the experiment. EMG and EOG signals were eliminated before the dataset was made. EEG signals underwent bandpass filtering and downsampling to 128 Hz.

Experimental devices

We set the dropout rate in the dropout layer and the size of the training batch to be 0.2 and 64, respectively. With the Adam optimizer, the loss function was minimized through gradient descent. The learning rate for Adam optimizer was set to be 0.001. Furthermore, we respectively set γ and β to be 2 and 1 in the ECA-Net module in order to achieve a compromise between the quantity of frequency bands and the size of convolutional kernels (Wang et al. 2020b). We set the epoch to be 200. The model was trained on an NVIDIA RTX 3090 GPU and implemented with Keras. During training, we evaluated the model through five-fold cross-validation and calculated the average classification accuracy (ACC) and standard deviation (STD) of each participant over five experiments. The ACC and STD were then averaged across all participants to provide the metrics for assessing the performance of the model.

Results

In SST-CRAM, the 4D EEG feature maps was the input. In this study, we preserved the spatial relationships among electrodes through a compact 2D layout with h = 8 and w = 9 and expanded h and w to 32 with an improved bicubic-EEG interpolation algorithm. All frequency band combinations could enhance one another and work better than a single frequency band (Yang et al. 2018). Therefore, we set f = 4. The perception of the spatial and temporal information contained in EEG signals is influenced by the parameters h, w, and T. As a result, we investigated the influences of 2D feature map dimensions (h and w) and EEG segment length (T) on recognition accuracy (Shen et al. 2020a). Then, through ablation experiments, we assessed the overall performance of the model and compared it with other models.

Influence of EEG segment length

We explored the effects of EEG segment length on the emotion recognition outcomes because the length of the segments determined the emotional information contained. Concurrently, we set h, w, and f to be 32, 32, and 4, respectively. The average ACC and STD for each participant under various T values are shown in Table 2.

Tabel 2.

Variations of the performance of SST-CRAM with fragment length

T (s)	Training Samples	Testing Samples	Arousal		Valence		Average
T (s)	Training Samples	Testing Samples	ACC (%)	STD (%)	ACC (%)	STD (%)	ACC (%)	STD (%)
1	122,880	30,720	97.88	2.02	97.71	2.11	97.80	2.07
2	61,440	15,360	98.65	1.74	98.54	1.52	98.60	1.63
3	40,960	10,240	98.68	1.66	98.45	1.60	98.57	1.63
4	30,720	7680	98.60	1.82	98.67	1.36	98.64	1.59
5	24,576	6144	98.63	1.77	98.66	1.39	98.65	1.58
6	20,480	5120	98.52	1.86	98.47	1.65	98.50	1.66

Open in a new tab

The bold data in the table indicated the maximum average ACC and the minimum average STD for the arousal and valence, respectively. The numbers of training samples and testing samples varied with T value. Furthermore, the best performance in terms of arousal (98.68%) was achieved under T = 3, whereas the best performance in terms of valence (98.68%) was achieved under T = 4. The best performance in terms of the mean values of both arousal and valence (98.65%) was achieved under T = 5. Additionally, we found that the differences in emotion recognition accuracy obtained under different T values were small. The maximum differences in arousal valence, and their mean values under different T values were respectively only 0.8%, 0.96%, and 0.85%, indicating that T value had little effect on the performance of the model to extract intrinsic temporal information from EEG signals. T = 5 was selected as the final value in the subsequent experiments.

Influence of different 2D maps

EEG electrodes could be transformed into two types of 2D images: compact map (Shen et al. 2020a) and sparse map (Li et al. 2018). Based on the compact map, we employed feature fusion and spatial interpolation algorithms to generate 2D compact maps with higher resolution for computing target point features. Figure 7 displays three types of 2D maps: compact map, sparse map, and interpolated 2D feature map with the dimensions of 8 × 9, 19 × 19, and 32 × 32, respectively. In this section, we evaluated their performance with SST-CRAM. The sparse map outperformed the non-interpolated compact map and demonstrated the improvements of 0.82% in arousal dimension and 0.75% in valence dimension (Table 3). Moreover, the interpolated feature map achieved the best performance and improved arousal dimension by 0.59% and valence dimension by 0.68% compared with the sparse map because the interpolation algorithm filled the missing electrode values and resulted in more complete EEG data. Based on the aforementioned analysis, the interpolated 2D compact map with h = w = 32 was selected in subsequent experiments.

Fig. 7 — 3 types of 2D maps: a compact map, b sparse map, c compact map with interpolation

Table 3.

Performance comparison of three types of 2D maps

2D map	Map shape	Arousal		Valence
2D map	Map shape	ACC (%)	STD (%)	ACC (%)	ATD (%)
Compact map	8 × 9	97.22	2.40	97.23	2.02
Sparse map	19 × 19	98.04	1.85	97.98	1.99
Compact map with interpolation	32 × 32	98.63	1.77	98.66	1.39

Open in a new tab

Overall performance

Based on the above analysis, the ideal parameters were obtained as: h = w = 32 and T = 5. Figure 8 illustrates the overall performance of SST-CRAM on the DEAP dataset. The average ACC and STD of 32 participants, measured in terms of arousal were respectively 98.63% and 1.77% and the results of 22 participants exceeded the averages. In terms of valence, average ACC and STD values of 32 participants were respectively 98.66% and 1.39% and the results of 21 participants exceeded the averages. Furthermore, it should be noted that Participant 17 had the lowest accuracy in both arousal (95.78%) and valence (96.33%) because the participant might be absent-minded during the experiment or give inaccurate subjective feedback after the experiment.

Fig. 8 — Overall performance of SST-CRAM: a classification accuracy and b standard deviation

Method comparison

We compared our model with six baseline models based on the DEAP dataset in terms of feature types, shape of input feature data, information contained in the features, and recognition accuracy in valence and arousal and demonstrated the effectiveness of our proposed model. These models are introduced as follows:

Firstly, in LSTM (Alhagry et al. 2017), temporal information can be extracted from raw EEG signals.

Secondly, in 3D-Continuous CNN (Yang et al. 2018), 3D feature structure is constructed from DE extracted from EEG signals and spatial and spectral data are extracted with CNN for EEG emotion recognition.

Thirdly, with EmotioNet (Wang et al. 2018), based on the analysis of the covariance shift of EEG signals, the dimension of the input feature is reshaped to a 2D tensor (2D electrode topology × time samples) and both temporal and spatial data of EEG signals are extracted with 3DCNN.

Fourthly, in Casc-CNN-LSTM (Chen et al. 2020), the one-dimensional chain PSD vector sequence of EEG signals is transformed into a two-dimensional mesh matrix sequence and a cascading and parallel mixed convolutional recurrent neural network is introduced to accurately predict the emotional category of each EEG sample.

Fifthly, in 4DCRNN (Shen et al. 2020b), 4D feature map is constructed from the DE extracted from EEG signals and spatial, spectral and temporal data are extracted with CRNN for EEG emotion recognition.

Sixthly, with MT-CNN (Rudakov et al. 2021), DE and PSD extracted from EEG signals are converted into brain maps and the brain maps of all frequency bands are stacked in depth. Then, with a fully convolutional architecture, both spatial and spectral data of EEG signals are extracted.

Table 4 presents the results obtained by our proposed model and other six baseline models. In LSTM model (Alhagry et al. 2017), raw EEG signals was input and temporal information was extracted by disregarding spatial and spectral information. LSTM performed worse than those models that extracted specific features such as DE and PSD. With 3D-Continuous CNN model (Yang et al. 2018), EEG signals were converted into 3D data and only two types of features were considered. However, time series of EEG signals were neglected. 3D-Continuous CNN model achieved an accuracy of 89.25% and an arousal rate of 90.24%. In contrast, EmotioNet analyzed the covariance shift of electrical signals and reshaped the dimension of the input features to a 2D tensor (2D electrode topology × time samples) by disregarding spectral information. 4DCRNN model (Shen et al. 2020b) only utilized DE and ignored the internal relationships between frequency bands during feature extraction. Casc-CNN-LSTM model (Chen et al. 2020) transformed the 1D PSD vector sequence of EEG signals into a 2D mesh matrix sequence and achieved the recognition accuracy of 93.26% in arousal dimension and 93.64% in valence dimension. Although MT-CNN model (Rudakov et al. 2021) combined DE with PSD, it overlooked temporal information.

Table 4.

Performance comparison of baseline models

Models	Features	Input shapes	Information	Arousal		Valence
Models	Features	Input shapes	Information	ACC (%)	STD (%)	ACC (%)	STD (%)
LSTM (Alhagry et al. 2017)	Raw EEG signals	–	Temporal	85.65	–	85.45	–
3D-Continuous CNN (Yang et al. 2018)	DE	9 × 9 × d	Spatial, spectral	90.24	–	89.25	–
Emotionet (Wang et al. 2018)	Covariance shift	7 × 9 × times samples	Spatial, temporal	73.24	3.13	72.16	3.26
4DCRNN (Shen et al. 2020a)	DE	8 ×9 × d × 2T	Spatial, spectral, temporal	94.58	3.69	94.22	2.61
Casc-CNN-LSTM (Chen et al. 2020)	PSD	9 × 9 × times samples	Spatial, spectral, temporal	93.26	–	93.64	–
MT-CNN (Rudakov et al. 2021)	DE + PSD	8 × 9 × 2d	Spatial, spectral	96.62	–	96.28	–
SST-CRAM	DE + PSD	32 × 32 × 2d × 2T	Spatial, spectral, temporal	98.63	1.77	98.66	1.39
SST-CRAM (DE)	DE	32 × 32 × d × 2T	Spatial, spectral, temporal	95.62	3.99	95.14	2.85
SST-CRAM (PSD)	PSD	32 × 32 × d × 2T	Spatial, spectral, temporal	95.89	3.26	95.40	2.69

Open in a new tab

Furthermore, to further investigate the effectiveness of feature fusion, we conducted two additional experiments by using SST-CRAM respectively with DE and PSD as the input. With DE as the input, ACC values for arousal and valence were respectively 1.04% and 0.92% higher than those obtained with 4DCRNN. With PSD as the input, ACC values for arousal and valence were respectively 2.63% and 1.76% higher than those obtained with Casc-CNN-LSTM. In short, SST-CRAM demonstrated the best overall performance. The findings highlighted the effectiveness of different feature fusion techniques.

Ablation experiments

In order to verify the efficacy of every module in our research, we carried out five ablation tests with DEAP dataset. All ablation experiments shared the same experimental settings except that the models were different. Figures 9a and b depict the main network models of CNN module and ECA-CNN module, respectively. Figure 9 c presents the complete model of SST-CRAM, named CBAM-ECA-CNN-Bi-LSTM. We designed the model depicted in Fig. 9d, named ECA-CNN-Bi-LSTM, to evaluate the performance of CBAM. The performance of ECA-Net was tested by replacing ECA-CNN module with CNN module in Fig. 9e, named CBAM-CNN-Bi-LSTM. The role of attention mechanism was examined by removing CBAM and ECA-Net simultaneously and maintaining only CNN and Bi-LSTM in Fig. 9f, named CNN-Bi-LSTM. To fully validate the time information representation ability of Bi-LSTM, we removed all modules except CNN module in Fig. 9g.

Fig. 9 — Network structures for ablation experiments; a CNN module; b ECA-CNN module; c CBAM-ECA-CNN-Bi-LSTM (SST-CRAM); d CBAM-CNN-Bi-LSTM; e ECA-CNN-Bi-LSTM; f CNN-Bi-LSTM; g Only CNN

Table 5 presents ACC and STD for arousal and valence in ablation experiments. The three models with attention mechanisms (CBAM-ECA-CNN-Bi-LSTM, ECA-CNN-Bi-LSTM, and CBAM-CNN-Bi-LSTM) outperformed the models without attention mechanism (CNN-Bi-LSTM, Only CNN) in both arousal and valence. Remarkably, CBAM-CNN-Bi-LSTM model achieved an accuracy of 98.78% in arousal, which was higher than that of CBAM-ECA-CNN-Bi-LSTM, indicating its strong learning ability for arousal labels. However, its performance in valence was not good. This study aims to recognize both arousal and valence, so the overall performance was evaluated based on the average performance of both dimensions (arousal and valence). The complete structure, CBAM-ECA-CNN-Bi-LSTM, achieved the highest recognition accuracy and the minimum standard deviation. The models without CBAM showed an average accuracy drop of 0.16% and the models without ECA-Net showed an average accuracy drop of 0.1%. The models without attention exhibited an average accuracy drop of 0.19%. Furthermore, the model with only CNN performed the worst and its accuracy was decreased by 0.3%. In conclusion, the complete SST-CRAM demonstrated the higher accuracy and stability in identifying different emotional states. Additionally, the ablation experiments validated the important roles of CBAM, ECA-Net, and Bi-LSTM in the proposed EEG emotion classification model.

Table 5.

The results of ablative experiments with different network structures

Models	Arousal		Valence		Average
Models	ACC (%)	STD (%)	ACC (%)	STD (%)	ACC (%)	STD (%)
CBAM-ECA-CNN-Bi-LSTM (SST-CRAM)	98.63	1.77	98.66	1.39	98.65	1.58
ECA-CNN-Bi-LSTM	98.47	1.61	98.50	1.57	98.49	1.59
CBAM-CNN-Bi-LSTM	98.78	1.47	98.32	1.70	98.55	1.59
CNN-Bi-LSTM	98.49	1.61	98.32	1.60	98.46	1.61
Only CNN	98.44	1.64	98.25	1.89	98.35	1.77

Open in a new tab

Discussion

Experimental results indicated that SST-CRAM model outperformed most existing emotion recognition methods. In this section, we comprehensively discussed the contributions of key components of SST-CRAM, including spatial interpolation algorithm, the fusion of DE and PSD, CBAM module, and ECA-Net.

First, SST-CRAM achieved the optimal performance by transforming different types of features (DE and PSD) into 4D EEG feature maps that includes spatial, spectral, and temporal information. By combining different types of emotional information from two different types of features, the deep learning model could learn more emotion-related information and extract more comprehensive features, thereby improving the accuracy of emotion recognition. The models with both DE and PSD outperformed the models with only one of them or raw EEG, such as 3D-Continuous CNN (Yang et al. 2018), 4DCRNN (Shen et al. 2020b), and Casc-CNN-LSTM (Chen et al. 2020), demonstrating the benefits of the utilization of multiple effective features. Moreover, 4D EEG feature maps outperformed 3D feature structure used in MT-CNN model as they contained temporal information along with spectral information and spatial information across multiple channels mapped with electrode positions.

Second, SST-CRAM achieved the better performance in emotion recognition tasks due to the interpolation of EEG feature maps with spatial interpolation algorithms and the selection of appropriate lengths of emotional segments. Our experimental results indicated that the sparse 2D map performed better than the compact 2D graph, showing the accuracy improvements of 0.82% in arousal and 0.75% in valence. However, the interpolated compact map utilized in SST-CRAM outperformed the sparse map and arousal accuracy and valence accuracy were improved by 0.59% and 0.68%, respectively. Through the interpolation of EEG feature maps with the spatial interpolation algorithm, SST-CRAM overcame the limitations of most existing methods in which zero-padding electrodes were used for spatial electrode position mapping. SST-CRAM could provide clearer EEG features for the model, enhance its representation capacity and thereby improve the accuracy of EEG emotion recognition. By adjusting the emotional segment length, SST-CRAM could more accurately capture contextual information contained in emotional segments and identify different emotional states.

Last, SST-CRAM achieved the optimal performance by constructing a deep model capable of adaptively and accurately capturing crucial EEG emotional information. Compared with 4DCRNN (Shen et al. 2020b) and Casc-CNN-LSTM (Chen et al. 2020), SST-CRAM contained CBAM and ECA-Net, extracted spatial-spectral features through CNN, and adaptively captured the most emotion-related information in important brain regions and frequency bands through the two attention mechanisms of CBAM and ECA-Net, respectively. Both CBAM and ECA-Net were lightweight and general-purpose technologies and could be seamlessly integrated into any CNN so as to improve the representation ability of CNNs at negligible computational cost and the efficiency and accuracy of emotion recognition. In addition, we used BN layer and Dropout layer to improve the convergence speed and generalization ability of the model to prevent overfitting. We improved the hyperparameters of CNN by reducing computation load and accelerating feature extraction with a smaller convolutional kernel and thus realized the good performance of SST-CRAM. Upon completing the spatial-frequency feature extraction, a bidirectional LSTM was used to explore the temporal correlations contained in 4D EEG feature maps. To validate the effectiveness of various modules, we conducted ablation experiments on different models and found that the attention and Bi-LSTM modules played a significant role in boosting the accuracy of SST-CRAM. Network modules with explicit responsibilities in extracting corresponding EEG information improved the overall performance of SST-CRAM.

The high accuracy of SST-CRAM in arousal and valence classification on the DEAP dataset proved its outstanding performance, indicating its potential impact. The results obtained by our proposed method suggested that effective EEG representations and appropriate identification models could fully leverage the advantages of EEG itself. By improving EEG-based emotion recognition, SST-CRAM contributed to the scientific understanding of brain-emotion dynamics and the development of more effective tools and interventions in mental health monitoring, personalized therapy, and human–computer interaction.

Conclusion

In this paper, we proposed a new method called SST-CRAM. 4D EEG feature maps generated with two EEG features (DE and PSD) integrated EEG spatial, spectral, and temporal information and were used as the input of SST-CRAM. Attention mechanisms were incorporated into CNN network for the more accurate extraction of EEG information. CNN module handled EEG spatial and spectral information and CBAM and ECA-Net modules captured key features for emotion classification adaptively. A time feature extraction network consisting of Bi-LSTM was employed to explore the temporal correlation of 4D EEG feature maps. The experimental results on DEAP dataset suggested that our model performed better than all baseline models. Different ablation studies confirmed the effectiveness of fused EEG feature maps generated based on DE and PSD and the role of the attention mechanism module in the proposed model. The enhancement effect of the lightweight attention mechanism on the model performance was validated. SST-CRAM greatly improved the accuracy of EEG emotion recognition and could be applied in sports imaging, epilepsy detection, image classification, and other fields. The adaptability and lightweight nature of SST-CRAM allowed the integration into wearable devices or mobile apps for continuous mood monitoring and intervention. In future work, we will validate our proposed method on other commonly used public EEG datasets such as DREAMER, SEED, and AMIGOS. Additionally, we will tackle the significant individual discrepancies in EEG signals and explore the utilization of multimodal data in cross-subject emotion recognition.

Acknowledgements

This research was financially supported by the Natural Science Research Council Fund Projects of Rizhao City (RZ2022ZR17).

Data availability

Data will be made available on request.

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Ahmed F, Bari ASMH, Gavrilova ML (2020) Emotion recognition from body movement. IEEE Access 8:11761–11781. 10.1109/ACCESS.2019.2963113 [Google Scholar]
Alhagry S, Aly A, AR (2017) Emotion recognition based on EEG using LSTM Recurrent Neural Network. ijacsa 8:. 10.14569/IJACSA.2017.081046
Aydin S (2020) Deep learning classification of neuro-emotional phase domain complexity levels induced by affective video film clips. IEEE J Biomed Health Inform 24:1695–1702. 10.1109/JBHI.2019.2959843 [DOI] [PubMed] [Google Scholar]
Bagheri M, Power SD (2020) EEG-based detection of mental workload level and stress: the effect of variation in each state on classification of the other. J Neural Eng 17:056015. 10.1088/1741-2552/abbc27 [DOI] [PubMed] [Google Scholar]
Busso C, Deng Z, Yildirim S, et al (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th international conference on Multimodal interfaces. ACM, State College PA USA, pp 205–211
Chen J, Jiang D, Zhang Y, Zhang P (2020) Emotion recognition from spatiotemporal EEG representations with hybrid convolutional recurrent neural networks via wearable multi-channel headset. Comput Commun 154:58–65. 10.1016/j.comcom.2020.02.051 [Google Scholar]
Collin L, Bindra J, Raju M et al (2013) Facial emotion recognition in child psychiatry: A systematic review. Res Dev Disabil 34:1505–1520. 10.1016/j.ridd.2013.01.008 [DOI] [PubMed] [Google Scholar]
Cowie R, Douglas-Cowie E, Tsapatsoulis N et al (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80. 10.1109/79.911197 [Google Scholar]
Dadebayev D, Goh WW, Tan EX (2022) EEG-based emotion recognition: Review of commercial EEG devices and machine learning techniques. J King Saud Univ—Comp Info Sci 34:4385–4401. 10.1016/j.jksuci.2021.03.009 [Google Scholar]
Deng X, Zhu J, Yang S (2021) SFE-Net: EEG-based emotion recognition with symmetrical spatial feature extraction
Ehrlich S, Guan C, Cheng G (2017) A closed-loop brain-computer music interface for continuous affective interaction. In: 2017 International Conference on Orange Technologies (ICOT). IEEE, Singapore, pp 176–179
Fieldhouse JLP, Singleton EH, Van Engelen ME et al (2023) Decreased emotion recognition and reduced focus on facial hallmarks in behavioral variant frontotemporal dementia compared to primary psychiatric disorders and controls. Euro J of Neurology 30:2222–2229. 10.1111/ene.15837 [DOI] [PubMed] [Google Scholar]
Fragopanagos N, Taylor JG (2005) Emotion recognition in human–computer interaction. Neural Netw 18:389–405. 10.1016/j.neunet.2005.03.006 [DOI] [PubMed] [Google Scholar]
George JM (2000) Emotions and leadership: the role of emotional intelligence. Human Relations 53:1027–1055. 10.1177/0018726700538001 [Google Scholar]
Giorgi A, Ronca V, Vozzi A et al (2021) Wearable technologies for mental workload, stress, and emotional state assessment during Working-Like tasks: a comparison with laboratory technologies. Sensors 21:2332. 10.3390/s21072332 [DOI] [PMC free article] [PubMed] [Google Scholar]
Goh SK, Abbass HA, Tan KC et al (2018) Spatio-Spectral representation learning for electroencephalographic Gait-Pattern classification. IEEE Trans Neural Syst Rehabil Eng 26:1858–1867. 10.1109/TNSRE.2018.2864119 [DOI] [PubMed] [Google Scholar]
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Vancouver, BC, Canada, pp 6645–6649
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, pp 770–778
Hu J, Shen L, Albanie S et al (2020) Squeeze-and-Excitation networks. IEEE Trans Pattern Anal Mach Intell 42:2011–2023. 10.1109/TPAMI.2019.2913372 [DOI] [PubMed] [Google Scholar]
Hwang S, Hong K, Son G, Byun H (2020) Learning CNN features from DE features for EEG-based emotion recognition. Pattern Anal Applic 23:1323–1335. 10.1007/s10044-019-00860-w [Google Scholar]
Karasu S, Altan A (2022) Agricultural crop classification with R-CNN and machine learning methods
Kerdjidj O, Amara K, Harizi F, Boumridja H (2023) Implementing hand gesture recognition using EMG on the Zynq Circuit. IEEE Sensors J 23:10054–10061. 10.1109/JSEN.2023.3259150 [Google Scholar]
Keys R (1981) Cubic convolution interpolation for digital image processing. IEEE Trans Acoust, Speech, Signal Process 29:1153–1160. 10.1109/TASSP.1981.1163711 [Google Scholar]
Kılıç B, Aydın S (2022) Classification of contrasting discrete emotional states indicated by EEG based graph theoretical network measures. Neuroinform 20:863–877. 10.1007/s12021-022-09579-2 [DOI] [PubMed] [Google Scholar]
Koelstra S, Muhl C, Soleymani M et al (2012) DEAP: A database for emotion analysis; using physiological signals. IEEE Trans Affective Comput 3:18–31 [Google Scholar]
Li D, Wang Z, Wang C et al (2019) The fusion of electroencephalography and facial expression for continuous emotion recognition. IEEE Access 7:155724–155736. 10.1109/ACCESS.2019.2949707 [Google Scholar]
Li J, Zhang Z, He H (2018) Hierarchical convolutional neural networks for EEG-Based emotion recognition. Cogn Comput 10:368–380. 10.1007/s12559-017-9533-x [Google Scholar]
Li X, Zhang Y, Tiwari P et al (2023) EEG based emotion recognition: a tutorial and review. ACM Comput Surv 55:1–57. 10.1145/3524499 [Google Scholar]
Lotfi E, Akbarzadeh-T M-R (2014) Practical emotional neural networks. Neural Netw 59:61–72. 10.1016/j.neunet.2014.06.012 [DOI] [PubMed] [Google Scholar]
Nita S, Bitam S, Heidet M, Mellouk A (2022) A new data augmentation convolutional neural network for human emotion recognition based on ECG signals. Biomed Signal Process Control 75:103580. 10.1016/j.bspc.2022.103580 [Google Scholar]
Özçelik YB, Altan A (2023a) A comparative analysis of artificial intelligence optimization algorithms for the selection of entropy-based features in the early detection of epileptic seizures. In: 2023 14th International Conference on Electrical and Electronics Engineering (ELECO). IEEE, Bursa, Turkiye, pp 1–5
Özçelik YB, Altan A (2023b) Overcoming nonlinear dynamics in diabetic retinopathy classification: a robust AI-Based model with chaotic swarm intelligence optimization and recurrent long Short-Term memory. Fractal Fract 7:598. 10.3390/fractalfract7080598 [Google Scholar]
Rudakov E, Laurent L, Cousin V, et al (2021) Multi-Task CNN model for emotion recognition from EEG Brain maps. In: 2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART). IEEE, Paris / Créteil, France, pp 1–4
Sarma P, Barma S (2021) Emotion recognition by distinguishing appropriate EEG segments based on random matrix theory. Biomed Signal Process Control 70:102991. 10.1016/j.bspc.2021.102991 [Google Scholar]
Shang Y, Yang M, Cui J et al (2022) Driver emotion and fatigue state detection based on time series fusion. Electronics 12:26. 10.3390/electronics12010026 [Google Scholar]
Shen F, Dai G, Lin G et al (2020) EEG-based emotion recognition using 4D convolutional recurrent neural network. Cogn Neurodyn 14:815–828. 10.1007/s11571-020-09634-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sikander G, Anwar S (2019) Driver Fatigue detection systems: a review. IEEE Trans Intell Transport Syst 20:2339–2352. 10.1109/TITS.2018.2868499 [Google Scholar]
Tang Y, Wang Y, Zhang X, Wang Z (2023) STILN: A novel spatial-temporal information learning network for EEG-based emotion recognition. Biomed Signal Process Control 85:104999. 10.1016/j.bspc.2023.104999 [Google Scholar]
Tepelena I, Frantzidis C, Salvari V, et al (2017) Are elderly less responsive to emotional stimuli? An EEG-based Study across pleasant, unpleasant and neutral greek words. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS). IEEE, Thessaloniki, pp 469–474
Topic A, Russo M (2021) Emotion recognition based on EEG feature maps through deep learning network. Eng Sci Technol Int J 24:1442–1454. 10.1016/j.jestch.2021.03.012 [Google Scholar]
Wang Q, Wu B, Zhu P, et al (2020a) ECA-Net: efficient channel attention for deep convolutional neural networks
Wang Q, Wu B, Zhu P, et al (2020b) ECA-Net: efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 11531–11539
Wang Y, Huang Z, McCane B, Neo P (2018) EmotioNet: A 3-D convolutional neural network for EEG-based emotion recognition. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, Rio de Janeiro, pp 1–7 [Google Scholar]
Woo S, Park J, Lee J-Y, Kweon IS (2018) CBAM: Convolutional Block Attention Module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision – ECCV 2018. Springer International Publishing, Cham, pp 3–19 [Google Scholar]
Xiao G, Shi M, Ye M et al (2022) 4D attention-based neural network for EEG emotion recognition. Cogn Neurodyn 16:805–818. 10.1007/s11571-021-09751-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Y, Wu Q, Fu Y, Chen X (2018) Continuous convolutional neural network with 3D Input for EEG-based emotion recognition. In: Cheng L, Leung ACS, Ozawa S (eds) Neural Information Processing. Springer International Publishing, Cham, pp 433–443 [Google Scholar]
Zhang Y, Chen J, Tan JH et al (2020) An investigation of deep learning models for EEG-Based emotion recognition. Front Neurosci 14:622759. 10.3389/fnins.2020.622759 [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng W-L, Zhu J-Y, Lu B-L (2019) Identifying stable patterns over time for emotion recognition from EEG. IEEE Trans Affective Comput 10:417–429. 10.1109/TAFFC.2017.2712143 [Google Scholar]
Zheng X, Chen W (2021) An Attention-based Bi-LSTM method for visual object classification via EEG. Biomed Signal Process Control 63:102174. 10.1016/j.bspc.2020.102174 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[CR1] Ahmed F, Bari ASMH, Gavrilova ML (2020) Emotion recognition from body movement. IEEE Access 8:11761–11781. 10.1109/ACCESS.2019.2963113 [Google Scholar]

[CR2] Alhagry S, Aly A, AR (2017) Emotion recognition based on EEG using LSTM Recurrent Neural Network. ijacsa 8:. 10.14569/IJACSA.2017.081046

[CR3] Aydin S (2020) Deep learning classification of neuro-emotional phase domain complexity levels induced by affective video film clips. IEEE J Biomed Health Inform 24:1695–1702. 10.1109/JBHI.2019.2959843 [DOI] [PubMed] [Google Scholar]

[CR4] Bagheri M, Power SD (2020) EEG-based detection of mental workload level and stress: the effect of variation in each state on classification of the other. J Neural Eng 17:056015. 10.1088/1741-2552/abbc27 [DOI] [PubMed] [Google Scholar]

[CR5] Busso C, Deng Z, Yildirim S, et al (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the 6th international conference on Multimodal interfaces. ACM, State College PA USA, pp 205–211

[CR6] Chen J, Jiang D, Zhang Y, Zhang P (2020) Emotion recognition from spatiotemporal EEG representations with hybrid convolutional recurrent neural networks via wearable multi-channel headset. Comput Commun 154:58–65. 10.1016/j.comcom.2020.02.051 [Google Scholar]

[CR7] Collin L, Bindra J, Raju M et al (2013) Facial emotion recognition in child psychiatry: A systematic review. Res Dev Disabil 34:1505–1520. 10.1016/j.ridd.2013.01.008 [DOI] [PubMed] [Google Scholar]

[CR8] Cowie R, Douglas-Cowie E, Tsapatsoulis N et al (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80. 10.1109/79.911197 [Google Scholar]

[CR9] Dadebayev D, Goh WW, Tan EX (2022) EEG-based emotion recognition: Review of commercial EEG devices and machine learning techniques. J King Saud Univ—Comp Info Sci 34:4385–4401. 10.1016/j.jksuci.2021.03.009 [Google Scholar]

[CR10] Deng X, Zhu J, Yang S (2021) SFE-Net: EEG-based emotion recognition with symmetrical spatial feature extraction

[CR11] Ehrlich S, Guan C, Cheng G (2017) A closed-loop brain-computer music interface for continuous affective interaction. In: 2017 International Conference on Orange Technologies (ICOT). IEEE, Singapore, pp 176–179

[CR12] Fieldhouse JLP, Singleton EH, Van Engelen ME et al (2023) Decreased emotion recognition and reduced focus on facial hallmarks in behavioral variant frontotemporal dementia compared to primary psychiatric disorders and controls. Euro J of Neurology 30:2222–2229. 10.1111/ene.15837 [DOI] [PubMed] [Google Scholar]

[CR13] Fragopanagos N, Taylor JG (2005) Emotion recognition in human–computer interaction. Neural Netw 18:389–405. 10.1016/j.neunet.2005.03.006 [DOI] [PubMed] [Google Scholar]

[CR14] George JM (2000) Emotions and leadership: the role of emotional intelligence. Human Relations 53:1027–1055. 10.1177/0018726700538001 [Google Scholar]

[CR15] Giorgi A, Ronca V, Vozzi A et al (2021) Wearable technologies for mental workload, stress, and emotional state assessment during Working-Like tasks: a comparison with laboratory technologies. Sensors 21:2332. 10.3390/s21072332 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] Goh SK, Abbass HA, Tan KC et al (2018) Spatio-Spectral representation learning for electroencephalographic Gait-Pattern classification. IEEE Trans Neural Syst Rehabil Eng 26:1858–1867. 10.1109/TNSRE.2018.2864119 [DOI] [PubMed] [Google Scholar]

[CR17] Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Vancouver, BC, Canada, pp 6645–6649

[CR18] He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA, pp 770–778

[CR19] Hu J, Shen L, Albanie S et al (2020) Squeeze-and-Excitation networks. IEEE Trans Pattern Anal Mach Intell 42:2011–2023. 10.1109/TPAMI.2019.2913372 [DOI] [PubMed] [Google Scholar]

[CR20] Hwang S, Hong K, Son G, Byun H (2020) Learning CNN features from DE features for EEG-based emotion recognition. Pattern Anal Applic 23:1323–1335. 10.1007/s10044-019-00860-w [Google Scholar]

[CR21] Karasu S, Altan A (2022) Agricultural crop classification with R-CNN and machine learning methods

[CR22] Kerdjidj O, Amara K, Harizi F, Boumridja H (2023) Implementing hand gesture recognition using EMG on the Zynq Circuit. IEEE Sensors J 23:10054–10061. 10.1109/JSEN.2023.3259150 [Google Scholar]

[CR23] Keys R (1981) Cubic convolution interpolation for digital image processing. IEEE Trans Acoust, Speech, Signal Process 29:1153–1160. 10.1109/TASSP.1981.1163711 [Google Scholar]

[CR24] Kılıç B, Aydın S (2022) Classification of contrasting discrete emotional states indicated by EEG based graph theoretical network measures. Neuroinform 20:863–877. 10.1007/s12021-022-09579-2 [DOI] [PubMed] [Google Scholar]

[CR25] Koelstra S, Muhl C, Soleymani M et al (2012) DEAP: A database for emotion analysis; using physiological signals. IEEE Trans Affective Comput 3:18–31 [Google Scholar]

[CR26] Li D, Wang Z, Wang C et al (2019) The fusion of electroencephalography and facial expression for continuous emotion recognition. IEEE Access 7:155724–155736. 10.1109/ACCESS.2019.2949707 [Google Scholar]

[CR27] Li J, Zhang Z, He H (2018) Hierarchical convolutional neural networks for EEG-Based emotion recognition. Cogn Comput 10:368–380. 10.1007/s12559-017-9533-x [Google Scholar]

[CR28] Li X, Zhang Y, Tiwari P et al (2023) EEG based emotion recognition: a tutorial and review. ACM Comput Surv 55:1–57. 10.1145/3524499 [Google Scholar]

[CR29] Lotfi E, Akbarzadeh-T M-R (2014) Practical emotional neural networks. Neural Netw 59:61–72. 10.1016/j.neunet.2014.06.012 [DOI] [PubMed] [Google Scholar]

[CR30] Nita S, Bitam S, Heidet M, Mellouk A (2022) A new data augmentation convolutional neural network for human emotion recognition based on ECG signals. Biomed Signal Process Control 75:103580. 10.1016/j.bspc.2022.103580 [Google Scholar]

[CR31] Özçelik YB, Altan A (2023a) A comparative analysis of artificial intelligence optimization algorithms for the selection of entropy-based features in the early detection of epileptic seizures. In: 2023 14th International Conference on Electrical and Electronics Engineering (ELECO). IEEE, Bursa, Turkiye, pp 1–5

[CR32] Özçelik YB, Altan A (2023b) Overcoming nonlinear dynamics in diabetic retinopathy classification: a robust AI-Based model with chaotic swarm intelligence optimization and recurrent long Short-Term memory. Fractal Fract 7:598. 10.3390/fractalfract7080598 [Google Scholar]

[CR33] Rudakov E, Laurent L, Cousin V, et al (2021) Multi-Task CNN model for emotion recognition from EEG Brain maps. In: 2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART). IEEE, Paris / Créteil, France, pp 1–4

[CR34] Sarma P, Barma S (2021) Emotion recognition by distinguishing appropriate EEG segments based on random matrix theory. Biomed Signal Process Control 70:102991. 10.1016/j.bspc.2021.102991 [Google Scholar]

[CR35] Shang Y, Yang M, Cui J et al (2022) Driver emotion and fatigue state detection based on time series fusion. Electronics 12:26. 10.3390/electronics12010026 [Google Scholar]

[CR36] Shen F, Dai G, Lin G et al (2020) EEG-based emotion recognition using 4D convolutional recurrent neural network. Cogn Neurodyn 14:815–828. 10.1007/s11571-020-09634-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] Sikander G, Anwar S (2019) Driver Fatigue detection systems: a review. IEEE Trans Intell Transport Syst 20:2339–2352. 10.1109/TITS.2018.2868499 [Google Scholar]

[CR38] Tang Y, Wang Y, Zhang X, Wang Z (2023) STILN: A novel spatial-temporal information learning network for EEG-based emotion recognition. Biomed Signal Process Control 85:104999. 10.1016/j.bspc.2023.104999 [Google Scholar]

[CR39] Tepelena I, Frantzidis C, Salvari V, et al (2017) Are elderly less responsive to emotional stimuli? An EEG-based Study across pleasant, unpleasant and neutral greek words. In: 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS). IEEE, Thessaloniki, pp 469–474

[CR40] Topic A, Russo M (2021) Emotion recognition based on EEG feature maps through deep learning network. Eng Sci Technol Int J 24:1442–1454. 10.1016/j.jestch.2021.03.012 [Google Scholar]

[CR41] Wang Q, Wu B, Zhu P, et al (2020a) ECA-Net: efficient channel attention for deep convolutional neural networks

[CR42] Wang Q, Wu B, Zhu P, et al (2020b) ECA-Net: efficient channel attention for deep convolutional neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp 11531–11539

[CR43] Wang Y, Huang Z, McCane B, Neo P (2018) EmotioNet: A 3-D convolutional neural network for EEG-based emotion recognition. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, Rio de Janeiro, pp 1–7 [Google Scholar]

[CR44] Woo S, Park J, Lee J-Y, Kweon IS (2018) CBAM: Convolutional Block Attention Module. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision – ECCV 2018. Springer International Publishing, Cham, pp 3–19 [Google Scholar]

[CR45] Xiao G, Shi M, Ye M et al (2022) 4D attention-based neural network for EEG emotion recognition. Cogn Neurodyn 16:805–818. 10.1007/s11571-021-09751-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] Yang Y, Wu Q, Fu Y, Chen X (2018) Continuous convolutional neural network with 3D Input for EEG-based emotion recognition. In: Cheng L, Leung ACS, Ozawa S (eds) Neural Information Processing. Springer International Publishing, Cham, pp 433–443 [Google Scholar]

[CR47] Zhang Y, Chen J, Tan JH et al (2020) An investigation of deep learning models for EEG-Based emotion recognition. Front Neurosci 14:622759. 10.3389/fnins.2020.622759 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] Zheng W-L, Zhu J-Y, Lu B-L (2019) Identifying stable patterns over time for emotion recognition from EEG. IEEE Trans Affective Comput 10:417–429. 10.1109/TAFFC.2017.2712143 [Google Scholar]

[CR49] Zheng X, Chen W (2021) An Attention-based Bi-LSTM method for visual object classification via EEG. Biomed Signal Process Control 63:102174. 10.1016/j.bspc.2020.102174 [Google Scholar]

PERMALINK

SST-CRAM: spatial-spectral-temporal based convolutional recurrent neural network with lightweight attention mechanism for EEG emotion recognition

Yingxiao Qiao

Qian Zhao

Abstract

Introduction

Methods

Overview of SST-CRAM

Fig. 1.

Table 1.

Construction of 4D EEG feature maps

Fig. 2.

Fig. 3.

Spatial-spectral feature extraction network

Fig. 4.

Fig. 5.

Temporal feature extraction network

Fig. 6.

Experiments

DEAP dataset

Experimental devices

Results

Influence of EEG segment length

Tabel 2.

Influence of different 2D maps

Fig. 7.

Table 3.

Overall performance

Fig. 8.

Method comparison

Table 4.

Ablation experiments

Fig. 9.

Table 5.

Discussion

Conclusion

Acknowledgements

Data availability

Declarations

Conflict of interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases