Enhanced performance of EEG-based brain–computer interfaces by joint sample and feature importance assessment

Xing Li; Yikai Zhang; Yong Peng; Wanzeng Kong

doi:10.1007/s13755-024-00271-0

. 2024 Feb 17;12(1):9. doi: 10.1007/s13755-024-00271-0

Enhanced performance of EEG-based brain–computer interfaces by joint sample and feature importance assessment

Xing Li ¹, Yikai Zhang ¹, Yong Peng ^1,^2,^✉, Wanzeng Kong ^1,²

PMCID: PMC10874355 PMID: 38375134

Abstract

Electroencephalograph (EEG) has been a reliable data source for building brain–computer interface (BCI) systems; however, it is not reasonable to use the feature vector extracted from multiple EEG channels and frequency bands to perform recognition directly due to the two deficiencies. One is that EEG data is weak and non-stationary, which easily causes different EEG samples to have different quality. The other is that different feature dimensions corresponding to different brain regions and frequency bands have different correlations to a certain mental task, which is not sufficiently investigated. To this end, a Joint Sample and Feature importance Assessment (JSFA) model was proposed to simultaneously explore the different impacts of EEG samples and features in mental state recognition, in which the former is based on the self-paced learning technique while the latter is completed by the feature self-weighting technique. The efficacy of JSFA is extensively evaluated on two EEG data sets, i.e., SEED-IV and SEED-VIG. One is a classification task for emotion recognition and the other is a regression task for driving fatigue detection. Experimental results demonstrate that JSFA can effectively identify the importance of different EEG samples and features, leading to enhanced recognition performance of corresponding BCI systems.

Keywords: EEG, Driving fatigue detection, Emotion recognition, Joint assessment, Sample and feature importance

Introduction

With recent developments in neurotechnology and artificial intelligence, the physiological signals in BCI communication have been advanced from perception to higher order cognitive activities [1]. BCI aims to establish a direct channel to transmit information between the brain of human beings and the outside devices. Compared with other physiological signals, EEG is more often used in cognitive and neuroscience research since it is closely related to the neural activities of our cerebral cortex. Thanks to the rapid progresses in weak signal acquisition and analysis techniques, EEG has played important roles in diverse fields such as healthcare, disease diagnosis and rehabilitation [2]. The EEG signal is acquired by non-invasively measuring the electrical potentials generated by neural activities through scalp electrodes, which is characterized by low cost and high temporal resolution.

Some studies have shown that EEG signals dominate the recent research efforts in physiological signal-based affective computing and fatigue detection. Many machine learning paradigms such as semi-supervised learning, transfer learning, deep learning were employed in classifying emotional states from EEG [3]. Transfer learning is a domain adaptive approach to find a common space where the inter-subject (inter-session) EEG data discrepancies are reduced while discriminative information across different subjects (sessions) is preserved [4]. Deep feature representations usually obtain better recognition results than manually extracted features. For example, by combining long short-term memory and attention networks, the formulated S-LSTM-ATT model can effectively handle time-series EEG data and recognise intrinsic connections and patterns [5]. However, their learned features are less interpretable. Similarly, among different data modalities in fatigue detection, EEG has been considered as the gold standard, offering more objective detection results [6–8].

The weak and non-stationary properties make EEG data easily contaminated by noise, meaning that the quality of different EEG samples might be different in mental state recognition. Besides, we usually extract EEG features from multiple frequency bands and channels, and then concatenate them to form sample vectors for subsequent processing. This makes that features of different dimensions have different contributions to the recognition task. Though some researchers have tried to identify the different functions of different EEG channels and rhythms [9], the metaheuristic optimization method has limited theoretical basis. Most of the existing models in EEG-based BCI research did not simultaneously take the quality of EEG samples and features into consideration, which obviously violate the common sense that different EEG samples as well as different features contribute differently in mental state recognition. More reasonably, it is necessary to weaken the impact of noisy samples and enhance the impact of high quality samples to improve the model robustness. Similarly, the importance of different EEG features should also be explored to improve the model discriminative ability.

To jointly complete the above mentioned two tasks, we propose a novel JSFA model to jointly measure the importance of EEG samples and features for mental state recognition. Because EEG samples in vector representation are usually arranged as rows (or columns) of the data matrix, JSFA intuitively performs the measurement along both the horizontal and vertical directions. Specifically, we use the self-paced learning technique to gradually incorporate samples into model training from easy to more difficult ones. Meanwhile, feature weighting technique is used to adaptively assign large or small weights to different EEG features according to their different contributions in mental state recognition. By jointly measuring the importance of EEG samples and features, on one hand, the EEG decoding performance of BCI systems is greatly enhanced; on the other hand, the critical EEG frequency bands and channels in characterizing the mental states of respective BCI tasks can be automatically identified according to the correspondence between each EEG feature dimension and the specific frequency band (channel).

As a consequence, the contributions of this paper consist of the following points.

We propose to jointly measure the importance of samples and features in EEG-based BCI systems. As far as we know, this is the first try to improve the BCI systems’ performance through investigating the importance of both samples and features. As a result, the importance value of each sample and feature is quantitatively achieved. The detailed JSFA model formulation as well as its optimization method are provided.
More than improving the recognition performance of BCI systems only, JSFA provides a basis for further performing knowledge discovery from EEG data. That is, according to the correspondence between EEG spectra features and the frequency bands (channels), we automatically identify the specific spatial-frequency EEG patterns of a certain BCI task by the learned feature importance variable.
Our proposed JSFA model is flexible enough for both classification and regression tasks, making it competent for diverse EEG-based BCI paradigms. In our experiments, it performs well on two representative BCI paradigms, i.e., emotional recognition and driving fatigue detection, which are respectively corresponding to classification and regression tasks.

The rest of this paper is structured as follows. Section “Related works” reviews some recent advances in EEG-based BCI research and related techniques. In section “Method”, we propose the novel JSFA model and solve the derived optimization problem by an efficient iterative algorithm. In section “Experimental studies”, comparative studies are conducted to demonstrate the effectiveness of JSFA by applying it to a synthetic data set, an emotion recognition data set SEED-IV (i.e., a classification task) and a driving fatigue detection data set SEED-VIG (i.e., a regression task). We show the conclusions in section “Conclusion”.

Notations We use lowercase letter, boldface lowercase letters and boldface uppercase letters to respectively denote scalars, vectors, and matrices in this paper. For matrix $M$ , its i-th row and j-th column are respectively denoted as $m^{i}$ and $m_{j}$ . The EEG frequency bands are denoted as Delta, Theta, Alpha, Beta and Gamma.

Related works

Below we review the related works from two aspects, recent advances in EEG-based BCIs with the emphasis on emotion recognition and driving fatigue detection, and the related techniques on sample and feature purification.

EEG-based BCIs

In a narrow sense, EEG-based BCIs aim to establish a new type of information communication and control channel between our brain and external devices, by decoding the EEG signals and translating them into commands [10]. Currently, there are some typical BCI paradigms including the motor imagery, P300, steady-state visual evoked potential and so on. The central problem in EEG-based BCIs is how to accurately decode EEG signals, which is usually formulated as a pattern recognition task. In this work, EEG-based emotion recognition and driving fatigue detection are considered in our experiments; that is, we need to determine the emotional states of subjects and the fatigue indices of drivers from EEG signals.

According to the general pipeline of pattern recognition, a typical EEG data analysis procedure includes the three consecutive stages, i.e., data preprocessing, feature extraction and model learning. Data preprocessing consists of operations such as down-sampling, filtering and artifact removal in order to provide clean and reliable EEG data for subsequent analysis [11]. Then, EEG features can be extracted from different domains to depict the abundant characteristics of EEG data [12]. Time domains features such as the event-related potentials, statistics, energy, power, and high-order zero-crossing analysis are the most intuitive because raw EEG data is multi-channel time series data. Research has shown that frequency domain features such as the power spectral density (PSD), event-related synchronization (desynchronization), high-order spectrum, and differential entropy (DE) are more stable than the time domain ones [13]. Sometimes, time–frequency features are necessary to capture the frequency information that varies over time, which can be achieved by wavelet transformation. In order to exploit the multi-channel property, connectivity features are developed to utilize the spatial information. For example, the differential asymmetry and rational asymmetry respectively explore the difference and ratio of features on symmetric electrodes of the left and right hemispheres. Brain network to encode functional connectivity among electrodes also provides useful information for EEG decoding [14].

On the machine learning-based EEG feature transformation and mental state recognition, a lot of efforts were made in the past decades. We can roughly divide the existing models into two categories, linear and nonlinear ones. The nonlinear models are mainly implemented by kernel trick or neural networks. To explore the complementary information among spatial, temporal and spectral domains of EEG signals, a spatio-temporal-spectral network termed STSNet was proposed for subject-independent EEG-based emotion recognition [15]. In [16], random vector functional link network was extended into semi-supervised learning by jointly optimizing the model variables and estimating the emotional states of the unlabeled EEG samples. In [17], a classification procedure was proposed by combining the correlation-based feature selection and a k-nearest neighbor for EEG-based attention recognition. In [18], by incorporating an adaptive graph learning strategy into the semi-supervised regression framework, the authors achieved the average results of 78.18%, 80.55% and 81.99% corresponding to the three subject-dependent cross-session emotion recognition tasks. By exploring the label-common and label-specific EEG features in cross-session EEG emotion recognition, the proposed JCSFE model not only obtained improved recognition performance but also provided us with data-driven EEG spatial-frequency activation patterns [19]. EEG data discrepancies usually appear in cross-subject (cross-session) scenarios, transfer learning models were widely used to enhance the universality in motor imagery [20], emotion recognition [21, 22], driving fatigue detection [23], and epileptic recognition [24]. It is worth mentioning that some deep learning models unified the feature extraction and recognition stages together, leading to an end-to-end mode. Though deep models exhibit promising performance in diverse BCI applications, their interpretability still needs to be improved [25]. Recent advances in emotion recognition and driving fatigue detection can be found in [26, 27], respectively.

Related techniques

Based on the fact that EEG features are often extracted from multiple frequency bands and channels, they should have different impacts in mental state recognition. In [28], a unified framework was proposed for feature importance learning, which is well suited for evaluating the importance of EEG frequency bands and channels in emotion recognition. The fact is that EEG features originated from different brain regions have different correlations to drowsiness, based on which a softmax feature weighting technique was incorporated into episodic training for driver drowsiness estimation [29]. In [30], the authors used a short-tailed Gaussian function to weight the common spatial pattern features rather than discarding unreliable features for EEG-based motor imagery.

Besides, EEG data is weak and easily contaminated by noises, which makes the obtained EEG samples sometimes unreliable enough for subsequent analysis. Some robust learning methods were proposed to enhance the model robustness on (but not limited to) EEG data analysis. Peng et al. [31] proposed a robust face recognition model based on structured sparse representation and the model objective was optimized under the half-quadratic framework. To handle the low quality test data, a neural process method was proposed to more robustly estimate the vigilance from EEG [32]. Recently, Meng et al. proposed a robust learning theory for noise modeling and overcoming by self-paced learning (SPL) [33, 34], which is inspired by the fact that easy concepts should be taught before difficult ones in the teaching-learning activities. Inspired by this idea, machine learning models should be trained in a self-paced fashion; that is, gradually incorporating samples from easy to difficult. Whereafter, SPL technique has been an effective tool in improving the robustness of existing models such as matrix factorization [35], feature selection [36] and multi-view learning [37]. In this paper, we mainly rely on the SPL technique for the sample quality assessment.

Method

This section first presents the JSFA model formulation of and then its optimization. Besides, discussions are provided to illustrate the rationality of SPL-based robust learning and the EEG feature weights in spatial-frequency patterns analysis.

Model formulation

Without loss of generality, below we take the classification paradigm as an example to state the EEG decoding settings. We are given the training samples $X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{d \times n}$ , where n and d respectively denote the number of training samples and the feature dimensionality. The corresponding label matrix is $Y = [y^{1} ; y^{2} ; \dots ; y^{n}] \in B^{n \times c}$ , where c represents the number of classes. $y^{i} |_{i = 1}^{n} \in B^{1 \times c}$ uses the one-hot encoding to indicate the category of the i-th sample.

As shown in Fig. 1, we propose to measure the importance along both the sample and feature directions. Concretely, we use $v_{i} |_{i = 1}^{n}$ to characterize the importance of the n samples $x_{i} |_{i = 1}^{n}$ and $θ_{i} |_{i = 1}^{d}$ to depict the importance of the d EEG features $f_{i} |_{i = 1}^{d}$ . Then, we should learn both vectors $v \in R^{n}$ and $θ \in R^{d}$ (which satisfies $θ \geq 0$ and $1^{T} θ = 1$ ) from the given EEG data. To this end, we treat both vectors as variables and incorporate them into the least squares regression formula due to its simplicity and efficacy, and propose the following objective function of our JSFA model as

\begin{matrix} \begin{matrix} min_{W, v, b, θ} & \frac{C}{2} \sum_{i = 1}^{n} v_{i} {∥x_{i}^{T} Θ W + b^{T} - y^{i}∥}_{2}^{2} + \frac{1}{2} {‖ W ‖}_{2}^{2} \\ + f (λ, v), s . t . v \geq 0, θ \geq 0, 1^{T} θ = 1 . \end{matrix} \end{matrix}

In Eq. (1), $Θ \in R^{d \times d}$ is a diagonal matrix whose j-th non-zero diagonal element is defined as $Θ_{jj} |_{j = 1}^{d} = \sqrt{θ_{j}}$ . $W \in R^{d \times c}$ and $b \in R^{c}$ are respectively the slop and intercept variables in the least squares regression. C is a parameter to depict the impact of the first term in objective function (1). $f (λ, v)$ is a regularization term associated with variable $v$ and parameter $λ$ , which determines how the EEG samples are incorporated into model learning and how their importance values are calculated. Formally, this term is called the self-paced function since it controls the learning pace through the age parameter $λ$ ( $λ > 0$ ). In this paper, we define its exact form according to the linear self-paced regularization which was proposed in [34]. That is

\begin{matrix} f (λ, v) = \frac{1}{2} λ \sum_{i = 1}^{n} (v_{i}^{2} - 2 v_{i}), s . t . v_{i} \geq 0, i = 1, \dots, n . \end{matrix}

By introducing an intermediate variable $A$ to replace $Θ W$ , Eq. (1) can be rewritten as

\begin{matrix} \begin{matrix} min_{A, v, b, θ} & \frac{C}{2} \sum_{i = 1}^{n} v_{i} {∥x_{i}^{T} A + b^{T} - y^{i}∥}_{2}^{2} + \frac{1}{2} ‖ Θ^{- 1} A ‖_{2}^{2} \\ + f (λ, v), s . t . v \geq 0, θ \geq 0, 1^{T} θ = 1 . \end{matrix} \end{matrix}

When $v$ , $A$ and $b$ are fixed, based on the definition of $Θ_{jj}$ and the constraint $1^{T} θ = 1$ , we have the following equation

\begin{matrix} min_{θ \geq 0, 1^{T} θ = 1} {∥Θ^{- 1}, A∥}_{2}^{2} = min_{θ \geq 0, 1^{T} θ = 1} \sum_{j = 1}^{d} \frac{{∥a^{j}∥}_{2}^{2}}{θ_{j}} . \end{matrix}

The corresponding Lagrangian function with respect to $θ_{j}$ is

\begin{matrix} L (θ_{j}) = \frac{‖ a^{j} ‖_{2}^{2}}{θ_{j}} + η (1^{T} θ - 1) + θ^{T} β, \end{matrix}

where $η \in R$ and $β \in R^{d}$ are two Lagrange multipliers. By setting $\frac{\partial L (θ_{j})}{\partial θ_{j}}$ to be zero and taking the normalization constraint of $θ$ into consideration, the solution to $θ$ is calculated as

\begin{matrix} θ_{j} = \frac{{∥a^{j}∥}_{2}}{\sum_{j^{'} = 1}^{d} {∥a^{'}∥}_{2}} . \end{matrix}

Then, we have the equivalent form of Eq. (4) as

\begin{matrix} min_{θ \geq 0, 1^{T} θ = 1} {‖ A ‖}_{2, 1}^{2} . \end{matrix}

Now, Eq. (3) can be rewritten as

\begin{matrix} \begin{matrix} min_{A, v, b} \frac{C}{2} \sum_{i = 1}^{n} v_{i} & {∥x_{i}^{T} A + b^{T} - y^{i}∥}_{2}^{2} + \frac{1}{2} {‖ A ‖}_{2, 1}^{2} + f (λ, v) . \end{matrix} \end{matrix}

Fig. 1 — The overall architecture of JSFA

Model optimization

We propose to optimize $v$ , $A$ and $b$ in objective function (8) by the alternating direction method; that is, one variable is updated with the others fixed. Below the detailed derivations are provided for each of the three variables.

Update $v$ with $A$ and $b$ fixed. The objective function $O (v)$ defined on $v$ is
$\begin{matrix} min_{v} \frac{C}{2} \sum_{i = 1}^{n} v_{i} {∥x_{i}^{T} A + b^{T} - y^{i}∥}_{2}^{2} + f (λ, v) . \end{matrix}$ 9
To simplify the following notations, we use $ℓ_{i}$ to denote the squared regression loss $‖ x_{i}^{T} A + b^{T} - y^{i} ‖_{2}^{2}$ on EEG sample $x_{i}$ . By merging parameters C with $λ$ into a newer one [38], we rewrite $O (v)$ as
$\begin{matrix} min_{v} \sum_{i = 1}^{n} v_{i} ℓ_{i} + \frac{λ}{2} \sum_{i = 1}^{n} (v_{i}^{2} - 2 v_{i}), s . t . v_{i} \geq 0, i = 1, \dots, n . \end{matrix}$ 10
By calculating the derivative of Eq. (10) with respect to $v_{i}$ and setting it to zero, we have
$\begin{matrix} \frac{\partial O (v_{i})}{\partial v_{i}} = ℓ_{i} + λ v_{i} - λ = 0 . \end{matrix}$ 11
It is easy to verify that the closed-form solution to $v_{i}$ is
$\begin{matrix} v_{i} = \{\begin{matrix} 1 - \frac{ℓ_{i}}{λ}, & ℓ_{i} < λ ; \\ 0, & ℓ_{i} ⩾ λ . \end{matrix}) \end{matrix}$ 12
Update $b$ with $A$ and $v$ fixed. Now objective function (8) degenerates to
$\begin{matrix} min_{b} \frac{C}{2} \sum_{i = 1}^{n} v_{i} {∥x_{i}^{T} A + b^{T} - y^{i}∥}_{2}^{2} . \end{matrix}$ 13
Denote $U = diag (\sqrt{v})$ and we obtain the more compact matrix form of the above equation as
$\begin{matrix} min_{b} {∥U, (X^{T} A + 1 b^{T} - Y)∥}_{2}^{2} . \end{matrix}$ 14
By calculating the partial derivative of the above equation with respect to $b$ and setting it to $0$ , the updating rule to the intercept variable $b$ is
$\begin{matrix} b = {(H^{T}, H)}^{- 1} {(T - G A)}^{T} H, \end{matrix}$ 15
where $H = U 1$ , $T = U Y$ and $G = U X^{T}$ .
Update $A$ with $v$ and $b$ fixed. The objective function defined on variable $A$ , i.e., $O (A)$ , is
$\begin{matrix} min_{A} \frac{C}{2} \sum_{i = 1}^{n} v_{i} {∥x_{i}^{T} A + b^{T} - y^{i}∥}_{2}^{2} + \frac{1}{2} {‖ A ‖}_{2, 1}^{2} . \end{matrix}$ 16
To avoid the singularity problem caused when the $ℓ_{2}$ -norm of a certain row of $A$ is zero, we regularize ${‖ A ‖}_{2, 1}^{2}$ as ${(\sum_{j = 1}^{d}, \sqrt{{∥a^{j}∥}_{2}^{2} + ϵ})}^{2}$ where $ϵ > 0$ is a small enough constant. Then, we have
$\begin{matrix} min_{A} \frac{C}{2} {∥U, (X^{T} A + 1 b^{T} - Y)∥}_{2}^{2} + \frac{1}{2} {(\sum_{j = 1}^{d}, \sqrt{{∥a^{j}∥}_{2}^{2} + ϵ})}^{2} . \end{matrix}$ 17
According to the previous definitions of $G$ , $H$ and $T$ , we rewrite Eq. (17) as
$\begin{matrix} O (A) = \frac{C}{2} {∥G A + H b^{T} - T∥}_{2}^{2} + \frac{1}{2} Tr (A^{T} D A), \end{matrix}$ 18
where $D \in R^{d \times d}$ is a diagonal matrix and its j-th diagonal element is calculated as
$\begin{matrix} d_{jj} = \frac{\sum_{p = 1}^{d} \sqrt{{∥a^{p}∥}_{2}^{2} + ϵ}}{\sqrt{{∥a^{j}∥}_{2}^{2} + ϵ}} . \end{matrix}$ 19
Similarly, by calculating $\frac{\partial O (A)}{\partial A}$ and setting its value to zero, we have
$\begin{matrix} C (G^{T} G A + G^{T} H b^{T} - G^{T} T) + D A = 0 . \end{matrix}$ 20
Thus, the updating rule to $A$ is
$\begin{matrix} A = {(G^{T} G + \frac{D}{C})}^{- 1} G^{T} (T - H b^{T}) . \end{matrix}$ 21

Since $A$ and $D$ are mutually involved in respective solutions, we propose to optimize variable $A$ iteratively, as shown in Algorithm 1. Based on the above derivations, we summarize the complete optimization procedure to JSFA objective function in Algorithm 2.

Below we analyze the computational complexity of using Algorithm 2 to optimize the JSFA objective function. Based on the big $O$ notation, the complexity of updating variable $v$ is $O (d n c + n c)$ , and that of updating $b$ is $O (d n c + n c + n)$ . When updating the projection matrix $A$ , it consumes $O (d^{3} + d^{2} n + d n c + d^{2} c + n^{2} d)$ complexity. Considering that the general case is $n > d ≫ c$ , we have the overall complexity of JSFA is $O (t n^{2} d)$ where t is the number of optimization iterations.

Discussions on JSFA

Below we provide a brief explanation to the rationality of JSFA in joint sample and feature importance assessment.

$▪$ We know that $v_{i} |_{i = 1}^{n}$ reflects the importance of the i-th sample. Obviously, $v_{i} = 0$ means that the loss corresponding to the i-th sample is zero; that is, this sample will not be involved in the model learning. Conventionally, all samples are treated equally, meaning that they share the same weight one. The weighted loss of the i-th sample, $v_{i} ℓ_{i}$ , will be decreased when the i-the sample is identified as a noisy sample and $v_{i}$ is assigned as a small value. As pointed out by [39], the latent self-paced learning loss under the linear regularizer is

\begin{matrix} F_{λ}^{L} (ℓ) = \{\begin{matrix} ℓ - \frac{ℓ^{2}}{2 λ}, & ℓ < λ ; \\ \frac{λ}{2}, & ℓ \geq λ, \end{matrix}) \end{matrix}

whose graphical illustration is provided in Fig. 2 and we neglect the subscript i here for succinctness. Essentially, when $λ = \infty$ , $F_{λ}^{L} (ℓ)$ degenerates to the original least squares loss. However, when $λ$ is set to a reasonable value, we can easily find the evident suppressing effect on $F_{λ}^{L} (ℓ)$ . When the loss exceeds a given threshold, $F_{λ}^{L} (ℓ)$ thereafter becomes a constant [31]. This explains how SPL works in dealing with outliers or heavy noises so as to improve the model robustness. Specifically, if the loss values of the samples are larger than the age parameter, they will have little influences to the model training because of their zero gradients; in other words, the importance values $v_{i}$ s of these samples are zeros; therefore, they have no influence on the model optimization.

$▪$ Once the model training of JSFA is completed, the learned variable $θ$ acts as the quantitative importance of features; specifically, $θ_{i}$ depicts the importance value of the i-th EEG feature. As shown by Fig. 3, considering the widely used spectra features (i.e., power spectral density, differential entropy), each EEG feature dimension should be always corresponding to a certain frequency band and channel. If an EEG data set has M frequency bands and R channels in total and we form the sample vector by concatenating the R features of each of the M frequency bands together, then we can quantify the importance value of the ${m |}_{m = 1}^{M}$ -th frequency band as

\begin{matrix} ϕ (m) = θ_{(m - 1) * R + 1} + θ_{(m - 1) * R + 2} + \dots + θ_{m * R} . \end{matrix}

Similarly, we can obtain the quantitative importance of the ${r |}_{r = 1}^{R}$ -th channel by

\begin{matrix} ψ (r) = θ_{r} + θ_{r + R} + \dots + θ_{r + (M - 1) * R} . \end{matrix}

Recent studies have pointed out that the identification results of critical frequency bands and channels not only provide more insights into the task-related EEG spatial-frequency activation patterns, but also lay a theoretical foundation for simplifying the hardware design of task-specific EEG acquisition devices in future.

Fig. 3 — Graphical illustration of the feature importance-based EEG spatial-frequency activation patterns identification

Experimental studies

This section conducts experiments to evaluate the effectiveness of the proposed JSFA model. This work involved human subjects in its research. Approval of all ethical and experimental procedures and protocols was granted by the Research Ethics Committee of Shanghai Jiao Tong University under Protocol No. 2017060.

Experiments on synthetic data

We first explain how the synthetic data set is constructed and then conduct experiments to illustrate the effectiveness of JSFA on joint sample and feature importance assessment.

As shown in Fig. 4a, the three Gaussian distributed clusters in different colors are corresponding to three different classes, each of which consists of 150 data points. By utilizing the one-versus-one mode, we trained three least squares regression (LSR) classifiers whose decision boundaries are shown by corresponding lines. For example, the blue line is the decision boundary of the cyan and the magenta classes. Then, we deliberately introduce some noisy samples to investigate the model robustness. There are five outliers belonging to the blue class and 15 outliers belonging to the magenta class in Fig. 4b. We find that these noisy samples significantly affect the originally obtained decision boundaries. Taking the magenta line for example which is the boundary between the blue and cyan classes, since there is no specified mechanism to guarantee the robustness, the LSR classifier has to take these five outliers into consideration (i.e., LSR has to try its best to correctly classify these two classes). As a result, the original decision boundary is anticlockwise rotated by some degrees.

Given a certain model, the fitting errors of these deliberately introduced 20 points are much larger than those of the remaining samples, indicating their poor quality. To decrease their impacts in model training, they should be assigned small weights to improve the model robustness. For our proposed JSFA model, we set the age parameter $λ$ as 0.85 in the experiment. After the model training, we obtained the average importance value of these 20 noisy samples is 0.5996 while that of the other points is 0.9724. In Fig. 4c, these samples are highlighted in boxes and the decision boundaries obtained by JSFA are almost identical to those in Fig. 4a. That is, the negative effects caused by these noisy samples are largely eliminated.

For this synthetic data set, if we want the samples to be well classified, they should be projected onto the x-axis rather than the y-axis, meaning that the first feature dimension is more discriminative than the second one. From Fig. 5b, we find that the cyan and blue classes are almost completely overlapped. Accordingly, the first feature dimension should be assigned a larger weight. By feeding this data set into JSFA, the learned feature importance vector $θ$ is [0.7793,0.2207]. The value corresponding to the first feature dimension is significantly larger than the one corresponding to the second feature dimension. Therefore, the importance of different feature dimensions is adaptively learned by maximizing the model discriminative ability.

Fig. 5 — Data points respectively projected onto the x-axis (top) and y-axis (bottom)

Experiments on emotion recognition

Data descriptions

SEED-IV is an video-evoked emotional EEG data set.¹ 15 healthy subjects were recruited in the EEG data acquisition experiments and each subject participated in the experiments three times (also termed three sessions). Therefore, SEED-IV consists of 45 sessions in total. In each session, four different emotional states (i.e., sad, fear, happy and neutral) were elicited by asking the subjects to watch the 24 well-chosen video clips, among which six clips correspond to one emotional state. When the subjects were watching the videos, EEG data was simultaneously recorded by using the ESI NeuroScan system with a 62-channel cap. The electrode placement is in line with the international 10–20 standard.

For each session, EEG data was partitioned into multiple 4-s non-overlapping segments and each of them will correspond to one sample for model learning. After being down-sampled from 1000 to 200 Hz, EEG data was first bandpass filtered to 1–75 Hz and then decomposed into five frequency bands including the Delta, Theta, Alpha, Beta and Gamma bands. Their frequency ranges are respectively 1–4 Hz, 4–8 Hz, 8–14 Hz, 14–31 Hz and 31–50 Hz. DE feature was extracted from each segment at these five frequency bands. Assuming that EEG data is a random variable (i.e., X) which follows the Gaussian distribution (i.e., $f (x) = N (x ; μ, σ^{2})$ ), the DE feature [13, 40] can be calculated by

\begin{matrix} \begin{matrix} h (X) & = - \int_{- \infty}^{+ \infty} f (x) ln \frac{1}{\sqrt{2 π σ^{2}}} exp \frac{{(x - μ)}^{2}}{2 σ^{2}} d x \\ = \frac{1}{2} ln (2 π σ^{2}) + \frac{Var (X)}{2 σ^{2}} = \frac{1}{2} ln (2 π e σ^{2}) . \end{matrix} \end{matrix}

From the above equation, it is easy to find an equivalence between DE and the logarithm of power spectrum. For each frequency band, there have 62 channel-wise features. We concatenate all these features corresponding to the five frequency bands together, leading to the sample dimensionality 310 in SEED-IV. Because the video clips have slightly different time durations, there are respectively 851, 832 and 822 EEG samples in these three sessions.

Experimental settings

Since each subject has EEG data from three different sessions, we perform emotion recognition in cross-session setting. By following the chronological order, three tasks including the ‘session1 $\to$ session2’, ‘session1 $\to$ session3’ and ‘session2 $\to$ session3’ are considered. For example, in the ‘session2 $\to$ session3’ task, the labeled EEG samples from the second session of each subject are used for model training and the unlabeled samples from the third session are used for model testing.

We compare our proposed JSFA model with some closely related ones including the support vector machine (SVM) and the least squares regression (LSR). In addition, to show the effectiveness of the feature importance variable $θ$ and the self-paced regularization term in Eq. (8), we extrally included the comparison with the Rescaled Linear Square Regression (RLSR), and the self-paced learning (SPL). Here, the RLSR is a supervised classification model by incorporating the feature self-weighting variable $θ$ into LSR, which is different from the semi-supervised version proposed in [41]. SPL has an augmented least squares loss by introducing the linear self-paced regularization term. In SVM, the linear kernel is used. Each of the four compared models has only one regularization coefficient C to tune. The regularization parameters in these compared models were tuned from ${2^{- 25}, 2^{- 24}, \dots, 2^{25}}$ . The set ${1.1, 1.2, \dots, 3.0}$ defines the search scope of step size parameter k in SPL and JSFA.

Results and analysis

Table 1 shows the results of JSFA and the other compared models, where we mark the best results in bold. These results depict the following meaningful points.

Though the EEG data collected at different sessions usually has considerable distribution discrepancies, our proposed JSFA model still achieves promising emotion recognition accuracies. To be specific, the average accuracies of JSFA on these three cross-session tasks are 80.79%, 82.52% and 81.20%, which respectively make improvements by 4.96%, 6.74% and 4.35% in comparison with the second-placed model. Therefore, we generally conclude that the internal subject-dependent emotional pattern is potentially stable which is covered with a layer of external factors and our JSFA model can effectively remove these factors from the perspective of jointly filtering out meaningless EEG samples and features.
During EEG data acquisition process, it is easily and sometimes inevitably contaminated by different types of noises such as the hardware devices and other physiological signals. Therefore, it is necessary to take the outliers or noises into consideration rather than ignoring them. From the average results, SPL outperformed LSR respectively by 6.41%, 9.15% and 6.22% in these three emotion recognition tasks, benefiting from introducing the sample importance descriptor to adaptively increase or decrease the impact of samples. To be specific, if a sample is difficult to fit with the current model, it is considered as a noisy sample and is automatically assigned a smaller weight.
Based on the consensus that different EEG frequency bands as well as different brain regions might have different correlations with neural activities, the discriminative abilities of these extracted EEG features should be different in mental state recognition. By introducing the feature self-weighting variable to adaptively explore the contributions of different EEG feature dimensions, RLSR obtained superior performance to LSR. Moreover, the average performance of JSFA outperforms that of SPL by 4.96%, 6.74% and 4.35% in these three cross-session tasks, indicating that adaptive learning of feature weights is beneficial for improving the recognition accuracy.
The experimental results depict the fact that the quality of both EEG samples and features determines the emotion recognition performance to a large extent. Therefore, seamlessly merging them into a unified model is beneficial for enhancing the recognition performance. From our point of view, these two aspects are complementary to each other. In JSFA, the sample importance descriptor $v$ is jointly optimized with the feature importance vector $θ$ in order to better capture the EEG data components which are more correlated to emotion expression, leading to improved emotion recognition performance.

Below we perform the one-way analysis of variance (ANOVA) between the experimental results respectively obtained by JSFA and each of the other compared models. For each model, there are 15 recognition accuracies corresponding to the 15 subjects in each session, leading to a accuracy sequence consisting of total 45 recognition accuracies, as shown in Table 1. The null hypothesis assumes that the means of these result groups corresponding to different models are equal. Table 2 shows the p-values returned by the ANOVA function, from which we find that JSFA significantly outperforms all the other models in emotion recognition and the hypothesis should be definitely rejected, demonstrating the effectiveness of our joint sample and feature assessment strategy.

Table 1.

The recognition accuracies (%) of JSFA and other compared methods on SEED-IV

ID	Session1 $\to$ session2					Session1 $\to$ session3					Session2 $\to$ session3
	SVM	LSR	RLSR	SPL	JSFA	SVM	LSR	RLSR	SPL	JSFA	SVM	LSR	RLSR	SPL	JSFA
Sub1	37.50	52.88	52.64	71.15	75.00	50.85	79.44	85.77	84.79	87.47	66.79	59.73	61.19	68.61	69.83
Sub2	83.89	85.82	92.79	94.47	97.36	72.02	40.51	50.36	64.36	80.90	82.12	45.86	46.84	63.75	64.84
Sub3	51.44	79.69	79.69	84.25	84.98	41.73	61.19	69.22	71.90	80.41	61.07	67.27	77.49	77.98	78.47
Sub4	32.45	69.35	73.80	72.96	82.93	54.01	76.16	78.35	81.87	82.60	62.04	84.79	85.04	85.77	88.69
Sub5	54.33	67.43	67.43	74.16	80.65	50.24	58.76	60.95	73.48	80.17	62.04	70.92	76.64	72.63	85.40
Sub6	44.23	68.75	69.59	71.03	74.76	84.67	90.15	87.47	92.34	94.89	67.40	91.73	92.82	92.70	94.28
Sub7	71.63	86.78	92.07	93.03	95.55	65.45	82.12	84.55	87.23	95.01	81.14	81.14	94.89	90.75	96.84
Sub8	65.38	67.55	77.76	74.40	81.85	84.55	83.70	86.01	89.05	94.04	73.97	74.93	77.13	76.28	77.13
Sub9	77.28	54.93	66.71	71.15	82.69	59.49	45.50	57.18	69.70	77.25	56.20	51.82	63.50	68.49	75.55
Sub10	41.59	57.81	58.29	68.15	70.19	34.79	58.15	58.64	63.26	75.43	67.03	61.56	66.30	75.55	77.37
Sub11	48.80	59.25	60.82	63.82	66.83	62.53	63.99	63.99	69.22	75.18	53.65	69.83	76.28	70.32	83.94
Sub12	43.63	65.87	73.08	66.23	74.76	30.17	51.46	53.16	58.52	64.60	64.84	70.80	68.13	72.87	75.67
Sub13	57.21	57.57	68.03	62.50	69.83	54.62	51.58	56.57	60.58	64.72	50.85	54.26	57.18	55.47	63.75
Sub14	78.61	74.40	77.76	77.28	78.73	63.38	77.86	84.18	86.50	92.58	82.12	87.23	86.86	91.73	91.73
Sub15	88.82	93.15	91.11	92.91	95.67	84.06	78.95	82.00	83.94	92.58	85.28	87.59	90.27	89.90	94.53
Avg.	58.45	69.42	73.44	75.83	80.79	59.50	66.63	70.56	75.78	82.52	67.77	70.63	74.70	76.85	81.20

Open in a new tab

Table 2.

The analysis of the variance (ANOVA) between JSFA and each of the other models

	JSFA vs. SVM	JSFA vs. LSR	JSFA vs. RLSR	JSFA vs. SPL
p-value	4.3707e−10**	3.34879e−06**	0.0007**	0.0165*

Open in a new tab

**p-value<0.01, *p-value<0.05

To provide more details on the recognition performance of each emotional state, we reorganize the recognition accuracies of the compared models by the form of confusion matrices in Fig. 6. From this figure, we get some insights into (1) the average recognition rate of each emotional state obtained by each compared model; (2) the rates of misclassifying samples from one class into the others, and (3) the performance improvements for each emotional state made by JSFA in comparison with the other models. Taking the neutral state as an example, the average recognition rate of JSFA is 87.77%, which is 11.78% higher than that of RLSR, i.e., 75.99%. In addition to the fact that 87.77% of the neutral EEG samples were classified correctly, the confusion matrix of JSFA also shows that 4.27%, 4.65% and 3.30% of the neutral samples were misclassified as the sad, fear, and happy states, respectively. Among these four emotional states, JSFA achieved the highest recognition rate on the neutral state.

Fig. 6 — Confusion matrices of the compared models

Activation patterns mining from emotion data

In EEG-based emotion recognition, what we are interested in includes but is not limited to the recognition accuracy. We also expect JSFA to perform knowledge discovery on the EEG spatial-frequency activation patterns in emotion expression. In view of the abundant frequency and channel information contained in the EEG data, it is necessary to investigate the correlations between different frequency bands (brain regions) and emotion recognition. Based on the descriptions in the second part of section “Discussions on JSFA”, JSFA provides us with a quantitative way to identify the critical frequency bands and channels in EEG-based emotion recognition according to its learned feature importance vector $θ$ .

As shown in Fig. 7, we visualize the learned $θ$ s. For example, the $θ$ in Fig. 7a corresponds to the average of the 15 cases in the session1 $\to$ session2 task and the last subfigure is the average across all the 45 cases. Obviously, these EEG features have significantly different contributions in emotion recognition. We divide the interval of horizontal axis into five parts to make it correspond more intuitively to the five EEG frequency bands. According to Eq. (23), the importance measure of all the five frequency bands is quantified in Fig. 8, where the mean values are explicitly annotated on the top of bars. From these results, we experimentally demonstrate that the Gamma frequency band has the strongest correlations with the occurrence of affective effects, followed by the Delta band. Similarly, in order to check the significance between the Gamma band and the others, the returned p-values by one-way ANOVA are respectively 0.0063, 1.1413e $-$ 16, 2.4991e $-$ 15 and 1.0236e $-$ 18, indicating that the importance of the Gamma band is significantly important than the others. Though the obtained results are similar, the identification of EEG spatial-frequency activation patterns by JSFA is more flexible and adaptive in comparison with these trial-and-error methods [42, 43].

Fig. 8 — The mean importance of different frequency bands in SEED-IV

As analyzed above, the channel-wise EEG features (i.e., DE in this experiment) have different discriminative abilities; in turn, we want to identify the contributions of different EEG channels and further different brain regions in emotion recognition. Similar to the above analysis on frequency bands, the importance of EEG channels can be measured by Eq. (24) once the feature importance vector $θ$ is fitted by data. Instead of directly listing their importance values here, we adopt the brain topographical map to more intuitively show the critical brain regions in Fig. 9. Obviously, the regions of the prefrontal, left/right central and (central) parietal lobes are considered to be more important in emotion recognition. The data-driven identification results of EEG spatial-frequency activation patterns not only provide us with more insights into the underlying neural mechanism in emotion processing, but also inspires us to design special EEG acquisition devices for emotion recognition in future.

Fig. 9 — The spatial activation patterns of the three cross-session tasks (a–c) and their average (d) identified by JSFA

Effect of the sample importance measurement

In addition to the theoretical analysis in section “Discussions on JSFA”, below we illustrate how the sample importance descriptor $v$ acts in weighting the importance of samples and further improving the robustness of JSFA by experiments. We take the EEG samples from ‘subject 2: session 1’ as an example and visualize them in a two-dimensional subspace by the t-distributed stochastic neighborhood embedding (t-SNE) method in Fig. 10a, where the four different colors correspond to the four different states in SEED-IV. Obviously, there is an overlapping area highlighted by a rectangle whose enlarged version is provided in Fig. 10b. Below we provide an illustration from the perspective of the data acquisition paradigm in SEED-IV.

Fig. 10 — An example to illustrate the sample importance measurement in JSFA. a The 2-D visualization by t-SNE of the samples from session 1 of subject 2; b A larger version corresponding to the rectangle part in a

In SEED-IV, self-assessment is conducted for each subject in a 45-s interval between trials. However, when a subject is completely immersed in the video clip of one trial and cannot extricate himself (herself), he (she) is difficult to quickly recover from that emotional state during such a short break. Then, in the front part of the next trial, this state will inevitably act as a background component. This will result in the inconsistencies between extracted EEG features and labeled emotional states. To be specific, though the EEG samples belong to the front part of the next trial are labeled with another emotional state, the underlying EEG features might be more similar to those in the previous trial. Therefore, it can be considered that the samples from both trials have similar features but are labeled with different emotional states. Correspondingly, we see some overlapped samples in the rectangle of Fig. 10 and there is no doubt that these samples are difficult to distinguish. To enhance the model robustness, we can perform model training by decreasing the weights of these samples to eliminate their side effects. As the self-paced regularizer did, these hard samples are treated as noisy ones whose weights are close to zero, and the remaining samples are treated as normal ones whose weights are close to one. As a result, the importance of samples is differently treated in the training process and the robustness of JSFA is enhanced by reducing the influence of noisy samples.