Abstract
Purpose
Depression is a global challenge causing psychological and intellectual problems that require efficient diagnosis. Electroencephalogram (EEG) signals represent the functional state of the human brain and can help build an accurate and viable technique for the early prediction and treatment of depression.
Methods
An attention-based gated recurrent units transformer (AttGRUT) time-series model is proposed to efficiently identify EEG perturbations in depressive patients. Statistical, spectral and wavelet features were first extracted from the 60-channel EEG signal data. Then, two feature selection techniques, recursive feature elimination and the Boruta algorithm, both with Shapley additive explanations, were utilised for selecting essential features.
Results
The proposed model outperformed the two baseline and two hybrid time-series models—long short-term memory (LSTM), gated recurrent units (GRU), convolutional neural network-LSTM (CNN-LSTM), and CNN-GRU—achieving an accuracy of up to 98.67%. Feature selection considerably increased the performance across all time-series models.
Conclusion
Based on the obtained results, novel feature selection greatly affected the results of the baseline and hybrid time-series models. The proposed AttGRUT can be implemented and tested in other domains by using different modalities for prediction.
Supplementary Information
The online version contains supplementary material available at 10.1007/s13755-022-00205-8.
Keywords: Electroencephalogram, Transformer, Multi head attention, Deep learning, Depression
Introduction
Depression results from a complex and varied combination of sociological, psychological, and physiological variables [1]. It is a collection of conditions linked to mood fluctuations [2] and affects the brain’s thought processes and, accordingly, the individual’s actions and mood [3]. Therefore, early detection of depression is critical to prevent it from progressing to a detrimental level where it may endanger the lives of those with the condition [4]. Establishing appropriate and effective measures for identifying depression remains a growing research topic, and recent advances in devices and smart sensors have created new possibilities for diagnosing depression [5].
Over the years, published research has adopted electroencephalography (EEG) signals for mental health predictions, including conditions such as schizophrenia [6], epilepsy [7], dementia [8] and autism [9]. This study presents a computer-aided depression detection method that uses EEG signals as biomarkers. Depression can be efficiently identified using EEG because these brain signals are recorded over time [10]. Artificial intelligence approaches have significant potential for addressing depression prediction challenges [11].
EEG signals record electrical impulses from the scalp surface of the head. This is possible because of the electrical activity generated by continuous spiking activation of numerous neurons in the brain. The procedure has various advantages for measuring brain activity and is a harmless, inexpensive, non-invasive, high-temporal-resolution technology that serves as an excellent alternative for neuroscientific research compared to other physiological modalities [12].
This study investigated a hybrid deep learning (DL) classification model for EEG signals with the following two main contributions:
A general feature selection (FS) method typically determines the feature importance based on single execution. Herein, two FS methods—recursive feature elimination (RFE) and the Boruta algorithm with SHapley Additive exPlanations (SHAP) importance—were applied to iteratively select significant features.
The novel attention-based gated recurrent units transformer (AttGRUT) time-series neural network (NN) proposed herein combines a succession of 'Conv1D transformer' units, each of which includes multi-head attention and convolutional layers to learn cross-representational connections and concurrent temporal activity. Two GRU layers that utilise less memory are added to retain useful information over an extended period.
The remainder of this paper is organised as follows. A review of prior scientific work on the prediction of depression using EEG signals is presented in Sect. 2. Section 3 describes the proposed AttGRUT model, while Sect. 4 discusses the experiment and its results. Section 5 considers and compares the proposed models with the baseline models, and Sect. 6 presents concluding remarks.
Related work
Many machine learning (ML) and deep learning (DL) techniques have been applied using EEG data for depression detection [13]. A review article on depression detection showed that almost 48.8% of articles were on ML, 32.6% addressed DL and 18.6% applied both ML and DL [14]. According to a review article, the most common ML model used is the support vector machine (SVM) approach, which demonstrates excellent performance. Among DL approaches, CNN and CNN-LSTM are the most widely used models for mental health prediction (58.92% of the total number of papers). Concurrently, recurrent neural networks (RNNs), GRU, and bidirectional LSTM models represent only 3.45% of the total papers [15]. Most authors have applied DL directly to raw EEG data to enable the automatic extraction of features [16]. In contrast, the authors that used ML primarily extracted handcrafted features for training the models [17].
Lei et al. [18] employed a novel CNN architecture to differentiate between bipolar and major depressive disorders using a multichannel resting-state EEG and obtained a mean accuracy of 96.88% for automatic feature extraction using a deep CNN. Song et al. [19] proposed a combination of a CNN-LSTM architecture with five frequency bands as features, and the proposed model performed better than the baseline ML and DL models. Aydemir et al. [20] used melamine patterns and discrete wavelet transform (DWT) processing for feature extraction, with neighbourhood component analysis (NCA) used as the FS method. These features were then fed to the SVM and k-nearest neighbour models for classification, achieving an accuracy of 95%.
Sharma et al. [10] devised a hybrid neural network architecture to predict depression by combining the power of CNN and LSTM. This model is less computationally expensive and executes in less time because it uses a windowing technique. Seal et al. [21] proposed a novel CNN architecture for classifying depressed and control patients, obtaining an accuracy of 99.37%. Feature extraction has been an important step specifically for applying ML models. Cukic et al. [22] discriminated depression from EEG using two non-linear features, namely, Higuchi’s Fractal Dimension (HFD) and Sample Entropy (SampEn) using baseline ML models. They found out that SampEn was the better discriminator for depression. Zhao et al. [23], introduced two asymmetry SampEn and Lemple–Ziv Complexity (LZC) for feature extraction. Their investigation validated an elevated frontal alpha asymmetry in depressive patients. Akbari et al. [24] introduced two new feature extraction methods, reconstructed phase space (RPS) and geometric features to achieve an accuracy up to 99.30%.
Various methods have been investigated with the objective of FS and dimension reduction for the applicability of depression detection using EEG. The genetic algorithm (GA) and principal component analysis (PCA) are the most common methods used in this regard [25]. Feature selection for DL applications has been underexplored in the literature, potentially because of DL’s ability to manage large data volumes obstructs its need.
Implementing transformers in DL [26] has recently sparked significant interest owing to outstanding results in natural language processing (NLP) [27], computer vision (CV) [28], speech signal processing [29] and other fields [30, 31]. Transformers have demonstrated a high level of modelling ability for long-range connections and interconnections in sequential data and are currently being used to model time-series data. Various transformer versions [32] have recently been proposed to address specific issues in time-series modelling and have been effectively applied to a wide range of time-series activities, for example, prediction, outlier detection and classification. Transformer-in-time series is still an emerging area in DL [33].
To the best of our knowledge, the current study represents the first application of a transformer for EEG-based depression prediction. However, transformers have been applied in other study areas. Jha et al. [34] applied a transformer with multi-head attention and temporal modules to predict autism using functional magnetic resonance imaging (fMRI) signals and achieved an accuracy of 77.40%. Yi et al. [35] used a basic time-series transformer architecture to predict mental workload based on EEG signals, with 95.28% accuracy. Bagchi et al. [36] proposed a ConvTransformer architecture with multi-head attention and temporal convolution layers using EEG signals with visual stimuli. A hybrid DL model is proposed in this study by combining the power of multi-head attention with a GRU and a transformer. The following section describes the architecture of the proposed model.
Proposed attention-based GRU-transformer
The proposed time-series-based AttGRUT architecture is based on the original transformer architecture (presented in [26] and [33]) but with a modified output for time-series classification and no requirement for positional encoding [37]. The encoder design, including layer normalisation and feed-forward components, is like that of the self-attention method. The transformer network is comprised of encoders and decoders. Here, the encoder is stacked three times, followed by two GRU layers and, finally, a multilayer perceptron head. The decoder is replaced with dense layers because decoding is no longer required, and the final layer is, thus, a dense output layer. The encoder component of our model’s output tensor must be reduced to a set of vectors for each data point in the current sample. A pooling layer is a common approach for achieving this goal. A global average pooling layer was found to be suitable in this scenario because it helped subsidise the representation of much deeper structures. The architecture of the AttGRUT model is shown in Fig. 1.
Fig. 1.
a Architecture of the proposed attention-based GRU transformer network, b multi-head attention block, c architecture of adopted stacked GRU
The various modules of the proposed transformer are as follows:
Input embeddings In NLP models, input embeddings are frequently used to translate relatively low-to-high-dimensional vectors to ease sequence modelling [30]. To maintain the correlations between distinct features without considering temporal information, a time-series sequence embedding is required [31]. To obtain the k-dimensional embeddings at each time step, the method presented herein employed a one-dimensional (1D) convolutional layer. The model executed the tensor shape as batch size → sequential length → features. Here, ‘sequential length’ referred to the time step’s number, and ‘features’ refer to each incoming time series.
Transformer encoder block Residual linkages, layer normalisation and dropout were all included in this block. The resulting layer was repeated several times. The model performed best when the transformer encoder block was layered three times. Therefore, the three layers of the encoder block were fixed for all experiments. Subsequently, a multilayer perceptron head was added as the classifier unit.
-
Multi-head attention (MHA) An essential component of transformer architecture is multi-head attention (MHA), which acquires a depiction of position-independent input data. Linearly projecting the queries, keys and values hH times with various learned linear projections to (queries), (keys) and (values) dimensionality, respectively, is more advantageous than conducting a single attention mechanism with D model dimensional keys, values and queries. The scaled dot-product attention used among transformers is offered by the query–key–value framework, as follows:
1 where queries , keys , values . n and m are the queries and keys length. and are the queries/keys and values dimension. Instead of using one attention function, the transformer employs the MHA in H separate sets of trained projections. It is given as:2 Here,
-
Feed-forward network (FFN) Along with three 1D convolution layers for feature expansion, a feed-forward network (FFN) was applied with an activation function = ‘ReLU’. The Keras.layers function was used to implement the 1D convolution projection layers, enabling the model to focus on the information from various representational subspaces at various points. As a fully connected unit, the point-wise FFN is expressed as follows:
3 where is the last layer outputs, , , , are the parameters for training. A residual layer (SelfAttention) unit is added to each module within deeper structures, followed by a normalization layer (Norm) unit as follows:4 5 -
Gated recurrent unit layer (GRU) The GRU was introduced as an alternative for extracting patterns from time-series data. The gated structure used by the GRU controls the information flow. At every time step, the volume of data received was determined. The GRU can remember the temporal trends over a longer period. For these advantages, two layers of GRU were stacked next to the FFN. The first and second layers consisted of 64 and 32 units, respectively. The default activation function ‘tanh’ was used. Further, ‘return_sequences’ was set as ‘true,’ which returned a whole sequence as the output rather than just an abstracted depiction of the input data, enabling no information loss. The architecture of the implemented stacked GRU layer is shown in Fig. 1(c). are the output features from the transformer block fed as input to the first layer of GRU. are the hidden representations at Lth steps of time and is defined as:
6 where represents the parameters of the model and GRU(.) is function determining the two recurrent layers. the output of the GRU model is given by k, which produced a sequence of predicted values using the activation function tanh(.). The fully gated recurrent unit is given as:7 8 9 10 where the update gate decides the information to forget; the reset gate decides what amount of information to be forwarded to the next state; is the intermediary state; is the output vector; and W, U, b are the weight parameter matrices. indicates entry wise product operation.
Global average pooling The next step was to convert the output vector of the transformer encoder section of the model into a set of feature vectors for each data point in the most recent batch, in addition to a pile of dense layers. A pooling layer is a popular approach to accomplish this. The GlobalAveragePooling1D layer was suitable for this scenario. Computing the average value of all variables in a feature map is known as global average pooling and is mainly used to lower the number of parameters that can be learned.
Experiments
This section explains the process for evaluating and comparing the efficacy of AttGRUT with other baseline methods. Figure 2 illustrates the methodology proposed in this study.
Fig. 2.
Workflow proposed for this research
Data description
The EEG dataset used in this study was acquired from OpenNeuro collected between 2008-2010 in John J.B. Allen lab at U Arizona (https://openneuro.org/datasets/ds003478/versions/1.1.0) [38]. This dataset comprised 122 college-attending participants, among which we selected 46 with major depressive disorder (MDD) and 46 healthy control participants. The remaining 30 participants (all belonging to the healthy category) were excluded to avoid class imbalance. The data were collected in a state of rest with eyes closed and open. The age of the selected MDD individuals ranged from 18 to 24 years (mean 18.793, standard deviation (SD) 1.162). Among them, 56 were female and 36 were male. All individuals were measured using the Beck Depression Inventory (BDI) scale for diagnosing depression, where a BDI score < 7 was considered to indicate stable low depression (control), and BDI ≥ 13 was considered to reflect a high level of depression. Only subjects with BDI scores of < 7 and greater than 13 were included in this study. The average BDI score for depressed participants was 22.22 and 1.83 for their healthy counterparts.
Data pre-processing
Dataset pre-processing was performed using EEGLAB2022.0 [39]. Figure 3 shows the steps taken for pre-processing each EEG signal. First, the 66-channel raw EEG data and channel locations were imported. The EEG data were then re-referenced to mastoids M1 and M2 (the electrodes placed behind the ears) and excluded from the data. Next, the data were down sampled from 500 to 128 Hz. A direct current shift can introduce significant filter artifacts prior to signal filtering; as such, it was removed at this stage.
Fig. 3.
Steps followed for pre-processing the raw EEG signals
The next step was to filter the data to remove the sluggish drift with a significant amplitude. A basic finite impulse response filter was used to achieve this. The lower edge bandpass was set to 0.5 Hz, and the higher edge was set to 50 Hz. The CleanLine function of EEGLAB was applied to remove line noise. Manual channel removal was then performed to remove empty and unimportant channels. The horizontal and vertical electrooculogram channels (used to capture eye movement artifacts) were removed from the data. The CB1 and CB2 cerebellar electrode channels were also excluded because they did not record any neural activity. The final pre-processing step was decomposing the data using independent component analysis (ICA). Because of its capacity to filter artifacts from the signal, ICA is frequently used during the signal pre-processing stage in EEG analysis. The benefits of using ICA become even more evident when collecting a multichannel signal.
The above pre-processing steps were applied to the raw EEG data for all included participants. The final selection of the 60 EEG channel map is shown in Fig. 4.
Fig. 4.
Electrode positioning for 60 channels with respect to 10–20 electrode positioning system
Feature extraction
Feature extraction has several benefits, such as reducing the amount of crucial information lost from the current signal, minimising overfitting risk, improving overall visualisation and reducing the amount of information required to precisely depict it, which will ease implementation challenges. The feature extraction process was applied to 60-channel pre-processed EEG data. A total of 204 statistical, spectral, wavelet and autoregressive features were extracted for the classification, and the named features were selected for their advantages. The features were combined to merge all the features advantages and overcome each other’s disadvantages. All the features were tested separately and also combined. The motivation was to find out how well the model performed with separate and combined features. Another motivation was to combine the advantages of all the features for a better depression detection system. Table 1 outlines the advantages and disadvantages of selected feature extraction methods. A description of these features is given in the following section.
Table 1.
Advantages and disadvantages of selected feature extraction methods
| Method | Advantages | Disadvantages |
|---|---|---|
| Statistical features | Demonstrates the viability of analysing lengthy continuous EEG signal slices | Suitable for stationary signals |
| Spectral features | Works well for narrowband signals and is faster compared to other feature extraction methods | It has poor spectrum estimation and is unable to analyse shorter EEG signals |
| Wavelet features | Window size varies, being wide at low and confined at high frequencies | Choosing the right mother wavelet is important |
| Autoregressive features | Better frequency resolution is produced by AR, which decreases the loss of spectrum instabilities | Challenging to choose the model order for spectral estimation |
Statistical features
Two types of statistical features were extracted in this study. First, basic statistical features were extracted, which included kurtosis, skewness, second difference mean, second difference max, coefficient of variation, first difference mean, first difference max, and the variance and mean of the vertex-to-vertex slope [40].
Skewness is a measurement of the asymmetry of the probability distribution of a true random variable around its mean. Positive, zero, negative or indefinite skewness values were obtained. ‘Kurtosis’ is a statistical term describing the extent to which points gather in a frequency distribution's tails or peak. The peak of the distribution is the highest, and the tail is the lowest. The formula for skewness and kurtosis is given by:
| 11 |
| 12 |
where is the data mean, S is the SD and n is the total number of samples. A statistical measurement of the proportional distribution of the data points in a data series around the mean is the coefficient of variation, which is given as follows:
| 13 |
where is sample standard deviation and is the sample mean.
Next, the Hjorth activity, mobility and complexity were extracted from the pre-processed data. In 1970, Hjorth devised a set of three components to describe EEG signals in the time domain: activity, mobility, and complexity [41]. These components can be specified using the first and second derivatives, also known as normalised gradient descriptors. The first component is the measurement of the mean power, which represents signal activity. The second component, mobility, approximates the mean frequency. The bandwidth of the signal was estimated using the final component (complexity). Because the Hjorth parameters are calculated using variance, this method requires minimal computing cost. In addition, the Hjorth time-domain aspect may be useful in instances when continuous EEG analyses are required. Three components (activity, mobility and complexity) were applied in this study. The following equations define the activity (a), mobility (m) and complexity (c):
| 14 |
| 15 |
| 16 |
where is spectrum of power density and is the EEG signal as a time function. The discrete equations used to calculate these components are as follows:
| 17 |
| 18 |
| 19 |
where , and are the Hjorth activity, mobility, and complexity.
Spectral features
The fast Fourier transform (FFT) method is commonly used for spectral feature extraction [42]. This approach analyses EEG data using mathematical methods or instruments. To properly represent the EEG signal, a power spectral density (PSD) approximation was used to determine the properties of the recorded EEG signal. The principal distinctive waveforms of the EEG spectral range were present in the four frequency ranges. Therefore, four FFT bands, that is, delta (0.1–3.9 Hz), theta (4–7.9 Hz), alpha (8–13.9 Hz) and beta (14–30 Hz) bands, were extracted as features.
PSD was computed by Fourier-transforming the nonparametrically calculated autocorrelation pattern. Welch's approach is one method for achieving this goal. Data windowing was applied to the data sequence, which resulted in revised periodograms. The information sequence for is provided as follows:
| 20 |
where n = 0,1,2, 3, …, R-1 and j = 0,1,2, 3, …., S-1. is the start of the jth sequence. The data segment is represented by R of length 2S. The welch power spectrum, , obtained after the average of revised periodograms is given as:
| 21 |
where W is the power normalisation factor and is the window function.
Wavelet transform features
The wavelet transform (WT) is useful in the identification and diagnosis of depression because it reduces many data points in a time-varying biological signal to a small number of signal-defining variables [43]. Because EEG is considered nonstationary, time–frequency methods such as WT represent the best choice for extracting features from raw data. The WT is a spectral estimation method in which any task is conveyed as an infinite number of wavelets. Because WT allows for varying frames, it offers additional flexibility in terms of signal-time representation. Large WT timeframes are used to obtain finer low-frequency resolutions; conversely, short timeframes are used to achieve a high-frequency output.
In this study, WT was developed based on multiscale feature depiction and used for feature extraction. Each scale under evaluation indicates a different EEG signal thickness. The main frequency aspect of the EEG data was determined by the number of levels into which the wavelet split.
The following formula depicts the association involving WTs and filter h, or low pass filter:
| 22 |
where is the filter’s h in z-transform. The corresponding z-transform of a high-pass filter is written as,
| 23 |
Autoregressive features
Using a parametric approach, autoregressive (AR) techniques were used to assess EEG PSD [43]. Unlike nonparametric approaches, AR methods do not suffer from signal loss and hence provide better frequency resolution. The PSD estimation was accomplished by estimating the coefficients or parameters under inspection. Selecting the model order is an important task that will influence lower spectral leakage and high resolution; a higher-order model will initiate flawed peaks in the spectra, while a lower-order model will produce smooth spectra. In this study, a level 3 model order was considered. Burg’s AR method was employed for the feature extraction, and the total number of features obtained was 180.
To fulfil the Levinson–Durbin recursion, the AR spectral estimation was centred on minimising forward and backward prediction errors. Burg’s method calculates the reflection coefficient without requiring calculation of the autocorrelation component. The advantages of this method are as follows. Burg’s approach can approximate the PSD’s existing data to accurately match the original input value. Once a signal has a low noise level, it can produce closely packed sinusoids. The PSD for the Burg’s method was calculated as follows:
| 24 |
where, Ep is the least mean square error (approximate) of the pth order of predictor, k represents segments and is the AR parameters.
Feature selection
Having many features is beneficial for the development of ML/DL models. However, employing only a small number of the most important features can create a better and more effective model. To achieve this, a semi-automated technique is required to select the relevant features for the supervised classifier. In this study, two wrapper-based FS methods were used—RFE and the Boruta algorithm with SHAP importance—to enhance generalisation.
RFE
Recursive feature elimination is an FS technique that fits a supervised method using fewer sets of features in a recursive manner [44], where features are eliminated based on their weight.
The steps for recursive feature elimination are as follows:
- Step 1:
Build an estimator (in this case, an Light Gradient Boosting Machine (LGBM) estimator was used for all features).
- Step 2:
Calculate the ranking based on the feature importance (in the current case, SHAP importance).
- Step 3:
The features are sorted according to their importance.
- Step 4:
Remove the most irrelevant features and re-fit the estimator.
- Step 5:
Compute the differences in performance between the models in successive iterations in Step 4.
- Step 6:
Features with no improvement in performance are excluded.
- Step 7:
Repeat Steps 4–7 until all features have been considered.
Boruta
Boruta is a simple, lesser-known technique. A supervised algorithm is usually iteratively fitted to a tree-based model on an extended version of a tabular dataset [45]. In each iteration, the extended version was created from the original data with duplicates of shuffled columns connected horizontally. Only the following features were retained in each iterative process: have a greater ranking than the best of the randomised features and are superior to the randomness of chance (using a binomial distribution).
The steps involved in the Boruta FS are as follows:
- Step 1:
Duplicates are created for all features in the given data collection, which are then randomised and known as shadow features.
- Step 2:
Using the expanded dataset, the Boruta FS trains a random forest classifier and uses a feature importance metric (the default is mean decreased accuracy) to evaluate the relevance of each feature, with higher values indicating greater importance.
- Step 3:
Next, check if an actual feature is more significant than the high-valued shadow features (i.e. if the feature has a higher Z-score than the largest Z-score of its shadow features) at each iteration, and remove features that are considered insignificant.
- Step 4:
Once all features have been approved/discarded or when the number of gradient boost runs reaches a predefined limit, the process is complete.
Result
Environmental setting
The training and testing of the models were performed in Google Colab Pro using 32 gigabytes of random-access memory and a Tesla T4 general processing unit for accelerated performance. All process implementations were performed in Python using the Keras library. The first step was to randomly divide the data into three sets: 80% for the training set, 10% for the validation set and 10% for the testing set.
Evaluation metric
To evaluate the performance of the time-series models, accuracy, precision, recall, F1-score and receiver operating characteristic (ROC) curve with area under curve (AUC) values were used as metrics. The scope of this study does not include a definition for each of these performance metrics. Nevertheless, interested readers can refer to [46] for more information on these metrics. Each metric had a value between 0 and 1, and it was reasonable to consider larger values to be those approaching 1. A higher value for each metric indicates that the model performed better. These values were converted into percentages in this study.
Computation protocol and results
Three experiments were conducted for the data training: without FS (204 features), with RFE FS (90 features) and with Boruta FS (60 features). The AttGRUT model is a time-series model, and its performance was compared to the four most popular time-series models: LSTM, GRU, CNN-LSTM and CNN-GRU. Table 2 provides an architectural summary of layers and parameters of the proposed AttGRUT model. A brief explanation of the number of layers and different operations, along with the parameters for the baseline time-series models, is given in Table 3. All models were trained at 200 epochs.
Table 2.
Architectural summary of the proposed attention-based gated recurrent unit transformer model
| Operation | Parameter |
|---|---|
| Input | Without FS: (204,1), with RFE FS (90, 1), with Boruta FS (60, 1) |
| Encoder block × 3 | |
| Multihead attention | (256, 4 × 4) |
| Dropout | 0.25 |
| Layer Normalisation | – |
| Conv1D + ReLU | 4 × 1 |
| Dropout | 0.25 |
| Conv1D | 4 × 1 |
| Dropout | 0.25 |
| Conv1D | 4 × 1 |
| Dropout | 0.25 |
| Layer Normalisation | – |
| GRU | 64 |
| GRU | 32 |
| Global average pooling | 1 × 1 |
| Dense + ReLU (MLP units) | 128 |
| Dropout | 0.25 |
| Dense + softmax | 2 |
Table 3.
Architecture of LSTM, GRU, CNN-LSTM, and CNN-GRU models
| Layers | LSTM | GRU | ||
|---|---|---|---|---|
| Operation | Parameters | Operation | Parameters | |
| 1 | LSTM | 256 | GRU | 256 |
| 2 | Dropout | 0.2 | Dropout | 0.2 |
| 3 | LSTM | 32 | GRU | 32 |
| 4 | Dropout | 0.2 | Dropout | 0.2 |
| 5 | Flatten | - | Flatten | - |
| 6 | Dense + ReLU | 128 | Dense + ReLU | 128 |
| 7 | Dropout | 0.2 | Dropout | 0.2 |
| 8 | Dense + Softmax | 2 | Dense + Softmax | 2 |
| CNN-LSTM | CNN-GRU | |||
|---|---|---|---|---|
| 1 | Conv1D + ReLU | 128 × 3 | Conv1D + ReLU | 128 × 3 |
| 2 | MaxPooling1D | 2 | MaxPooling1D | 2 |
| 3 | Dropout | 0.2 | Dropout | 0.2 |
| 4 | Conv1D + ReLU | 128 × 3 | Conv1D + ReLU | 128 × 3 |
| 5 | MaxPooling1D | 2 | MaxPooling1D | 2 |
| 6 | Dropout | 0.2 | Dropout | 0.2 |
| 7 | LSTM | 256 | GRU | 256 |
| 8 | Dropout | 0.2 | Dropout | 0.2 |
| 9 | LSTM | 32 | GRU | 32 |
| 10 | Dropout | 0.2 | Dropout | 0.2 |
| 11 | Flatten | – | Flatten | – |
| 12 | Dense + ReLU | 128 | Dense + ReLU | 128 |
| 13 | Dropout | 0.2 | Dropout | 0.2 |
| 14 | Dense + Softmax | 2 | Dense + Softmax | 2 |
The ‘shap-hypetune’ Python package was used for parameter tuning and feature selection, while the ‘BoostBoruta’ and ‘BoostRFE’ Python functions were used for feature extraction. Both methods used LGBM as an estimator with SHAP as feature importance. For parameter tuning, the authors considered 200 estimators; the learning rates were 0.3, 0.2 and 0.1, and the number of leaves were 20, 25 and 30, respectively. The FS procedure was repeated ten times for each method. The selected features were recorded for each trial using conventional tree-based feature significance with SHAP importance. The plot shows the number of times a feature was selected after each trial. Only the features selected for all ten trials were considered as the final inputs for prediction. Table 4 shows the important features selected from the two-FS (RFE and Boruta) methods for all ten iterations. Figures S1 and S2 in supplementary shows the graphical representation of features selected.
Table 4.
Features selected after feature selection
| Features | RFE-SHAP | Boruta-SHAP |
|---|---|---|
| Statistical | 7 (Activity, mobility, complexity, 2nd Difference Mean, Coefficient of Variation, 1st Difference Mean, Mean of Vertex-to-Vertex Slope) | 6 (Mobility, complexity, 2nd diff mean, Coefficient of Variation, 1st Difference Mean, Mean of Vertex-to-Vertex Slope) |
| Spectral | 4 (FFT Delta Max Power, FFT Theta Max Power, FFT Alpha Max Power, FFT Beta max power) | 4 (FFT Delta Max Power, FFT Theta Max Power, FFT Alpha Max Power, FFT Beta max power) |
| WT |
6 (wavelet detailed entropy, approximate energy, approximate mean) Wavelet Approximate Mean, Wavelet Approximate Std Deviation, Wavelet Detailed Std Deviation, Wavelet Detailed Energy, Wavelet Approximate Entropy, Wavelet Detailed Entropy |
5(Wavelet Approximate Mean, Wavelet Detailed Std Deviation, Wavelet Approximate Energy, Wavelet Detailed Energy, Wavelet Approximate Entropy) |
| Autoregressive | 73 autoregressive coefficients | 45 autoregressive coefficients |
| Total | 90 | 60 |
Table 5 highlights the comparative performance of all models for different features separately. Based on this table, the proposed model has outperformed the baseline models. Of all the features, autoregressive features performed the best, with 92.38% accuracy obtained with RFE FS using the proposed model. The next step was to combine all the features to increase the accuracy of the proposed model.
Table 5.
Test performance and comparison of the proposed AttGRUT model with the baseline and hybrid time-series models with separate features (in percentage)
| Models | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| With spectral features | ||||
| LSTM | 66.4459 | 65.3110 | 63.1944 | 64.2353 |
| GRU | 64.7903 | 70.3072 | 47.0320 | 56.3611 |
| CNN-LSTM | 66.8874 | 65.7534 | 65.7534 | 65.7534 |
| CNN-GRU | 66.4459 | 63.7860 | 70.7763 | 67.0996 |
| AttGRUT (proposed) | 68.2119 | 68.2039 | 64.1553 | 66.1176 |
| With wavelet features | ||||
| LSTM | 67.2185 | 66.9746 | 65.3153 | 66.1345 |
| GRU | 62.3620 | 70.2786 | 48.0932 | 57.1069 |
| CNN-LSTM | 68.6534 | 70.7048 | 68.0085 | 69.3305 |
| CNN-GRU | 73.5099 | 74.4726 | 74.7881 | 74.6300 |
| AttGRUT (proposed) | 73.7307 | 75.8850 | 72.6695 | 74.2424 |
| With statistical features | ||||
| LSTM | 63.6865 | 62.1013 | 72.2707 | 66.8012 |
| GRU | 61.4790 | 66.1721 | 48.6900 | 56.1006 |
| CNN-LSTM | 69.2053 | 70.5747 | 67.0306 | 68.7570 |
| CNN-GRU | 68.6998 | 69.2984 | 69.8849 | 69.0415 |
| AttGRUT (proposed) | 70.0883 | 68.1553 | 76.6376 | 72.1480 |
| With autoregressive features | ||||
| LSTM | 62.3620 | 61.2836 | 65.7778 | 63.4512 |
| GRU | 58.0574 | 61.6667 | 41.1111 | 49.3333 |
| CNN-LSTM | 90.0662 | 90.3587 | 89.5556 | 89.9554 |
| CNN-GRU | 89.8455 | 90.4328 | 88.8143 | 89.6163 |
| AttGRUT (proposed) | 92.3841 | 92.8090 | 91.7778 | 92.2905 |
Table 6 shows the comparative performance of the proposed model with the four baseline models, with combined features and with and without feature selection. Figure 5 shows a graphical representation of the accuracy obtained across all the models. AttGRUT outperformed the other hybrid and single models in classifying depression and control, with or without feature selection, as shown is Fig. 5. The highest accuracy achieved by AttGRUT with the RFE feature selection was 98.67%. Throughout the experiment, an increase in the overall performance was observed with feature selection. The second highest overall accuracy was obtained with Boruta FS (97.01%) using the proposed model. CNN-GRU was the second highest-performing model after the proposed model, with an accuracy of up to 93.92%. GRU was the lowest-performing model overall.
Table 6.
Test performance and comparison of the proposed AttGRUT model with baseline and hybrid time-series models with combined features (in percentage)
| Models | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Without FS | ||||
| LSTM | 73.9514 | 80.1609 | 64.8590 | 71.7026 |
| GRU | 56.0706 | 76.9231 | 19.5228 | 31.1419 |
| CNN-LSTM | 88.7417 | 94.1032 | 83.0803 | 88.2488 |
| CNN-GRU | 93.0464 | 94.2222 | 91.9740 | 93.0845 |
| AttGRUT (proposed) | 95.8057 | 96.8254 | 94.6785 | 95.7399 |
| With BORUTA FS | ||||
| LSTM | 93.8190 | 93.4924 | 94.3107 | 93.8998 |
| GRU | 67.3289 | 76.0518 | 51.4223 | 61.3577 |
| CNN-LSTM | 85.0993 | 85.0000 | 85.5580 | 85.2781 |
| CNN-GRU | 92.6049 | 93.5268 | 91.6849 | 92.5967 |
| AttGRUT (proposed) | 97.0199 | 97.2574 | 97.0526 | 97.1549 |
| With RFE FS | ||||
| LSTM | 93.1567 | 95.3125 | 91.2393 | 93.2314 |
| GRU | 66.6667 | 77.6667 | 49.7863 | 60.6771 |
| CNN-LSTM | 95.3642 | 96.1039 | 94.8718 | 95.4839 |
| CNN-GRU | 93.9294 | 95.1860 | 92.9487 | 94.0541 |
| AttGRUT (proposed) | 98.6755 | 98.6175 | 98.6175 | 98.6175 |
FS feature selection
Fig. 5.
Graphical representation of accuracy obtained across different classifier models with and without feature selection (in percentage)
Table 6 shows that RFE was the best FS method, yielding the highest accuracy in most classification models. The maximum increment in accuracy obtained by LSTM, GRU, CNN-LSTM, and CNN-GRU is 19.20%, 10.59%, 6.62%, and 0.88%, respectively, with RFE FS, compared to classification without feature extraction. AttGRUT with RFE FS showed a 2.87% increase in accuracy when compared to no FS. The AttGRUT model with RFE FS performed better than the Boruta FS by just 1.66%. The proposed model worked flawlessly, with and without feature selection, and with an accuracy above 95%. For the baseline models, the FS appeared to be an essential step for better classification performance. Although the performances of both FS methods were similar for the proposed method, the time complexity for execution was different. RFE FS outperformed in performance, whereas Boruta FS required a comparatively lower execution time.
Figures 6, 7 and 8 show the ROC curve in comparison with the AttGRUT models to the other baseline models without FS, with Boruta FS and with RFE FS, respectively. The ROC curve determines the area under the curve (AUC) value, which is shown in the figures. The AUC values ranges from 0 to 1. Values closer to 1 were considered to have a good prediction. From the figures, it can be observed that the AUC value of the proposed model is very close to 1, indicating an excellent prediction.
Fig. 6.

ROC curves for all models used in the experiment without feature selection
Fig. 7.

ROC curve for all models used in the experiment with boruta feature selection
Fig. 8.

ROC curves for all models used in the experiment with RFE feature selection
Discussion
Among the world's biggest health concerns, detecting depression in its initial stages is challenging. Electroencephalography-based technologies have recently been used to study depression as a condition represented by electrical activity in the human brain. These EEG signals are time-series signals. Thus, a transformer for time-series data with multi-head attention was proposed in this study. The added attention units clearly render the model prediction for human interpretation. This was simulated using the GRU layers in the proposed model. GRUs have forget and update gates that help remember useful information and forget irrelevant information. In contrast to the existing literature, two less-explored iterative FS approaches have been used to select a robust set of features. The results obtained in this study outperformed those of the two advanced and hybrid models. The remaining discussion focuses on these aspects.
This study demonstrated the utility of the important features selected from Table 4, which helped considerably increase the model’s predictability performance. In the literature, few studies employed multiple feature categories together, with most including a single feature category. In this study, autoregressive features gave the best performance compared to other features. Autoregressive features have been commonly used in feature extraction. The findings are in line with the existing literature where autoregressive features performed exceptionally well for making predictions [47] [48]. The next best performing feature was DWT, with the advantage of varying window size and better-than-unexpected signal irregularities. Statistical and FFT were the worst performing features. These two features have been the most used in the literature [14].
RFE and the Boruta FS approach had previously been implemented for mental health predictions [49] [50] [51]. In this study, the most important FS method was RFE. In [52], two optimal features were selected using SVM-RFE to detect major depression, with an accuracy of 74.4%. The Boruta method performed like RFE but is generally lesser known. In an article on depression prediction, the Boruta algorithm was found to be the greatest contributor to the final FS [53].
Transformers have outperformed in performance on various NLP and CV tasks, giving rise to additional studies on their potential role in a time-series context. Among the many benefits of transformers, capturing long-range connections and interconnections is particularly appealing for time-series analysis in terms of creating advancements in various time-series applications. In the past few years, time-series transformers have been implemented in different application domains, including classification. Their application areas are forecasting [54], spatial and temporal forecasting [55], event classification [56], anomaly detection [57] and classification tasks [58].
The AttGRUT model proposed in the present research achieved outstanding performance because of the attention-based mechanism implemented for training the transformer. The attention mechanism is known for its greater focus on a particular part of a complex unit. The mechanism aims to break down complex tasks into smaller attention blocks for subsequent processing. Additionally, the model’s successful performance could also be attributed to the addition of GRU layers, which helped boost the model’s available memory, thereby making training the model easier.
Table 7 presents a comparison between the methodologies adopted in the present study and those in the existing studies. Notably, the methodology adopted in this study outperformed those in existing research.
Table 7.
Methodology comparison of the present study with existing studies
| Year | Dataset | Subjects | Features | Classifier | Accuracy (%) |
|---|---|---|---|---|---|
| 2018 [59] | Own | 13 healthy, 13 depressed | Linear, non-linear methods | Logistic regression | 92 |
| 2019 [60] | publicly available | 30 healthy, 30 depressed | Linear, non-linear | Multi-layer perceptron, radial basis function | 93.33 |
| 2019 [61] | Own | 20 healthy, 24 depressed | Frequency bands, sample entropy, Detrended Fluctuation Analysis | Gaussian SVM | 90.26 |
| 2022 [62] | Own | 58 healthy, 34 Depressed | Delta, theta, alpha, beta | Transductive SVM | 89 |
| 2022 [19] | Own | 40 health, 40 depressed | Five frequency bands | Proposed CNN-LSTM | 94.69 |
|
Proposed AttGRUT |
Open Neuro | 46 healthy, 46 depressed | Statistical, spectral, wavelet, autoregressive | Attention base GRU-transformer with RFE FS | 98.67 |
The execution times for the proposed model with and without FS were 3–4 min and 26 min, respectively. However, the baseline and hybrid models required comparatively less execution times. Although AttGRUT was computationally expensive, it outperformed for the identification of depression. Figure 9 shows the accuracy and loss graph obtained from the AttGRUT model with RFE FS. The accuracy graph represents the model’s fit in terms of its ability to make predictions, while the loss graph shows the loss values, that is, the summation of the errors generated by the model. Fewer loss values represent better model performance. Analysing the accuracy and loss graph together was insightful. Based on Fig. 9, with time, the accuracy increased and the loss decreased—an ideal situation that illustrates good performance.
Fig. 9.
AttGRUT model accuracy and loss graph with RFE feature selection
This study proposes a reliable system for identifying people with depression in real-world situations. In the future, psychiatrists could monitor a patient’s mental state using the proposed methodology.
However, the proposed model has certain drawbacks. Brain region-specific predictions can be incorporated for improved classification. Furthermore, this study was limited to participants within a relatively young age group. Further investigations should include larger sample sizes from different age groups.
Conclusion and future work
The EEG data indicated a shift in mental activity in cases of depression in the data used in this study and in the analysis using DL models. The proposed AttGRUT model, an attention-based GRU transformer model, outperformed other time-series models, both with and without FS. Four feature extraction methods (statistical, spectral, WT and AR) were extracted and combined, and two FS methods (RFE and Boruta) were applied with the SHAP feature importance. The FS method was iterated ten times, and the features selected in all rounds were considered as the final input to the classifier.
A novel time-series model was proposed and compared with other time-series baseline and hybrid DL models. The highest accuracy achieved was 98.67% with RFE and 97.01% with the Boruta FS algorithm. Without FS, the highest accuracy achieved was 95.80%. The AttGRUT model performed exceptionally well with or without FS. In comparison, the performance increment of the baseline models was significant.
This research is aimed at creating a brain-computer interface (BCI) that uses EEG data from the brain to detect levels of depression, anxiety, and other anomalies. BCI transfers information to a web server and makes therapeutic recommendations. The application aims to act as a tool for psychiatrists, assisting them in tracking their patients’ treatment histories. The strategy proposed in this work serves as a first step in improving people's mental health by improving detection procedures. Finally, the efficacy of the AttGRUT model should be tested with other physiological modalities.
Supplementary Information
Below is the link to the electronic supplementary material.
Declarations
Conflict of interest
The author confirms that there is no conflict of interest and there are no financial funds.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Benazzi F. Various forms of depression. Dialogues Clin Neurosci. 2022;8:151–161. doi: 10.31887/DCNS.2006.8.2/fbenazzi. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Paykel ES. Basic concepts of depression. Dialogues Clin Neurosci. 2022;10:279–289. doi: 10.31887/DCNS.2008.10.3/espaykel. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kamenov K, Caballero FF, Miret M, Leonardi M, Sainio P, Tobiasz-Adamczyk B, Haro JM, Chatterji S, Ayuso-Mateos JL, Cabello M. Which are the most burdensome functioning areas in depression? A cross-national study. Front Psychol. 2016;7:1342. doi: 10.3389/fpsyg.2016.01342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cacheda F, Fernandez D, Novoa FJ, Carneiro V. Early detection of depression: social network analysis and random forest techniques. J Med Internet Res. 2019;21(6):e12554. doi: 10.2196/12554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ayano G, Demelash S, Haile K, Tulu M, Assefa D, Tesfaye A, Haile K, Solomon M, Chaka A, Tsegay L. Misdiagnosis, detection rate, and associated factors of severe psychiatric disorders in specialized psychiatry centers in Ethiopia. Ann Gen Psychiatry. 2021;20(1):1. doi: 10.1186/s12991-021-00333-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Prabhakar SK, Rajaguru H, Lee SW. A framework for schizophrenia EEG signal classification with nature inspired optimization algorithms. IEEE Access. 2020;8:39875–39897. [Google Scholar]
- 7.Usman SM, Khalid S, Bashir Z. Epileptic seizure prediction using scalp electroencephalogram signals. Biocybern Biomed Eng. 2021;41(1):211–220. [Google Scholar]
- 8.Sánchez-Reyes LM, Rodríguez-Reséndiz J, Avecilla-Ramírez GN, García-Gomar ML, Robles-Ocampo JB. Impact of EEG parameters detecting dementia diseases: a systematic review. IEEE Access. 2021;9:78060. [Google Scholar]
- 9.Cannon J, O'Brien AM, Bungert L, Sinha P. Prediction in autism spectrum disorder: a systematic review of empirical evidence. Autism Res. 2021;14(4):604–630. doi: 10.1002/aur.2482. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sharma G, Parashar A, Joshi AM. DepHNN: a novel hybrid neural network for electroencephalogram (EEG)-based screening of depression. Biomed Signal Process Control. 2021;66:102393. [Google Scholar]
- 11.Liu GD, Li YC, Zhang W, Zhang L. A brief review of artificial intelligence applications and algorithms for psychiatric disorders. Engineering. 2020;6(4):462–467. [Google Scholar]
- 12.Chen X, Li C, Liu A, McKeown MJ, Qian R, Wang ZJ. Toward open-world electroencephalogram decoding via deep learning: a comprehensive survey. IEEE Signal Process Mag. 2022;39(2):117–134. [Google Scholar]
- 13.Safayari A, Bolhasani H. Depression diagnosis by deep learning using EEG signals: a systematic review. Med Novel Technol Devices. 2021;12:100102. [Google Scholar]
- 14.Khosla A, Khandnor P, Chand T. Automated diagnosis of depression from EEG signals using traditional and deep learning approaches: a comparative analysis. Biocybern Biomed Eng. 2021;42:108–142. [Google Scholar]
- 15.Rivera MJ, Teruel MA, Maté A, Trujillo J. Diagnosis and prognosis of mental disorders by means of EEG and deep learning: a systematic mapping study. Artif Intell Rev. 2021;2021:1–43. [Google Scholar]
- 16.Roy Y, Banville H, Albuquerque I, Gramfort A, Falk TH, Faubert J. Deep learning-based electroencephalography analysis: a systematic review. J Neural Eng. 2019;16(5):051001. doi: 10.1088/1741-2552/ab260c. [DOI] [PubMed] [Google Scholar]
- 17.Hosseini MP, Hosseini A, Ahi K. A review on machine learning for EEG signal processing in bioengineering. IEEE Rev Biomed Eng. 2020;14:204–218. doi: 10.1109/RBME.2020.2969915. [DOI] [PubMed] [Google Scholar]
- 18.Lei Y, Belkacem AN, Wang X, Sha S, Wang C, Chen C. A convolutional neural network-based diagnostic method using resting-state electroencephalograph signals for major depressive and bipolar disorders. Biomed Signal Process Control. 2022;72:103370. [Google Scholar]
- 19.Song X, Yan D, Zhao L, Yang L. LSDD-EEGNet: an efficient end-to-end framework for EEG-based depression detection. Biomed Signal Process Control. 2022;75:103612. [Google Scholar]
- 20.Aydemir E, Tuncer T, Dogan S, Gururajan R, Acharya UR. Automated major depressive disorder detection using melamine pattern with EEG signals. Appl Intell. 2021;51(9):6449–6466. [Google Scholar]
- 21.Seal A, Bajpai R, Agnihotri J, Yazidi A, Herrera-Viedma E, Krejcar O. DeprNet: a deep convolution neural network framework for detecting depression using EEG. IEEE Trans Instrum Meas. 2021;70:1–3. [Google Scholar]
- 22.Čukić M, Stokić M, Simić S, Pokrajac D. The successful discrimination of depression from EEG could be attributed to proper feature extraction and not to a particular classification method. Cogn Neurodyn. 2020;14(4):443–455. doi: 10.1007/s11571-020-09581-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Zhao L, Yang L, Li B, Su Z, Liu C. Frontal alpha EEG asymmetry variation of depression patients assessed by entropy measures and Lemple-Ziv complexity. J Med Biol Eng. 2021;41(2):146–154. [Google Scholar]
- 24.Akbari H, Sadiq MT, Rehman AU, Ghazvini M, Naqvi RA, Payan M, Bagheri H, Bagheri H. Depression recognition based on the reconstruction of phase space of EEG signals and geometrical features. Appl Acoust. 2021;179:108078. [Google Scholar]
- 25.Craik A, He Y, Contreras-Vidal JL. Deep learning for electroencephalogram (EEG) classification tasks: a review. J Neural Eng. 2019;16(3):031001. doi: 10.1088/1741-2552/ab0ab5. [DOI] [PubMed] [Google Scholar]
- 26.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008. [Google Scholar]
- 27.Kenton JD, Toutanova LK. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019).
- 28.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J. An image is worth 16x16 words: transformers for image recognition at scale. https://arxiv.org/abs/2010.11929 (2020).
- 29.Dong L, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. pp. 5884–5888 (2018).
- 30.Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I. Decision transformer: reinforcement learning via sequence modeling. Adv Neural Inf Process Syst. 2021;34:15084–15097. [Google Scholar]
- 31.Wu N, Green B, Ben X, O'Banion S. Deep transformer models for time series forecasting: the influenza prevalence case. https://arxiv.org/abs/2001.08317 (2020).
- 32.Ahmed S, Nielsen IE, Tripathi A, Siddiqui S, Rasool G, Ramachandran RP. Transformers in time-series analysis: a tutorial. https://arxiv.org/abs/2205.01138 (2022).
- 33.Wen Q, Zhou T, Zhang C, Chen W, Ma Z, Yan J, Sun L. Transformers in time series: a survey. https://arxiv.org/abs/2202.07125 (2022).
- 34.Jha RR, Bhardwaj A, Garg D, Bhavsar A, Nigam A. MHATC: Autism Spectrum Disorder identification utilizing multi-head attention encoder along with temporal consolidation modules. https://arxiv.org/abs/2201.00404 (2021). [DOI] [PubMed]
- 35.Yi P, Chen K, Ma Z, Zhao D, Pu X, Ren Y. EEGDnet: fusing non-local and local self-similarity for 1-D EEG signal denoising with 2-D transformer. https://arxiv.org/abs/2109.04235 (2021). [DOI] [PubMed]
- 36.Bagchi S, Bathula DR. EEG-ConvTransformer for single-trial EEG-based visual stimulus classification. Pattern Recogn. 2022;129:108757. [Google Scholar]
- 37.Wang YA, Chen YN. What do position embeddings learn? An empirical study of pre-trained language model positional encoding. https://arxiv.org/abs/2010.04903 (2020)
- 38.Cavanagh JF, EEG: Depression rest. OpenNeuro. (2021) https://openneuro.org/datasets/ds003478/versions/1.1.0. Accessed 9 June 2022.
- 39.Delorme A, Makeig S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J Neurosci Methods. 2004;134(1):9–21. doi: 10.1016/j.jneumeth.2003.10.009. [DOI] [PubMed] [Google Scholar]
- 40.Übeyli ED. Statistics over features: EEG signals analysis. Comput Biol Med. 2009;39(8):733–741. doi: 10.1016/j.compbiomed.2009.06.001. [DOI] [PubMed] [Google Scholar]
- 41.Hjorth B. EEG analysis based on time domain properties. Electroencephalogr Clin Neurophysiol. 1970;29(3):306–310. doi: 10.1016/0013-4694(70)90143-4. [DOI] [PubMed] [Google Scholar]
- 42.Li M, Chen W. FFT-based deep feature learning method for EEG classification. Biomed Signal Process Control. 2021;66:102492. [Google Scholar]
- 43.Al-Fahoum AS, Al-Fraihat AA. Methods of EEG signal features extraction using linear analysis in frequency and time-frequency domains. Int Sch Res Not. 2014;2014:730218. doi: 10.1155/2014/730218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Marcílio WE, Eler DM. From explanations to feature selection: assessing shap values as feature selection mechanism. In: 2020 33rd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE, pp. 340–347 (2020).
- 45.Gramegna A, Giudici P. Shapley feature selection. FinTech. 2022;1(1):72–80. [Google Scholar]
- 46.Erickson BJ, Kitamura F. Magician’s corner: 9. Performance metrics for machine learning models. Radiology. 2021;3(3):e200126. doi: 10.1148/ryai.2021200126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yu PN, Liu CY, Heck CN, Berger TW, Song D. A sparse multiscale nonlinear autoregressive model for seizure prediction. J Neural Eng. 2021;18(2):026012. doi: 10.1088/1741-2552/abdd43. [DOI] [PubMed] [Google Scholar]
- 48.Attia A, Moussaoui A, Chahir Y. Epileptic seizures identification with autoregressive model and firefly optimization based classification. Evol Syst. 2021;12(3):827–836. [Google Scholar]
- 49.Mohan P, Paramasivam I. Feature reduction using SVM-RFE technique to detect autism spectrum disorder. Evol Intell. 2021;14(2):989–997. [Google Scholar]
- 50.Zulfiker MS, Kabir N, Biswas AA, Nazneen T, Uddin MS. An in-depth analysis of machine learning approaches to predict depression. Curr Res Behav Sci. 2021;2:100044. [Google Scholar]
- 51.Haque UM, Kabir E, Khanam R. Detection of child depression using machine learning methods. PLoS ONE. 2021;16(12):e0261131. doi: 10.1371/journal.pone.0261131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Byun S, Kim AY, Jang EH, Kim S, Choi KW, Yu HY, Jeon HJ. Detection of major depressive disorder from linear and nonlinear heart rate variability features during mental task protocol. Comput Biol Med. 2019;112:103381.l. doi: 10.1016/j.compbiomed.2019.103381. [DOI] [PubMed] [Google Scholar]
- 53.Alghowinem SM, Gedeon T, Goecke R, Cohn J, Parker G. Interpretation of depression detection models via feature selection methods. In: IEEE transactions on affective computing. (2020). [DOI] [PMC free article] [PubMed]
- 54.Zhou T, Ma Z, Wen Q, Wang X, Sun L, Jin R. FEDformer: frequency enhanced decomposed transformer for long-term series forecasting. https://arxiv.org/abs/2201.12740 (2022).
- 55.Yu C, Ma X, Ren J, Zhao H, Yi S. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European conference on computer vision. Springer, Cham; pp. 507–523 (2020).
- 56.Mei H, Yang C, Eisner J. Transformer embeddings of irregularly spaced events and their participants. In: International conference on learning representations (2021).
- 57.Xu J, Wu H, Wang J, Long M. Anomaly transformer: time series anomaly detection with association discrepancy. https://arxiv.org/abs/2110.02642 (2021).
- 58.Liu M, Ren S, Ma S, Jiao J, Chen Y, Wang Z, Song W. Gated transformer networks for multivariate time series classification. https://arxiv.org/abs/2103.14438 (2021).
- 59.Bachmann M, Päeske L, Kalev K, Aarma K, Lehtmets A, Ööpik P, Lass J, Hinrikus H. Methods for classifying depression in single channel EEG using linear and nonlinear signal analysis. Comput Methods Programs Biomed. 2018;155:11–17. doi: 10.1016/j.cmpb.2017.11.023. [DOI] [PubMed] [Google Scholar]
- 60.Mahato S, Paul S. Detection of major depressive disorder using linear and non-linear features from EEG signals. Microsyst Technol. 2019;25(3):1065–1076. [Google Scholar]
- 61.Mahato S, Goyal N, Ram D, Paul S. Detection of depression and scaling of severity using six channel EEG data. J Med Syst. 2020;44(7):1–2. doi: 10.1007/s10916-020-01573-y. [DOI] [PubMed] [Google Scholar]
- 62.Lin H, Jian C, Cao Y, Ma X, Wang H, Miao F, Fan X, Yang J, Zhao G, Zhou H. MDD-TSVM: a novel semisupervised-based method for major depressive disorder detection using electroencephalogram signals. Comput Biol Med. 2022;140:105039. doi: 10.1016/j.compbiomed.2021.105039. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






