Abstract
High inter-subject variability and the non-stationary nature of EEG signals pose significant challenges for subject-independent Brain-Computer Interfaces (BCIs) leading to poor model generalization. Differences in neural activity patterns, electrode placements, and external noise further degrade performance making it difficult to develop BCIs that remain reliable across users without extensive recalibration. This study presents a Compact Convolutional Swin Transformer (CCST) to address this issue by using hierarchical window based self-attention combined with convolutional feature extraction to efficiently capture both local electrode interactions and global temporal dependencies. This multi-scale feature representation enhances generalization across subjects, a critical factor for real world BCI deployment. We evaluated CCST on the BCI Competition IV (2a, 2b) and PhysioNet MI datasets using Leave-One-Subject-Out (LOSO) cross-validation achieving state-of-the-art classification accuracies of 68.27%, 76.61%, and 71.70% respectively. Our statistical analysis using the Wilcoxon signed-rank test with Bonferroni correction confirms significant performance improvements over benchmark models. Additionally, CCST achieves a reduction in parameters and a decrease in FLOPs compared to full self-attention models making it more efficient for real-time BCI applications. These results establish CCST as a scalable and efficient framework for adaptive subject-independent BCIs with promising applications in neurorehabilitation, assistive technology, and cognitive training.
Keywords: Brain computer interface, Motor imagery, Deep learning, EEG, Swin Transformer
Subject terms: Neuroscience, Medical research, Biomedical engineering
Introduction
Brain Computer Interfaces (BCIs) allow direct communication between the brain and external devices1. Among them, motor imagery (MI) based BCIs play a crucial role in assistive technologies and neurorehabilitation2,3. Electroencephalography (EEG) with its high temporal resolution, non-invasive nature, and portability is widely used in BCI research particularly for motor imagery tasks4,5. However, a critical challenge persists: ensuring generalization across subjects. Due to inter-subject variability most developed methods require subject-specific calibration which limits their real-world applicability6,7. The inherent variability in EEG signals across individuals makes designing robust models for EEG based BCIs particularly challenging6,8.
Deep learning architectures including Convolutional Neural Networks (CNN) have shown promising results in EEG based BCI applications however CNN faces challenges when capturing long-range dependencies6,9–11. Transformers have recently gained attention in EEG based BCIs because of their ability to capture long-range dependencies making them highly effective for analyzing complex neural signals12–14. Transformers provide unique functionalities in MI-BCI research as highlighted by Abibullaev et. al12. Their capacity for automated feature extraction reduces the dependency on manual feature extraction, reducing preprocessing workflows15–18. The multi-head attention mechanism in transformers enhances robustness to noise by selectively focusing on prominent EEG segments and removing artifacts that impact signal interpretation16,17,19,20. Additionally, transformers excel in modeling temporal dependencies in EEG sequences outperforming conventional CNN and RNN in capturing MI related dynamics16,19,21–25. Hybrid architectures, such as CNN-Transformer, fusions and gated transformer variants, further emphasize the flexibility of transformer based approaches in capturing spatial and temporal EEG dynamics15,17,21,23,25–27. These functional strengths make transformers as adaptable tools for advancing MI-EEG decoding particularly in scenarios requiring noise resilience, temporal precision, and cross-domain adaptability. Other CNN based deep learning architectures have also been explored for EEG based MI classification such as TCN, TIDNet, DeepConvNet, EEGNet, ShallowConvNet, and EEGInception28–32. Some approaches used dilated causal convolutions to capture long range temporal dynamics while others combine specialized convolutions with normalization techniques to improve generalization across participants and yet others extract hierarchical features or used hybrid models and provide frequency specific information33–35.
However, these approaches are sensitive to differences between subjects, require substantial computational resources and large datasets to avoid overfitting and show lower accuracy in subject independent applications. Therefore, some studies have attempted to bridge the gap by integrating CNN with attention mechanisms21,28,36,37. These hybrid approaches have improved performance but still face challenges in subject-independent generalization and full self-attention transformers also introduce high computational overhead limiting real-time usability6. Recent deep learning models such as EEGNet and EEGConformer have shown significant results in subject-dependent scenarios9,21. EEGNet, a lightweight convolutional neural network is well-suited for extracting spatial and temporal features from EEG data and provides excellent performance when trained and tested using data from the same subject9. However, although EEGNet has demonstrated strong generalization capabilities in cross-subject setups, its performance can still be influenced by inter-subject variability in brain activity, which remains a key challenge in subject-independent settings.38,39. Similarly, EEGConformer which use the transformer architecture along with convolutional layers has demonstrated strong performance in subject-dependent tasks but still struggles to generalize across subjects due to large inter-individual differences in EEG signals21. Additionally, EEGInception effectively captures multi-scale features and shows strong performance in ERP-based BCI applications, it did not showed promising results on MI datasets, likely due to its design focus on ERP paradigms and its relatively high computational complexity when applied to MI tasks32. Despite advances in deep learning models based on EEG, a key gap remains to efficiently capture local (short-range) and global (long-range) dependencies in EEG signals while ensuring robustness to cross-subject variability40. Conventional CNN based BCI models excel at local feature extraction but struggle with long-range temporal dependencies limiting cross-subject adaptability9,28–31. On the other hand, Transformer based models utilize global attention but often fail to retain critical localized spatial features due to their full self-attention mechanism21,41–43. Additionally, naive application of Transformers to high dimensional EEG data can be computationally expensive6,44. These issues emphasize the importance for a hybrid approach that balances localized and global feature learning while maintaining computational efficiency.
To address these limitations, we introduce the Compact Convolutional Swin Transformer (CCST), a novel hybrid architecture that innovatively fuses convolutional layers for efficient local spatio-temporal feature extraction with a customized Swin Transformer backbone for hierarchical global modeling in EEG representations. Unlike simplistic CNN + Transformer combinations, CCST unique contributions include a compact design that minimizes parameter overhead through optimized window partitioning and shifting mechanisms, enabling progressive capture of multi-scale dependencies while preserving spatial correlations among EEG electrodes. This customized integration not only enhances robustness against limited training data but also prioritizes subject-independent generalization, outperforming prior models in cross-individual adaptability. In contrast to standard transformers reliant on computationally intensive global self-attention45, CCST utilizes window-based hierarchical attention with shifting to achieve a superior balance of efficiency, interpretability, and performance46. Figure 1 illustrates the fundamental differences between traditional CNN-based feature extraction and CCST innovative Swin-inspired hierarchical self-attention.
Fig. 1.
Comparison of CNN and Swin Transformer for EEG based BCIs. CNN capture local features while Swin Transformers use hierarchical self-attention for better global context integration.
Through these improvements, CCST significantly enhances cross-subject adaptability in motor imagery EEG classification by using hierarchical window-based attention, shifting, and merging windows at different scales, thereby balancing the model to extract local and global feature. This design allows CCST to maintain robust cross-subject generalization with improved computational efficiency, making it ideal for real-time BCI applications. We compared CCST with existing architectures to show its superior generalizability and effectiveness in subject-independent scenarios. We showed that CCST not only achieves higher accuracy in cross-subject motor imagery classification but also maintains efficiency, thereby advancing the field of subject-independent EEG-based BCIs.
Related work
The advent of deep learning (DL) has shifted the paradigm toward end-to-end models that automatically learn hierarchical features from raw EEG signals, reducing reliance on manual preprocessing. Convolutional neural networks (CNN) have been particularly prominent due to their ability to capture local spatiotemporal patterns. Single-scale CNN use compact architectures with depthwise and separable convolutions to extract temporal and spatial features, achieving competitive accuracies9,30. However, these models use fixed convolution scales, limiting their adaptability to the multi-resolution nature of EEG signals and individual differences among subjects.
To address these limitations, multi-scale CNN (MSCNN) incorporate parallel branches with varying kernel sizes to capture features at different temporal resolutions. For instance, Dai et al. proposed HS-CNN, a hybrid-scale CNN that combines multiple convolution scales to handle subject-specific variations in EEG MI classification47. Evaluated on BCI Competition IV 2a and 2b datasets, HS-CNN achieved average accuracies of 91.57% and 87.6% in subject-dependent settings, outperforming single-scale baselines by adapting to diverse spatiotemporal dependencies. Similarly, other MSCNN variants such as MCNN, fuse features from multi-layer streams with different depths, enhancing robustness but potentially increasing computational complexity without fully modeling long-range dependencies48. More recently, Transformer based models have emerged to overcome CNN limited receptive fields by utilizing self-attention mechanisms for global dependency modeling45. Pure Transformers, inspired by their success in natural language processing and vision tasks treat EEG time series as sequences to capture long-distance correlations21. However, they often underperform in EEG decoding due to inadequate local feature extraction and high data requirements. Hybrid CNN-Transformer designs address this issue by integrating the local feature advantages of CNN with the global attention capabilities of Transformers.
Zhao et al. introduced TCANet, a temporal convolutional attention network that hierarchically integrates a multi-scale convolutional module (MSCM) for local spatiotemporal features, a temporal convolutional module (TCM) for fusion and compression, and a stacked multi-head self-attention (MHSA) module for global refinement49. Tested on BCI IV 2a and 2b datasets, TCANet achieved subject-dependent accuracies of 83.06% and 88.52%, with kappa values of 0.7742 and 0.7703, respectively, demonstrating superior performance in both subject-dependent and subject-independent scenarios. Similarly, Zhao et al. proposed CTNet, which utilizes a CNN module analogous to EEGNet for initial local and spatial feature extraction, followed by a Transformer encoder with MHSA to identify global dependencies50. CTNet achieved 82.52% and 88.49% accuracies on BCI IV 2a and 2b in subject-specific evaluations, and 58.64% and 76.27% in cross-subject settings, highlighting its generalization potential.
Building on this, Zhao et al. developed MSCFormer, a multi-scale convolutional Transformer that uses parallel CNN branches for multi-resolution feature extraction and a Transformer encoder for global integration51. With data augmentation via segmentation and reconstruction, MSCFormer attained 82.95% and 88.00% accuracies on BCI IV 2a and 2b, with kappa values of 0.7726 and 0.7599, outperforming several state-of-the-art methods in five-fold cross-validation. Deng et al. extended hybrid approaches to 3D representations with ConSwinFormer, transforming EEG signals into a three-dimensional structure (time, electrode plane, and frequency bands) and utilizing CNN for local features followed by a Swin Transformer for global extraction52. This method achieved 83.99% accuracy on BCI IV 2a, emphasizing the benefits of window-based attention for high-dimensional, low-SNR EEG data.
Despite these advances, existing hybrid models often face challenges in balancing local-global feature fusion, handling limited training data, and ensuring interpretability. Moreover, while multi-scale designs improve adaptability, they may overlook spatial correlations in electrode distributions or require excessive parameters. Our work addresses these gaps by proposing a novel framework that integrates CNN for local feature extraction and Swin Transformer for global feature modeling, achieving superior accuracy and robustness on benchmark datasets. Furthermore, unlike many existing approaches that focus on subject-dependent paradigms, our approach specifically emphasizes subject-independent performance to enhance generalizability across diverse individuals.
Methodology
CCST model architecture
The Compact Convolutional Swin Transformer (CCST) is designed to address the challenges of capturing multi-scale spatio-temporal dependencies in EEG data. Unlike standard Transformers that compute global attention across the entire sequence or CNN that focus on local receptive fields, the swin based hierarchical approach makes it possible to learn localized features within smaller windows and progressively capture global context by shifting these windows28,45,46. This design choice also significantly reduces computational cost compared to standard transformer implementations21,45,53. For EEG signals which are characterized by local rhythmic activity (e.g., alpha, beta) and significant inter-subject variability, this window-based architecture maintains subject-specific EEG signal characteristics while enabling hierarchical feature extraction54. The CCST architecture comprises four key stages: patch embedding for initial feature extraction, positional encoding for temporal ordering, multiple swin transformer blocks for hierarchical attention and feature refinement, and a classification head that outputs final class predictions. The overall architecture of CCST is shown in Fig. 2, showing the hierarchical patch embedding, attention mechanism, and classification head that collectively enable efficient feature extraction and learning.
Fig. 2.
Overall architecture of the Compact Convolutional Swin Transformer (CCST) model. The design includes four key stages: patch embedding for feature extraction, positional embedding for encoding temporal structure, multiple Swin Transformer Blocks for hierarchical self-attention, and a Classification Head for final motor imagery EEG decoding.
Table 1 provides a comparison between our proposed CCST model and several previous deep learning models for EEG-based BCIs. The table summarizes each models architecture, the type of self-attention used, its subject-independent performance, and computational cost.
Table 1.
Comparison between our proposed model (CCST) and previous models.
| Model | Architecture | Self-attention type | Subject-independent performance |
Computational cost |
|---|---|---|---|---|
| TCN28 | Temporal CNN | None | Moderate | Low |
| TIDNet31 | Deep CNN | None | Low | High |
| EEGNet9 | CNN | None | High | Moderate |
| DeepConvNet30 | Deep CNN | None | Low | High |
| Conformer21 | CNN + Transformer | Global | Moderate | High |
| ShallowConvNet30 | Shallow CNN | None | Moderate | Low |
| MSCFormer51 | CNN + Transformer | Yes | High | Low |
| TCANet49 | TCN + Transformer | Yes | High | Moderate |
| CTNet50 | CNN + Transformer | Yes | High | Moderate |
| EEGInception32 | CNN | None | Low | Very high |
| CCST (Proposed) | Swin Transformer | Local Windowed | High | Low |
Patch embedding
The Patch Embedding module processes the raw EEG tensor of shape
where
is the batch size,
is the number of electrodes, and
is the number of time samples. To extract local temporal features, a temporal convolution with a kernel size of
and 40 filters is applied:
![]() |
1 |
where
is the temporal filter,
is the bias term, and
(.) denotes the Exponential Linear Unit (ELU) activation function. Next, a spatial convolution with a kernel size of
is applied to capture cross-electrode interactions:
![]() |
2 |
where
represents the spatial filter and
is the bias. Following this, batch normalization and ELU activation are applied to stabilize feature representations. An average pooling layer with a kernel size of
and stride
reduces the temporal dimension, producing 15 tokens per EEG trial while preserving the electrode wise features. Finally, a
convolutional projection maps the feature space into a 64 dimensional embedding:
![]() |
3 |
where
is the projection weight, and
is the bias term. The output is then reshaped to
forming 15 tokens per EEG trial. These tokens serve as compact, expressive representations, preventing self-attention layers from being saturated by high temporal resolution. Notably, recent foundation models LaBraM have shown that complete EEG trials can be effectively represented in a 64-dimensional latent space, aligning with our use of compact embeddings while differing in the underlying pretraining methodology55.
Positional embedding
To capture the inherent temporal ordering within EEG, the model adds element-wise positional embeddings to the patch embeddings. We explore three strategies: learnable embeddings, fixed sinusoidal embeddings, and none56. Fixed sinusoidal embeddings have shown effectiveness in many transformer based tasks and provide a global frequency based position encoding57. However, EEG signals often have subject and trial specific timing variations that may not align neatly with a purely sinusoidal pattern58. Learnable embeddings, initialized with a truncated normal distribution can adaptively encode these variations potentially providing better alignment with the subtle temporal dynamics of motor imagery across subjects59–61. If no positional embedding is used, the model relies solely on attention mechanisms to identify temporal dependencies which can be suboptimal when exact sequence ordering carries crucial information about brain oscillations. We adopt learnable embeddings initialized with a truncated normal distribution as fixed sinusoidal embeddings do not effectively capture the variable timing of motor imagery onset across subjects. Given an input patch embedding sequence where
is the batch size,
is the sequence length, and
is the embedding dimension:
![]() |
4 |
the learnable positional embeddings are defined as a parameter:
![]() |
5 |
and the final input to the transformer is computed by an element-wise addition:
![]() |
6 |
Empirical results show that learnable embeddings improve CCST classification accuracy by approximately 1.2% over fixed embeddings.
Transformer blocks
The core of CCST consists of swin transformer inspired blocks which adopt a window based self-attention mechanism. Instead of focusing across all 15 tokens leading to
complexity, the sequence is divided into non-overlapping windows thus reducing the complexity to
. The window size W is set to 4 tokens to balance capturing local temporal EEG dynamics while maintaining computational efficiency.
![]() |
7 |
where
and
the number of windows. Experiments showed that increasing window size to W = 8 led to over-smoothing while reducing it to W = 2 failed to capture important spatial context. These choices were empirically validated through ablation studies62,63. Setting W = 4 optimally balances local EEG dynamics with computational efficiency. This step is particularly suited to EEG signals where local temporal windows can capture short-lived yet informative events such as bursts in alpha or beta rhythms. Within each window, the self-attention mechanism with a residual connection is applied:
![]() |
8 |
where the attention operation is defined as
![]() |
9 |
with Q, K, and V obtained via a linear projection of
and
. To ensure coverage of the entire sequence, cyclic shifting is applied every other block, shifting each window by half its size (2 tokens).
![]() |
10 |
As a result, tokens at the boundary of one window eventually appear in an adjacent window in the next block allowing the model to integrate contextual information across segments. Each block contains multi head self-attention with 4 heads, two layer normalization steps (one before self-attention and another before the MLP), a feedforward network (hidden dimension 128, GELU activation), and a 10% dropout after both self-attention and the MLP layer. Residual connections are included to promote stable gradient flow and preserve information from earlier layers.
![]() |
11 |
where the MLP consists of two linear layers with a GELU activation and dropout between them. Stacking three swin transformer blocks provides a hierarchical scheme in which lower blocks learn localized EEG features and higher blocks incorporate progressively broader contextual information getting robustness to inter-subject variability. The windows are then merged back to reconstruct the full sequence:
![]() |
12 |
and if a cyclic shift was applied initially, the inverse shift is performed:
![]() |
13 |
Thus, the overall operation of the transformer block can be summarized as:
![]() |
14 |
Classification head
Following the final swin transformer block, the output sequence (15 tokens) is averaged to produce a single feature vector per trial. This global average pooling step combines the learned local contexts from the individual windows generating a compact representation of the entire EEG trial. A dropout layer is then applied to minimize overfitting. Dropout rate selection was selected through empirical study across different values (0.1, 0.3, 0.5)64–66. A dropout of 0.3 achieved the most suitable combination between regularization and classification accuracy reducing overfitting while preserving unique features. Empirical tests showed that higher dropout rates (
) degraded performance while lower dropout (
) resulted in overfitting67. A fully connected layer in the end converts the aggregated information into a binary output space relevant to the motor imagery task (e.g., left vs right hand movement).
![]() |
15 |
where:
are the token embeddings from the final swin transformer block (with
tokens)The average pooling aggregates these embeddings into a single feature vector

applies a dropout with a rate of 0.3
and
are the weights and bias of the final fully connected layer mapping the feature vector to the two-class output space
By pairing window based self-attention with learnable positional embeddings, CCST captured both local details and higher-level global structure making it effective in handling the complex spectral and temporal features found in EEG data.
Model complexity
To evaluate the computational cost of our EEG decoding model and the benchmark models, we measured three key metrics: the number of floating point operations (FLOP), the total number of trainable parameters, and the average inference time. These metrics provides unique insights into models efficiency and potential suitability for real-time applications. FLOP provide an estimate of the computational workload required to process a single input. A lower FLOP count generally indicates a lighter model that is faster to run. The total number of learnable parameters indicates the memory footprint of the model. Models with fewer parameters are typically more lightweight and may be more resistant to overfitting. Inference speed is measured as the average time taken to process a single input (forward pass). A faster inference speed is critical for real time brain computer interface applications where low latency is essential. For both parameter count and model summary, we used the default values as mentioned in the original papers for each model. For other measurements, we used dummy inputs generated to have the same shape as our EEG data (e.g., a tensor of shape (B, C, T). This ensures that all models are compared under consistent conditions.
Table 2 shows a complete breakdown of the complexity metrics for each model, including FLOP, parameter count and average inference time per forward pass.
Table 2.
Summary of model complexity metrics.
| Model | FLOPs | Params (k) | Inference time (s) |
|---|---|---|---|
| EEGNet | 3.81 MMac | 1.78 | 0.000659 |
| TCANet | 3.58 MMac | 13.59 | 0.001631 |
| TCN | 8.74 MMac | 25.65 | 0.001991 |
| ShallowConvNet | 50.32 KMac | 37.52 | 0.000608 |
| MSCFormer | 4.35 MMac | 149.71 | 0.004123 |
| CTNet | 3.02 MMac | 149.92 | 0.005147 |
| CCST | 18.94 MMac | 150.96 | 0.002407 |
| TIDNet | 142.87 MMac | 186.74 | 0.001574 |
| DeepConvNet | 7.95 MMac | 204.05 | 0.001093 |
| EEGConformer | 19.33 MMac | 318.47 | 0.007451 |
| EEGInception | 2.86 GMac | 8,870 | 0.006249 |
To further contextualize the efficiency of the CCST model, we analyze its computational complexity in comparison to conventional global self-attention models. Standard transformer architecture compute self-attention across the entire sequence leading to a complexity of
or
while considering the feature dimension (d) where L is the sequence length68,69. In contrast, CCST utilizes a window based self-attention mechanism that restricts attention to non-overlapping windows of size W. This design reduces the computational complexity to
which scales linearly with the window size. Consequently, as long as W is much smaller than L, this approach significantly cuts down on the number of FLOP and inference latency. This reduction in complexity is critical for real-time brain computer interface applications where fast processing is essential.
Experimental setup
Datasets
We used three widely recognized open-source EEG datasets for motor imagery (MI) classification which includes BCI Competition IV (2a and 2b)70 and the PhysioNet EEG Motor Imagery dataset71. These datasets provide standardized benchmarks for evaluating EEG based BCI models and facilitate assessment of subject-independent generalization.
BCI Competition IV 2a consists of EEG recordings from nine subjects performing MI tasks. Recordings were obtained using 22 EEG channels and 3 EOG channels positioned according to the international 10–20 system.
BCI Competition IV 2b also includes EEG data from the nine subjects focusing on MI tasks. Recordings were made with only three EEG electrodes.
PhysioNet EEG Motor Imagery dataset contains EEG recordings from 109 subjects performing motor tasks. Data were recorded using a standard 64-channel EEG cap. The large subject pool allows evaluation of CCST on a more diverse population, reflecting real-world variability in EEG patterns.
These datasets were used to to evaluate the generalization capability of the proposed CCST model under a subject-independent (leave-one-subject-out) cross-validation setting. Table 3 summarizes the key characteristics of the three datasets used in this study.
Table 3.
Summary of the BCI competition IV 2a, 2b, and PhysioNet MI dataset for motor imagery classification.
| Dataset | Subjects | Channels | Trials | Sessions | Labels |
|---|---|---|---|---|---|
| BCI IV 2a | 9 | 22 | 5184 | 2 | LH , RH , TG , BF |
| BCI IV 2b | 9 | 3 | 6520 | 5 | LH , RH |
| PhysioNet MI | 109 | 64 | 4683 | 1 | LH , RH |
Preprocessing
A finite impulse response (FIR) band-pass filter (7–30 Hz) was applied to extract the mu (7–12 Hz) and beta (13–30 Hz) frequency bands. These frequency ranges are well-known for capturing sensorimotor rhythms linked to motor imagery tasks allowing us to focus on neural activities most relevant to classification72. The band-pass filter also suppresses line noise and removes very low-frequency drifts thereby retaining cleaner and more meaningful EEG data.
Following the filtering step, each trial was rescaled to a fixed size of
where E is the number of channels and T is the number of time sample. For trials shorter than the required length, zero-padding was applied and for longer trials, the excess data was truncated. This procedure allows a uniform input dimension across all trials without discarding the critical time points that often capture the onset of motor imagery. Event markers were then remapped to two classes (0 / 1) to match our binary classification objective.
Prior to model training, we applied channel wise z-score normalization using statistics (mean and standard deviation) computed from the training set. By restricting these statistics to the training data, we avoided potential data leakage into validation or test folds. To further enhance robustness and reduce overfitting, several data augmentations were incorporated during training. Specifically, temporal shifts of ±10 samples (±40ms) were used to accommodate natural timing variations in motor imagery onset. We also introduced gaussian noise with a standard deviation of 0.01 and 5% electrode dropout to observe variability or sensor loss that may occur in real-world EEG recordings.
Hyperparameter selection and model configuration
To ensure optimal performance while maintaining computational efficiency, we conducted a small-scale grid search for key hyperparameters. Hyperparameter choices were selected by a careful balance between model complexity and generalization.The 40 dimensional patch embedding was selected to capture sufficient local spatio-temporal detail while minimizing model complexity and parameter count. This embedding was subsequently linearly projected to a 64-dimensional transformer space to enhance representation capacity within the Swin-1D encoder. The configuration of 3 transformer layers with 4 attention heads was chosen to effectively model the complex interdependencies in the input data while maintaining computational efficiency. The 128 node MLP was used to provide sufficient capacity for nonlinear feature transformation and a dropout rate of 0.3 was empirically determined to provide the best balance between regularization and performance. The window size and shift size in the swin transformer blocks were determined by evaluating model performance on a held-out validation set. To ensure reproducibility, all training configurations were standardized across experiments. We used Leave-One-Subject-Out (LOSO) cross-validation to assess the generalization of the Compact Convolutional Swin Transformer (CCST) as listed in Algorithm 1.
Algorithm 1.
Training procedure of CCST.
LOSO cross-validation provides a robust measure of subject-independent performance by training on all-but-one subjects and testing on the unseen subject43. This setup closely simulates real-world BCI scenarios where a model must generalize to new users without recalibration. The model was trained on all-but-one subject with 10% of the training data reserved for validation to enable early stopping. The best performing model from each fold was then tested on the unseen subject. Experiments were conducted using Kaggle GPUs (T4 x2) which provided the necessary computational resources for efficient training. A batch size of 72 was used to balance memory usage, convergence speed, and the model was trained for 100 epochs per fold. Early stopping using a patience of 10 epochs was used to prevent overfitting and ensure model convergence.
The final CCST configuration included a 40-dimensional patch embedding, linearly projected to a 64-dimensional transformer embedding, 3 transformer layers, 4 attention heads, a 128 node MLP, and a 0.3 dropout rate. Learnable positional embeddings and data augmentation techniques were incorporated to improve generalization73. The learning rate was set to
with an adaptive scheduler (ReduceLROnPlateau) adjusting learning based on validation loss. Table 4 summarizes the hyperparameters used for both CCST configurations, including the transformer architecture (number of layers, attention heads, MLP), embedding dimension, and training settings (dropout, learning rate).
Table 4.
Hyperparameter configurations for CCST.
| Dataset | Kernel | Emb. Dim. | Trans. Dim. | Layers | Heads | MLP | Dropout (FC) | LR |
|---|---|---|---|---|---|---|---|---|
| BCI IV 2a | (1,25) / (22,1) | 40 | 64 | 3 | 4 | 128 | 0.3 |
|
| BCI IV 2b | (1,25) / (3,1) | 40 | 64 | 3 | 4 | 128 | 0.3 |
|
In addition to the overall CCST configuration described above, Table 5 provides a detailed summary of the operations within the feature extractor module. This table outlines the sequential processing of the raw EEG data denoted by C (channels) and T (time samples) starting from the initial temporal and spatial convolutions followed by normalization, non-linear activation, and pooling. The feature maps are then reshaped and projected into a higher-dimensional space suitable for subsequent transformer processing. This concise overview highlights the design choices that enable effective extraction of spatio temporal representations while maintaining computational efficiency.
Table 5.
Detailed parameters and operations within the CCST feature extractor module.
| Operation | Input shape | Parameters | Output shape |
|---|---|---|---|
| Temporal convolution | (B, 1, C, T) |
Conv2d: kernel_size = (1,25), stride = (1,1) |
(B, 40, C, T) |
| Spatial convolution | (B, 40, C, T) |
Conv2d: kernel_size = (C,1), stride = (1,1) |
(B, 40, 1, T) |
| Batch normalization | (B, 40, 1, T) | BatchNorm2d: num_features = 40 | (B, 40, 1, T) |
| Activation (ELU) | (B, 40, 1, T) | ELU activation | (B, 40, 1, T) |
| Average Pooling | (B, 40, 1, T) |
AvgPool2d: kernel_size = (1,75), stride = (1,15) |
![]() |
| Dropout | ![]() |
Dropout: p = 0.5 | ![]() |
| Projection convolution | ![]() |
Conv2d: in_channels = 40, out_channels = 40, kernel_size = (1,1) |
![]() |
| Rearrangement | ![]() |
Rearrange: ’b e (ht) (w) b (ht w) e’ |
![]() |
| Embedding projection | ![]() |
Linear: in_features = 40, out_features = 64 |
![]() |
| Positional encoding | ![]() |
{learnable, sine, none}, shape = (1, 15, 64) |
![]() |
| Transformer | ![]() |
Multiple layers: num_layers = 3, num_heads = 4, mlp_hidden = 128, window_size = 4 |
(B, 64) |
By structuring preprocessing around sensorimotor frequencies, enforcing consistent trial lengths, normalizing features, and applying augmentations, CCST was provided with clean yet representative input data. This facilitated the extraction of meaningful spatio-temporal representations enhancing its ability to differentiate between left and right hand motor imagery across different subjects.
Results
Subject-independent classification performance was evaluated using a leave-one-subject-out cross-validation scheme. In each iteration, one subject was held out for testing while the model was trained on the remaining subjects. The reported performance metrics are based on the aggregated test performance across all subjects. The t-SNE visualization was also applied to the complete dataset including trials from all subjects. This approach allowed us to assess how the learned feature representations from our model are distributed across subjects. By reducing the high dimensional features to two dimensions, the visualization shows clusters corresponding to different classes highlighting both the distinguishing power of the model and the inter-subject variability present in the EEG data. Figure 3a–c shows the t-SNE visualization of the learned feature representations across all subjects on BCI IV 2a, 2b and PhysioNet MI datasets.
Fig. 3.
t-SNE visualization of the learned feature representations across all subjects. Panel (a) shows the BCI IV 2a dataset, panel (b) shows the BCI IV 2b dataset, and panel (c) shows the PhysioNet MI dataset.
Performance on the BCI IV 2a dataset
The proposed CCST model was first evaluated on the BCI IV 2a dataset and compared against established benchmark models. A leave-one-subject-out cross-validation was performed across nine subjects, and the resulting subject-independent test accuracies were recorded.
The results in Table 6 demonstrate substantial inter-subject variability across all models. CCST achieved accuracies ranging from approximately 29.51% to 81.08% when data augmentation (DA) was used, and from 34.90% to 91.49% without DA. Notably, CCST consistently ranked as the top-performing model for most subjects under both settings. In particular, CCST achieved the highest overall mean accuracy of 54.71% (with DA) and 68.27% (without DA), outperforming all benchmark models by a clear margin.
Table 6.
Comparison of results of CCST and benchmark models on BCI IV 2a dataset.
| DA type | Model | Sub 1 | Sub 2 | Sub 3 | Sub 4 | Sub 5 | Sub 6 | Sub 7 | Sub 8 | Sub 9 | Avg ± SD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| With DA | CCST | 60.94 | 29.51 | 67.88 | 46.53 | 29.69 | 48.61 | 51.91 | 81.08 | 76.22 | 54.71 |
| EEGConformer | 63.37 | 25.69 | 59.38 | 38.37 | 25.35 | 28.30 | 32.29 | 47.22 | 47.92 | 40.88 | |
| ShallowConvNet | 58.16 | 26.04 | 59.90 | 33.68 | 26.04 | 31.42 | 37.50 | 61.63 | 52.43 | 42.98 | |
| EEGNet | 62.67 | 24.31 | 64.93 | 40.45 | 26.04 | 27.26 | 38.72 | 57.99 | 57.29 | 44.41 | |
| DeepConvNet | 52.26 | 24.13 | 52.43 | 31.77 | 26.56 | 34.20 | 32.47 | 55.38 | 54.51 | 40.41 | |
| TCN | 54.86 | 24.48 | 38.72 | 35.59 | 24.13 | 28.12 | 38.54 | 52.78 | 35.24 | 36.94 | |
| TIDNet | 42.01 | 24.13 | 48.61 | 35.94 | 25.69 | 30.73 | 33.33 | 40.28 | 44.62 | 36.15 | |
| MSCFormer | 59.03 | 23.26 | 57.99 | 36.28 | 25.17 | 24.31 | 33.85 | 55.21 | 43.75 | 39.87 | |
| TCANet | 64.41 | 26.91 | 52.26 | 36.81 | 26.22 | 31.77 | 34.90 | 59.20 | 55.03 | 43.06 | |
| CTNet | 55.38 | 24.83 | 58.33 | 38.37 | 26.22 | 27.95 | 37.67 | 63.19 | 53.99 | 42.88 | |
| EEGInception | 58.51 | 29.34 | 37.67 | 37.50 | 28.12 | 31.94 | 39.24 | 51.91 | 53.12 | 40.82 | |
| Without DA | CCST | 66.67 | 38.89 | 86.28 | 61.46 | 34.90 | 65.28 | 79.86 | 91.49 | 89.58 | 68.27 |
| EEGConformer | 59.55 | 24.48 | 56.42 | 35.76 | 26.22 | 28.82 | 36.63 | 59.38 | 59.72 | 43.00 | |
| ShallowConvNet | 63.72 | 25.87 | 68.92 | 34.72 | 26.56 | 30.21 | 40.80 | 63.72 | 63.37 | 46.43 | |
| EEGNet | 66.32 | 24.83 | 65.80 | 40.97 | 24.65 | 26.56 | 43.92 | 64.58 | 61.81 | 46.60 | |
| DeepConvNet | 69.97 | 25.52 | 67.53 | 30.90 | 27.26 | 34.55 | 37.67 | 58.85 | 57.29 | 45.51 | |
| TCN | 63.72 | 23.61 | 60.07 | 46.18 | 24.31 | 35.76 | 34.72 | 59.20 | 47.40 | 43.89 | |
| TIDNet | 43.23 | 28.65 | 53.12 | 39.24 | 23.78 | 36.63 | 35.42 | 43.75 | 48.61 | 39.16 | |
| MSCFormer | 68.92 | 27.60 | 62.85 | 40.45 | 26.56 | 27.78 | 40.28 | 67.71 | 65.97 | 47.57 | |
| TCANet | 62.67 | 27.95 | 58.68 | 39.41 | 29.51 | 26.74 | 35.24 | 57.99 | 65.28 | 44.83 | |
| CTNet | 64.93 | 26.22 | 58.85 | 30.03 | 27.60 | 28.99 | 31.08 | 61.28 | 60.94 | 43.33 | |
| EEGInception | 51.04 | 25.17 | 51.56 | 29.86 | 27.78 | 34.38 | 32.47 | 44.62 | 60.94 | 39.76 |
Significant values are in [bold].
Aggregate analysis (see Fig. 4A and C) shows that despite inherent inter-subject differences, CCST attains higher mean accuracy and lower variability compared to competing models. The box/strip plot in Fig. 4B and D further reveals that the distribution of CCST accuracies is tightly clustered toward the higher end, indicating superior generalization across subjects and robustness to subject-specific EEG variability.
Fig. 4.
Performance comparison of CCST and baseline models on the BCI IV 2a dataset using leave-one-subject-out cross-validation. (A) and (C) Grouped bar chart showing the mean classification accuracy (%) for each model with bars representing the standard deviation across subjects. (B) and (D) Box plot showing the distribution of classification accuracies for each model highlighting inter-subject variability.
Performance on the BCI IV 2b dataset
A similar evaluation was conducted on the BCI IV 2b dataset using the same leave-one-subject-out cross-validation protocol. As shown in Table 7, CCST achieved the best overall performance among all models, with an average accuracy of 75.79% (with DA) and 76.61% (without DA). These results surpass the next-best methods such as ShallowConvNet (73.73% with DA, 73.66% without DA) and EEGNet (74.68% with DA, 73.67% without DA).
Table 7.
Comparison of results of CCST and benchmark models on BCI IV 2b dataset.
| DA Type | Model | Sub 1 | Sub 2 | Sub 3 | Sub 4 | Sub 5 | Sub 6 | Sub 7 | Sub 8 | Sub 9 | Avg ± SD |
|---|---|---|---|---|---|---|---|---|---|---|---|
| With DA | CCST | 70.14 | 58.53 | 58.47 | 86.08 | 83.11 | 79.17 | 81.25 | 79.08 | 86.25 | 75.79 |
| EEGConformer | 71.53 | 58.24 | 57.08 | 90.81 | 72.84 | 61.25 | 72.36 | 78.95 | 80.14 | 71.47 | |
| ShallowConvNet | 72.08 | 57.06 | 55.83 | 90.81 | 74.46 | 80.14 | 74.31 | 79.87 | 79.03 | 73.73 | |
| EEGNet | 74.17 | 59.71 | 56.53 | 91.22 | 73.38 | 84.03 | 73.19 | 80.00 | 79.86 | 74.68 | |
| DeepConvNet | 73.61 | 58.97 | 58.61 | 90.54 | 73.78 | 73.19 | 74.44 | 79.47 | 80.14 | 73.64 | |
| TCN | 69.86 | 57.65 | 55.14 | 89.46 | 75.27 | 75.97 | 72.50 | 79.34 | 76.11 | 72.37 | |
| TIDNet | 69.17 | 57.94 | 56.25 | 87.43 | 68.51 | 77.92 | 72.92 | 75.26 | 75.28 | 71.19 | |
| MSCFormer | 76.81 | 55.59 | 55.69 | 91.62 | 74.86 | 65.14 | 73.06 | 76.97 | 80.56 | 72.26 | |
| TCANet | 72.08 | 57.21 | 56.53 | 92.43 | 73.38 | 66.81 | 75.69 | 77.24 | 78.19 | 72.17 | |
| CTNet | 73.75 | 57.35 | 55.56 | 91.62 | 71.22 | 77.22 | 74.17 | 78.68 | 79.31 | 73.21 | |
| EEGInception | 56.25 | 52.65 | 53.33 | 82.70 | 65.00 | 65.00 | 69.17 | 78.29 | 72.08 | 66.05 | |
| Without DA | CCST | 69.31 | 58.09 | 57.92 | 88.24 | 86.22 | 82.78 | 81.11 | 79.74 | 86.11 | 76.61 |
| EEGConformer | 70.42 | 54.41 | 55.00 | 88.65 | 70.81 | 77.50 | 71.53 | 78.29 | 79.58 | 71.80 | |
| ShallowConvNet | 71.94 | 57.79 | 54.58 | 91.76 | 73.38 | 78.75 | 74.72 | 79.21 | 80.83 | 73.66 | |
| EEGNet | 70.69 | 57.21 | 57.36 | 88.38 | 73.38 | 81.94 | 74.31 | 80.00 | 79.72 | 73.67 | |
| DeepConvNet | 70.00 | 56.03 | 57.22 | 88.11 | 67.43 | 72.78 | 69.58 | 79.34 | 77.50 | 70.89 | |
| TCN | 68.75 | 56.18 | 54.44 | 86.76 | 73.65 | 79.03 | 73.06 | 78.42 | 77.50 | 71.98 | |
| TIDNet | 68.06 | 56.03 | 56.53 | 87.30 | 67.57 | 71.67 | 69.03 | 79.21 | 73.33 | 69.86 | |
| MSCFormer | 75.00 | 56.03 | 56.53 | 92.84 | 71.22 | 77.92 | 74.17 | 79.34 | 78.47 | 73.50 | |
| TCANet | 68.19 | 60.59 | 57.50 | 89.46 | 69.19 | 71.25 | 70.83 | 78.42 | 76.53 | 71.33 | |
| CTNet | 67.36 | 56.03 | 57.08 | 91.49 | 71.22 | 75.69 | 75.00 | 79.21 | 77.78 | 72.32 | |
| EEGInception | 68.33 | 52.06 | 55.28 | 87.16 | 65.41 | 71.81 | 70.83 | 78.16 | 74.44 | 69.28 |
Significant values are in [bold].
CCST exhibited particularly strong results for Subjects 5, 7, and 9, where it consistently outperformed all baselines, demonstrating its capability to generalize effectively across individuals with diverse EEG patterns. The grouped bar chart in Fig. 5A confirms that CCST achieves both high average accuracy and low variability across subjects, while the box/strip plot in Fig. 5B illustrates a concentrated accuracy distribution toward higher values further emphasizing CCST robustness and stability in subject-independent settings.
Fig. 5.
Performance comparison of CCST and baseline models on the BCI IV 2b dataset using leave-one-subject-out cross-validation. (A) and (C) Grouped bar chart showing the mean classification accuracy (%) for each model with bars representing the standard deviation across subjects. (B) and (D) Box plot showing the distribution of classification accuracies for each model highlighting inter-subject variability.
The grouped bar chart (see Fig. 5A and C) for the BCI IV 2b dataset reaffirms that CCST not only achieves high mean test accuracy but also maintains low variability across subjects. The box/strip plot (see Fig. 5B and D) shows that the distribution of test accuracies for CCST is more concentrated with the majority of individual results aggregating towards the higher end of the performance range.
Performance on the PhysioNet MI dataset
To further assess generalization, the CCST model was evaluated on the large-scale PhysioNet MI dataset comprising 109 subjects. As shown in Table 8, CCST achieved the highest overall accuracy among all compared models, recording 71.89% with data augmentation and 71.70% without data augmentation. In contrast, competing models such as EEGConformer, DeepConvNet, and TCN achieved around 49–55% accuracy on average.
Table 8.
Comparison of results of CCST and benchmark models on PhysioNet MI dataset.
| # | Models | Average (109 Subjects) | |
|---|---|---|---|
| With DA | Without DA | ||
| 1 | CCST | 71.89 | 71.70 |
| 2 | EEGConformer | 49.83 | 49.36 |
| 3 | ShallowConvNet | 50.02 | 50.32 |
| 4 | EEGNet | 49.95 | 54.01 |
| 5 | DeepConvNet | 49.97 | 49.96 |
| 6 | TCN | 50.00 | 49.93 |
| 7 | TIDNet | 49.91 | 50.01 |
| 8 | MSCFormer | 50.81 | 55.91 |
| 9 | TCANet | 50.44 | 50.25 |
| 10 | CTNet | 50.84 | 70.40 |
| 11 | EEGInception | 49.91 | 50.17 |
Significant values are in [bold].
These results demonstrate that CCST not only generalizes effectively across diverse subjects but also maintains consistent performance regardless of the presence of data augmentation. Figure 6A and B illustrates that CCST substantially outperforms all baselines in both conditions, highlighting its robustness and adaptability across large-scale EEG datasets.
Fig. 6.
Performance comparison of CCST and baseline models on the PhysioNet MI dataset using leave-one-subject-out cross-validation. (A) Grouped bar chart showing the mean classification accuracy (%) for each model without data augmentation (B) grouped bar chart showing the mean classification accuracy (%) for each model with data augmentation.
Statistical analysis
To assess whether the performance improvements of CCST are statistically significant, paired statistical tests were conducted comparing CCST against each baseline model across the BCI IV 2a, 2b, and PhysioNet MI datasets. For each subject, the difference in test accuracy between CCST and a baseline model was calculated. The normality of these differences was evaluated using the Shapiro–Wilk test74. When normality was not satisfied (
), the non-parametric Wilcoxon signed-rank test75 was applied otherwise, a paired t-test was used. To correct for multiple comparisons across baseline models, a Bonferroni-adjusted significance threshold of
was applied76. Figure 7 visualizes the p-values across all comparisons, where darker shades indicate stronger statistical significance (p < 0.005).
Fig. 7.

Statistical Significance (p-value < 0.005) for CCST vs. Baseline Models based on leave-one-subject-out test accuracies. The color gradient represents the significance level of the p-values with dark shades indicating stronger statistical significance (p < 0.005) where CCST demonstrated a significant performance improvement over a given baseline model.
BCI IV 2a (without DA): CCST achieved statistically significant improvements compared to all benchmark models including EEGNet, at the Bonferroni-corrected threshold (
).
BCI IV 2b (without DA): Only CCST versus TIDNet (
) achieved statistical significance. All other comparisons were not significant under the Bonferroni-corrected threshold. However, CCST still achieved a higher overall average accuracy, with subjects 5, 6, 7, and 9 showing approximately a 5% improvement in accuracy compared to the other models.
PhysioNet MI (without DA): CCST significantly outperformed all baseline models. Only CTNet (
) did not show a significant difference. However, CCST still achieved approximately 20% higher accuracy when data augmentation was applied.
Overall, the statistical analysis confirms that CCST provides meaningful performance improvements in subject-independent classification settings particularly on the BCI IV 2a and PhysioNet MI datasets. The lack of statistically significant differences on the BCI IV 2b dataset shows that performance variations across subjects in this dataset may be more substantial requiring further analysis. Nevertheless, the results support CCST potential as a robust and effective EEG classification model for BCI applications.
Discussion
The Compact Convolutional Swin Transformer (CCST) represents a significant advancement in EEG-based motor imagery (MI) classification by effectively addressing the challenges of inter-subject variability, feature representation, and computational efficiency. The superior performance of CCST can be primarily attributed to its hybrid design, which integrates convolutional layers for localized spatial feature extraction with hierarchical window-based self-attention from the Swin Transformer to capture long-range temporal dependencies. This combination allows CCST to describe both detailed and broad EEG patterns essential for attaining substantial generalization across unseen participants.
Experimental results on the BCI Competition IV 2a, 2b, and PhysioNet MI datasets demonstrate that CCST consistently outperforms established CNN and Transformer based models. On the BCI IV 2a dataset, CCST achieved a mean accuracy of 68.27%, while on BCI IV 2b it reached 76.61%, both surpassing existing baselines by a substantial margin. Similarly, on the large-scale PhysioNet MI dataset, CCST attained an average accuracy of 71.89%, further confirming its ability to generalize across diverse EEG patterns. These results underscore CCST strength in maintaining stable performance despite considerable inter-subject and dataset variability.
While data augmentation consistently improved the performance of our proposed CCST model across all three datasets (+ 2.1% to + 4.8% on average), we observed mixed effects on several benchmark architectures (e.g., EEGNet, DeepConvNet, TCN), where DA occasionally led to minor performance degradation (− 0.3% to − 1.1%). This behavior is expected in subject-independent LOSO settings, where overly aggressive augmentation can violate the inter-subject distribution shift that some CNN-based models implicitly exploit through their inductive biases. Our augmentation suite (time-shift ±50 ms, Gaussian noise
, channel dropout
, amplitude scaling 0.8–1.2) was tuned to benefit hierarchical transformer-style models that rely on stable spatial patterns, explaining the consistently positive effect on CCST. For lighter convolutional models, excessive noise or dropout may perturb compact spatial filters and induce overfitting to augmented samples. To ensure transparency, we now report both “with DA” and “without DA” results throughout the paper. Future work may explore adaptive, model-specific, or subject-conditioned augmentation strategies to further improve cross-subject generalization.
In addition to accuracy gains, CCST exhibits notable computational efficiency. By utilizing localized attention windows and compact convolutional modules, the model significantly reduces the number of parameters and floating-point operations (FLOPs) compared to full self-attention architectures. This efficiency makes CCST well-suited for real-time BCI applications, including deployment in resource-constrained environments such as wearable EEG systems or mobile neurofeedback platforms. The architecture thus achieves an optimal trade-off between model complexity and performance, offering both scalability and practical utility.
Despite these advantages, the observed variation in performance across datasets suggests that CCST generalization capability may still be influenced by factors such as electrode configuration, sampling rates, and channel count. Future research should focus on adaptive attention windowing and dynamic feature recalibration procedures specifically designed for unique EEG spatial-temporal structures to improve adaptability across diverse datasets.
Furthermore, integrating domain adaptation techniques and multimodal fusion strategies such as combining EEG with electromyography (EMG) or functional near-infrared spectroscopy (fNIRS) could extend CCST applicability to more complex and realistic BCI scenarios. Such advancements would facilitate the translation of CCST from controlled laboratory conditions to real-world assistive, clinical, and neurorehabilitation environments, thereby paving the way toward practical, user-adaptive brain–computer interfaces.
Conclusion
In this study, we introduced the Compact Convolutional Swin Transformer (CCST), a novel hybrid architecture specifically designed to address the challenges of subject-independent EEG-based motor imagery classification. By combining local convolutional feature extraction with global hierarchical self-attention, CCST effectively captures both spatially localized and temporally distributed EEG representations.
Extensive evaluations on three benchmark datasets, BCI Competition IV 2a, 2b, and PhysioNet MI demonstrated that CCST consistently outperforms established deep learning models. The model achieved state-of-the-art average accuracies of 68.27% and 76.61% on the BCI IV 2a and 2b datasets (without data augmentation), respectively, and 71.89% on the large-scale PhysioNet MI dataset. These results highlight CCST superior generalization ability across subjects and its robustness to inter-subject variability.
Overall, CCST provides a computationally efficient and accurate framework that bridges the strengths of convolutional and transformer-based paradigms for EEG decoding. Future research will explore real-time deployment, adaptive domain alignment for cross-session generalization, and multimodal extensions integrating additional physiological signals. Such developments could further advance CCST toward practical, scalable applications in neurorehabilitation, cognitive monitoring, and assistive brain–computer interface systems.
Acknowledgements
This research was funded by Private Institution "Nazarbayev University Research Administration" under the Faculty Development Competitive Research Grants Program for 2025–2027, Grant No. 040225FD4727, B.A.
Author contributions
W.Q. took the lead in writing the manuscript including creating the figures and tables. B.A. supervised the research and were in charge of overall direction and planning. All authors provided critical feedback and helped shape the research, analysis and manuscript.
Data availability
The datasets used in this study are publicly available from the BCI Competition IV repository (https://www.bbci.de/competition/iv) and from PhysioNet (https://physionet.org/content/eegmmidb/1.0.0/). The code for this study is available at (https://github.com/WasiUrRehmanQamar/CCST).
Competing Interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Zhang, H. et al. Brain-computer interfaces: the innovative key to unlocking neurological conditions. Int. J. Surg.110, 5745–5762. 10.1097/js9.0000000000002022 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Shih, J. J., Krusienski, D. J. & Wolpaw, J. R. Brain-computer interfaces in medicine. Mayo Clin. Proc.87, 268–279. 10.1016/j.mayocp.2011.12.008 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Abibullaev, B., Zollanvari, A., Saduanov, B. & Alizadeh, T. Design and optimization of a bci-driven telepresence robot through programming by demonstration. IEEE Access7, 111625–111636 (2019). [Google Scholar]
- 4.Yadav, H. & Maini, S. Electroencephalogram based brain-computer interface: applications, challenges, and opportunities. Multimedia Tools Appl.82, 47003–47047. 10.1007/s11042-023-15653-x (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Abibullaev, B., Dolzhikova, I. & Zollanvari, A. A brute-force cnn model selection for accurate classification of sensorimotor rhythms in bcis. IEEE Access8, 101014–101023. 10.1109/ACCESS.2020.2997681 (2020). [Google Scholar]
- 6.Keutayeva, A. & Abibullaev, B. Data constraints and performance optimization for transformer-based models in eeg-based brain-computer interfaces: A survey. IEEE Access12, 62628–62647. 10.1109/access.2024.3394696 (2024). [Google Scholar]
- 7.Keutayeva, A., Zollanvari, A. & Abibullaev, B. Evolving Trends and Future Prospects of Transformer Models in EEG-Based Motor-Imagery BCI Systems, 233–256 (Springer Nature, 2024). [Google Scholar]
- 8.Barmpas, K. et al. A causal perspective on brainwave modeling for brain–computer interfaces. J. Neural Eng.21, 036001. 10.1088/1741-2552/ad3eb5 (2024). [DOI] [PubMed] [Google Scholar]
- 9.Lawhern, V. J. et al. EEGNet: a compact convolutional neural network for EEG-based brain-computer interfaces. J. Neural Eng.15, 326. 10.1088/1741-2552/aace8c (2018). [DOI] [PubMed] [Google Scholar]
- 10.Wang, C. et al. Mi-eeg classification using shannon complex wavelet and convolutional neural networks. Appl. Soft Comput.130, 109685. 10.1016/j.asoc.2022.109685 (2022). [Google Scholar]
- 11.Barmpas, K., Panagakis, Y., Adamos, D. A., Laskaris, N. & Zafeiriou, S. Brainwave-scattering net: a lightweight network for eeg-based motor imagery recognition. J. Neural Eng.20, 056014. 10.1088/1741-2552/acf78a (2023). [DOI] [PubMed] [Google Scholar]
- 12.Abibullaev, B., Keutayeva, A. & Zollanvari, A. Deep learning in eeg-based bcis: a comprehensive review of transformer models, advantages, challenges, and applications. IEEE Access11, 127271–127301. 10.1109/access.2023.3329678 (2023). [Google Scholar]
- 13.Hamidi, A. & Kiani, K. Motor imagery eeg signals classification using a transformer-gcn approach. Appl. Soft Comput.2024, 112686. 10.1016/j.asoc.2024.112686 (2024). [Google Scholar]
- 14.Zhao, Q. & Zhu, W. Tmsa-net: a novel attention mechanism for improved motor imagery eeg signal processing. Biomed. Signal Process. Control102, 107189. 10.1016/j.bspc.2024.107189 (2024). [Google Scholar]
- 15.Du, Y., Xu, Y., Wang, X., Liu, L. & Ma, P. Eeg temporal-spatial transformer for person identification. Sci. Rep.12, 236. 10.1038/s41598-022-18502-3 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hameed, A. et al. Temporal–spatial transformer based motor imagery classification for bci using independent component analysis. Biomed. Signal Process. Control87, 105359. 10.1016/j.bspc.2023.105359 (2023). [Google Scholar]
- 17.Tan, X., Wang, D., Chen, J. & Xu, M. Transformer-based network with optimization for cross-subject motor imagery identification. Bioengineering10, 609. 10.3390/bioengineering10050609 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kim, S., Lee, D., Kwak, H. & Lee, S. Toward domain-free transformer for generalized eeg pre-training. IEEE Trans. Neural Syst. Rehabil. Eng.32, 482–492. 10.1109/TNSRE.2024.3355434 (2024). [DOI] [PubMed] [Google Scholar]
- 19.Wang, P., Gong, P., Zhou, Y., Wen, X. & Zhang, D. Decoding the continuous motion imagery trajectories of upper limb skeleton points for eeg-based brain–computer interface. IEEE Trans. Instrum. Meas.72, 1–12. 10.1109/tim.2022.3224991 (2022). [Google Scholar]
- 20.Song, Y., Zheng, Q., Wang, Q., Gao, X. & Heng, P.-A. Global adaptive transformer for cross-subject enhanced eeg classification. IEEE Trans. Neural Syst. Rehabil. Eng.31, 2767–2777. 10.1109/tnsre.2023.3285309 (2023). [DOI] [PubMed] [Google Scholar]
- 21.Song, Y., Zheng, Q. B. L. & Gao, X.,. Convolutional transformer for eeg decoding and visualization: Eeg conformer. IEEE Trans. Neural Syst. Rehabil. Eng.31, 710–719. 10.1109/tnsre.2022.3230250 (2022). [DOI] [PubMed]
- 22.Zhang, D., Yao, L., Chen, K. & Monaghan, J. A convolutional recurrent attention model for subject-independent eeg signal analysis. IEEE Signal Process. Lett.26, 715–719. 10.1109/lsp.2019.2906824 (2019). [Google Scholar]
- 23.Liu, H., Liu, Y., Wang, Y., Liu, B. & Bao, X. Eeg classification algorithm of motor imagery based on cnn-transformer fusion network. In Proceedings of the 2022 International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) (2022). 10.1109/trustcom56396.2022.00182.
- 24.Deny, P. & Choi, K. W. Hierarchical transformer for brain computer interface. In Proceedings of the 2023 IEEE International Conference on Brain-Computer Interface (BCI) 1–5 (2023). 10.1109/bci57258.2023.10078473.
- 25.Tao, Y. et al. Gated transformer for decoding human brain eeg signals. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 125–130 (2021). 10.1109/embc46164.2021.9630210. [DOI] [PubMed]
- 26.Luo, J. et al. A shallow mirror transformer for subject-independent motor imagery bci. Comput. Biol. Med.164, 107254. 10.1016/j.compbiomed.2023.107254 (2023). [DOI] [PubMed] [Google Scholar]
- 27.Sun, J., Xie, J. & Zhou, H. Eeg classification with transformer-based models. In 2021 IEEE 3rd Global Conference on Life Sciences and Technologies (LifeTech) 92–93 (2021). 10.1109/lifetech52111.2021.9391844.
- 28.Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. In International Conference on Learning Representations (ICLR) Workshop (2018).
- 29.Supratak, A., Dong, H., Wu, C. & Guo, Y. Deepsleepnet: a model for automatic sleep stage scoring based on raw single-channel eeg. IEEE Trans. Neural Syst. Rehabil. Eng.25, 1998–2008. 10.1109/tnsre.2017.2721116 (2017). [DOI] [PubMed] [Google Scholar]
- 30.Schirrmeister, R. T. et al. Deep learning with convolutional neural networks for eeg decoding and visualization. Hum. Brain Mapp.10.1002/hbm.23730 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kostas, D. & Rudzicz, F. Thinker invariance: enabling deep neural networks for bci across more people. J. Neural Eng.17, 056008. 10.1088/1741-2552/abb7a7 (2020). [DOI] [PubMed] [Google Scholar]
- 32.Santamaria-Vazquez, E., Martinez-Cagigal, V., Vaquerizo-Villar, F. & Hornero, R. Eeg-inception: a novel deep convolutional neural network for assistive erp-based brain-computer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng.28, 2773–2782. 10.1109/TNSRE.2020.3048106 (2020). [DOI] [PubMed] [Google Scholar]
- 33.Liao, W., Miao, Z., Liang, S., Zhang, L. & Li, C. A composite improved attention convolutional network for motor imagery eeg classification. Front. Neurosci.19, 236. 10.3389/fnins.2025.1543508 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Liang, T. et al. Eeg-cdilnet: a lightweight and accurate ccnn network using circular dilated convolution for motor imagery classification. J. Neural Eng.20, 046031. 10.1088/1741-2552/acee1f (2023). [DOI] [PubMed] [Google Scholar]
- 35.Wu, Z., Sun, B. & Zhu, X. Coupling convolution, transformer and graph embedding for motor imagery brain-computer interfaces. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS) 404–408 (2022). 10.1109/iscas48785.2022.9937435.
- 36.Liu, B., Wang, Y., Gao, L. & Cai, Z. Enhanced electroencephalogram signal classification: a hybrid convolutional neural network with attention-based feature selection. Brain Res.2025, 149484. 10.1016/j.brainres.2025.149484 (2025). [DOI] [PubMed] [Google Scholar]
- 37.Ma, Y., Song, Y. & Gao, F. A novel hybrid cnn-transformer model for eeg motor imagery classification. In 2022 International Joint Conference on Neural Networks (IJCNN) 1–8 (2022). 10.1109/ijcnn55064.2022.9892821.
- 38.Khademi, Z., Ebrahimi, F. & Kordy, H. M. A review of critical challenges in mi-bci: from conventional to deep learning methods. J. Neurosci. Methods383, 109736. 10.1016/j.jneumeth.2022.109736 (2022). [DOI] [PubMed] [Google Scholar]
- 39.Wei, C. et al. Editorial: inter- and intra-subject variability in brain imaging and decoding. Front. Comput. Neurosci.15, 253. 10.3389/fncom.2021.791129 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Huang, G. et al. Discrepancy between inter- and intra-subject variability in eeg-based motor imagery brain-computer interface: evidence from multiple perspectives. Front. Neurosci.17, 236. 10.3389/fnins.2023.1122661 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Altaheri, H., Muhammad, G. & Alsulaiman, M. Physics-informed attention temporal convolutional network for eeg-based motor imagery classification. IEEE Trans. Industr. Inf.19, 2249–2258. 10.1109/tii.2022.3197419 (2022). [Google Scholar]
- 42.Keutayeva, A., Fakhrutdinov, N. & Abibullaev, B. Compact convolutional transformer for subject-independent motor imagery eeg-based bcis. Sci. Rep.14, 236. 10.1038/s41598-024-73755-4 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Keutayeva, A. & Abibullaev, B. Exploring the potential of attention mechanism-based deep learning for robust subject-independent motor-imagery based bcis. IEEE Access11, 107562–107580. 10.1109/access.2023.3320561 (2023). [Google Scholar]
- 44.Vafaei, E. & Hosseini, M. Transformers in eeg analysis: a review of architectures and applications in motor imagery, seizure, and emotion classification. Sensors25, 1293. 10.3390/s25051293 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (2017).
- 46.Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 10012–10022 (2021).
- 47.Dai, G., Zhou, J., Huang, J. & Wang, N. Hs-cnn: a cnn with hybrid convolution scale for eeg motor imagery classification. J. Neural Eng.17, 016025. 10.1088/1741-2552/ab405f (2019). [DOI] [PubMed] [Google Scholar]
- 48.Amin, S. U., Alsulaiman, M., Muhammad, G., Mekhtiche, M. A. & Hossain, M. S. Deep learning for eeg motor imagery classification based on multi-layer cnns feature fusion. Futur. Gener. Comput. Syst.101, 542–554. 10.1016/j.future.2019.06.027 (2019). [Google Scholar]
- 49.Zhao, W. et al. Tcanet: a temporal convolutional attention network for motor imagery eeg decoding. Cogn. Neurodyn.19, 91. 10.1007/s11571-025-10275-5 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Zhao, W., Jiang, X., Zhang, B., Xiao, S. & Weng, S. Ctnet: a convolutional transformer network for eeg? Based motor imagery classification. Sci. Rep.14, 20237. 10.1038/s41598-024-71118-7 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Zhao, W. et al. Multi?scale convolutional transformer network for motor imagery brain?computer interface. Sci. Rep.15, 12935. 10.1038/s41598-025-96611-5 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Deng, X., Huo, H., Ai, L., Xu, D. & Li, C. A novel 3d approach with a cnn and swin transformer for decoding eeg-based motor imagery classification. Sensors25, 2922. 10.3390/s25092922 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Fournier, Q., Caron, G. M. & Aloise, D. A practical survey on faster and lighter transformers. ACM Comput. Surv.55, 1–40. 10.1145/3586074 (2023). [Google Scholar]
- 54.Stolk, A. et al. Electrocorticographic dissociation of alpha and beta rhythmic activity in the human sensorimotor system. ELife8, 256. 10.7554/elife.48065 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Jiang, W.-B., Zhao, L.-M. & Lu, B.-L. Large brain model for learning generic representations with tremendous EEG data in BCI. In The Twelfth International Conference on Learning Representations (2024).
- 56.Huang, Z., Liang, D., Xu, P. & Xiang, B. Improve transformer models with better relative position embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020 (eds. Cohn, T., He, Y. & Liu, Y.) 3327–3335 (Association for Computational Linguistics, 2020). 10.18653/v1/2020.findings-emnlp.298.
- 57.Sun, C. et al. Learning high-frequency functions made easy with sinusoidal positional encoding. In Proceedings of ICML 2024, ICML’24 (JMLR.org, 2024).
- 58.Morales, S. & Bowers, M. E. Time-frequency analysis methods and their application in developmental eeg data. Dev. Cogn. Neurosci.54, 101067. 10.1016/j.dcn.2022.101067 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR) (2021).
- 60.Gehring, J., Auli, M., Grangier, D., Yarats, D. & Dauphin, Y. N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17 1243–1252 (JMLR.org, 2017).
- 61.Wang, B. et al. On position embeddings in bert. In International Conference on Learning Representations (ICLR) (2021).
- 62.Huang, W., Huang, Q., Ma, L., Chen, Z. & Wang, C. Swg-former: sliding-window graph convolutional network integrated with conformer for sound event localization and detection. arXiv (Cornell University). 10.48550/arxiv.2310.14016 (2023).
- 63.Banos, O., Galvez, J.-M., Damas, M., Pomares, H. & Rojas, I. Window size impact in human activity recognition. Sensors14, 6474–6499. 10.3390/s140406474 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res.15, 1929–1958 (2014). [Google Scholar]
- 65.Li, Y. et al. A survey on dropout methods and experimental verification in recommendation. IEEE Trans. Knowl. Data Eng.2022, 1–20. 10.1109/tkde.2022.3187013 (2022). [Google Scholar]
- 66.Pauls, C. & Yoder, M. Determining the optimal dropout rate for neural networks. In Proceedings of the Midwest Instruction and Computing Symposium (MICS) (2018).
- 67.Pansambal, B. H. & Nandgaokar, A. B. Integrating dropout regularization technique at different layers to improve the performance of neural networks. Int. J. Adv. Comput. Sci. Appl.14, 236. 10.14569/ijacsa.2023.0140478 (2023). [Google Scholar]
- 68.Chen, T. & Li, L. Fit: far-reaching interleaved transformers. 10.48550/arxiv.2305.12689 (2023).
- 69.Zhu, Z. & Soricut, R. H-transformer-1d: fast one-dimensional hierarchical attention for sequences. 10.48550/arxiv.2107.11906 (2021).
- 70.Tangermann, M. et al. Review of the bci competition iv. Front. Neurosci.6, 236. 10.3389/fnins.2012.00055 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Goldberger, A. et al. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation101, e215–e220 (2000). [DOI] [PubMed] [Google Scholar]
- 72.Groppe, D. M. et al. Dominant frequencies of resting human brain activity as measured by the electrocorticogram. Neuroimage79, 223–233. 10.1016/j.neuroimage.2013.04.044 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Nie, Y., H. Nguyen, N., Sinthong, P. & Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In International Conference on Learning Representations (2023).
- 74.Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality. Biometrika52, 591–611. 10.1093/biomet/52.3-4.591 (1965). [Google Scholar]
- 75.Siegel, S. Nonparametric Statistics for the Behavioral Sciences (McGraw-Hill, 1956). [Google Scholar]
- 76.Dunn, O. J. Multiple comparisons among means. J. Am. Stat. Assoc.56, 52–64. 10.1080/01621459.1961.10482090 (1961). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used in this study are publicly available from the BCI Competition IV repository (https://www.bbci.de/competition/iv) and from PhysioNet (https://physionet.org/content/eegmmidb/1.0.0/). The code for this study is available at (https://github.com/WasiUrRehmanQamar/CCST).





































