Skip to main content
NPJ Digital Medicine logoLink to NPJ Digital Medicine
. 2025 Dec 18;8:768. doi: 10.1038/s41746-025-02154-4

An Automated Classifier of Harmful Brain Activities for Clinical Usage Based on a Vision-Inspired Pre-trained Framework

Yulin Sun 1,2,3, Xiaopeng Si 1, Runnan He 1, Xiao Hu 4, Peter Smielewski 5, Wenlong Wang 1, Xiaoguang Tong 6, Wei Yue 6, Meijun Pang 1, Kuo Zhang 1, Xizi Song 1, Dong Ming 1,2,3,, Xiuyun Liu 1,2,3,7,
PMCID: PMC12715239  PMID: 41413226

Abstract

Timely identification of harmful brain activities via electroencephalography (EEG) is critical for brain disease diagnosis and treatment, which remains limited in application due to inter-rater variability, resource constraints, and poor generalizability of existing artificial intelligence models. In this study, we describe an automated classifier, VIPEEGNet, which leverages the advantage of transfer learning from ImageNet-pretrained models to distinguish six types of brain activities. For the development cohort, the recall of VIPEEGNet ranges from 36.8% to 88.2%, and the precision ranges from 55.6% to 80.4%, with performance comparable to that of human experts. Notably, the external testing showed Kullback-Leibler divergence (KLD) values of 0.223 (public) and 0.273 (private), ranking second among the existing 2767 competing algorithms, while using only 0.7% of the parameters of the top-ranked algorithm. Its minimal parameter requirements and modular design offer a deployable solution for real-time brain monitoring, potentially expanding access to expert-level EEG interpretation in resource-limited settings.

Subject terms: Computational biology and bioinformatics, Health care, Mathematics and computing, Medical research, Neurology, Neuroscience

Introduction

Globally, an estimated 43% of the world population suffers from neurological diseases, becoming the leading cause of overall disease burden in the world1. As the most often used tool, the electroencephalography (EEG) provides essential information for brain disease diagnosis and guidance of the treatment2. Timely identification of harmful brain activity via EEG plays a critical role in managing epilepsy patients and critically ill individuals, underpinning crucial decisions in seizure detection, brain deterioration monitoring, treatment outcome prediction, and rehabilitation planning3. Evidence further suggests that other epileptiform patterns, such as lateralized periodic discharges (LPD), generalized periodic discharges (GPD), lateralized rhythmic delta activity (LRDA), and generalized rhythmic delta activity (GRDA), etc., the abnormal neuronal activities between overt seizures and normal EEG may also cause neuronal damage4. While these patterns provide a conceptual framework for understanding pathology, we still lack an accurate classifier to distinguish different abnormal brain activities, especially those harmful ones, for early interventions.

The current gold standard of manual EEG analysis by specialized neurologists presents significant limitations, including being highly time-consuming, costly, indistinguishable to fatigue-induced errors, and poor assessment consistency5. Although advances in deep learning and the availability of larger EEG datasets have spurred the development of automated algorithms for classifying seizures and related patterns6,7, critical obstacles hinder their practical application and generalization. Most existing models are trained and validated on relatively small datasets, demonstrating poor performance in handling the inherent variability of EEG signals across diverse patient populations8,9. Even more critically, most models utilize one-dimensional EEG signals as input, lacking the robust and versatile pre-trained models that are prevalent in the field of computer vision, such as ResNet, ViT, and Swin Transformer trained on ImageNet10,11. Consequently, the existing models optimized for specific datasets show poor transferability for a new clinical data set and it is difficult to achieve a balance between high accuracy, model efficiency, ease of deployment, and broad generalizability. A solution involves converting EEG signals into image representations, such as waveform plots or time-frequency spectrograms, and leveraging transfer learning to utilize weights from models pre-trained on ImageNet, thereby achieving high performance1214. The primary challenge lies in determining the optimal method for this EEG-to-image conversion, which must avoid reliance on task-specific prior knowledge, minimize additional computational burden, and preserve the majority of information present in the original EEG signals. This challenge raises a pivotal question: Can the advances of transfer learning from state-of-the-art ImageNet-pretrained vision models, which are dominantly used in the computer vision field for imaging classification, be effectively utilized to distinguish harmful brain activities by using multichannel one-dimensional EEG signals? Our goal was to propose an AI model, i.e., VIPEEGNet (Vision-Inspired Pre-trained Network for EEG Classification), for automatic classification of six critical brain activity patterns, i.e., seizure, GPD, LPD, LRDA, GRDA, and “other”15. Our core innovation is a general EEG-to-image conversion module, enabling the direct adaptation of state-of-the-art (SOTA) ImageNet pre-trained vision models for EEG analysis. This strategy, using only 0.7% of parameters, matches the excellent performance of complex algorithms that use 30 backbone networks as feature extractors, achieving significant efficiency.

We developed and internally validated the deep learning model on a cohort of 1950 subjects, where each EEG recording was annotated by at least three experts. The model’s generalizability was subsequently assessed through external validation using a separate cohort via an online testing platform, with each recording in this external set annotated by a panel of at least ten experts to ensure robust consensus. The newly established model, which we made publicly available on Kaggle (https://www.kaggle.com/code/sunyuri/hms-vipeegnet), exhibits significant generality and features a future-oriented modular design. The findings prove that leveraging powerful pre-trained vision models based on ImageNet is able to achieve excellent EEG classification performance. VIPEEGNet represents a significant step towards bridging computer vision and EEG analysis, demonstrating the feasibility and power of transfer learning for this critical medical task and opening a new path for efficient and accurate EEG classification at the clinical bedside.

Results

Datasets

The datasets used in this study is available in Kaggle and can be accessed via this link https://kaggle.com/competitions/hms-harmful-brain-activity-classification. It was developed by Dr. Westover and his team members from the Massachusetts General Hospital (MGH) and Harvard Medical School (HMS) Department of Neurology5,16,17. A portion of this database has been uploaded to the Kaggle platform for researchers to develop and test their algorithms18. The data can be divided into two cohorts: the development cohort and the online testing cohort.

The development cohort was used for algorithm training and validation and included EEG recordings from 1950 patients (mean age 47.73 years [SD 25.52]; 963 females [49.38%]), with experts annotating 106,800 EEG segments (Table 1). The EEG segments corresponding to seizure, LPD, GPD, LRDA, GRDA, and “other” events were 20,933, 14,856, 16,702, 16,640, 18,861, and 18,808. The “other” category functions as a catch-all classification. It was assigned by experts to EEG segments that were not deemed to belong to any of the five specified event types (seizure, LPD, GPD, LRDA, or GRDA). Consequently, this category may encompass a variety of waveforms, including pathological patterns related to other diseases, artifacts or technical noise, and normal background EEG activity. The development cohort can be divided into two datasets based on the number of experts involved in the annotation: Dataset 1, composed of EEG recordings from 1794 patients, with 11900 EEG recordings and 68854 EEG segments, annotated by 1–7 experts, and Dataset 2 (referred to as the high-quality annotation subset in this paper), composed of EEG recordings from 1,063 patients, with 5939 EEG recordings and 39,946 EEG segments, annotated by 10–28 experts.

Table 1.

The EEG segments and expert annotations of the development cohort

Whole Subset 1, 1-7 labels Subset 2, 10-28 labels
Patient Number 1950 1794 1063
Age, y, mean (SD) 47.73 (25.52) ~ ~
Female, n (%) 963 (49.38) ~ ~
Segments, n (%) 106,800 (100) 68,854 (100) 39,946 (100)
Seizure 20,933 (19.6) 20,337 (29.54) 596 (1.49)
LPD 14,856 (13.91) 7416 (10.77) 7440 (18.63)
GPD 16,702 (15.64) 6625 (9.62) 10,077 (25.23)
LRDA 16,640 (15.58) 10,410 (15.12) 6230 (15.6)
GRDA 18,861 (17.66) 13,478 (19.57) 5383 (13.48)
Other 18,808 (17.61) 8588 (12.47) 10,220 (25.58)
EEG recordings, n (%) 17,089 (100) 11,900 (100) 5,939 (100)
Seizure 3973 (23.25) 3694 (31.04) 279 (4.7)
LPD 3350 (19.6) 2055 (17.27) 1295 (21.81)
GPD 2081 (12.18) 806 (6.77) 1275 (21.47)
LRDA 1132 (6.62) 771 (6.48) 361 (6.08)
GRDA 1930 (11.29) 1483 (12.46) 447 (7.53)
Other 7717 (45.16) 4555 (38.28) 3162 (53.24)

LPD denotes lateralized periodic discharge, GPD generalized periodic discharge, LRDA lateralized rhythmic delta activity, GRDA generalized rhythmic delta activity.

The online testing cohort consisted of a subset of EEG segments from an additional 1532 subjects, with each EEG sample annotated by at least 10 experts. Researchers were only allowed to upload classification models to classify the test data, and the website returned the model performance evaluation results without allowing direct access to the test set to strictly prevent data leakage.

EEG electrodes were placed according to the 10–20 placement standard, and EEG data were resampled to a rate of 200 Hz. EEG data were re-referenced using longitudinal bipolar/double banana methods. Each EEG segment was 50 seconds long, and experts annotated the event type for the middle 10 seconds16. Experts could refer not only to the 50-second EEG but also to the spectrogram of the preceding and following 5 minutes as auxiliary information for annotation. As shown in Fig. 1a, in a continuous 400-second EEG recording, there are a total of 68 segments with annotation results. Taking the 7th segment as an example, experts can observe the signals within the 50-second time window from 35 to 85 seconds, but only annotate the events present in the 55 to 65-second segment of the EEG recording (1 LPD, 8 LRDA, 4 Other). Although suspected seizure signals are present from 35 to 45 seconds, these are not within the target range for annotation in the 7th sample; instead, they belong to the third segment. Therefore, the annotation results for each 50-second segment are only responsible for the middle 10-second segment.

Fig. 1. Expert annotation details and model prediction pipeline.

Fig. 1

a Experts’ annotation results on EEG samples. b Workflow of the model for classifying discrete samples and continuous EEG recordings.

Model experiments

During model training and testing, the model’s discrete classification results for EEG segments are compared against expert annotations to calculate loss or evaluate performance. However, in clinical applications, the model must classify continuous data streams. To handle this, we use a sliding window technique (with a 50-second window and a 2-second step) to generate numerous overlapping, discrete classifications. These overlapping results are then aggregated and averaged to produce a final continuous output, as shown in Fig. 1b. In practical deployment scenarios, causal constraint issues exist, which will be analyzed in the discussion section.

In the development cohort, a five-fold cross-validation strategy was employed to reliably assess model performance. Specifically, the entire development dataset comprising 17,089 records was evenly divided into five folds, while striving to maintain the proportion of each class of EEG records within each fold as consistent as possible with their distribution in the overall dataset. During the partitioning process, strict measures were taken to ensure that data from the same participant would not be split into different folds, thereby preventing data leakage. A total of five experiments were conducted, each time sequentially selecting one-fold as the validation set and the remaining four folds as the training set. The results from the five validation sets across these experiments were averaged to evaluate the model’s performance within the development cohort, and the five trained models were used for testing on external cohorts. As shown in Fig. 2, for each fold, the model underwent a two-stage training process. In the first stage, the model was trained using all samples from four folds. The number of expert annotations per sample was incorporated as a weight in the loss function to balance the contribution of expert knowledge. A cosine annealing schedule was applied to adjust the learning rate: the learning rate was linearly increased from 1 × 104 to 1 × 103 over the first five epochs, then gradually decayed to 1 × 105 via cosine annealing over the subsequent 10 epochs. The model parameters from the epoch with the lowest validation loss were retained.

Fig. 2. Training, validation, and testing strategies.

Fig. 2

Five-fold cross-validation was used to evaluate the model’s performance on the development cohort. Each fold involved two-stage training and validation. The optimal model determined by the validation set was used for final assessment on the online test dataset.

In the second stage, the model initialized with parameters from the first stage was fine-tuned on a subset of the available data, specifically samples with 10 or more labels. In this stage, each sample was assigned equal weight in the loss function. The initial learning rate was set to 3 × 104 and then decayed to 1 × 106 using a cosine annealing strategy over five epochs. The model parameters corresponding to the epoch with the lowest validation loss were retained for final evaluation on an external independent test cohort. This two-stage training approach enabled the model to first learn general features and then refine its predictions using a more reliable subset of data. For online testing, predictions from models trained on different folds were averaged to generate the final classification probabilities for each EEG sample. These probabilities were then saved in the required format for submission.

Five-fold cross-validation performance

For each EEG segment, the event type with the highest frequency of annotations among the relevant expert annotations was determined as the expert consensus. Consequently, for each EEG segment, there may exist individual expert annotations that deviate from the established consensus. To facilitate the analysis of these discrepancies, we constructed a confusion matrix (The upper part of Fig. 3a and b). In the confusion matrix, each row corresponds to an expert consensus class, and each column corresponds to the predicted classes by individual expert votes. This matrix serves as a comprehensive tool for visualizing and quantifying the patterns of disagreement between individual expert annotations and the collective consensus. For Fig. 3a, each confusion matrix is normalized by row, so the values on the diagonal correspond to the recall. For Fig. 3b, each confusion matrix is normalized by column, so the values on the diagonal correspond to the precision.

Fig. 3. Confusion matrices of VIPEEGNet predictions versus expert annotations.

Fig. 3

Confusion matrices computed with expert majority consensus as the reference. a Each confusion matrix is normalized by row, so the values on the diagonal correspond to the recall. b Each confusion matrix is normalized by column, so the values on the diagonal correspond to the precision. LPD denotes lateralized periodic discharge, GPD generalized periodic discharge, LRDA lateralized rhythmic delta activity, GRDA generalized rhythmic delta activity.

In general, the recall and precision of VIPEEGNET for the six categories were similar to human experts (The lower part of Fig. 3a and 3b). For subset 1 (1794 patients, each sample annotated by seven or fewer experts), the recall and precision of expert voting for seizure, LPD, GPD, LRDA, GRDA, and “other” categories were (87.5, 96.9) %, (78.7, 85.0) %, (80.4, 79.4) %, (77.2, 66.4) %, (90.7, 85.7) %, and (89.3, 81.0) %, respectively. In comparison, VIPEEGNet achieved recall and precision of (63.7, 90.5) %, (74.0, 75.5) %, (73.6, 78.3) %, (32.2, 56.9) %, (64.6, 76.8) %, and (91.2, 65.5) %, respectively. For subset 2 (1063 patients, each sample annotated by ten or more experts), the recall and precision of expert voting for seizure, LPD, GPD, LRDA, GRDA, and “other” categories were (60.6, 51.8) %, (71.3, 76.6) %, (69.3, 71.2) %, (54.4, 37.0) %, (57.3, 45.8) %, and (78.1, 83.3) %, respectively. VIPEEGNet demonstrated recall and precision of (57.7, 68.2) %, (80.3, 77.5) %, (73.8, 80.4) %, (36.8, 55.6) %, (40.7, 63.0) %, and (88.2, 79.6) %, respectively.

Considering that the number of expert annotations determines the reliability of EEG sample labels, our subsequent analysis primarily focuses on subset 2, which we refer to as the high-quality annotation set. In terms of recall, the model’s performance in distinguishing LRDA and GRDA was significantly lower than that of experts, slightly lower for seizure events, but higher for LPD, GPD, and “other” events compared to experts. Regarding precision, the model’s ability to distinguish LPD and “other” events was lower than that of experts, while it outperformed experts in all other event types.

The VIPEEGNet achieved high accuracy, with an AUROC for binary classification of seizure, LPD, GPD, LRDA, GRDA, and “other” categories at 0.972 (95% CI, 0.957–0.988), 0.962 (95% CI, 0.954–0.970), 0.972 (95% CI, 0.960–0.984), 0.938 (95% CI, 0.917–0.959), 0.949 (95% CI, 0.941–0.957), and 0.930 (95% CI, 0.926–0.935), as shown in Fig. 4a. The AUPRC values for VIPEEGNet were 0.717 (95% CI, 0.607–0.827), 0.869 (95% CI, 0.829–0.910), 0.863 (95% CI, 0.812–0.913), 0.514 (95% CI, 0.417–0.610), 0.605 (95% CI, 0.514–0.696), and 0.940 (95% CI, 0.934–0.947), respectively, as presented in Fig. 4b. The optimal classification thresholds were determined to be 0.400, 0.369, 0.385, 0.299, 0.251, and 0.350. It was observed that the model exhibited a tendency to misclassify EEG segments that were annotated as LRDA and GRDA based on expert consensus into other categories.

Fig. 4. One-vs-rest receiver operating characteristic curves and precision-recall curves.

Fig. 4

a Receiver operating characteristic curves. b Precision-recall curves. Each curve represents the classification performance for one class against the other five, providing a comprehensive view of the model’s ability to distinguish each class from the rest. The pentagram represents the threshold corresponding to the optimal performance of the binary classification. LPD denotes lateralized periodic discharge, GPD generalized periodic discharge, LRDA lateralized rhythmic delta activity, GRDA generalized rhythmic delta activity, TPR denotes true positive rate, FPR false positive rate, AUROC area under the receiver operating characteristic curve, AUPRC area under the precision-recall curve.

Furthermore, when considering the annotation results of multiple experts, traditional binary classification metrics are insufficient to describe the annotation of difficult samples. For instance, when ten experts annotate a sample as a seizure and another ten as GPD. To address this limitation, the expert annotation results were normalized by dividing by the total number of experts annotating the sample to obtain soft labels. The Kullback-Leibler divergence (KLD) was employed to calculate the similarity between the model’s predicted probabilities and the expert annotations. The average KLD across five-fold cross-validation in the development dataset was 0.225 (95% CI, 0.211–0.238). This metric provides a more nuanced evaluation of the model’s performance, taking into account the variability in expert annotations.

Performance comparison

We conducted a comparison of existing research methods in the relevant field, including SPaRCNet16, ProtoPMed-EEG17, and EfficientNet19. Comparative evaluation with state-of-the-art methods reveals that VIPEEGNet achieves superior or competitive performance across most event types (Table 2). The average area under the receiver operating characteristic curve (AUROC) for the four methods across six event types is 0.888, 0.900, 0.907, and 0.953, respectively, while the average area under the precision-recall curve (AUPRC) is 0.608, 0.655, 0.655, and 0.752. VIPEEGNet achieves the highest performance in both metrics. It attains the highest AUROC for all events, alongside leading AUPRC for LPD, and “other” events.

Table 2.

Performance comparison between the VIPEEGNet and existing methods

AUC Work Seizure LPD GPD LRDA GRDA Other
AUROC SPaRCNet, 2023 0.86 [0.85, 0.88] 0.90 [0.90, 0.90] 0.94 [0.94, 0.95] 0.92 [0.92, 0.92] 0.92 [0.91, 0.92] 0.79 [0.79, 0.80]
ProtoPMed-EEG, 2024 0.87 [0.86, 0.89] 0.93 [0.92, 0.93] 0.96 [0.95, 0.96] 0.92 [0.92, 0.93] 0.93 [0.93, 0.94] 0.80 [0.79, 0.80]
EfficientNet, 2025 0.94 [0.93, 0.94] 0.92 [0.91, 0.92] 0.94 [0.93, 0.94] 0.87 [0.86, 0.88] 0.89 [0.88, 0.90] 0.88 [0.87, 0.89]
VIPEEGNet (ours) 0.97 [0.96, 0.99] 0.96 [0.95, 0.97] 0.97 [0.96, 0.98] 0.94 [0.92, 0.96] 0.95 [0.94, 0.96] 0.93 [0.93, 0.94]
AUPRC SPaRCNet, 2023 0.19 [0.16, 0.23] 0.73 [0.72, 0.74] 0.89 [0.88, 0.89] 0.74 [0.72, 0.75] 0.63 [0.61, 0.64] 0.47 [0.46, 0.48]
ProtoPMed-EEG, 2024 0.25 [0.21, 0.28] 0.81 [0.80, 0.82] 0.92 [0.91, 0.92] 0.76 [0.75, 0.77] 0.67 [0.65, 0.68] 0.52 [0.51, 0.53]
EfficientNet, 2025 0.84 [0.82, 0.85] 0.74 [0.72, 0.76] 0.73 [0.70, 0.75] 0.30 [0.27, 0.31] 0.53 [0.50, 0.55] 0.79 [0.78, 0.80]
VIPEEGNet (ours) 0.72 [0.61, 0.83] 0.87 [0.83, 0.91] 0.86 [0.81, 0.91] 0.51 [0.42, 0.61] 0.61 [0.51, 0.70] 0.94 [0.93, 0.95]

LPD denotes lateralized periodic discharge, GPD generalized periodic discharge, LRDA lateralized rhythmic delta activity, GRDA generalized rhythmic delta activity, AUROC area under the receiver operating characteristic curve, AUPRC area under the precision-recall curve. 95% confidence intervals are shown in square brackets. Bold numbers indicate optimal performance.

Online testing results

After demonstrating promising performance in the development datasets, the VIPEEGNet model was tested in an additional online extra cohort (a subset of an additional 1532 patients). The online testing dataset was hosted on the Kaggle platform and made available in a competition format to 3507 participants from 113 countries, forming 2767 teams18. The public leaderboard and private leaderboard used 35% and 65% of the testing data, respectively. For each test sample, the prediction results (i.e., the probabilities of six events) generated by the five models trained on the development cohorts were averaged. The KLD between this averaged probability and the expert-annotated probability was computed, and the average KLD across all test samples was used as the final score. VIPEEGNet achieved a KLD score of 0.223296 on the public leaderboard and 0.272543 on the private leaderboard (Table 3). Its overall performance ranked second among 2767 algorithms, trailing the top algorithm by only 0.0002 in KLD score. Notably, compared to the first-ranked algorithm based on raw EEG and spectrogram using 30 model backbones, VIPEEGNet used only a single EEG-based model, reducing the parameter count to 0.7% of the former and offering higher transparency and interpretability, which is beneficial for practical deployment.

Table 3.

The model’s performance on the online test cohort compared with the top 5 gold medal algorithms out of 2767

Teams (2767) Leaderboard (KLD) View Backbones Parameters (M)
Public Private
Team Sony (1) 0.2161 0.2723 EEG ConvNeXt Atto × 4 + InceptionNeXt Tiny 1762
Spectrogram MaxViT_base × 11 + SwinV2-Tiny × 14
VIPEEGNet (ours) 0.2233 0.2725 EEG EfficientNetV2-B3 13
Holiday (2) 0.2243 0.2738 EEG EfficientNet-B5 × 2 + PPHGNetV2_B5 283
Spectrogram X3D-L + EfficientNet-B5
Multi-view X3D-L & EfficientNet-B5
Nvidia/DD (3) 0.2255 0.2741 EEG (1D CNN & SqueezeFormer) × 5 274
Spectrogram (mixnet_l or mixnet_xl) × 18
EEG 1D CNN + MobileNetV2
Aillis & GO & bilzard (4) 0.2186 0.2760 Spectrogram SwinV2-Large × 4 929
Multi-view Vit-Tiny + CAFormer + ConvNext
KTMUD (5) 0.2259 0.2762 Multi-view WaveNet + 2D CNN ~
Silver medal (16-138)
Mikhail Kotyushev (16) 0.2377 0.2946 Multi-view Vit-Tiny 21
bronze medal (139- 276)
Yuki Take93 (139) 0.2849 0.3534 ~ ~ ~

KLD Kullback-Leibler divergence.

For the details of algorithms, refer to https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/leaderboard.

Ablation experiments

The ablation experiments provided valuable insights into the contribution of each component of VIPEEGNet. The results, as shown in Fig. 5a, indicated that the removal of the EEG-to-image component had the most significant impact on model performance (0.275 vs. 0.216, p < 0.0001), and the removal of the pre-training component also led to a significant decline in performance (0.258 vs. 0.216, p < 0.0001). These findings demonstrate that converting EEG into a quasi-image format and leveraging pre-trained model weights based on images effectively enhances the model’s classification ability. Additionally, the intermediate local selection component helped the model allocate more attention to the middle 10 seconds of 50-second-long samples, consistent with the behavior of human experts during annotation. The removal of this component resulted in performance loss, with a non-significant difference (0.226 vs. 0.216, p = 0.413). Statistical analysis was performed using the Wilcoxon rank-sum test. The statistical sample comprised patient-level KLD results within the development cohort, involving a total of 1063 patients. On the online testing dataset, VIPEEGNet exhibited similar performance. Specifically, the KLD values for VIPEEGNet, VIPEEGNet without central selection, VIPEEGNet without pre-training, and VIPEEGNet without EEG-to-image were 0.255, 0.271, 0.301, and 0.326, respectively. These results underscore the importance of each component in the overall performance of VIPEEGNet.

Fig. 5. Model interpretability analysis.

Fig. 5

a Ablation studies comparing model performance by removing different components (central selection, pretraining, and EEG-to-image). b t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization of model outputs with dimensional reduction, where colors represent expert consensus. KLD denotes Kullback-Leibler divergence, LPD lateralized periodic discharge, GPD generalized periodic discharge, LRDA lateralized rhythmic delta activity, GRDA generalized rhythmic delta activity.

t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization

t-SNE visualization was employed to reduce the dimensionality of the model’s output and visualize the model’s classification of different EEG sample categories20. t-SNE is a nonlinear dimensionality reduction technique specifically designed for visualizing high-dimensional data by preserving local similarities and revealing underlying cluster structures. It operates by converting high-dimensional Euclidean distances between data points into conditional probabilities representing similarities, using a Gaussian distribution in the original space. The t-SNE results resembled a starfish shape, as shown in Fig. 5b. Using the line connecting the seizure area to the “other” area as a dividing line, GPD and GRDA were located above the line, while LPD and LRDA were below. This indicates that the model has a high ability to distinguish between lateralized and generalized events. Moreover, GPD and LPD were closer to the seizure side, while GRDA and LRDA were closer to the “other” side, suggesting that VIPEEGNet tends to associate GPD and LPD more with seizures. This visualization provides an intuitive understanding of how the model differentiates between various EEG patterns and supports the findings from the classification performance metrics.

Discussion

This study aims to address the automatic classification of harmful brain activities, with the key challenge lying in utilizing the advances of pre-trained weights from image-trained algorithms to enhance model classification performance. An automated classifier, i.e., VIPEEGNET, which enables the direct adaptation of ImageNet pre-trained vision models for one-dimensional EEG signal analysis, was developed. The research validated that employing the EEG-to-image conversion component and pre-trained models can improve performance by 21.5% and 16.3%, respectively. In practical applications, the parameter usage of VIPEEGNet is significantly less than that of existing SOTA algorithms, demonstrating its feasibility for real-world deployment.

Previous researchers have found that different experts exhibit variability when annotating different EEG rhythms and periodic patterns, with reliability only at a moderate level (overall pairwise agreement percentage of 52%)5. The variability in expert annotations can be attributed to several factors. First, EEG signals are complex and may be affected by various factors, such as individual differences in brain anatomy and physiology, changes in recording conditions, and artifacts generated by eye movements, muscle activity, etc., which increase the difficulty of accurate annotation21. Second, the annotation of EEG rhythms and periodic patterns requires extensive expertise and experience. Different experts may have different understandings and judgments of the same EEG signal, leading to inconsistent annotations. Therefore, it is crucial to examine the generalization performance of different architectures. Ideally, the performance of deep learning algorithms should be consistent across datasets22.

t-SNE analysis indicates that the model’s extracted features for LPD and GPD samples are closer to seizure, while LRDA and GRDA are closer to the “other” side. This suggests that VIPEEGNet tends to associate GPD and LPD more with epileptic seizures, consistent with previous research findings23. Furthermore, a study based on 4772 participants revealed that LPDs, regardless of frequency, are most strongly associated with epileptic seizures24. The association between GPDs and LRDA and epileptic seizures is frequency-dependent. For LPDs and GPDs, the higher the pattern frequency, the greater the risk of epileptic seizures. Typically, diffuse periodic EEG patterns often indicate acute encephalopathy, while non-reactive bilateral rhythmic patterns may suggest epilepticus25.

VIPEEGNet mainly comprises three components. The results of the ablation experiments indicate that the EEG-to-image component and the pre-training component have the most significant impact on model performance, which is the primary motivation of this work. Previous studies have shown that pre-trained neural networks can effectively extract deep EEG features. Researchers have demonstrated the effectiveness and feasibility of pre-trained models in EEG classification tasks2628. By varying the scale of pre-training data, model performance generally improves with increasing data volume. However, compared to natural language and image data, acquiring EEG data presents significant challenges29. Additionally, EEG data annotation typically requires substantial effort from domain experts, leading to a lack of sufficient EEG datasets for training baseline models for subsequent fine-tuning on downstream tasks. Our study innovatively proposes a data-driven method to convert EEG into image format, enabling the direct application of models pre-trained on large-scale image datasets like ImageNet. This effectively leverages the knowledge embedded in human expert image category annotations. This process involves converting EEG signals into spatiotemporal representations that resemble images. This allows us to use advanced neural networks designed for image data and offers a new perspective for EEG feature extraction. The resulting “EEG images” can capture the temporal dynamics and spatial distribution of EEG signals, providing richer information for model training.

In the data used by this method, for both the development cohort and the 50-second-long samples from the online external test cohort, experts annotated the middle ten-second segment. In clinical applications, the model must classify continuous data streams. A typical approach to this problem is to use a sliding window technique to generate numerous overlapping discrete classifications. During this process, the model does not assume that the event to be classified is always at the center of the sample; instead, the model is only responsible for classifying the event type at the center of the sample. These overlapping results are then aggregated and averaged to produce a final continuous output. In practical deployment scenarios, the use of VIPEEGNet is unrestricted for retrospective analysis or when the detection latency requirement is not stringent (e.g., allowing for a 20–30 second event detection delay). However, in online, real-time detection scenarios, causal constraints must be addressed. For example, when classifying data at the 400th second, only data from up to and including the 400th second is permitted. In such cases, it is necessary to artificially reverse the time scale of the data from 375 to 400 seconds to simulate the data type from 400 to 425 seconds without introducing excessive extraneous events. The augmented EEG data is then input into the model (Fig. 1b). This limitation of VIPEEGNet primarily stems from the non-causal nature of expert EEG annotation used in this study, where experts view a 50-second data segment and annotate the middle 10 seconds.

When tested on the online testing cohort, numerous top-performing algorithms utilized a combination of raw EEG, spectrogram, and waveform plots to train multi-view models. While this approach can enhance generalization to some extent, it also increases model complexity and opacity, posing deployment challenges30. Moreover, according to the Data Processing Inequality, practical data processing steps such as time-frequency transformation and waveform plotting inevitably result in information loss. Researchers still choose to convert raw EEG into spectrogram and waveform plots to match the input requirements of pre-trained models13. This is because transfer learning models must adhere to the dimensions of pre-trained models and cannot easily modify the neural architecture to address classification problems in other datasets31. To address the challenges of applying image-trained models to EEG classification tasks, VIPEEGNet employs a data-driven approach to automatically learn image representations of raw EEG signals.

VIPEEGNet combines pre-trained models in the field of computer vision, such as EfficientNetV2-B3, with multi-channel one-dimensional EEG signal analysis. By designing a universal EEG-to-image conversion module, 1D EEG signals are converted into image representations. The feature extraction ability of the ImageNet pre-trained model is directly utilized to improve the robustness of the model. In external validation, compared with the champion algorithm, the model parameter quantity was reduced by more than 99%, meeting the real-time monitoring needs of clinical practice. The EEG-to-image component can be adapted to other pre-trained visual models, providing a foundation for future extensions. Visually displaying the distribution of model features conforms to medical cognition and enhances clinical credibility. At the same time, algorithms have certain limitations. In reality, patients’ EEG may be more complex, indicating the need for more data and the accumulation of more EEG patterns32. The transferability of the model proposed in this paper to datasets beyond MGH/HMS still requires validation. Whether VIPEEGNet can serve as a general algorithmic foundation for classifying EEG patterns in other databases corresponding to conditions such as sleep disorders, metabolic and toxic encephalopathies, neurodegenerative diseases, and dementia remains to be verified. This necessitates the development of larger-scale, multi-expert annotated databases. In addition, EEG sequences can not only be converted into image representations but also share structural similarities with natural language sequences, both exhibiting temporal dependencies33,34. Future improvements could involve converting EEG into formats suitable for large language models and fine-tuning these models. In disease diagnosis and clinical recommendations, the model could combine existing medical knowledge and EEG feature databases to provide preliminary judgments.

Methods

Preprocessing

EEG signals were bandpass filtered using a third-order Butterworth filter between 0.5 Hz and 45 Hz to remove noise and artifacts outside the frequency range of interest. The EEG signals were then clipped to the range of −1024 µV to 1024 µV to handle outliers and scaled to 0–255 for standardization to suit the input range of image-based pre-trained models. During training, various data augmentation techniques were applied to the EEG signals to enhance the model’s robustness and generalization ability. These techniques included random masking of portions of the EEG signal, channel permutation, signal inversion, time reversal, and channel swapping. Technical implementation details are available in our open-sourced code.

Model architecture

The architecture of VIPEEGNet is illustrated in Fig. 6. The input to the model is a preprocessed 50-second EEG sample, comprising 16 bipolar channels with a sampling rate of 200 Hz. Thus, the input shape of the model is (16, 10,000). The output is the classification result for the event occurring during the middle ten seconds of the sample, which corresponds to one of six predefined brain activities.

Fig. 6. The schematic diagram of the VIPEEGNet model.

Fig. 6

The model takes 16-channel scalp EEG signals with a double banana montage as input. The EEG-to-image embedding component converts the EEG signals into image representations. These image representations are then input into a pre-trained computer vision classification backbone for feature extraction. Average pooling is applied to the central regions of the temporal dimension of the feature maps. Finally, the feature representation is input into a fully connected (FC) layer to obtain the probabilities of different EEG patterns. LPD denotes lateralized periodic discharge.

The EEG-to-image component primarily consists of three convolutional layers with identical structures but non-shared weights. These layers, called “R”, “G”, and “B”, process the input signal independently. Although the three convolutional layers are structurally identical, they are initialized independently and learn distinct feature representations from the input EEG signal throughout the training process. The nomenclature “R”, “G”, and “B” is adopted by analogy to the red, green, and blue channels of a natural image. However, in the context of VIPEEGNet, these layers do not process color information. Instead, they function as three parallel feature extractors. The key difference between them lies in their non-shared weights, which allow each layer to specialize in capturing different temporal characteristics or aspects of the EEG signal autonomously. By concatenating their output feature maps along a new dimension, we effectively create a 3-channel “image” that provides a richer and more diverse set of input features for the subsequent pre-trained vision model, compared to using a single convolutional layer. This design encourages the model to learn a more robust representation, akin to how different color channels provide complementary information in visual tasks. Taking the R convolutional layer as an example, it comprises 10 convolutional kernels, each with a length of 10 and a stride of 10. During initialization, the i-th kernel is set to have an initial weight of 1 at the i-th position and 0 elsewhere. Throughout training, constraints are applied to ensure that the weights of each kernel remain positive and sum to 1, with no use of bias or activation functions. This approach minimizes information loss and ensures that the output values remain within the range of 0 to 255. In practice, each kernel slides along the temporal dimension of each EEG channel with a stride of 10 to extract shallow features. Consequently, the temporal dimension of the feature map is reduced to one-tenth of the input, and the output shape for each EEG channel after this convolutional layer is (1000, 10), where 10 corresponds to the number of kernels. The overall feature map extracted by this convolutional layer has dimensions (16, 1000, 10), with 16 representing the number of EEG channels.

Subsequently, the output for each EEG channel is reshaped to (10, 1000) and rearranged sequentially along the spatial dimension. Specifically, the transformed feature map of the first EEG channel, with dimensions (10, 1000), is concatenated with that of the second channel along the spatial axis, updating the shape to (20, 1000). This process continues until all channels are incorporated, resulting in a final shape of (160, 1000). Thus, the EEG-to-image component reduces the temporal resolution of the original EEG signal by a factor of 10 while expanding the spatial dimension by a factor of 10. This can be regarded as a learnable down-sampling process that preserves the overall information content. Finally, the feature maps extracted by the three independent convolutional layers are concatenated to produce the ultimate output of the EEG-to-image component, with a shape of (160, 1000, 3), representing the image representation of the EEG signal. Throughout the training process, the parameters of the three convolutional layers were not forcibly differentiated. Instead, after specifying their final merging order, they were automatically learned in a data-driven manner. During this conversion, temporally adjacent sampling points are transformed into spatially adjacent points in the resulting image. The horizontal axis of the generated image encapsulates temporal information, while the vertical axis incorporates spatiotemporal information, thereby homogenizing the attributes of both axes.

The classic model EfficientNetV2-B3 was employed as the feature extraction backbone, loaded with pre-trained weights from ImageNet to leverage transfer learning35. The output from the EEG-to-image component was fed into this module, producing a feature map with the shape (5, 32, 1536). Subsequently, we retained the intermediate region of the feature map along the temporal scale to align with expert annotation practices (where a 50-second sample is observed and the event types in the middle 10 seconds are annotated). A global average pooling operation was then applied to obtain the feature representation (with a size of 1536) of the EEG sample, followed by a dropout layer with a probability of 0.5 to prevent overfitting. Finally, a dense layer with a SoftMax activation generated classification probabilities for different brain activities.

In terms of model parameters, the EEG-to-image component contains only 300 trainable parameters, with a storage size of 1.17 KB. The EfficientNetV2-B3 component has 12.82 million trainable parameters and 0.11 million non-trainable parameters. The final classification layer contains 9222 trainable parameters. The entire model has a total parameter count of 12.94 million and a storage size of 49.36 MB. This high parameter efficiency, particularly in the EEG-to-image component, is crucial for developing lightweight and potentially real-time systems that could be deployed in clinical settings with limited computational resources.

Evaluating performance

During training, the model’s performance was evaluated using the Kullback-Leibler divergence (KLD) loss function, as defined in Eq. (1).

Loss=DKLP||Q=i=16PilogPiQi 1

In a 6-class classification task, we define Pi as the true probability distribution over classes, typically represented as a smoothed label (e.g., [0.01, 0.00, 0.96, 0.01, 0.01, 0.01]). The model’s predicted probability for class i, denoted Qi, is generated by applying Softmax normalization to the model’s raw outputs (logits).

In addition, we used Recall (True Positive Rate, TPR), Precision, and False Positive Rate (FPR) to evaluate the consistency between the model’s classification results and the consensus reached from expert annotations. Using TP, FP, TN, and FN to represent True Positives, False Positives, True Negatives, and False Negatives respectively, the calculation methods for these metrics are as shown in Eq. (2).

Recall=TPTP+FNPrecision=TPTP+FPFPR=FPFP+TN 2

Furthermore, we computed the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC) to comprehensively evaluate the model’s classification performance on imbalanced samples. The ROC curve plots the TPR against the FPR at various classification thresholds. The PR curve, on the other hand, displays the relationship between Precision and Recall at different thresholds. These metrics provide a robust evaluation of the model’s overall classification capability, especially when dealing with class imbalance.

Acknowledgements

This study was funded by Scientific Research Innovation Capability Support Project for Young Faculty (ZYGXQNJSKYCXNLZCXM-H15), National Science Fund for Excellent Overseas Scholars (0401260011), National Natural Science Foundation of China (82472098, 32300704), Tianjin Natural Science Foundation-Outstanding Youth Project (24JCJQJC00250), Major Science and Technology Special Projects and Engineering-Major Project of National Key Laboratories (24ZXZSSS00510), National Key Technologies Research and Development Program (2021YFF1200602), Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2024-JKCS-16).

Author contributions

Y. S., X. S., R. H., and X. L. were responsible for the study conception. Y. S. contributed to running experiments and drafting the manuscript. X. H., P. S., X. T., W. Y., M. P., K. Z., and X. S. were responsible for the critical review of the manuscript for important intellectual content. Y. S. and W. W. were responsible for the statistical analysis. D. M. and X. L. obtained funding and were responsible for administrative support and supervision. All authors read and approved the final manuscript.

Data availability

The training/validation datasets for this study is available in Kaggle and can be accessed via this link https://kaggle.com/competitions/hms-harmful-brain-activity-classification.

Code availability

The code for this study is available on Kaggle and can be accessed via this link https://www.kaggle.com/code/sunyuri/hms-vipeegnet.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Dong Ming, Email: richardming@tju.edu.cn.

Xiuyun Liu, Email: xiuyun_liu@tju.edu.cn.

References

  • 1.Steinmetz, J. D. et al. Global, regional, and national burden of disorders affecting the nervous system, 1990-2021: a systematic analysis for the Global Burden of Disease Study 2021. Lancet Neurol.23, 344–381 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tveit, J. et al. Automated interpretation of clinical Electroencephalograms using artificial intelligence. JAMA Neurol.80, 805–812 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Bitar, R., Khan, U. M. & Rosenthal, E. S. Utility and rationale for continuous EEG monitoring: a primer for the general intensivist. Crit. Care2810.1186/s13054-024-04986-0 (2024). [DOI] [PMC free article] [PubMed]
  • 4.Ge, W. D. et al. Deep active learning for Interictal Ictal Injury Continuum EEG patterns. J. Neurosci. Methods35110.1016/j.jneumeth.2020.108966 (2021). [DOI] [PMC free article] [PubMed]
  • 5.Jing, J. et al. Interrater reliability of expert electroencephalographers identifying seizures and rhythmic and periodic patterns in EEGs. Neurology100, E1737–E1749 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Saab, K. et al. Towards trustworthy seizure onset detection using workflow notes. NPJ Digit. Med.710.1038/s41746-024-01008-9 (2024). [DOI] [PMC free article] [PubMed]
  • 7.Hogan, R. et al. Scaling convolutional neural networks achieves expert-level seizure detection in neonatal EEG. NPJ Digit. Med.810.1038/s41746-024-01416-x (2025). [DOI] [PMC free article] [PubMed]
  • 8.Sun, Y. L. et al. Continuous seizure detection based on transformer and long-term iEEG. IEEE J. Biomed. Health Inform.26, 5418–5427 (2022). [DOI] [PubMed] [Google Scholar]
  • 9.Sun, Y. L. et al. Multi-task transformer network for subject-independent iEEG seizure detection. Expert Syst. Appl.26810.1016/j.eswa.2024.126282 (2025).
  • 10.Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. Communications ACM60, 84–90 (2017). [Google Scholar]
  • 11.Russakovsky, O. et al. ImageNet large-scale visual recognition challenge. Int. J. Comput. Vis.115, 211–252 (2015). [Google Scholar]
  • 12.Khalkhali, V. et al. Low-latency real-time seizure detection using transfer learning. In 2021 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). 1–7 (IEEE, 2021).
  • 13.Nogay, H. S. & Adeli, H. Detection of epileptic seizure using pretrained deep convolutional neural network and transfer learning. Eur. Neurol.83, 602–614 (2021). [DOI] [PubMed] [Google Scholar]
  • 14.Tripathi, P. M., Kumar, A., Kumar, M. & Komaragiri, R. S. Automatic seizure detection and classification using super-resolution superlet transform and deep neural network -A preprocessing-less method. Comput. Methods Prog. Biomed.24010.1016/j.cmpb.2023.107680 (2023). [DOI] [PubMed]
  • 15.Hirsch, L. J. et al. American Clinical Neurophysiology Society’s standardized critical care EEG terminology: 2021 Version. J. Clin. Neurophysiol.38, 1–29 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jing, J. et al. Development of expert-level classification of seizures and rhythmic and periodic patterns during EEG interpretation. Neurology100, E1750–E1762 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Barnett, A. J. et al. Improving clinician performance in classifying EEG patterns on the Ictal-interictal injury continuum using interpretable machine learning. NEJM AI110.1056/aioa2300331 (2024). [DOI] [PMC free article] [PubMed]
  • 18.Jing, J. et al. HMS-Harmful Brain Activity Classif. (https://kaggle.com/competitions/hms-harmful-brain-activity-classification).
  • 19.Degano, G. et al. ICU-EEG pattern detection by a convolutional neural network. Ann. Clin. Transl. Neurol.10.1002/acn3.70164 (2025). [DOI] [PMC free article] [PubMed]
  • 20.van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.9, 2579–2605 (2008). [Google Scholar]
  • 21.Dan, J. et al. SzCORE: Seizure Community Open-Source Research Evaluation framework for the validation of electroencephalography-based automated seizure detection algorithms. EPILEPSIA10.1111/epi.18113 (2024). [DOI] [PMC free article] [PubMed]
  • 22.Handa, P., Lavanya, Goel, N. & Garg, N. Software advancements in automatic epilepsy diagnosis and seizure detection: 10-year review. Artific. Intell. Rev.5710.1007/s10462-024-10799-y (2024).
  • 23.Thou, E., Thompson, N. R., Hantus, S. & Punia, V. Investigation of lateralized periodic discharge features associated with epileptogenesis. Clin. Neurophysiol.172, 17–21 (2025). [DOI] [PubMed] [Google Scholar]
  • 24.Ruiz, A. R. et al. Association of periodic and rhythmic electroencephalographic patterns with seizures in critically ill patients. JAMA Neurol.74, 181–188 (2017). [DOI] [PubMed] [Google Scholar]
  • 25.Gélisse, P., Tatum, W. O., Crespel, A. & Kaplan, P. W. Rhythmic EEG patterns: The oldest idea in the EEG world, but without an obvious definition. Clin. Neurophysiol.171, 76–81 (2025). [DOI] [PubMed] [Google Scholar]
  • 26.Liu, S. Q. et al. EEG emotion recognition based on the attention mechanism and pre-trained convolution capsule network. Knowl.-based Syst.26510.1016/j.knosys.2023.110372 (2023).
  • 27.Hammour, G. et al. From Scalp to Ear-EEG: A generalizable transfer learning model for automatic sleep scoring in older people. IEEE J. Transl. Eng. Health Med.12, 448–456 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Klein, T., Minakowski, P. & Sager, S. Flexible patched brain transformer model for EEG decoding. Sci. Rep.1510.1038/s41598-025-86294-3 (2025). [DOI] [PMC free article] [PubMed]
  • 29.Jiang, W. B., Zhao, L. M. & Lu, B. L. Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI. In International Conference on Learning Representations (2024).
  • 30.Dong, C. X., Sun, D. D., Yu, Z. D. & Luo, B. Multi-view brain network classification based on Adaptive Graph Isomorphic Information Bottleneck Mamba. Expert Syst. Appl.26710.1016/j.eswa.2024.126170 (2025).
  • 31.Yuen, B., Dong, X. D. & Lu, T. A 3D ray-traced biological neural network learning model. Nat. Commun.1510.1038/s41467-024-48747-7 (2024). [DOI] [PMC free article] [PubMed]
  • 32.Ruijter, B. J. et al. Treating rhythmic and periodic EEG patterns in Comatose survivors of cardiac arrest. N. Engl. J. Med.386, 724–734 (2022). [DOI] [PubMed] [Google Scholar]
  • 33.Zhang, Y. H. et al. Integrating Large Language Model, EEG, and Eye-Tracking for Word-Level Neural State Classification in Reading Comprehension. IEEE Trans. Neural Syst. Rehabil. Eng.32, 3465–3475 (2024). [DOI] [PubMed] [Google Scholar]
  • 34.Kerr, W. T. et al. Supervised machine learning compared to large language models for identifying functional seizures from medical records. Epilepsia66, 1155–1164 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tan, M. & Le, Q. EfficientNetV2: smaller models and faster training. In Proceedings of the 38th International Conference on Machine Learning. 10096-10106 (PMLR, 2021).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The training/validation datasets for this study is available in Kaggle and can be accessed via this link https://kaggle.com/competitions/hms-harmful-brain-activity-classification.

The code for this study is available on Kaggle and can be accessed via this link https://www.kaggle.com/code/sunyuri/hms-vipeegnet.


Articles from NPJ Digital Medicine are provided here courtesy of Nature Publishing Group

RESOURCES