Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Feb 24.
Published in final edited form as: Epilepsia. 2025 Oct 7;67(1):437–447. doi: 10.1111/epi.18652

Seizure Detection Using Ultra-Long-Term Subcutaneous EEG: A Deep Learning CNN-BiLSTM Approach

Sihyeong Park 1,*, Jordan S Clark 1,*, Pedro F Viana 3,4,5, Jie Cui 1, Jonas Duun-Henriksen 6, Jay Mandrekar 7, Nicholas Gregg 1, Vaclav Kremen 1,2, Gregory A Worrell 1,2, Mark P Richardson 3,4, Benjamin H Brinkmann 1,2,**
PMCID: PMC12927827  NIHMSID: NIHMS2136115  PMID: 41056137

Abstract

Objective:

This study reports development and validation of a deep learning seizure detection algorithm for two-channel subscalp EEG recordings. Ultra-long-term monitoring of people with epilepsy may produce new insights into the timing and pattern of their seizures, and may pave the way for novel therapeutic options. While EEG is the accepted standard for epilepsy monitoring, ultra long-term EEG recordings generate a massive quantity of data and are not reasonably reviewable in full by human readers.

Methods:

The CNN-BiLSTM hybrid algorithm uses 9 layers operating on channel spectrograms five minutes in length with 50% overlap. Retrospective subscalp EEG data from 16 patients recorded at three centers was available for algorithm development and testing. EEG was recorded for median 63 days (range 9-508) and median 17 seizures (range 0-96) were recorded. Training data was augmented with scalp EEG seizures, and performance was benchmarked against a conventional spectral power classifier algorithm. We also evaluated an intra-patient training-testing approach where the algorithm was trained on the first 45% of data.

Results:

The CNN-BiLSTM trained on scalp EEG and subscalp EEG achieved an AUROC of 0.98 and an AUPRC of 0.50, which corresponds to 94% sensitivity with 1.11 false detections per day. The same model trained on iEEG only achieved AUROC of 0.94 and AUPRC 0.36. The conventional spectral band power detector achieved AUROC of 0.93 and AUPRC 0.38. The CNN-BiLSTM detector trained on the earliest half of the subscalp EEG data achieved an AUROC of 0.93 and AUPRC of 0.37, corresponding to 87% sensitivity and 5.9 false detections per day.

Significance:

High sensitivity and specificity are possible in automated seizure detection in two-channel subscalp EEG data using a CNN-BiLSTM framework. Performance of the detector is superior using subcutaneous EEG data for training rather than intracranial EEG, but addition of scalp EEG seizures for training was helpful.

1. Introduction

Patient-reported seizure diaries have been a primary source of information for clinical epileptologists as well as for trials of medications and devices in epilepsy. However, these self-reported diaries have been shown to be unreliable for half or more of patients with epilepsy in multiple studies 14. Previous studies have demonstrated significant discrepancies between patient-reported and objectively measured seizure frequencies recorded with ambulatory EEGs 5, 6, with patients often underreporting seizure occurrences despite the help of a dedicated caregiver 7.

While chronic scalp video-EEG or intracranial recordings, including subdural and stereoelectroencephalography offer valuable data over days to weeks, and some implanted responsive neurostimulation devices are capable of providing ultra-long-term (months to years) data from a limited set of locations 811, there remains an unmet need for ultra-long-term EEG monitoring with minimal invasiveness and regional coverage to provide accurate, quantitative measures of seizure burden. Subcutaneous EEG (sqEEG) has been proposed as a solution which is capable of recording EEG over the course of months to years and does not require an intracranial surgery. Previous work 12 has shown that these devices are safe, the quality of the EEG signals from these devices is stable over considerable timeframes, and the spectral characteristics of these signals are similar to those of scalp EEG, suggesting these systems may be suitable for chronic implantation and seizure counting.

However, the significant volumes of data generated by recording continuous EEG over multiple months create a significant burden on the physicians or technologists tasked with evaluating and interpreting these recordings. The limited published data suggests that efficient clinical EEG review of long records by experienced readers may require 2.6-5.2% of EEG time (38-76 minutes per 24 hours of EEG) 1315. While subcutaneous EEG is limited in coverage and channels, and review speed might benefit from acquired familiarity during reading, scaling this rate linearly implies that review of a full year of EEG data could require 231-462 hours (9.6-19.3 continuous days) to review. Reviewer effort of this order of magnitude is clearly not feasible, and some automated means of data reduction is needed to make these types of recordings useful for seizure monitoring.

Much prior work in ultra-long-term subcutaneous EEG seizure detection has focused on conventional feature-based machine learning classification approaches. Given the novelty of limited channel subscalp EEG recordings and the massive data requirements for deep learning algorithms, it is unclear whether conventional feature-based machine learning methods may be a better choice presently. Furbass et. al. described multi-center validation of a feature-based multimodal system using scalp EEG, EMG, and ECG showed 94% sensitivity and 12.8 false detections per day (FD/D) in 55 TLE patients and 74% sensitivity and 22 FD/D in 37 ETLE patients (11,978h, 494 seizures in total) 16. An ictal fingerprinting approach with subcutaneous EEG, and EMG estimate, and chest-mounted accelerometry in 4 PWE using a random forest algorithm on delta, high and low theta EEG spectral power and EMG and accelerometric features found clusters of distinctive patterns that may be useful for distinguishing clinical from subclinical seizures 17. Deep learning approaches have been tried, despite the limited data available for training. Remvig et. al. 18 and Hartmann 19 developed a deep neural network approach using TensorFlow, which utilized a publicly available scalp EEG repository of 590 patients for training, with the 10-20 system electrodes reduced to a three electrode, two channel montage. Testing was performed on similarly decimated scalp EEG records from an independent hospital, and 88% sensitivity with 12.9 false positives per day was achieved 19. The same algorithm tested on subcutaneous EEG from nine people with epilepsy (PWE) and 12 neurotypical subjects on 946 days of EEG in total produced 99.6% and 99.9% data reduction in PWE and neurotypical subjects respectively, with 86% sensitivity and 2.4 false positives per 24 hours 18.

The purpose of this study was to develop a deep learning seizure detection algorithm for ultra long-term subcutaneous sqEEG recordings in patients with focal epilepsy, in support of a seizure forecasting and alert system. An accurate seizure detection algorithm is necessary to reduce reviewer burden in identifying seizure periodicities in recordings and measuring the accuracy of forecasts. Here, we present a comprehensive system for seizure identification in ultra-long-term sqEEG recordings and benchmark it on the largest recorded dataset to date. We show typical artifacts that occur in these extended duration records and present simple preprocessing methods to deal with them, and we show that CNN-BiLSTM spectrogram classifier is an accurate and reliable method in seizure detection. We also show that this hybrid deep-learning approach has superior performance compared to a conventional spectral bandpass filter model. In addition, we show that feeding back artifact and false positive detections from the training data into the model’s training set can produce superior seizure detector performance and reduce false detections.

2. Methods

This study tests the hypothesis that a deep learning algorithm approach will out-perform a rule-based spectral power in band approach. The second hypothesis tested is that expanding the available training data by incorporation of EEG data from other sources with similar characteristics to subscalp EEG can improve detector performance.

2.1. Study Population

Subcutaneous EEG recordings were acquired from 23 patients with temporal lobe epilepsy from three institutions – Kings College London (KCL) 20, Zeeland University Hospital (ZUH), and Mayo Clinic Rochester (MCR). All recordings were made with the UNEEG SubQ System (UNEEG Medical A/S, Denmark) with the recording electrodes placed over the patient’s primary epileptic focus in the temporal or frontal lobe, as determined by the patient’s clinical video-EEG evaluation (see Weisdorf et. al. 20 and Viana et. al. 21 for a detailed description). Subcutaneous EEG seizures were annotated by board-certified neurologists with experience reading EEG, as described previously20, 21. SB was excluded due to a device connection issue. SD was excluded because their seizures were not clearly distinguishable on EEG by clinicians. Patients who were seizure free during monitoring were omitted from testing, but their interictal data was used for training where applicable. Adverse events were described in Weisdorf et. al. 20 and Viana et. al. 21 for the Danish and KCL cohorts. One serious adverse event occurred in the KCL cohort related to improper electrode placement. No serious device-related adverse events occurred in the Danish or Mayo cohorts.

2.2. Data acquisition

EEG data was recorded with the 24/7 EEG SubQ system as previously described 22. Briefly, a small (103mm) silicon and platinum-iridium electrode with 3 contacts is implanted under the skin on the side of and over the primary area of seizure onset. The electrode is placed via an incision behind the ear, and is most naturally placed covering the temporal lobe, but can be rotated to cover frontal, parietal, or occipital regions. Following implantation and a 7-10 day healing period, the electrode is connected to the inductive coupling unit (Figure 1) and confirmed to be recording good quality data. The patient is sent home with two recording units and instructed to wear one while charging the second unit, and to exchange units once daily. Data is automatically offloaded from on-board storage via custom software provided by the manufacturer and uploaded to cloud storage for later review.

Figure 1: Subscalp EEG system and inductive receiver.

Figure 1:

A) The three contact electrode is implanted using a postauricular entry incision. The electrode’s orientation can be tailored to cover the patient’s seizure onset zone. B) The receiver unit contains a battery and data storage, and can be clipped on the user’s shirt using a magnetic (C) or physical (D) clip. The physical clip is used for patients with existing subclavicular neuromodulation devices that could be triggered by a magnet swipe. E) the wire and inductive coupling disk are placed on the skin over the electrode’s ceramic housing.

2.3. Data preprocessing

Data acquired in an uncontrolled environment, such as a patient’s home or workplace, is especially prone to artifacts. To preprocess the signal for algorithm training, disconnection artifacts, produced when the inductive-coupling disk is physically removed from the skin, were first removed from the raw signal (Figure 2). Each seizure in the training set was clipped in a 5-minute data epoch, with ten copies randomly shifted within the 5-minute data window, maintaining onset and offset visible in the frame. In rare instances there were multiple seizure onsets within 5 minutes, and the offset of the second seizure was often not visible in the frame. Varying amounts of low-level noise were added to the seizure copies as well, to promote robustness to noise and help ameliorate the data imbalance problem in training. The raw signal was then transformed into a spectrogram with 0.5 second, 0.5 Hz resolution, lowpass filtered at 23 Hz.

Figure 2: Processing of disconnection artifacts.

Figure 2:

Example of subcutaneous EEG signal before (left) and after (right) removal of disconnection artifacts. This step improves the quality of spectrogram inputs for classifier training EEG as part of data preprocessing.

The raw signal from interictal segments of the training period were converted to spectrograms (5 minutes, 0.5 s time resolution, 0.5 Hz frequency resolution). A deep learning seizure detector previously trained on a large set of ambulatory intracranial EEG data (NeuroVista SAS, Medtronic RC+S)8 was applied to the subscalp EEG data. The subcutaneous EEG epochs categorized as interictal were used for training only if the probability of the interictal-trained classifier was below 0.4 to eliminate ambiguous segments. Each channel of each spectrogram was z-score normalized, then scalar normalized between 0-1. Each 0.5 second of the spectrograms was labeled seizure or interictal.

A two-class model was trained on the initial training dataset and then reapplied to the training samples to identify false detections. These false positives were incorporated into a new training set for a three-class classifier. Spectrogram samples were labeled as artifacts if any timestep in the frame had a seizure probability greater than 0.8. For those samples, all timesteps with probabilities above 0.4 were labeled as artifacts, while the remaining timesteps were labeled as interictal. Manually labeled scalp EEG data obtained from the EMU were also added to the training set and converted into normalized spectrograms, augmented with noise-added, time-shifted copies in the same manner as the sqEEG data.

For comparison, a conventional power-in-band (PIB) detector was implemented. To preprocess the sqEEG signals for the power in band detector, a one-minute whitening window immediately preceding each 10 second data epoch was used, with a 10 second step size. Ten second segments were used to calculate the power in each band according to the traditional IFCN definition ( delta 1-4 Hz, theta 4-8 Hz, alpha 8-12 Hz, sigma 12-15 Hz, beta 15-30 Hz, and gamma 30-55 Hz) 23. The power in each band of each 10-second window was normalized to the mean band power in the whitening window. Extreme values were clipped below the 0.60 percentile and above the 99.8 percentile. The thresholds were taken to be the percentiles between 75-100.

2.4. Training-Validation-Testing Data Split

Given the limited subscalp EEG data available, we trained and tested the algorithm using two different approaches, both of which maintained a strict separation between training and testing data, such that training data always preceded testing data for any given subject. In the first data split, the initial 45% of each of the subject’s recorded data was used for training, the middle 10% for validation, and the last 45% for testing. This replicated the real-world case where a patient’s early recording could be added to the algorithm’s training data in an attempt to tailor the algorithm to that person’s electrographic pattern. A second split where subjects from the Danish cohort and all but the last subject from KCL were used for training, while subjects from Mayo Clinic and the last recorded subject from KCL were used for testing. This replicated the real-world condition where a patient’s data is processed by an algorithm trained entirely on data from other subjects and has not been trained on the subject’s individual seizure pattern. We tested two and three class models for both train-validation-test splits. Furthermore, a set of scalp EEG seizure data selected from temporal and frontal channels in epilepsy monitoring unit records was used to augment the training data for the deep learning algorithm. No scalp EEG data was used in the testing set.

2.5. Classifier architecture

The model architecture is outlined in Figure 3. The input to the model is sent to the batch norm layer, then the channels of the EEG spectrogram are convolved into a unified image. The biLSTM layer then returns output for each timestep. These values were then sent through a feedforward network with a 100-neuron dense layer using sigmoid activation, then a .2 dropout layer, then a 50-neuron dense layer with sigmoid activation, then either a 2 or 3 neuron softmax layer depending on the number of classes. Categorical cross-entropy was used to calculate loss, and Adam was used for the optimizer. For the 3 class models, loss weights of 1, 20, 10 were used for the interictal, seizure, and artifact classes respectively. The metrics accuracy, AUROC, and recall were tracked during training, but early stopping was not used. Each model was trained for 200 epochs with a batch size of 64. Various layer sizes and parameters were evaluated and found to have minimal impact on performance.

Figure 3: Overview of CNN-BiLSTM classifier architecture.

Figure 3:

Schematic of the deep learning model used for seizure detection. The input spectrogram passes thorough convolutional and bidirectional LSTM layers followed by fully connected and softmax output layers for classification.

2.6. Statistical analysis

Each epoch of each model was tested on a subset of the testing data that included every seizure and the first 288 interictal samples for each subject. The AUROC was calculated for each epoch and used to choose the best epoch for each model. Models that performed well were tested on all the data for each of the testing subjects. Half-overlapped windows were used during testing to account for edge cases (Figure 4). AUROC, AUPRC, and Brier score were used to evaluate each model. Sensitivity and false positives per day were used to select the best threshold for each patient. False positives per day is calculated by the false positive rate multiplied by the number of contiguous 5-minute windows per day (288), which assumes 100% data rate. The optimal thresholds for each patient were then averaged to get sensitivity and false positives per day metrics for the aggregated subjects for each model.

Figure 4: Half-overlapping testing window strategy.

Figure 4:

Visualization of how half-overlapping windows allow multiple detection opportunities per seizure, increasing the likelihood of accurate detection and reducing missed events.

To test the PIB detector, 5-minute windows of the 10 second power in band ratio values with a 10 second step size were used to determine if a seizure was present. If two consecutive values were above the threshold, the detection would be positive.

The methods used to assess model performance included area under the receiver-operator characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and the Brier Score. Sensitivity and the daily false alarm rate were also calculated to provide an intuitive performance measure. While a random classifier would produce an AUROC of 0.5, a random AUPRC result would be equal to the ratio of positive samples to the total number of samples. In the unbalanced case of detecting rare seizures, these are typically very low values. The Brier Score is the mean squared error from the correct classification, and therefore a perfect classifier would achieve a score of 0, while a random classifier would have a score of 0.25. DeLong’s test24 was used to evaluate whether differences between algorithms on the held-out data were statistically significant.

3. Results

3.1. Study Population

Twenty-one subjects with focal epilepsy were included in this study, as outlined in Table 1 and S1. Median age was 42 (range: 21-75). The total duration of the recorded data was 3339.9 days. A total of 569 seizures were recorded. Across the cohort the data capture rate varied considerably, from 11.2% to 98.4%, with an overall mean data capture rate of 63.3%, equating to 15.2 hours per day. As described above, no seizure free subjects data were used for testing, but their data was used in training.

Table 1: Participant dataset summary.

Calendar duration, length of recorded data, and number of EEG seizures for each participant, grouped by data collection site: King’s College London, Zeeland University Hospital, and Mayo Clinic Rochester.

King’s College London Zeeland University Hospital Mayo Clinic Rochester
Subject Duration (d) Data (d) Seizures Subject Duration (d) Data (d) Seizures Subject Duration (d) Data (d) Seizures
SA2 404.3 177.9 52 E02 94.3 66.5 24 MSEL_00645 525.5 63.3 17
SC 505.9 289.4 53 E03 73.7 56.3 1 MSEL_01929 90.1 9 9
SE 537.9 493.3 54 E04 69.2 63 12 MSEL_02461 315.9 35.4 4
SF 438.0 431.9 59 E05 83.0 75 5 MSEL_02462 16.4 14.9 0
SG 262.7 116 96 E06 75.1 47.6 17
SH 433.1 150.9 28 E07 78.8 52.7 0
SI 328.4 56.7 0 E08 91.1 23.3 9
SJ 415.6 396.7 46 E09 82.0 62.4 25
E10 29.1 25.9 0

3.2. Seizure detection performance

The intracranial EEG trained model when tested on the held-out subjects produced a mean AUROC of 0.94 and AUPRC of 0.36 (Table 2a, Figure 5a). The spectral power in band model produced an AUROC of 0.93 and AUPRC of 0.38 (Table 2b, Figure 5b) when tested on the held-out subjects. The CNN-BiLSTM model trained with additional artifacts and scalp EEG data produced an AUROC of 0.94 and AUPRC of 0.37 (Table 2c, Figure 5c) when trained and tested on split data from all subjects, and an AUROC of 0.98 and AUPRC of 0.50 (Table 2d, Figure 5d) when tested on the held-out subjects. Statistical testing found no significant differences in AUROC between the power in band and subcutaneous EEG trained deep learning model (p=0.65), and IC-EEG trained and subcutaneous EEG trained deep learning algorithms (p=0.62) when tested on the held-out dataset (Table 4). A full description of testing results on each algorithm is presented in Table 2ad, a summary of all the algorithms’ performances is presented in Table 3, and ROC and PRC curves are presented in Figure 5ad.

Table 2a-d: Seizure detection performance for individual subjects across model types.

(a) Intracranial-trained CNN-BiLSTM model.

(b) Power-in-Band (PIB) spectral detector.

(c) CNN-BiLSTM model trained on split KCL-Danish dataset.

(d) CNN-BiLSTM model trained on full SubQ dataset.

Metrics include per-subject sensitivity, false positives per day (fps/day), AUROC, AUPRC, Brier score, and number of seizures evaluated. These sub-tables allow direct comparison of model performance on identical test subjects.

a: Intracranial EEG Trained CNN-BiLSTM Model:
Subject Threshold Sensitivity fps/day AUROC AUPRC Brier Score Latency Seizures

MSEL_00645 0.85 0.93 31.0 0.90 0.39 0.239 15.1 14
MSEL_01929 0.9 0.89 12.3 0.96 0.45 0.192 0.144 9
MSEL_02461 0.5 1.00 58.6 0.93 0.34 0.160 0.267 3
SJ 0.7 0.89 20.9 0.97 0.28 0.131 3.80 46

Mean 0.93 30.8 0.94 0.36 0.181 4.82 72
95% Confidence interval [0.84, 1.00] [0.00, 61.3] [0.91, 0.98] [0.25, 0.53] [0.11, 0.25] [0.00, 16.0]
b: Power-in-Band Model:
Subject Threshold Sensitivity fps/day AUROC AUPRC Brier Score Latency # Seizures

MSEL_00645 0.85 0.85 37.3 0.88 0.12 0.054 16.1 13
MSEL_01929 0.95 1.00 12.5 0.98 0.50 0.078 10.0 8
MSEL_02461 0.9 1.00 24.0 0.94 0.50 0.062 5.00 2
SJ 0.9 0.91 30.2 0.91 0.38 0.003 21.3 45

Mean 0.94 26.0 0.93 0.38 0.049 13.1 68
95% Confidence interval [0.82, 1.00] [9.26, 42.8] [0.86, 0.99] [0.00, 0.62] [0.00, 0.10] [1.80, 24.4]
c: SubQ Trained CNN-BiLSTM Model:
Subject Threshold Sensitivity fps/day AUROC AUPRC Brier Score Latency # Seizures

MSEL_00645 0.85 0.93 0.86 0.96 0.42 0.0083 1.21 14
MSEL_01929 0.8 1.00 0.75 1.00 0.53 0.0137 0.263 9
MSEL_02461 0.9 1.00 0.41 1.00 0.55 0.0206 0.457 3
SJ 0.55 0.83 2.42 0.95 0.49 0.0103 2.06 46

Mean 0.94 1.11 0.98 0.50 0.0132 0.997 72
95% Confidence interval [0.81,1.00] [0.00, 2.53] [0.94, 1.00] [0.41, 0.59] [0.0046, 0.0218] [0.00, 2.30]
d: Split KCL-Danish SubQ Trained CNN-BiLSTM Model:
Subject Threshold Sensitivity fps/day AUROC AUPRC Brier Score Latency # Seizures

MSEL_00645 0.7 0.93 1.08 0.96 0.03 0.0071 15.46 14
MSEL_01929 0.25 1.00 1.34 1.00 0.56 0.0034 0.701 9
MSEL_02461 0.85 1.00 0.50 1.00 0.39 0.0052 3.91 3
SA2 0.2 0.96 3.43 0.98 0.47 0.0065 5.00 25
SC 0.05 0.97 11.9 0.98 0.46 0.0118 1.83 39
SE 0.95 0.95 0.54 0.97 0.52 0.0052 2.24 41
SF 0.25 0.93 13.4 0.95 0.34 0.0250 0.258 15
SG 0.25 0.88 2.69 0.94 0.60 0.0082 7.15 85
SH 0.75 0.70 0.69 0.85 0.33 0.0034 0.125 10
SJ 0.05 0.96 6.88 0.98 0.47 0.0081 1.02 46
E02 0.1 0.67 5.26 0.86 0.34 0.0061 13.4 12
E04 0.0026 0.71 24.22 0.82 0.22 0.0037 0.750 7
E06 0.0026 0.75 10.90 0.87 0.13 0.0016 0.000 4
E09 0.95 0.63 0.23 0.83 0.32 0.0032 3.04 16

Mean 0.87 5.94 0.94 0.37 0.0070 3.92 326
95% Confidence interval [0.73, 0.98] [2.78, 8.62] [0.87, 1.00] [0.32, 0.52] [0.0038, 0.0103] [1.13, 6.71]

Figure 5. Comparison of seizure detection model performance.

Figure 5.

Figure 5.

Figure 5.

Figure 5.

(A) Intracranial electroencephalography (EEG)-trained CNN-BiLSTM Model. (B) Power-in-band model. (C) CNN-BiLSTM model trained on Danish and King’s College London (KCL) cohort and Mayo scalp EEG data, tested on the Mayo SubQ and final KCL SubQ patients. (D) CNN-BiLSTM model trained on the initial 45% of Danish, KCL, and Mayo cohort data and tested on the final 45%.

Table 3: Summary of performance metrics across all models.

Comparison of average performance metrics across the four tested models, showing sensitivity, false positives per day, AUROC, AUPRC, and Brier score.

Model Sensitivity fps/day AUROC AUPRC Brier
SubQ Trained CNN-BiLSTM 0.94 1.11 0.98 0.50 0.0132
Split SubQ CNN-BiLSTM 0.87 5.94 0.94 0.37 0.0070
Power in Band 0.94 26.0 0.93 0.38 0.0493
Intracranial Trained CNN-BiLSTM 0.93 30.7 0.94 0.36 0.1807

4. Discussion

In this study, we developed and evaluated several seizure detection algorithms for ultra-long-term subcutaneous EEG, with the goal of determining the most effective approach for this novel recording modality. Our best-performing model, the SubQ + Scalp-trained 2-Class CNN-BiLSTM, achieved the highest AUROC (0.98), AUPRC (0.50), and the best average sensitivity (0.94) and false positive rate (1.11) among all models tested. These results compare favorably to previously published work which achieved 86% sensitivity and 2.6 FP/D18. Furthermore this suggests that combining sqEEG with additional scalp EEG training data and treating artifact and interictal segments as a unified non-seizure class may improve the generalizability and precision of the model.

The Split SubQ-trained 3-Class CNN-BiLSTM, which explicitly modeled artifact as a third class and used cross-institutional training and testing, also performed well (AUROC = 0.94, AUPRC = 0.37). Although its sensitivity (0.86) and precision were lower than the 2-class model, it produced the best Brier score (0.007), indicating superior calibration. This highlights a trade-off between sensitivity and output probability reliability, which may be relevant depending on downstream use (e.g., real-time alerting vs. post hoc review).

In contrast, the Power-in-Band (PIB) model and Intracranial-trained CNN-BiLSTM demonstrated good AUROC (0.93 and 0.94 respectively) and sensitivity (0.94 and 0.93), but suffered from higher false positive rates (26.0 and 30.7 per day) and lower precision (AUPRC = 0.38 and 0.36). These results may suggest that feature-based and cross-modality detectors may struggle to generalize to the temporal characteristics and artifact environment of sqEEG.

Across all CNN-BiLSTM models, our preprocessing pipeline — including disconnection artifact removal, seizure augmentation, and spectrogram transformation — contributed to improved robustness. The use of half-overlapping testing windows further enhanced robustness by ensuring that seizures near window edges were not missed.

Importantly, our findings emphasize the value of training with data that reflects the real-world characteristics of ultra-long-term ambulatory EEG: noisy environments, diverse artifact types, and changing signal dynamics. Inclusion of scalp EEG data likely provided additional diversity in seizure morphology and background variation, which improved generalizability.

Nevertheless, this study has several limitations. First, the number of subjects remains relatively small, and performance metrics may change with larger or more diverse populations. Future investigations may consider incorporating larger datasets to look for statistically significant differences in performance and reduce the possibility of model overfitting. The present study should be viewed as an engineering study optimizing approaches to seizure detection in limited availability data, and future work in a larger data set should be done to fully validate the superiority of one approach over another. Second, although the SubQ + Scalp model generalized well to held-out subjects, further validation across completely independent datasets is needed to more completely assess overfitting. Third, our models were evaluated retrospectively and offline; future work should explore real-time implementation feasibility, latency of detection, and clinical impact in prospective studies. Our CNN-BiLSTM architecture design using a 5-minute epochs in particular hinders the application of this approach in real-time applications, as no detection could be made until 5 minutes of data has accumulated for analysis. While we have reported latency values here for completeness, this measure should be understood as a measure of algorithm accuracy and not as representing a real-time latency.

Ultimately, this work provides a foundation for building robust seizure detectors for minimally invasive, ultra-long-term monitoring. By enabling accurate seizure counting over months to years, such algorithms may facilitate individualized seizure forecasting, inform treatment decisions, and reduce reliance on inaccurate patient self-reporting. As sqEEG devices become more widely adopted, continued refinement of automated detection systems will be important for clinical application.

5. Conclusion

Reliability of seizure detection is crucial in ultra-long-term EEG monitoring applications. Here, we showed numerically better performance that was not statistically significant with a CNN-BiLSTM spectrogram based machine learning algorithm over a conventional spectral band power model, and we demonstrated the importance of including sqEEG data in the training set. Future investigations should focus on incorporating larger datasets for further optimization and validation of the seizure detector performance as well as prospective validation.

Supplementary Material

Figure S1
2

Key Points:

  • A deep learning CNN-BiLSTM algorithm was used to identify seizures in two-channel subscalp EEG data from three centers.

  • The deep learning algorithm trained with scalp and subcutaneous EEG achieved higher AUROC and AUPRC than a conventional spectral band power approach.

  • Machine learning assisted seizure detection is essential to reduce reviewer burden in ultra-long-term subcutaneous EEG monitoring.

Acknowledgements

The authors acknowledge Carey Huebert, Sherry Klingerman, Tal Pal Attia, and Mona Nasseri PhD for technical assistance.

Funding information

This project was supported by NIH NS123066, the Mayo Clinic Disruptive Innovation program, and Epilepsy Foundation of America’s My Seizure Gauge grant, the Mayo Neurology AI Program, and the Mayo Clinic Department of Neurology.

Footnotes

Conflict of interest statement

BHB, GAW, NG, and MPR report research support from UNEEG A/S, and JDH is an employee of UNEEG. BHB has licensed IP to Cadence Neuroscience and Seer Medical Pty, and has received honoraria and travel support from Eisai Inc.

Ethics statement

This work was approved by the Mayo Clinic Institutional Review Board. Analysis results and high level data will be made available upon reasonable request. Raw EEG used in this study is not available for sharing. We confirm that we have read the Journal’s position on issues involved in ethical publication and affirm that this report is consistent with those guidelines.

References

  • 1.Fisher RS, Blum DE, DiVentura B, Vannest J, Hixson JD, Moss R, et al. Seizure diaries for clinical research and practice: limitations and future prospects Epilepsy & Behavior. 2012;24:304–310. [DOI] [PubMed] [Google Scholar]
  • 2.Cui J, Balzekas I, Nurse E, Viana P, Gregg N, Karoly P, et al. Perceived seizure risk in epilepsy: Chronic electronic surveys with and without concurrent electroencephalography Epilepsia. 2023;64:2421–2433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Schulze‐Bonhage A, Richardson MP, Brandt A, Zabler N, Dümpelmann M, San Antonio‐Arce V. Cyclical underreporting of seizures in patient‐based seizure documentation Annals of Clinical and Translational Neurology. 2023;10:1863–1872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Elger CE, Hoppe C. Diagnostic challenges in epilepsy: seizure under-reporting and seizure detection The Lancet Neurology. 2018;17:279–288. [DOI] [PubMed] [Google Scholar]
  • 5.Hannon T, Fernandes KM, Wong V, Nurse ES, Cook MJ. Over‐and underreporting of seizures: How big is the problem? Epilepsia. 2024. [DOI] [PubMed] [Google Scholar]
  • 6.Tatum WOt, Winters L, Gieron M, Passaro EA, Benbadis S, Ferreira J, et al. Outpatient seizure identification: results of 502 patients using computer-assisted ambulatory EEG J Clin Neurophysiol. 2001. Jan;18:14–19. [DOI] [PubMed] [Google Scholar]
  • 7.Hoppe C, Poepel A, Elger CE. Epilepsy: accuracy of patient seizure counts Arch Neurol. 2007. Nov;64:1595–1599. [DOI] [PubMed] [Google Scholar]
  • 8.Sladky V, Nejedly P, Mivalt F, Brinkmann BH, Kim I, Louis EK, et al. Distributed brain co-processor for tracking spikes, seizures and behaviour during electrical brain stimulation Brain Communications. 2022;4:fcac115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gregg NM, Marks VS, Sladky V, Lundstrom BN, Klassen B, Messina SA, et al. Anterior nucleus of the thalamus seizure detection in ambulatory humans Epilepsia. 2021;62:e158–e164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Skarpaas TL, Jarosiewicz B, Morrell MJ. Brain-responsive neurostimulation for epilepsy (RNS((R)) System) Epilepsy Res. 2019. Jul;153:68–70. [DOI] [PubMed] [Google Scholar]
  • 11.Cook MJ, O’Brien TJ, Berkovic SF, Murphy M, Morokoff A, Fabinyi G, et al. Prediction of seizure likelihood with a long-term, implanted seizure advisory system in patients with drug-resistant epilepsy: a first-in-man study Lancet Neurol. 2013. Jun;12:563–571. [DOI] [PubMed] [Google Scholar]
  • 12.Viana PF, Remvig LS, Duun-Henriksen J, Glasstetter M, Dumpelmann M, Nurse ES, et al. Signal quality and power spectrum analysis of remote ultra long-term subcutaneous EEG Epilepsia. 2021. Aug;62:1820–1828. [DOI] [PubMed] [Google Scholar]
  • 13.Haider HA, Esteller R, Hahn CD, Westover MB, Halford JJ, Lee JW, et al. Sensitivity of quantitative EEG for seizure identification in the intensive care unit Neurology. 2016;87:935–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Moura LMVR, Shafi MM, Ng M, Pati S, Cash SS, Cole AJ, et al. Spectrogram screening of adult EEGs is sensitive and efficient Neurology. 2014;83:56–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Brogger J, Eichele T, Aanestad E, Olberg H, Hjelland I, Aurlien H. Visual EEG reviewing times with SCORE EEG Clinical Neurophysiology Practice. 2018. 2018/January/01/;3:59–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fürbass F, Kampusch S, Kaniusas E, Koren J, Pirker S, Hopfengärtner R, et al. Automatic multimodal detection for long-term seizure documentation in epilepsy Clinical Neurophysiology. 2017. 2017/August/01/;128:1466–1472. [DOI] [PubMed] [Google Scholar]
  • 17.Kjaer TW, Remvig LS, Helge AW, Duun-Henriksen J. The individual ictal fingerprint: combining movement measures with ultra long-term subcutaneous EEG in people with epilepsy Frontiers in Neurology. 2021;12:718329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Remvig LS, Duun-Henriksen J, Furbass F, Hartmann M, Viana PF, Kappel Overby AM, et al. Detecting temporal lobe seizures in ultra long-term subcutaneous EEG using algorithm-based data reduction Clin Neurophysiol. 2022. Oct;142:86–93. [DOI] [PubMed] [Google Scholar]
  • 19.Hartmann M, Koren J, Baumgartner C, Duun‐Henriksen J, Gritsch G, Kluge T, et al. Seizure detection with deep neural networks for review of two‐channel electroencephalogram Epilepsia. 2023;64:S34–S39. [DOI] [PubMed] [Google Scholar]
  • 20.Viana PF, Duun-Henriksen J, Biondi A, Winston JS, Freestone DR, Schulze-Bonhage A, et al. Real-world epilepsy monitoring with ultra long-term subcutaneous EEG: a 15-month prospective study medRxiv. 2024:2024.2011. 2016.24317163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Weisdorf S, Duun‐Henriksen J, Kjeldsen MJ, Poulsen FR, Gangstad SW, Kjær TW. Ultra‐long‐term subcutaneous home monitoring of epilepsy—490 days of EEG from nine patients Epilepsia. 2019;60:2204–2214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Viana PF, Pal Attia T, Nasseri M, Duun-Henriksen J, Biondi A, Winston JS, et al. Seizure forecasting using minimally invasive, ultra-long-term subcutaneous electroencephalography: Individualized intrapatient models Epilepsia. 2023. Dec;64 Suppl 4:S124–S133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Babiloni C, Barry RJ, Basar E, Blinowska KJ, Cichocki A, Drinkenburg W, et al. International Federation of Clinical Neurophysiology (IFCN) - EEG research workgroup: Recommendations on frequency and topographic analysis of resting state EEG rhythms. Part 1: Applications in clinical research studies Clin Neurophysiol. 2020. Jan;131:285–307. [DOI] [PubMed] [Google Scholar]
  • 24.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach Biometrics. 1988:837–845. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1
2

RESOURCES