Abstract
Brain computer interfaces (BCIs) are one of the developing technologies, serving as a communication interface for people with neuromuscular disorders. Electroencephalography (EEG) and gaze signals are among the commonly used inputs for the user intent classification problem arising in BCIs. Fusing different types of input modalities, i.e. EEG and gaze, is an obvious but effective solution for achieving high performance on this problem. Even though there are some simplistic approaches for fusing these two evidences, a more effective method is required for classification performances and speeds suitable for real-life scenarios. One of the main problems that is left unrecognized is highly noisy real-life data. In the context of the BCI framework utilized in this work, noisy data stem from user error in the form of tracking a nontarget stimuli, which in turn results in misleading EEG and gaze signals. We propose a method for fusing aforementioned evidences in a probabilistic manner that is highly robust against noisy data. We show the performance of the proposed method on real EEG and gaze data for different configurations of noise control variables. Compared to the regular fusion method, robust method achieves up to 15% higher classification accuracy.
Keywords: Multimodal fusion, Brain computer interfaces, M-estimation, c-VEP, Eye-tracking, Bayesian fusion
I. Introduction
A significant number of people struggle with motor control and speech degradation [1]. As the recognition of needs of such people grow, researchers try to develop and improve alternative communication technologies. Brain computer interfaces (BCIs) are one of the emerging technologies providing a non-muscular communication channel for people with neuromuscular disorders, such as amyotrophic lateral sclerosis (ALS), brain-stem stroke and spinal cord injury [2], [3]. Generally, such BCI systems aim at enabling the user to choose an action from a set of options. For this purpose, electroencephalography (EEG) based BCIs have gained a higher degree of focus due to their noninvasive nature. A particular branch of EEG based BCI systems utilizes Steady State Visually Evoked Potentials (SSVEPs) [2], [4]. SSVEP is the brain response to flickering stimuli and mainly stems from the visual cortex [5]. Different stimuli result in different brain responses that are used to classify user intent. In code based VEP (c-VEP), which is a type of SSVEP, stimuli are chosen to be pseudo-random binary codes, i.e. control sequences [6]. M-sequences are one of the most used binary codes due to their favorable properties, such as low inter-sequence correlation and approximate orthogonality between shifted versions of a sequence [4].
Some users of BCI maintain the ability to control gaze and blinking [7], [8]. Even though EEG and gaze are individually or jointly used for inferring user intent, real-life scenarios where both of the signals are highly noisy are disregarded [1], [9]. With c-VEP based BCIs, the main source of highly noisy signals, i.e. outliers, is the user paying attention to a stimuli other than the target stimuli, which occurs because of human error or health problems. Such outliers in EEG and gaze signals degrade the classification performance of the BCI. As a result, a robust scheme for fusing EEG and gaze in BCI systems is required. We present a novel letter by letter BCI typing paradigm where c-VEP and gaze are fused at decision level. More importantly, we propose a probabilistic classification method that is robust against outliers in both EEG and gaze signals. We demonstrate the performance of the robust multimodal framework on real data. Robust fusion outperforms regular fusion by up to 15% classification accuracy. Compared to using only regular c-VEP modality, the robust fusion method achieves 29.8% accuracy gain.
II. Prior Work
Various approaches for fusing predictions in classification have been explored in recent studies. Aceto et al. [10], [11] focus on traffic classification and combine decisions of classifiers. They explore hard fusion approaches involving classifiers providing only the predicted class, and soft fusion approaches involving generative classifiers, where class priors are estimated on a validation set. We employ a soft fusion approach (c.f. Section III-D), which does not require a validation set, and combine the decisions obtained from EEG and gaze data. Crucially, we differ from Aceto et al. by learning the parameters of each classifier in a robust manner through M-estimation.
Fusing input modalities for BCIs has been studied in several works. Lim et al. use frequency based VEP, i.e. f-VEP, for user intent prediction and fuse the information from gaze tracker as a binary trust variable [12]. They distribute every letter to a stationary position on screen and divide the screen into 3 horizontal groups. Their method makes a decision only if the horizontal position of the letter estimated to be the target with f-VEP is also correctly estimated with an eye-tracker. [13] and [14] utilize a very similar scheme, where classification is first carried by gaze data and verified by f-VEP. To the best of our knowledge, there is no previous study where EEG and gaze data are fused at decision level in a robust manner. Due to the inherent noise and outliers observed in human-sourced data, the proposed robust fusing scheme is more suitable for practical scenarios.
III. Problem Formulation
A. Experiment Design
We focus on a letter by letter BCI typing system. User interface of this system contains four stimuli placed at the corners of the screen, as in Fig. 1. Each stimulus is presented with a rate of 60 bits/s and is paired with a unique m-sequence containing 63 bits. Accordingly, presentation of a single m-sequence takes 1.05 s. The stimuli are chosen to be two 10 × 6 checkerboards with reversed patterns of green and red, since this combination of colors are reported to result in high classification accuracy [4]. One checkerboard is matched with bit 0 and the reversed version is matched with bit 1, arbitrarily. Users are asked to focus on their target letter during the process called a trial, which takes 2.1 s. Each trial consists of a dynamic period, followed by a static period. During the dynamic period, letters appear on the screen and each letter moves to its respective checkerboard, while the checkerboards flicker once with respect to their m-sequences, i.e. for 1.05 s. Once letters move to their corresponding checkerboards, the static period starts. During the static period, each checkerboard flickers for one m-sequence, while letters are stationary. We expect to see the target response in EEG signals during the static period. For each trial i ∊ {1,…, Nt}, we form two evidences: a single EEG data sample Xi and a set of gaze samples . Fusing the likelihoods from these two evidences, we classify each trial via maximum a-posteriori (MAP) estimation:
| (1) |
where S is the discrete random variable representing the stimuli, i.e. class label. Both Xi and are corrupted by noise when user tracks a stimuli other than the target stimuli. During the trials where any of the modalities are noisy, classification performance degrades. Hence, in the following, we propose a robust scheme to fuse EEG and gaze for user intent classification. We summarize our notation in Table I.
Fig. 1.
User interface of the BCI typing system contains four stimuli placed at the corners of the screen. Users are asked to focus on their target letter, as the letters move and checkerboards flicker.
TABLE I.
Summary of Notation
|
S Nt, Nc, Ns |
stimuli, i.e. discrete class label random variable number of trials, channels, and stimuli, respectively |
|---|---|
| EEG data sample for trial i in channel c | |
| EEG data sample for trial i | |
| template vector for stimulus s in channel c | |
| correlation score for trial i and stimulus s in channel c | |
| feature vector dimension in each channel | |
| feature vector for trial i in channel c | |
| feature vector for trial i | |
| mean and covariance of the likelihood of ri given S = s | |
| gaze sample t in trial i | |
| state and innovation covariance to track gaze sample t | |
| probability of stimulus s given gaze samples up to t | |
| set of gaze samples for trial i | |
| the area on screen where stimuli s appears | |
| the ratio of outliers in the dataset | |
| the degree of corruption on outlying EEG samples | |
| the ratio of corrupted gaze samples |
B. Evidence from EEG Data
To synchronize presented m-sequences and corresponding EEG data, EEG signals are recorded simultaneously with a hardware trigger signal marking the onset of each static period. EEG data is acquired with Brain Products’ actiCHamp amplifier with Nc = 32 channels at a sampling rate of 200 samples/s. Every channel’s data are first referenced to channel Fp1 according to 10–20 system. Then, only channels O1, Oz, and O2 are used for feature extraction, since these channels coincide with visual cortex and are reported to increase classification accuracy [4]. Fundamental frequency components of m-sequences with 63 bits at 60 bits/s rate are 30, 20, 15, 12, 10, 8.6 Hz [4]. Hence, a band-pass filter in the frequency range 1:5 − 45 Hz is applied to each channel, not only to remove DC drifts and power line noise but also for capturing the frequency band of interest. Finally, time windowing is applied to the EEG signal in each channel, resulting in the data samples with c ∊ {1,…, Nc} being the channel index.
To train the classification pipeline, a calibration dataset with known stimuli is used to calculate a template brain response to each stimulus in each channel. Formally, is the template vector for stimulus s ∊ {1,…, Ns} in channel c. Templates are calculated using sample median due to robustness properties [15]. Then, for every EEG data sample , a feature vector ri is formed as:
| (2) |
Here, and is the scalar correlation score between the template of stimuli s and in channel c. For each class, i.e. stimulus, we fit a multivariate Gaussian distribution on these feature vectors by estimating the class mean and covariance. As a result, we obtain a robust generative model capturing the relationship between the correlation scores of target and non-target stimuli in each channel. Likelihood of ri given its class label S = s becomes:
| (3) |
In regular c-VEP studies, maximum likelihood (ML) estimates of the parameters μs and Ʃs are calculated from the calibration dataset using Eq. (3) [4]. Nevertheless, it has been shown that ML estimates of mean and covariance are sensitive to outliers and a single outlier can arbitrarily inflate these estimates [15], [16]. Defining the breaking point (BP) as the maximal fraction of outliers in the dataset which an estimator can handle without breaking down, ML estimates of mean and covariance achieve 0% BP, whereas M-estimators achieve 50% [15]. Hence, we estimate means and covariances using the robust M-estimation algorithm presented in [17]. This algorithm iteratively re-estimates the parameters of interest with a weighting function, i.e. Huber’s loss. Weights become very small for outlying data samples, resulting in parameter estimates that are robust to outliers.
C. Evidence from Gaze Data
Starting from the beginning of each trial, eye gaze of the user is recorded with an eye-tracker at rate 60 samples/s. Every gaze sample , where t is the gaze sample index in a trial, is concatenated to form the dataset for trial i. We track the gaze position and gaze velocity with a Kalman filter to construct a probabilistic distribution of the user’s attention on screen at every t [18]. Denoting , the process and measurement equations of the Kalman filter are:
| (4) |
| (5) |
Note that wt and ηt are assumed to be white, mutually independent Gaussian noise processes and the posterior state covariance is . We denote the posterior probability of stimulus s given gaze observations up to time t as and calculate by integrating the posterior distribution of state ξt on the screen area where stimuli s appears. If the user gets distracted and their gaze deviates far from the target stimulus, corresponding might lead to misclassification. However, if the distraction occurs for a relatively short time period, we can instead rely on the ensemble of to provide a better estimate of . Formally, defining As as the area on screen where stimuli s appears, posterior probability of a stimuli given gaze data becomes:
| (6) |
where the mean () is over t. To cope with outlying data samples occurring when the user tracks a nontarget stimuli, we calculate the robust mean of over t with an M-estimator for univariate data [15].
D. Fusing EEG and Gaze
We adopt the Independent Likelihood Pool approach to fuse the evidences from EEG and gaze data [19]. This approach works well when the prior information on class label is shared between different sources of evidence. Applying Bayes’ theorem on the posterior distribution in Eq. (1), we obtain:
| (7) |
Here, we obtain the second step by assuming that EEG and gaze data are conditionally independent, and that . More specifically, as the two evidences, i.e. Xi and , originate from the same stimulus S = s, we model the relationship between the evidences and stimuli by a graphical model, where both evidences are child nodes of the parent node S. Hence, given S = s, EEG and gaze data are conditionally independent. Finally, by applying Bayes’ theorem on in the third step, we relate our classification objective to Eq. (3) and Eq. (6).
IV. Experiments
A calibration dataset of Nt = 360 trials with equal number of trials from each stimuli is recorded. Then, a validation dataset of the same size is recorded. In order to simulate practical situations where signals get corrupted, three control variables are adopted. κ∊{0.1, 0.5} is the ratio of outliers in both calibration and validation datasets. For each outlying trial, a stimuli other than the target stimuli, i.e. s′, is chosen. β ∊ [0, 1] is the weighting, i.e. degree of corruption, variable that controls how much the outlying EEG samples are corrupted. The EEG samples that are chosen to be outliers are modified with . v ∊ {0.4, 0.8} controls the ratio of corrupted gaze samples. We randomly sample a start indice and the gaze samples following this indice are modified, so that these samples simulate a person looking at a nontarget stimuli s′ before finding the target stimuli. Classification performances of regular and robust methods are compared with respect to accuracy, which is averaged over 30 Monte Carlo runs and reported with the standard deviation.
Figures 2, 3, 4, and 5 show the classification accuracy of regular and robust methods for three different input signals: only c-VEP, only gaze, c-VEP and gaze fused. For each set of outlier (κ) and corrupted gaze (v) ratios, the corresponding plot demonstrates the classification performance with respect to the degree of corruption (β). For the case of β = 0.5, κ = 0.5, and v = 0.4, as in Fig. 4, we further provide the confusion matrices for regular c-VEP and robust fusion methods. Robust fusion method mostly outperforms the regular fusion. In 2 and 4, the performance difference between fusion methods reaches 15%. The only exception is the most challenging configuration in Fig. 5, where each control variable results in outlier rates above 50%, i.e. the BP of M-estimators. With other configurations, even though individual modality based classification results are quite similar, there is a significant performance gain of robust fusion over regular fusion. The advantage of fusion over using only individual modalities in both robust and regular methods is also evident, especially for the cases of highly corrupted signals, i.e. Fig. 5. In Fig. 4, robust fusion method leads to an accuracy gain of 29.8% compared to regular c-VEP. Even though the gaze signals here are highly noisy, robust fusion method captures the underlying information and achieves better performance than regular fusion.
Fig. 2.
Classification accuracy vs β for κ = 0.1 and v = 0.4
Fig. 3.
Classification accuracy vs β for κ = 0.1 and v = 0.8
Fig. 4.
Classification accuracy vs β for κ = 0.5 and v = 0.4
Fig. 5.
Classification accuracy vs β for κ = 0.5 and v = 0.8.
V. Conclusion
We propose a robust multimodal classification method fusing c-VEP and gaze inputs in an outlier-aware manner. With the proposed method, we are able to robustify the system against the most probable source of error: users tracking a nontarget letter. We conclude that using robust methods in the fusion of c-VEP and gaze results in significant performance increase and produces a much reliable BCI system for real-life use. Robust fusion method outperforms regular fusion by up to 15% and only c-VEP based regular method by up to 29.8%. Future work could involve other fusion approaches, including the ones introduced by Aceto et al. [10], [11]. A thorough analysis on the choice of fusion method in terms of both classification performance and computational complexity is another direction.
TABLE II.
Confusion Matrices of Robust Fusion and Regular c-Vep
| = 1 | = 2 | = 3 | = 4 | |
|---|---|---|---|---|
| S =1 | 84.76 ± 1.64 | 1.16 ± 0.89 | 1.93 ± 1.12 | 1.13 ± 0.8 |
| S = 2 | 1.83 ± 1.21 | 85.26 ± 1.67 | 0.26 ± 0.51 | 0.63 ± 0.87 |
| S = 3 | 4.13 ± 1.96 | 1.6 ± 1.17 | 90.23 ± 1.97 | 2.03 ± 1.01 |
| S = 4 | 1.83 ± 0.89 | 3.63 ± 1.35 | 3.86 ± 1.35 | 75.66 ± 1.57 |
| S=1 | 67.23 ± 5.09 | 5.7 ± 2.57 | 6.73 ± 3.06 | 9.33 ± 2.48 |
| S=2 | 8.43 ± 3.48 | 53.83 ± 6.09 | 9.26 ± 3.46 | 16.46 ± 4.44 |
| S = 3 | 6.56 ± 2.55 | 7.9 ± 3.12 | 70.03 ± 4.72 | 13.5 ± 3.12 |
| S=4 | 4.5 ± 2.45 | 9.76 ± 3.61 | 6.86 ± 3.11 | 63.86 ± 5.6 |
Stimuli predictions () vs. true stimuli (S) of robust fusion and regular c-VEP methods. Here, degree of corruption β = 0.5, outlier ratio κ = 0.5, and corrupted gaze ratio v = 0.4. Number of predictions are averaged over 30 Monte Carlo runs and reported along with their standard deviations.
Acknowledgments
This work is supported by NSF (IIS-1149570, CNS-1544895, CNS-1815349, IIS-1715858), DHHS (90RE5017–02-01), and NIH (R01DC009834).
Footnotes
The dataset is available at: “github.com/neu-spiral/Robust-Fusion-of-c-VEP-and-Gaze-Dataset”.
Contributor Information
Berkan Kadıoğlu, Email: kadioglub@ece.neu.edu.
İlkay Yıldız, Email: yildizi@ece.neu.edu.
Pau Closas, Email: closas@northeastern.edu.
Melanie B. Fried-Oken, Email: friedm@ohsu.edu.
Deniz Erdoğmuş, Email: erdogmus@ece.neu.edu.
References
- [1].Orhan U, “RSVP Keyboard: An EEG Based BCI Typing System with Context Information Fusion,” Ph.D. dissertation, Northeastern University, 2013. [Google Scholar]
- [2].McFarland DJ and Wolpaw JR, “Brain-computer interfaces for communication and control,”. Commun. ACM, vol. 54, pp. 60–66, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Fishman I, Electronic communication aids: Selection and use. PROED, Incorporated, 1986. [Google Scholar]
- [4].Nezamfar H, Salehi SSM, and Erdogmus D, “Stimuli with opponent colors and higher bit rate enable higher accuracy for c-VEP BCI,” in SPMB. IEEE, 2015. [Google Scholar]
- [5].Bin G, Gao X, Wang Y, Hong B, and Gao S, “VEP-based brain-computer interfaces: time, frequency, and code modulations [research frontier],” IEEE Computational Intelligence Magazine, vol. 4, 2009. [Google Scholar]
- [6].Muller-Putz GR and Pfurtscheller G, “Control of an electrical pros-thesis with an SSVEP-based BCI,” IEEE Transactions on Biomedical Engineering, vol. 55, no. 1, pp. 361–364, 2008. [DOI] [PubMed] [Google Scholar]
- [7].Abu-Faraj ZO, Sleiman HCB, Al Katergi WM, Heneine J-LD, and Mashaalany MJ, “A rehabilitative eye-tracking based brain-computer interface for the completely locked-in patient,” in Encyclopedia of Healthcare Information Systems. IGI Global, 2008. [Google Scholar]
- [8].Kathner I, Kubler A, and Halder S, “Comparison of eye tracking, electrooculography and an auditory brain-computer interface for binary communication: a case study with a participant in the locked-in state,” Journal of Neuroengineering and Rehabilitation, vol. 12, p. 76, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Nezamfar H, Mohseni SS, Higger M, and Erdogmus D, “Code-VEP vs. eye tracking: A comparison study.” Brain sciences, vol. 8, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Aceto G, Ciuonzo D, Montieri A, and Pescape A, “Traffic classification of mobile apps through multi-classification,” in GLOBECOM. IEEE, 2017, pp. 1–6. [Google Scholar]
- [11].Aceto G, Ciuonzo D, Montieri A, and Pescapé A, “Multi-classification approaches for classifying mobile app traffic,” Journal of Network and Computer Applications, vol. 103, pp. 131–145, 2018. [Google Scholar]
- [12].Lim J-H, Lee J-H, Hwang H-J, Kim DH, and Im C-H, “Development of a hybrid mental spelling system combining SSVEP-based brain-computer interface and webcam-based eye tracking,” Biomedical Signal Processing and Control, vol. 21, pp. 99–104, 2015. [Google Scholar]
- [13].Vilimek R and Zander TO, “Bc (eye): Combining eye-gaze input with brain-computer interaction,” in International Conference on Universal Access in Human-Computer Interaction. Springer, 2009, pp. 593–602. [Google Scholar]
- [14].Stawicki P, Gembler F, Rezeika A, and Volosyak I, “A novel hybrid mental spelling application based on eye tracking and SSVEP-based BCI,” Brain sciences, vol. 7, no. 4, p. 35, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Zoubir AM, Koivunen V, Chakhchoukh Y, and Muma M, “Robust estimation in signal processing: A tutorial-style treatment of fundamental concepts,” IEEE Signal Processing Magazine, vol. 29, pp. 61–80, 2012. [Google Scholar]
- [16].Maronna RA, “Robust M-estimators of multivariate location and scatter,” The Annals of Statistics, pp. 51–67, 1976. [Google Scholar]
- [17].Kadioglu B, Yildiz I, Closas P, and Erdogmus D, “M-estimation based subspace learning for brain computer interfaces,” IEEE Journal of Selected Topics in Signal Processing, 2018. DOI: 10.1109/JSTSP.2018.2871956. [DOI] [Google Scholar]
- [18].Jazwinski AH, Stochastic processes and filtering theory. Courier Corporation, 2007. [Google Scholar]
- [19].Punska O, “Bayesian approaches to multi-sensor data fusion,” Master’s thesis, University of Cambridge, 1999. [Google Scholar]





