Abstract
Current approaches on optimal spatio-spectral feature extraction for single-trial BCIs exploit mutual information based feature ranking and selection algorithms. In order to overcome potential confounders underlying feature selection by information theoretic criteria, we propose a non-parametric feature projection framework for dimensionality reduction that utilizes mutual information based stochastic gradient descent. We demonstrate the feasibility of the protocol based on analyses of EEG data collected during execution of open and close palm hand gestures. We further discuss the approach in terms of potential insights in the context of neurophysiologically driven prosthetic hand control.
Keywords: EEG, brain-computer interfaces, information theoretic learning, feature projection, hand gestures
1. INTRODUCTION
Electroencephalogram (EEG) based brain-computer interfaces (BCIs) have shown the promise of providing communication and control means for people with neuromuscular disabilities. There exists a variety of statistical signal processing tools to extract optimal user-specific features for decoding user intent from neural activity, and further exploit in single-trial decision making with BCIs. State-of-the-art methods used for discriminative spatial filtering of multichannel EEG highlight the well-known common spatial patterns (CSP) algorithm [1], and filter bank CSP (FBCSP) extensions [2, 3]. Current approaches for spatio-spectral feature extraction with such protocols consider mutual information based feature ranking and selection algorithms [3–7]. In other respects, there exists significant evidence claiming that feature ranking by any criterion, including information theoretic criteria, to be potentially sub-optimal [8–10]. Accordingly to tackle this confounding problem, information theoretic feature projection approaches are introduced [10–12].
Effective feature extraction relying on information theoretic criteria is likely to provide further insights that go beyond conventional right and left hand motor imagery protocols [13], to discriminate more complex same hand gestures. A variety of studies claim existence of a diverse range of brain rhythms being related to human motor behavior in general [14, 15]. Motivated by existing pioneer work [16–19], we argue that extracting such neural information embedded in EEG would yield important progress towards EEG-assisted robotic prosthetic hand control studies for amputees [20]. In that regard, we extend current spatio-spectral feature extraction paradigms to potentially utilize in more complex EEG decoding tasks of interest.
In the present study we propose an information theoretic non-parametric feature extraction protocol for single-trial BCIs. The approach exploits dimensionality reduction by a linear feature projection optimized through mutual information based stochastic gradient descent. In an experimental EEG decoding study for binary classification of same hand gestures, we demonstrate the feasibility of the proposed approach with promising insights into user-specific feature extraction for online BCI-assisted prosthetic hand control.
2. NON-PARAMETRIC FEATURE EXTRACTION
Proposed feature extraction protocol on training data initially resembles to the conventional FBCSP approach [2, 3]. EEG data is filtered in various frequency bands and CSP algorithm obtains discriminative spatial transformations for each band. Differently, after extraction of CSP features in multiple frequency bands, the high dimensional feature vector is used for feature projection to maximize mutual information between projected features and class labels in the training set. Lower-dimensional projected features that are believed to carry maximal class information are later used for classification. An illustration of the proposed approach is provided in Figure 1, and described in detail within this section.
Fig. 1.
Illustration of the proposed information theoretic feature extraction framework for a single-trial BCI.
2.1. Pre-Processing with Filter Banks
Recorded raw EEG data were first filtered separately with FIR bandpass filters in a total of K frequency bands. Complete filter bank frequency range and individual band bin widths can vary arbitrarily as analysis-specific parameters. After filtering in each frequency band, data was spatially transformed with extracted CSP filters, and frequency band specific discriminative features are obtained.
2.2. Spatial Filtering with Common Spatial Patterns
Feature transformation with CSP aims to identify a discriminative basis for a multichannel signal recorded under different conditions, in which signal representations maximally differ in variance between these conditions (i.e. classes). For a two-class paradigm with class labels c1 and c2, CSP algorithm solves the problem:
| (1) |
where and denote the N × N class covariance matrices of the data matrix with N number of channels and S number of samples, for classes c1 and c2. Eq. 1 can be solved through the generalized eigenvalue problem:
| (2) |
which has N possible solutions. The eigenvector corresponding to the highest eigenvalue indicates a basis where variance of the data corresponding to c1 will be highest, and c2 will be lowest. Vice versa for the lowest eigenvalue. Data preprocessing is usually performed by combining L eigenvectors with smallest and largest eigenvalues obtained by Eq. 2, to form and compute the transformed time series data . Following this transformation, features are obtained as normalized signal variances in transformed time-series data :
| (3) |
where the superscript k ∈ {1, 2 …, K}is the filter bank index and the subscript l ∈ {1, 2, …, L}denotes the row index corresponding to one of the L CSP transformations. This process is applied to each filter bank output separately to extract frequency-band specific discriminative features, resulting in a total of NI = L × K features. By concatenating all possible features, a high dimensional feature vector is initially constructed for further feature projection.
2.3. Feature Projection based on Non-Parametric Stochastic Mutual Information Gradient
In contrast to ranking features with information theoretic criteria and selecting a specific subset of features, dimensionality reduction is achieved by obtaining an optimal feature projection in information theoretic sense for the feature matrix , which is formed by vectors f at each one of the NT instances (i.e. trials) in training data. In particular, we aim to obtain a linear feature projection model:
| (4) |
where consists the projected features from training data and is the linear projection matrix for dimensionality reduction if NO << NI. The output with of this feature projection is aimed to maximize the mutual information with the corresponding training data class labels vector with elements . Here, i = 1, 2, …, NT for yi and . Mutual information between the continuous random variable Y with probability density function p(y) and discrete class labels random variable C with probability mass function p(c) is denoted by:
| (5) |
Inserting the definitions of Shannon’s entropy and conditional entropy for Y given by Eqs. 6 and 7 into Eq. 5 yields into the expression in Eq. 8 for mutual information.
| (6) |
| (7) |
| (8) |
Entropy and mutual information estimation as in Eq. 8 was successfully applied non-parametrically to practical problems based on kernel density estimations given NT instances and their corresponding class labels , where kernel density estimate of an unknown probability density underlying the instances is given by with κσ(ξ) being the size σ multivariate kernel function for an NO dimensional vector [21]. Here, our main objective is not to accurately estimate mutual information, but to determine the optimal feature projection weights under maximum mutual information criterion. Hence for practical manipulations, a stochastic approach is adopted. In order to stochastically estimate Eq. 8, we drop the expectation over Y to evaluate the expression at each instantaneous sample of Y, resembling to the existing adaptive filtering approaches [22, 23]. After dropping the expectation over Y, applying Bayes’ Theorem and law of total probability, we obtain the following expression with estimated probability density functions:
| (9) |
| (10) |
where denotes an instance of class ci and denotes the number of instances belonging to class ci in training set, that satisfies . For Eq. 9, we estimate the probability mass functions for class priors in training data as .
The expected value of the expression Eq. 9 satisfies , where is the estimated mutual information. By evaluating the gradient of this expression at every instance t, weights of the linear feature projection model is updated by as in an adaptive linear system that generates the samples of Y according to yt = Mft, where ηt is the step size at iteration t. This process is performed over all training instances for a total of NE times (i.e. the number of epochs). Note the existence of a constraint on the choice of the kernel function for probability density estimation, which is to be continuously differentiable for proper evaluation of the gradient. We will refer to the expression as the Stochastic Mutual Information Gradient (SMIG), analogous to the adaptive system training approaches previously presented in [12,23].
3. EXPERIMENTAL STUDY
3.1. Participants and Experimental Data
Four right-handed healthy subjects (3 male, 1 female; mean age 27.0 ± 2.1) participated in this study. All subjects performed the experiment with their dominant hand. Before the experiments, all participants gave their informed consent after the experimental procedure was explained to them in accordance with guidelines set by the research ethics committee of Northeastern University.
Throughout the experiments, 32-channel EEG data was recorded at 256 Hz sampling frequency, using active EEG electrodes and a g.USBamp biosignal amplifier (g.tec medical engineering GmbH, Austria). Electrodes were placed on the scalp according to the 10–20 system.
3.2. Study Design
Participating subjects performed motor execution and subsequent imagery of open palm (i.e. resting) and close palm (i.e. grasping) gestures with the dominant hand, as cued by presented images on a computer screen placed in front of them. Motivated by experimental evidence on the resembling neural signatures of attempted and executed motor movements in amputees [24, 25], we incorporated motor execution within the experimental protocol rather than the conventional pure motor imagery approach suggested for locked-in stroke patients. We argue that motor execution in parallel with imagery constructs a reliable basis to investigate neural correlates of complex same hand gestures in healthy individuals, for potential use of neurally controlled prostheses with amputees.
Experiment consisted of four blocks (B1, B2, B3, B4) of 50 trials each, resulting in a total of 200 trials. Trials were evenly separated between two conditions (i.e. open and close palm) and conditions were randomly selected at each trial. Blocks were separated by brief intermissions of one to two minutes. The first two blocks formed Session 1, and the last two blocks formed Session 2, as will be referred in session-to-session learning data analyses.
Each trial lasted five seconds and trials were separated with three second resting intervals. During resting intervals, a visual cue indicating the gesture to be performed in the upcoming trial was presented to the subject on a computer screen (see Figure 2). At the beginning of the trials, as the resting interval ended, subjects executed the presented hand gesture once (i.e. closed or opened the palm) and imagined the corresponding gesture (e.g., grasping an object or opening the palm to wave hand) for the rest of the five seconds. Visual cues were present on the screen during the trials as well.
Fig. 2.
Visual cues presented during the experiments.
3.3. EEG Data Analysis
Within the proposed non-parametric feature extraction framework as demonstrated in Figure 1 and presented in Section 2, for each subject, EEG data was divided into training and test splits and was used to discriminate between two same hand gestures: open palm and close palm. Data analysis results are presented in two different schemes as follows.
Firstly the impact of changing output projection dimensionality (NO) for SMIG is investigated. In particular, for all possible NO ∈ {2, 3, 4, 5, 6, 7, 8}: (1) data from B1 were used in training and B2, B3, B4 were used for testing, (2) data from B1 and B2 (i.e. Session 1) were used in training to test the data from B3 and B4 (i.e. Session 2), (3) data from B1, B2 and B3 were used in training to classify last block’s data (B4), (4) data from Session 1 were used to test Session 2 as in case 2, using the conventional mutual information based individual feature ranking and selection approach [2, 3] for comparison with the proposed. Here note that a single classifier model is used, which replicates an online experiment paradigm resembling to a user-specific train-and-use robotic prosthetic hand control to open and close palm for grasping.
Secondly, for fixed output projection dimensionality (NO = 2), the effects of changing input feature dimensionality (NI = L × K) for SMIG is evaluated. In particular, for the same filter bank (i.e. fixed K), the amount of CSP features extracted for each frequency band was chosen to be either L = 2 (i.e. using only the most discriminative CSP features extracted at each band), or L = 32 (i.e. using all CSP features extracted at each band). Classification was performed across sessions (i.e. Session 1 to Session 2 and vice versa). Similarly this replicates an online paradigm with a single feature extraction step and non-overlapping train and test data splits. All data analyses investigated the feasibility of session-to-session learning with the proposed approach.
In all analyses the filter bank consisted a total of K = 25 overlapping FIR bandpass filters generated within the range of 4 − 40 Hz which was split into bands with 4, 6, 9, 12, 18 or 36 Hz bin widths. The 4 − 40 Hz interval sufficiently covers a wide spectral range representative of motor cortical activity as studied in the literature [2,3]. Gaussian kernels are used for the density estimations in Eq. 10. SMIG step size at each iteration t was chosen to be ηt = 0.01/t0.02, and the number of epochs were NE = 100 in the presented results. Extracted features in training phase were used to construct a linear discriminant analysis classifier which was used in testing as demonstrated in Figure 1.
3.4. Results on Session-to-Session Learning
Increasing output projection dimensionality NO results in a high-dimensional feature space for classification, however it does not necessarily increase binary classification accuracies as observed in Figure 3. In most conditions, we observe a saturation of accuracies in response to further increase in NO. We argue that this may be a result of inconsistent high-dimensional density estimation incorporated within SMIG. Hence low dimensional feature projections (NO = 2 or 3) is likely to suffice. Furthermore except Subject 2, we observe that increasing training data amount results in better accuracies, highest being 82%, which supports the feasibility of our approach for optimal spatio-spectral feature extraction. Comparison of the solid and dashed blue curves in Figure 3 reveal the potential sub-optimality of the conventional approach explicitly for subjects 1 and 4. In particular for subject 4, feature selection performs below chance level (50%).
Fig. 3.
Each subject’s classification accuracies for varying output projection dimensionalities (NO), and three different train-test data splits for session-to-session learning, using L = 2 for CSP feature extraction at each frequency band. Dashed blue curves represent the accuracies with the feature selection approach [2, 3] for Session 1 to Session 2 train-test data split, where NO represents the number of selected features in that case.
Table 1 demonstrates the impact of changing the number of CSP features extracted in each frequency band (i.e. L). Average session-to-session classification accuracies across subjects were observed to be around 65% for the same hand gesture binary classification using EEG. Due to the linear projection nature of the model, using only the most discriminative CSP features (L = 2) is likely to be sufficient, with respect to L = 32. On an individual level, subjects reach classification accuracies up to and above 70%, which was stated to be effective for binary BCI paradigms [26]. In our context, discriminating these same hand gestures would relate to using a BCI channel for a neurally controlled prosthetic hand.
Table 1.
Session-to-session learning classification accuracies for NO = 2 with SMIG using only the most discriminative CSP features extracted at each band (L = 2), or using all CSP features extracted at each band (L = 32).
| Session 1 to 2 | Session 2 to 1 | |||
|---|---|---|---|---|
| L = 2 | L = 32 | L = 2 | L = 32 | |
| Subject 1 | 65 % | 57 % | 58 % | 58 % |
| Subject 2 | 63 % | 66 % | 79 % | 79 % |
| Subject 3 | 70 % | 77 % | 65 % | 61 % |
| Subject 4 | 64 % | 65 % | 54 % | 57 % |
| Average | 65.5 % | 66.2 % | 64.0 % | 63.7 % |
4. DISCUSSION
In the present study we propose a feature extraction protocol for single-trial BCIs based on linear projection of high-dimensional features by mutual information driven stochastic gradient descent. Both using information theoretic learning (i.e. entropy as an explicit function of the probability density itself carrying all high-order statistical properties), and the non-parametric approach enables reasonable use of overlapping filter banks for a broad range of user-specific spatio-spectral features. Results on pilot experimental data demonstrates the feasibility of the approach to potentially overcome inherent confounders of state-of-the-art feature selection protocols in an online same hand gesture decoding task using EEG. In the context of online BCI-assisted prosthetic hand control, further work can exploit context-aware decision making by fusion of other neurophysiological (e.g., electromyography (EMG)), and external (e.g., vision-induced object detection) information or increased amount of daily user training data for more reliable feature extraction in online use that can potentially exploit transfer learning modalities.
Acknowledgments
Our work is supported by NSF (IIS-1149570, CNS-1544895), NIDLRR (90RE5017–02-01), and NIH (R01DC009834).
REFERENCES
- [1].Ramoser H, Muller-Gerking J, and Pfurtscheller G, “Optimal spatial filtering of single trial EEG during imagined hand movement,” IEEE Transactions on Rehabilitation Engineering, vol. 8, no. 4, pp. 441–446, 2000. [DOI] [PubMed] [Google Scholar]
- [2].Ang KK, Chin ZY, Zhang H, and Guan C, “Filter bank common spatial pattern (FBCSP) in brain-computer interface,” in IEEE International Joint Conference on Neural Networks, 2008, pp. 2390–2397. [Google Scholar]
- [3].Ang KK, Chin ZY, Zhang H, and Guan C, “Mutual information-based selection of optimal spatial–temporal patterns for single–trial EEG-based BCIs,” Pattern Recognition, vol. 45, no. 6, pp. 2137–2144, 2012. [Google Scholar]
- [4].Hamadicharef B, Zhang H, Guan C, Wang C, Phua KS, Tee KP, and Ang KK, “Learning EEG-based spectral-spatial patterns for attention level measurement,” in IEEE International Symposium on Circuits and Systems, 2009, pp. 1465–1468. [Google Scholar]
- [5].Jenke R, Peer A, and Buss M, “Feature extraction and selection for emotion recognition from EEG,” IEEE Transactions on Affective Computing, vol. 5, no. 3, pp. 327–339, 2014. [Google Scholar]
- [6].Grosse-Wentrup M and Buss M, “Multiclass common spatial patterns and information theoretic feature extraction,” IEEE Transactions on Biomedical Engineering, vol. 55, no. 8, pp. 1991–2000, 2008. [DOI] [PubMed] [Google Scholar]
- [7].Kwak N and Choi CH, “Input feature selection by mutual information based on parzen window,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667–1671, 2002. [Google Scholar]
- [8].Guyon I and Elisseeff A, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, no. Mar, pp. 1157–1182, 2003. [Google Scholar]
- [9].Torkkola K, “Information-theoretic methods,” in Feature Extraction, pp. 167–185. Springer, 2008. [Google Scholar]
- [10].Erdogmus D, Ozertem U, and Lan T, “Information theoretic feature selection and projection,” in Speech, Audio, Image and Biomedical Signal Processing using Neural Networks, pp. 1–22. Springer, 2008. [Google Scholar]
- [11].Hild KE, Erdogmus D, Torkkola K, and Principe JC, “Feature extraction using information–theoretic learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1385–1392, 2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Torkkola K, “Feature extraction by non-parametric mutual information maximization,” Journal of Machine Learning Research, vol. 3, no. Mar, pp. 1415–1438, 2003. [Google Scholar]
- [13].Pfurtscheller G and Neuper C, “Motor imagery and direct brain-computer communication,” Proceedings of the IEEE, vol. 89, no. 7, pp. 1123–1134, 2001. [Google Scholar]
- [14].Nashmi R, Mendonça AJ, and MacKay WA, “EEG rhythms of the sensorimotor region during hand movements,” Electroencephalography and Clinical Neurophysiology, vol. 91, no. 6, pp. 456–467, 1994. [DOI] [PubMed] [Google Scholar]
- [15].Ozdenizci O, Yalcin M, Erdogan A, Patoglu V, Grosse-Wentrup M, and Cetin M, “Electroencephalographic identifiers of motor adaptation learning,” Journal of Neural Engineering, vol. 14, no. 4, pp. 046027, 2017. [DOI] [PubMed] [Google Scholar]
- [16].Mohamed AK, Marwala T, and John LR, “Single–trial EEG discrimination between wrist and finger movement imagery and execution in a sensorimotor BCI,” in International Conference of the IEEE EMBC, 2011, pp. 6289–6293. [DOI] [PubMed] [Google Scholar]
- [17].Shiman F, Irastorza-Landa N, Sarasola-Sanz A, Spüler M, Birbaumer N, and Ramos-Murguialday A, “Towards decoding of functional movements from the same limb using EEG,” in International Conference of the IEEE EMBC, 2015, pp. 1922–1925. [DOI] [PubMed] [Google Scholar]
- [18].Edelman BJ, Baxter B, and He B, “EEG source imaging enhances the decoding of complex right–hand motor imagery tasks,” IEEE Transactions on Biomedical Engineering, vol. 63, no. 1, pp. 4–14, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Salehi SSM, Moghadamfalahi M, Quivira F, Piers A, Nezamfar H, and Erdogmus D, “Decoding complex imagery hand gestures,” in International Conference of the IEEE EMBC, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Lebedev MA and Nicolelis MAL, “Brain–machine interfaces: past, present and future,” TRENDS in Neurosciences, vol. 29, no. 9, pp. 536–546, 2006. [DOI] [PubMed] [Google Scholar]
- [21].Principe JC, Xu D, and Fisher J, “Information theoretic learning,” Unsupervised Adaptive Filtering, vol. 1, pp. 265–319, 2000. [Google Scholar]
- [22].Erdogmus D, Hild KE, and Principe JC, “Online entropy manipulation: Stochastic information gradient,” IEEE Signal Processing Letters, vol. 10, no. 8, pp. 242–245, 2003. [Google Scholar]
- [23].Chen B, Hu J, Li H, and Sun Z, “Adaptive filtering under maximum mutual information criterion,” Neurocomputing, vol. 71, no. 16, pp. 3680–3684, 2008. [Google Scholar]
- [24].Bruurmijn LCM, Vansteensel MJ, and Ramsey NF, “Classification of attempted and executed hand movements from ipsilateral motor cortex in amputees,” in Proceedings of the 6th International Brain-Computer Interface Meeting, 2016. [Google Scholar]
- [25].Blokland Y, Spyrou L, Bruhn J, and Farquhar J, “Why BCI researchers should focus on attempted, not imagined movement,” in Proceedings of the 6th International Brain-Computer Interface Meeting, 2016. [Google Scholar]
- [26].Edlinger G, Allison BZ, and Guger C, “How many people can use a BCI system?,” in Clinical Systems Neuroscience, pp. 33–66. Springer, 2015. [Google Scholar]



