Skip to main content
Entropy logoLink to Entropy
. 2018 Jan 2;20(1):7. doi: 10.3390/e20010007

Information Theoretic Approaches for Motor-Imagery BCI Systems: Review and Experimental Comparison

Rubén Martín-Clemente 1,*, Javier Olias 1, Deepa Beeta Thiyam 1,2, Andrzej Cichocki 3,4,5, Sergio Cruces 1,*
PMCID: PMC7512284  PMID: 33265109

Abstract

Brain computer interfaces (BCIs) have been attracting a great interest in recent years. The common spatial patterns (CSP) technique is a well-established approach to the spatial filtering of the electroencephalogram (EEG) data in BCI applications. Even though CSP was originally proposed from a heuristic viewpoint, it can be also built on very strong foundations using information theory. This paper reviews the relationship between CSP and several information-theoretic approaches, including the Kullback–Leibler divergence, the Beta divergence and the Alpha-Beta log-det (AB-LD)divergence. We also revise other approaches based on the idea of selecting those features that are maximally informative about the class labels. The performance of all the methods will be also compared via experiments.

Keywords: common spatial patterns, generalized divergences, brain computer interfaces

1. Introduction

The electroencephalogram (EEG) is a record over time of the differences of potential that exist between different locations on the surface of the head [1,2]. It originates from the summation of the synchronous electrical activity of millions of neurons distributed within the cortex. In recent years, there has been a growing interest in using the EEG as a new communication channel between humans and computers. Brain-computer interfaces (BCIs) are computer-based systems that enable us to control a device with the mind, without any muscular intervention [3,4,5,6]. This technology, though not yet mature, has a number of therapeutic applications, such as the control of wheelchairs by persons with severe disabilities, but also finds use in fields as diverse as gaming, art or access control.

There are several possible approaches for designing a BCI [1,4,7]. Among them, motor imagery (MI)-based BCI systems seem to be the most promising option [6,8,9,10]. In MI-based BCI systems, the subject is asked to imagine the movement of different parts of his or her body, such as the hands or the feet. The imagined actions are then translated into different device commands (e.g., when the subject imagines the motion of the left hand, the wheelchair is instructed to turn to the left). What makes this possible is that the spatial distribution of the EEG differs between different imagined movements. More precisely, since each brain hemisphere mainly controls the opposite side of the body, the imagination of right and left limb movements produces a change of power over the contralateral left and right brain motor areas. These fluctuations, which are due to a pair of phenomena known as event-related desynchronization (ERD) or power decrease and event-related synchronization (ERS) or power increase [11,12], can be detected and converted into numerical features. By repeating the imagined actions several times, a classifier can be trained to determine which kind of motion the subject is imagining (see [13] for a review). In practice, three classes of MI are used in BCIs, namely the movements of the hands, the feet and the tongue. Left hand movement imagery is more prominent in the vicinity of the electrode C4 (see Figure 1), while right hand imagined actions are detected around electrode C3 [14]. The imagery of feet movements appears in the electrode Cz and its surrounding area; nevertheless, it is not usually possible to distinguish between left foot or right foot motor imagery because the corresponding activation areas are too close in the cortex [11,14]. Finally, imagery of tongue movements can be detected on the primary motor cortex and the premotor cortex [15]. One of the inherent difficulties of designing a BCI is that the EEG features are highly non-stationary and vary over sessions. To cope with this problem, the background state of the subject (i.e., his or her motivation, fatigue, etcetera) and the context of the experiment can be both modeled as latent variables, whose parameters can be estimated using the expectation-maximization (EM) algorithm [16,17]. Overall, current BCI approaches achieve success rates of over 90%, although much depends on the person from whom the EEG data are recorded [14].

Figure 1.

Figure 1

Electrode locations of the international 10–20 system for EEG recording. The letters “F”, “T”, “C”, “P” and “O” stand for frontal, temporal, central, parietal and occipital lobes, respectively. Even numbers correspond to electrodes placed on the right hemisphere, whereas odd numbers refer to those on the left hemisphere. The “z” refers to electrodes placed in the midline.

The common spatial patterns (CSP) method [4,18,19,20,21,22] is a method of dimensionality reduction that is widely used in BCI systems as a preprocessing step. Basically, assuming two classes of MI-EEG signals (e.g., left hand and right hand MI tasks), CSP projects the EEG signals onto a low-dimensional subspace, which captures the variability of one of the classes while, at the same time, trying to minimize the variance in the other class. The goal is to enhance the ability of the BCI to discriminate between the different MI tasks, and it has been shown that CSP is able to reduce the dimension of the data significantly without decreasing the classification rate. It is noteworthy that CSP admits an interesting probabilistic interpretation. Under the assumption of Gaussian distributed data, CSP is equivalent to maximizing the symmetric Kullback–Leibler (KL) divergence between the probability distributions of the two classes after the projection onto the low dimensional space [23,24]. As a generalization of this idea in the context of BCI, it is interesting to investigate the dimensionality reduction ability of other different divergence-based criteria, which is drawing a lot of interest among the computational neuroscience community.

The present manuscript is a review of the state of the art of information theoretic approaches for motor imagery BCI systems. The article is written as a guideline for researchers and developers both in the fields of information theory and BCI, and the goal is to simplify and organize the ideas. We will present a number of approaches based on Kullback–Leibler divergence, Beta divergence (which is a generalization of Kullback–Leibler’s) and Alpha-Beta log-det (AB-LD) divergence (which include as special cases Stein’s loss, the S-divergence or the Riemannian metric), as well as their relation to CSP. We will also review a technique based on the idea of selecting those features that are maximally informative about the class labels. Complementarily, for the purpose of comparison, several non-information theoretic variants of CSP and their different regularization schemes are revised in the paper. The performance of all approaches will be evaluated and compared through simulations using both real and synthetic datasets.

The paper is organized as follows: The CSP algorithm is introduced in Section 2. Section 3 introduces the main characteristics of the Kullback–Leibler divergence, the Beta divergence and the Alpha-Beta log-det divergence, respectively, as well as their application to the problem of designing MI-BCI systems and the algorithms used to optimize them. Section 4 reviews an information-theoretic feature extraction framework. Section 5 presents, as has been said before, several extensions of CSP not based on information-theoretic principles. Finally, Section 6 presents the results of some experiments in which the performances of the above criteria are tested, in terms of their accuracy, computational burden and robustness against errors.

EEG Measurement and Preprocessing

For measuring the EEG, several different standardized electrode placement configurations exist. The most common among them is the International 10–20 system, which uses a set of electrodes placed at locations defined relative to certain anatomical landmarks (see Figure 1). The ground reference electrode is usually positioned at the ears or at the mastoid. To obtain a reference-free system, it is common practice to calculate the average of all the electrode potentials and subtract it from the measurements [1,2].

The EEG is usually contaminated by several types of noise and artifacts. Eye blinks, for example, elicit a large potential difference between the cornea and the retina that can be several orders of magnitude greater than the EEG. In the rest of the paper, it is assumed that the signals have already been pre-processed to remove noise and interferences. To this end, several techniques [25], such as autoregressive modeling [26], the more complex independent component analysis (ICA) [27], or the signal space projection (SSP) method [28], have shown good or excellent results (see also [29] and the references therein). Signal preprocessing includes also the division of the EEG into several frequency bands that are separately analyzed [30,31]. The “mu” band (8–15 Hz) and the “beta” band (16–31 Hz) are particularly useful in BCIs, as they originate from the sensorimotor cortex, i.e., the area that controls voluntary movements [2].

2. The Common Spatial Pattern Criterion

In this section, we present the common spatial patterns (CSP) method [4,18,19,20,21,22,32,33]. Consider a two-class classification problem, where the EEG signals belong to exactly one of two classes or conditions (e.g., left-/right-hand movement imagination).

To fix notation, let Xi,kRD×T be the matrix that contains the EEG data of class i{1,2} in the k-th trial or experiment, where D is the number of channels and T the number of samples in a trial. The corresponding sample covariance estimator is defined by:

Σi,k=1T1Xi,kXi,k, (1)

where (·) denotes “transpose”. Here, the EEG signals are assumed to have zero-mean, which is fulfilled as they are band-pass filtered (see the previous section). If L trials per class are performed, the spatial covariance matrix for class i is usually calculated by averaging the trial covariance matrices as:

Σi=1Lk=1LΣi,k (2)

In practice, these covariance matrices are often normalized in power with the help of the following transformation:

ΣiΣi/trΣi, (3)

where tr(·) denotes the trace operator.

After the BCI training phase, in which matrices Σ1 and Σ2 are estimated using training data, suppose that a new, not previously observed, data matrix XRD×T of imagined action is captured. The problem that arises is to develop a rule to allocate these new data to one class or the other. A useful approach is to define a weight vector wRD (also known as a ‘spatial filter’) and allocate X to one class if the variance of wX exceeds a certain predefined threshold and to the other if not; this relates to the fact that event-related desynchronizations and event-related synchronizations, i.e., the phenomena underlying the MI responses, are associated with power decreases/increases of the ongoing EEG activity [12].

Of course, not just any spatial filter is of value. To enhance the discrimination of the MI tasks, CSP proposes using spatial filters that maximize the variance of the band-pass filtered EEG signals in one class while, simultaneously, minimizing it for the other class. Mathematically, CSP aims at maximizing an objective function based on the the following Rayleigh quotient:

J(w)=wΣ1wwΣ2w=σ12σ22, (4)

where σi2 is the variance of the i-th projected class and Σi is the covariance matrix of the i-th class.

It is a straightforward derivation to obtain that the spatial filters that hierarchically maximize (4) can be computed by solving the generalized eigenvalue problem:

Σ1w=λΣ2w. (5)

Each eigenvector wi gives a different solution. Observe that:

wiΣ1wi=λiwiΣ2wiλi=wiΣ1wiwiΣ2wi=J(wi),

where λi is the generalized eigenvalue corresponding to wi. Therefore, the larger (or smaller) the eigenvalue, the larger the ratio between the variances of the two classes and the better the discrimination accuracy of the filter.

The latter readily suggests selecting the spatial filters among the principal and the minor eigenvectors (i.e., the eigenvectors associated with the largest and smallest eigenvalues, respectively). Let:

WCSP=[w1,,wd]RD×d (6)

be the matrix that collects these dD top (i.e., most discriminating) spatial filters. Given a data matrix XRD×T of observations, all of the same class, the outputs of the spatial filters are defined as:

yi=wiX,i=1,,d,(dD) (7)

which can be gathered at the d×T output matrix Y=WCSPX. Denoting by Σ the sample covariance matrix of X, it follows that the covariance matrix of the outputs is given by WCSPΣWCSP, while the variance of the output of the i-th spatial filter is equal to wiΣwi. Finally, not the sample variances, but the log transformed sample variances of the outputs, i.e.,

Fi=log(wiΣwi),i=1,,d, (8)

are used as features for the classification of the imagined movements. Observe that, as long as d<D, the dimensionality of the data is reduced.

CSP admits an interesting neurological interpretation. First note that the scalp EEG electrodes measure the addition of numerous sources of neural activity, which are spread over large areas of the neocortical surface, and this does not always allow a reliable localization of the cortical generators of the electrical potentials. It has been suggested that CSP linearly combines the EEG signals so that the sources of interest are enhanced while the others are suppressed [34].

Another interpretation of (4) may be as follows: the basic theory of principal component analysis (PCA) states that maximizing wΣiw finds the direction vector that best fits, in the least-squares sense, the data of class i in the D-dimensional space. Similarly, minimizing this ratio obtains the opposite effect. Thus, we can interpret that CSP seeks directions that fit well with the data in one class, but are not representative of the data in the other class. By projecting the EEG data onto them, a significant reduction of the variance of one of the classes, while preserving the information content of the other, can be thus obtained.

An interesting generative model perspective has been proposed in [35,36]. Here, the above data matrices are assumed to be generated by a latent variable model:

Xi(:,k)=AYi(k)+Ni(k),

where we have used the notation Xi(:,k)RD for the k-th column of the data matrix Xi, i.e., it is the observation vector at time k for class i, i=1,2; ARD×s is a mixing matrix, the same for both classes; Yi(k)N(0,Γi) is an s-dimensional column vector of latent variables (s has to be estimated from the data) and Ni(k)N(0,Δi) is a D-dimensional vector of noise, independent of the data. Here, the covariance matrices Γi and Δi are assumed to be diagonal matrices, implying that the latent factors are also independent of each other. Under this model, the columns of matrix A can be regarded as the “spatial patterns” that explain how the EEG data are formed at each electrode location, where the latent variables represent the degree to which each “spatial pattern” appears in the data. Under the assumptions that the noise is negligible and matrix A is square, it is noteworthy that the CSP spatial filters are precisely the columns of the matrix A [35].

3. Divergence-Based Criteria

CSP produces quite good results in general, but also suffers from various shortcomings: e.g., it is sensitive to artifacts [37,38] and its performance is degraded for non-stationary data [39]. For these reasons, CSP is still an active line of research, and a number of variants have been proposed in the literature. In particular, in this paper, we are interested in reviewing CSP-variants based on an information-theoretic framework.

There is a common assumption in the literature that the classes can be modeled by multivariate Gaussian distributions with zero-means and different covariance matrices. This assumption is based on the principle of maximum entropy, not in actual measures of EEG data. By projecting the data onto the principal generalized eigenvectors, CSP transforms them onto a lower dimensional space where the variance of Class 1 is maximized, while the variance for Class 2 is minimized. Conversely, the projection onto the minor generalized eigenvectors has the opposite effect. Since a zero-mean univariate normal variable is completely determined by its variance, we can understand the ratio (4) as a measure of how much the distributions of the projected classes differ from each other (the larger the ratio between the variances, the more different the distributions). By accepting this viewpoint, it is interesting to investigate the ability of other measures of dissimilarity between statistical distributions, rather than the ratio of the corresponding variances, to help in discriminating between the classes. In fact, the most interesting features for classification often belong to those subspaces where there is a large dissimilarity between the conditional densities of the considered classes, which is another justification for proposing a divergence maximization framework in the context of MI-BCI.

In the following sections, we review the main information-theoretic-based approaches.

3.1. Criterion Based on the Symmetric Kullback–Leibler Divergence

Divergences are functions that measure the dissimilarity or separation between two statistical distributions. Given two univariate Gaussian densities N1(0,σ1) and N2(0,σ2), their Kullback–Leibler divergence (the KL divergence between two distributions f1 and f2 is defined as DivKLf1f2=f1(x)logf1(x)f2(x)dx) is easily found to be:

DivKLN1(0,σ1)N2(0,σ2)=12logσ22σ12+σ12σ221.

If the densities have interchangeable roles, it is reasonable to consider the use of a symmetrized measure like the one provided by the symmetrized Kullback–Leibler (sKL) divergence. This is defined simply as:

sDivKLN1N2=DivKLN1N2+DivKLN2N1=12σ12σ22+σ22σ121. (9)

The resemblance to the CSP criterion (4) is quite obvious, as was already noted, e.g., in [24]. In particular, note that, since z+1z increases when z goes to either infinity or zero, (9) is maximized by either maximizing or minimizing the ratio of the variances σ1 and σ2.

The generalization to multivariate data is straightforward. Let Y=WX, where X is the observed data matrix and W=[w1,,wd]RD×d denotes an arbitrary matrix of spatial filters with 1dD. Under the assumption that the EEG data are conditionally Gaussian distributed for each class ck{1,2}, i.e., X|ckN(0,Σk), the spatially-filtered data are also from a normal distribution, i.e., Y|ckN(0,Σ¯k), where:

Σ¯k=WΣkWRd×d,

k=1,2. The KL divergence between two d-dimensional multivariate Gaussian densities f1=N1(0,Σ¯1) and f2=N2(0,Σ¯2), that is,

DivKLf1f2=N1(0,Σ¯1)logN1(0,Σ¯1)N2(0,Σ¯2)dy,

can be shown to be (after some algebra):

DivKLf1f2=12log|Σ¯2||Σ¯1|d+trace(Σ¯2)1(Σ¯1), (10)

where |·| stands for “determinant”. The symmetrized Kullback–Leibler (sKL) divergence between the probability distributions of the two classes is now defined as:

DivsKLf1f2=DivKLf1f2+DivKLf2f1=12trace(WΣ1W)1(WΣ2W)+(WΣ2W)1(WΣ1W)d,

where we show again the explicit dependency on W.

We can naturally extend this formula to define the equivalent sKL matrix divergence:

DsKL(WΣ1WWΣ2W)=12trace(WΣ1W)1(WΣ2W)+(WΣ2W)1(WΣ1W)d. (11)

It has been shown in [23] that the subspace of the filters that maximize the sKL matrix divergence,

WsKL=argmaxWDsKL(WΣ1WWΣ2W), (12)

coincides with the subspace of those that maximize the CSP criterion, in the sense that the columns of WsKL and WCSP span the same subspace:

span(WsKL)=span(WCSP), (13)

that is, every column of WsKL is a combination of the top spatial filters of WCSP and vice versa.

In practice, WsKL is first used to project the data onto a lower dimensional subspace, and then, WCSP is determined by applying CSP to the projected data. Some advantage can be gained, compared to using CSP only, if in the first step the optimization of the sKL matrix divergence is also combined with some suitable regularization scheme. For example, to fight against issues caused by the non-stationarity of the EEG data, it has been proposed to maximize the regularized objective function [23]:

LsKL(W)=(1ϕ)DsKL(WΣ1WWΣ2W)ϕΔ(W), (14)

where 0ϕ<1 and:

Δ(W)=12Li=12k=1LDivKLN(0,WΣi,kW)N(0,WΣiW) (15)

is a regularization term, where we have assumed that L trials per class have been performed and Σc,k is the covariance matrix in the k-th trial of class c{1,2}. This proposed regularization term enforces the transformed data in all the trials to have the same statistical distribution. Other ideas have been proposed in [23], and a related approach can be found in [40]. Observe also that (15) is defined on the basis of the KL divergence, not on its symmetrized version. The KL divergence is calculated by a formula similar to (10), giving:

DivKLN(0,WΣi,kW)N(0,WΣiW)=12log|WΣiW||WΣi,kW|d+trace(WΣiW)1(WΣi,kW). (16)

The inverse of Σi,k does not appear in (16), which makes sense if this matrix is ill-conditioned due to insufficient sample size. For this reason, the KL divergence is preferred to its symmetric counterpart. In addition, the logarithm in (16) downweights the effect of |WΣi,kW|1 in case Σi,k is nearly singular.

3.2. Criterion Based on the Beta Divergence

The beta divergence, which is a generalization of the Kullback–Leibler’s, seems to be an obvious alternative measure of discrepancy between Gaussians. Given two zero-mean multivariate probability density functions f1(y) and f2(y), the beta divergence is defined for β>0 as:

Divβ(f1(y)f2(y))=1βf1β(y)f2β(y)f1(y)dy1β+1f1β+1(y)f2β+1(y)dy.

As limβ0f1βf2ββ=logf1f2, it can be shown that the beta divergence converges to the KL divergence for β0.

Let f1=N(0,Σ¯1) and f2=N(0,Σ¯2), with Σ¯i=WΣiWRd×d, i=1,2, be the zero-mean Gaussian distributions of the spatially-filtered data. In this case, the symmetric beta divergence between them yields the following closed form formula [41]:

Dsβ(WΣ1WWΣ2W)=γ|Σ¯1|β/2+|Σ¯2|β/2(β+1)d/2|Σ¯2|1β2|βΣ¯1+Σ¯2|1/2+|Σ¯1|1β2|βΣ¯2+Σ¯1|1/2, (17)

where γ=1β1(2π)βd(β+1)d. Observe that Dsβ is somewhat protected against possible large increases in the elements of Σ1 or Σ2 caused by outliers or estimation errors. For example, if Σi (resp. Σ¯i) grows, i{1,2}, then the contribution of all the terms containing Σi (resp. Σ¯i) in (17) tends to vanish. Compared with the previous case, if Σ1 (for example) increases, then the term:

trace(WΣ2W)1(WΣ1W)

may dominate (11).

With the necessary changes of divergences being made, the regularizing framework previously defined by Equations (14) and (15) can be easily adapted to the present case [23]. It has been argued in [23] that small values of β penalize abrupt changes in the covariance matrices caused by single extreme events, such as artifacts, whereas a large β is more suitable to penalize the gradual changes over the dataset from trial to trial.

Alternatively, supposing that L trials per class are performed, it has been also proposed in [41] to use as the objective function the sum of trial-wise divergences:

D¯sβ(W)=i=1LDsβ(WΣ1,iWWΣ2,iW),

where Σ1,i and Σ2,i are the covariance matrices in the i-th trial of Class 1 and Class 2, respectively.

3.3. Criterion Based on the Alpha-Beta Log-Det Divergence

Given the covariance matrices of each class, Σ1 and Σ2, an extension of the Kullback–Leibler symmetric matrix divergence given in Equation (11) is the Alpha-Beta log-det (AB-LD) divergence, defined as [42,43]:

DLD(α,β)(Σ1Σ2)=1αβlogα(Σ212Σ1Σ212)β+β(Σ212Σ1Σ212)αα+β+forα0,β0,α+β0, (18)

where:

|x|+=xx0,0,x<0,

denotes the non-negative truncation operator. For the singular cases, the definition becomes:

DLD(α,β)(Σ1Σ2)=1α2tr(Σ212Σ11Σ212)αIαlog|Σ212Σ11Σ212|forα0,β=0,1β2tr(Σ212Σ1Σ212)βIβlog|Σ212Σ1Σ212|forα=0,β0,1α2log(Σ212Σ1Σ212)α(I+log(Σ212Σ1Σ212)α)+forα=β,12||log(Σ212Σ11Σ212)||F2forα,β=0. (19)

It can be easily checked that DLD(α,β)(Σ1Σ2)=0 iff Σ1=Σ2. The interest in the AB-LD divergence is motivated by the fact that, as can be observed in Figure 2, it generalizes several existing log-det matrix divergences, such as the Stein’s loss (the Kullback–Leibler matrix divergence), the S-divergence, the Alpha and Beta log-det families of divergences and the geodesic distance between covariance matrices (the squared Riemannian metric), among others [43].

Figure 2.

Figure 2

Illustration of the Alpha-Beta log-det divergence (AB-LD) divergence DLD(α,β)(Σ1Σ2) in the (α,β)-plane. Note that the position of each divergence is specified by the value of the hyperparameters (α,β). This parameterization smoothly connects several positive definite matrix divergences, such as the squared Riemannian metric (α=0,β=0), the KL matrix divergence or Stein’s loss (α=1,β=0), the dual KL matrix divergence (α=0,β=1) and the S-divergence (α=12,β=12), among others.

There is a close relationship between the AB-LD divergence criterion and CSP: it has been shown [42] that the sequence of Courant-like minimax divergence optimization problems [42]

wπi=argmindim{W}=Di+1maxw{W}DLD(α,β)(wΣ1wwΣ2w),i=1,,D, (20)

yields spatial filters wπi that essentially coincide (i.e., up to a permutation πi in the order) with the CSP spatial filters wi, i.e., with the generalized eigenvectors defined by (5). The permutation ambiguity can be actually avoided if we introduce a suitable scaling κR+ in one of the arguments of the divergence, so (20) becomes

wi=argmindim{W}=Di+1maxw{W}DLD(α,β)(wΣ1wκwΣ2w),i=1,,D, (21)

where κ is typically close to the unity.

For W=[w1,,wd]RD×d with 1dD, a criterion based on the AB-LD divergence takes the following form [42]

LLD(W)=DLD(α,β)(WΣ1WκWΣ2W)ηP(c1)R1+P(c2)R2, (22)

where P(c1) and P(c2) are the prior probabilities of Class 1 and Class 2,

R1=1Li=1LDLD(α,β)(WΣ1,iWWΣ1W), (23)
R2=1Li=1LDLD(α,β)(WΣ2,iWWΣ2W), (24)

where L is the number of trials per class and Σ1,i and Σ2,i are the covariance matrices in the i-th trial of Class 1 and Class 2, respectively.

The regularization term:

P(c1)R1+P(c2)R2

may be interpreted as a sort of within-class scatter measure, which is reminiscent of that used in Fisher’s linear discriminant analysis. The parameter η thus controls the balance between the maximization of the between-class scatter and the minimization of the within-class scatter. Observe that when both classes are equiprobable, P(c1)=P(c2)=1/2, this regularization term is the equivalent of the one defined in Equation (15).

3.4. Algorithms for Maximizing the Divergence-Based Criteria

To give some idea of how the objective functions are, Figure 3 depicts the divergences defined in Section 3.1, Section 3.2 and Section 3.3 assuming two-dimensional data in the particular case d=1 (so that the projected data are one dimensional). These divergence-based criteria can be optimized in several ways. In practice, a two-step procedure seems convenient, in which a first “whitening” of the observed EEG data is followed by maximization where the search space is the set of the orthogonal matrices.

Figure 3.

Figure 3

This figure shows the evolution of the common spatial patterns (CSP) criterion function (in blue line), the symmetrized Kullback–Leibler divergence (sKL) (in red line), the symmetrized beta divergence (in purple line) and the AB-LD divergence (in yellow line), all of them as a function of the components of the spatial filter w=[w1,w2] in the two-dimensional case, where it is assumed that w22=w12+w22=1. All the divergences are normalized with respect to their maximum values, and no regularization has been applied. Observe the coincidence of all the critical points. The covariance matrices were generated at random in this experiment.

The rationale is as follows. Observe first that the CSP filters, i.e., the solutions to Equation (5), which is rewritten next for the reader’s convenience,

Σ1w=λΣ2wΣ21Σ1w=λw,

are also the eigenvectors of the matrix Σ21Σ1. Since this matrix is not necessarily symmetric, it follows that these eigenvectors do not form an orthogonal set. A well-posed problem can be obtained by transforming the covariance matrices Σi into Σ^iPΣiP, where PRD is chosen in such a way to ensure the whitening of the sum of the expected sample observations, i.e.,

P(Σ1+Σ2)P=I.

Let w be the matrix that contains the eigenvectors of Σ21Σ1 in its columns, and let V be the matrix with the eigenvectors of Σ^21Σ^1. It can be shown that matrix V is orthogonal. Furthermore,

W=PVΛW=ΛVP,

where Λ is a diagonal matrix (up to elementary column operations) that contains scale factors. In practice, since only the directions of the spatial filters (i.e., not the magnitude) are of interest, we can ignore the above-defined scale matrix Λ. Then, when only dD filters are retained, it can be assumed that W can be decomposed into two components W=R˜P that successively transform the observations. The first matrix PRD is chosen in such a way to ensure the whitening of the sum of the expected sample observations, i.e., P(Σ1+Σ2)P=I, as was previously explained. The second transformation R˜Rd×D is performed by a semi-orthogonal projection matrix, which rotates and reflects the whitened observations and projects this result onto a reduced d-dimensional subspace. This is better seen through the decomposition R˜=IdR, where R is a full rank orthogonal matrix (RR=I) and IdRd×D is the identity matrix truncated to have only the first d rows.

Let D(·||·) denote any of the previously-studied divergences. The above discussion suggests maximizing the criterion:

D(WΣ1WWΣ2W)=D(R˜Σ˜1R˜R˜Σ˜2R˜)=D(IdRΣ˜1RIdIdRΣ˜2RId)J(R) (25)

under the constraint that R is an orthogonal matrix, where Σ˜i=PΣiP.

Now, we face the problem of optimizing J(R) under the orthogonality constraint RR=I. This problem can be addressed in several ways, and here, we review two particularly significant approaches.

3.4.1. Tangent Methods

First of all, it has been shown that the gradient of J at R on the group of orthogonal matrices is given by [44,45]:

J(R)=J(R)R(J(R))R, (26)

where J(R) is the matrix of partial derivatives of J with respect to the elements of R, i.e.,

(J(R))ij=J(R)rij, (27)

where rij is the (i,j)th entry of matrix R. Therefore, for steepest ascent search, consider small deviations of R in the direction J(R) as follows:

RR¯=R+μJ(R), (28)

with μ>0. If R is orthogonal, this update direction maintains the orthogonality condition, in the sense that R¯R¯=I+o(μ2). Furthermore, since the first order Taylor expansion of J(R) is:

J(R+ΔR)=J(R)+<J(R)|ΔR>+o(ΔR), (29)

where <A|B>=traceAB represents the inner product of two matrices, if R is modified into R¯, it follows that:

J(R¯)=J(R)+μ<J(R)|J(R)>+o(μ). (30)

Some algebra shows that:

<J(R)|J(R)>=12<J(R)|J(R)> (31)

which is always positive, and therefore, J always increases. The steepest ascent method thus becomes:

Rt+1=Rt+μJ(R)=I+μH(Rt)Rt, (32)

where:

H(Rt)=J(Rt)RtRtJ(Rt). (33)

A drawback of this approach is that, as the algorithm iterates, the orthogonality constraint may be no longer satisfied. One possible solution is to re-impose the constraint from time to time by projecting R back to the constraint surface, which may be performed using an orthogonalization method such as the Gram–Schmidt technique. This approach has been used, e.g., in [42].

3.4.2. Optimization on the Lie Algebra

Alternatively, R can be forced to remain always on the constraint surface using an iteration of the form [44]:

Rt+1=QtRt, (34)

where Qt=exp(Mt) and Mt is skew symmetric, i.e., Mt=Mt. As the exponential of a skew symmetric matrix is always orthogonal, we ensure that Rt+1 is orthogonal, as well, supposing Rt to be. Technically speaking, the set of the skew symmetric matrices is called a Lie algebra, and the idea is to optimize J moving along it. As the update rule for R given in (34) may be also considered as an update for M from the zero matrix to its actual value Mt, the algorithm is as follows:

  1. Start at the zero matrix 0.

  2. Move from 0 to
    Mt=μMJ|M=0, (35)
    where MJ is the gradient of J with respect to M in the Lie algebra:
    MJ=J(R)RRJ(R). (36)
  3. Define Qt=exp(Mt), and use it to come back into the space of the orthogonal matrices.

  4. Update Rt+1=QtRt.

Note that, for small enough μ, we have that exp(M)=exp(μMJ)I+μMJ, so that (34) coincides with (32). From this viewpoint, it may seem that (34), which is used in [23,41], is superior to (32), in the sense that includes (32) as a particular case. Nevertheless, the main drawback of (34) is that it is necessary to calculate the exponential of a matrix, which is a somewhat “tricky” operation [46]. In both approaches, the optimal value of μ can be chosen by a line search along the direction of the gradient.

More advanced optimization techniques, like the standard quasi-Newton algorithms based on the Broyden–Fletcher–Goldfarb–Shannon (BFGS) method [24] have been recently extended to work on Riemannian manifolds [47]. The algorithm used in Section 6 for the optimization of the AB-LD divergence criterion [42], which we will denote in this paper as the Sub-LD algorithm, is based on the BFGS implementation on the Stiefel manifold of semi-orthogonal matrices [48]. Finally, note that spatial filters can be computed all at a once, yielding the so-called subspace approach, or one after the other by a sequential procedure, which is called the deflation approach. In the latter case, the problem is repeatedly solved for d=1, and a projection mechanism is used to prevent the algorithms from converging to previously found solutions [23].

3.4.3. Post-Processing

Finally, it has to be pointed out that, by maximizing any divergence, we may not obtain the CSP filters, i.e, the vectors wi computed by the CSP method, but a linear combination of them [23,42]. The filters are actually determined by applying CSP to the projected data in a final step.

4. The Information Theoretic Feature Extraction Framework

Information theory can play a key role in the dimensionality reduction step that extracts the relevant subspaces for classification. Inspired by some other papers in machine learning, the authors of [49] adopted an information theoretic feature extraction (ITFE) framework based on the idea of selecting those features, which are maximally informative about the class labels. Let X be the D-dimensional random variable describing the observed EEG data. In this way, the desired spatial filters are the ones that maximize the mutual information between the output random variable Y=wX and a class random variable C that represents the true intention of the BCI user, i.e.,

w*=argmaxwI(C;wX). (37)

As was noted in [49], this criterion can be also linked with the minimization of an upper-bound on the probability of classification error. Consider the entropy H(C) and a function:

U(γ)=12(H(C)γ), (38)

which was used in [50] to obtain an upper-bound for the probability of error:

PeU(I(C;Y)). (39)

Since U(γ) is a strictly monotonous descending function, the minimization of the upper-bound of Pe is simply obtained through the maximization of the mutual information criterion:

JITFE(w)=I(C;wX). (40)

Although the samples in each class are assumed to be conditionally Gaussian distributed, the evaluation of this criterion also requires one to obtain h(wX), the differential entropy of the output of the spatial filter, which is non-trivial to evaluate, and therefore, it has to be approximated. The procedure starts by choosing the scale of the filter that normalizes the random variable wX to unit variance. Assuming that wX is nearly Gaussian distributed, the differential entropy of this variable is approximated with the help of a truncated version of the Edgeworth expansion for a symmetric density [51]:

h(wX)hg(wX)148k4(wX)2, (41)

where hg(wX) denotes the entropy of a Gaussian random variable with power E[|wX|2]=1 and kurtosis k4(wX). By expressing the value of the kurtosis of a mixture of conditional Gaussian densities in terms of the conditional variances of the output for each class, after substituting these values in (41), the authors of [49] arrive to the approximated mutual information criterion that they propose to maximize:

J˜ITFE(w)12k=1ncP(ck)log2wΣkw316k=1ncP(ck)(wΣkw)212JITFE(w), (42)

where nc is the number of classes and Σk denotes the conditional covariance matrix of the k-th class.

On the one hand, for only two classes (nc=2), the exact solution of the ITFE criterion can be shown to coincide with the one of CSP. On the other hand, for multiclass scenarios (nc>2), it is proposed to use a Joint Approximate Diagonalization (JAD) procedure (which is no longer exact) for obtaining the independent sources of the observations and then retain only those sources that maximize the approximated mutual information with the class labels.

5. Non-Information-Theoretic Variants of CSP

In this section we review, for the purposes of comparison, some variants of CSP that are not based on information-theoretic principles. Although CSP is considered to be the most effective algorithm for the discrimination of motor imagery movements, it is also sensitive to outliers. Several approaches have been proposed to improve the robustness of the algorithm.

Using the sample estimates of the covariance matrices, the CSP criterion (4) can be rewritten as:

J^(w)=wΣ1wwΣ2w=wX1X1wwX2X2w=wX122wX222, (43)

where Xi denotes the data matrix of class i. Therefore, CSP is not a robust criterion as large outliers are favored over small data values by the square in Equation (43). To fix this problem, some approaches use robust techniques for the estimation of the covariance matrices [37]. Alternatively, as presented in [52], a natural extension of CSP that eliminates the square operation, having it replaced by the absolute value, is given by:

J^1(w)=wX11wX21. (44)

This l1-norm-based CSP criterion is more robust against outliers than the original l2-norm-based formula (43). However, l1-norm CSP does not explicitly consider the effects of other types of noise, such as those caused by ocular movements, eye blinks or muscular activity, supposing that they are not completely removed in the preprocessing step [53,54]. To take them into account, [55] added a penalty term in the denominator of the CSP-l2 objective function, obtaining:

J^1r(w)=wX122wX222+ρR(w), (45)

where R(w) is some measure of the intraclass scattering of the filtered data in each of the classes, so the maximization of J^1r(w) encourages the minimization of R(w), and ρ is a positive tuning parameter. Finally, a generalization of the l1-norm-based approach has been proposed in [56,57], which explores the use of lp norms through the following criterion:

J^1p(w)=wX1p1/pwX2p1/p. (46)

Other approaches for regularizing the original l2-norm based CSP algorithm include performing a robust estimation of the covariance matrices Σi or adding a penalty term Δ in the objective function. With regard to the first approach, [58] proposes the use of information from various subjects as a regularization term, so the sample covariance matrices Σ are substituted in the formulas for:

Σ˜=(1ψ)Σ+ψ1|S|kSΣk,

where S is a set of subjects whose data have been previously recorded, Σk is the sample covariance matrix of the k-th subject and ψ(0,1) is a regularization parameter. Related approaches can be found in [21,37,38,59,60,61,62,63]. Finally, in [7] the covariance matrices are estimated using data originating from specific regions of interest within the brain.

The second regularization approach consists of including a penalty term in the CSP objective function [64]. The regularized CSP objective functions can be represented as:

J˜1(w)=wΣ1wwΣ2w+αΔ(w) (47)
J˜2(w)=wΣ2wwΣ1w+αΔ(w) (48)

where α is the regularization parameter. The regularized Tikhonov-CSP approach (RTCSP) penalizes the solutions with large weights by using a penalty term Δ(w) of the form:

Δ(w)=w.

The filters w can computed by solving an eigenvalue problem similar to that of the standard CSP algorithm. Specifically, the stationary points of J˜1(w) verify [64]:

(Σ2+αI)1Σ1w=λw.

Similarly, the stationary points of J˜2(w) are the eigenvectors of matrix (Σ1+αI)1Σ2. Observe that it is necessary to optimize both objective functions, as the stationary points of any of them alone maximize the variance of one class, but do not minimize the variance of the other class.

Finally, all the previous approaches admit the following generalization: in traditional CSP, the EEG data is usually band-pass pre-filtered using one single filter between 8 and 30 Hz, which is a range that covers the so-called “alpha”, “beta” and “mu” EEG bands. An straightforward extension, known as the filter bank CSP (FBCSP) technique, was proposed in [30], where the input MI-EEG signals are bandpass filtered between different bands of frequency ((4–8 Hz), (8–12 Hz), , (36–40 Hz)) and the CSP algorithm, or any of its variants, is applied to each band for the computation of the spatial filters. The results of all analyses are then combined to form the final response (see Figure 4). Similar approaches have been proposed in [10,65,66]. An extension to the multiclass problem can be found in [67]. Since the optimal frequency bands can vary from subject to subject, several alternative approaches have been proposed that combine the time-frequency characteristics of the EEG data [68,69] for improving the classification accuracy and reducing the number of electrodes [70].

Figure 4.

Figure 4

Architecture of filter bank CSP. LDA is shorthand for Linear Discriminant Analysis.

6. Experimental Results

Initially, we will test the algorithms using real datasets obtained from the BCI competition III (dataset 3a) and BCI competition IV (dataset 2a), which are publicly available at [71]. On the one hand, the dataset 3a from BCI competition III consists of EEG data acquired from three subjects (k3b, k6b and l1b) at a sampling frequency of 250 Hz using a 60-channel EEG system. In each trial, an arrow to the left, right, up or down was shown on a display for a few seconds, and in response to the stimulus, the subject was asked to respectively perform left hand, right hand, tongue and foot MI movements. The dataset consists of 90 trials per class for Subject k3b and 60 trials per class for Subjects k6b and l1b. On the other hand, the dataset 2a from BCI competition IV was acquired by using 22 channels from nine subjects (A01–A09) while also performing left hand, right hand, tongue and foot MI movements following a similar procedure. The signals were also sampled at 250 Hz and were recorded in two sessions on different days, each of them with 72 trials per each class.

For a total of four possible motor-imagery (MI) movements, (42)=6 different combinations of pairs of MI movements (i.e., left hand-right hand, left hand-foot, left hand-tongue, right hand-foot, right hand-tongue, foot-tongue) can be formed. The experiments below consider all possible combinations: since 12 users are available and for nine of them we have recordings performed on two different days, this makes a total of 3×6+9×6×2=126 different experiments. We repeated eight times each of these 126 possible experiments, and results were averaged. For each repetition, 60 trials were selected at random from each MI movement, which were split into 40 trials for training and 20 trials for testing. Additionally, in the case of the BCI competition IV, we averaged over the two sessions conducted for each user to avoid biasing the statistical tests. As a result, 3×6+9×6×2/2=72 averaged performance measures are finally available for each algorithm. The data have been initially bandpass filtered between the cut-off frequencies of 8–30 Hz, except before using the FBCSP method, which as we explained in Section 5, considers four bands for covering the frequency range between 4 and 40 Hz. The information of the classes in each trial is summarized by their respective covariance matrices. These matrices are estimated, normalized by their trace and used as input to the algorithms that carry out the calculation of the spatial filters prior to the MI classification, which is performed by using linear discriminant analysis (LDA).

The only parameter of the CSP algorithm is the number of spatial filters that one would like to consider. Although, this number d is usually fixed a priori for each dataset, it is advantageous to estimate automatically the best number of spatial filters for each user by using the combination of cross-validation and hypothesis testing proposed in [72]. Figure 5a illustrates this fact. The figure represents the scatter plot of the accuracies, expressed as a percentage, that have been respectively obtained by the CSP algorithm for a fixed value of d=8 (x-axis) and for the estimation of the best value of d (y-axis). These estimated accuracies have been obtained by averaging eight test samples, as explained above. The accuracies obtained for different individuals or for different pairs of conditions can be reasonably considered approximately independent and nearly Gaussian. Under this hypothesis, a one-sided paired t-test of statistical significance can be used to compare the results obtained by both alternatives. Let δf(m)=fy(m)fx(m) be the paired differences of accuracy ((y-axis value) vs. (x-axis value)) for m=1,,M, where M=72 is the number of samples. Then, the averaged difference is:

Δf¯=1Mm=1Mδf(m) (49)

and the unbiased estimate of its variance is:

s2=sΔf2M, (50)

where sΔf2=1M1m=1M(δf(m)Δf¯)2. Under the null hypothesis (H0) that the expected performance values coincide, i.e., E[fy(m)]=E[fx(m)], the t-statistic:

TSTAT=Δf¯sΔf/M. (51)

follows a Student’s t distribution with M1 degrees of freedom. Thus, the probability that the null hypothesis can generate a t-statistic larger than TSTAT gives the p-value of the right-sided test:

PVAL=Prob(t>TSTAT|H0). (52)

Figure 5.

Figure 5

Illustration of the advantages in performance of using an automatic cross-validation method to estimate the best even number of features d with respect to using an a priori fixed value of d. The automatic method relies on the technique proposed in [72], which was implemented here using one-sided t-tests of significance instead of the original two-sided tests. (a) Scatter plot comparison of the accuracies (in percentage) obtained by the CSP algorithm for fixed d=8 (x-axis) and for the automatic estimation of d (y-axis); (b) histogram of the estimated best even number of features d.

The more positive is TSTAT, the smaller is the PVAL, and the probability of observing a t-statistics larger than TSTAT decreases under the null hypothesis. When the p-value falls below the 0.05 threshold of significance, the hypothesis of not having a performance improvement when using the alternative procedure can be rejected, because this would correspond to a quite improbable situation. On the contrary, if the p-value of the right-sided test is above 0.05, the null hypothesis cannot be rejected.

In this particular case, the p-value of the test in Figure 5a is below 0.05; therefore, one can reject the hypothesis that the automatic estimation of d does not improve the results over the method that a priori selects d=8 filters.

We briefly name and describe below some of the implementations that optimize the already mentioned criteria for dimensionality reduction in MI-BCIs. Because of the substantially higher computational complexity of most of the alternatives to CSP (see Table 1), it is not practical to develop a specific automatic estimation procedure of the number of spatial filters for each of them. For this reason, we will consider in their implementations the same number of spatial filters that was automatically estimated for CSP.

Table 1.

Computational burden of the considered algorithms, which are sorted in increasing value of their respective execution times without using cross-validation. FBCSP, filter bank CSP; ITFE, information theoretic feature extraction.

Algorithm Time (s)
CSP 0.0017
FBCSP 0.0050
ITFE 0.3070
Sub-LD 1.0538
DivCSP 4.6696
  • CSP (see Section 2) and ITFE (see Section 4): apart from the number of spatial filters, these two methods do not have hyper-parameters to tune. Their respective algorithms have been implemented according to the specifications given in [18,49].

  • RTCSP (see Section 5): RTCSP has a regularization parameter, which has been selected by five-fold cross-validation in {0,0.1,0.2,,1}. The MATLAB implementation of this algorithm has been obtained from [73].

  • FBCSP (see Section 5): In this case, we have used a variation of the algorithm in [30]. The selected frequency bands correspond to the brainwaves theta (4–7 Hz), alpha (8–15 Hz), beta (16–31 Hz) and low gamma (32–40 Hz), where five-fold cross-validation has been used to select the best combination of these frequency bands. We extract d features from each band, where d is selected using the method in [72].

  • DivCSP (see Section 3.2 and Section 3.4). The values of β and ϕ (the regularization parameter) have been selected by five-fold cross-validation, β[0,1], ϕ[0,0.5]. This divergence includes the KL divergence as a particular case when β=0. MATLAB code of the algorithm has been downloaded from [74] and used without any modification. Optimization has been performed using the so-called subspace method (see Section 3.4).

  • Sub-LD (sub-space log-det): this algorithm, which also belongs to the class of the subspace methods, is based on the criterion in [42] to maximize the Alpha-Beta log-det divergence (see Section 3.3 and Section 3.4). In this paper, the implementation of the algorithm is based on the BFGS method on the Stiefel manifold of semi-orthogonal matrices and takes as the initialization point the solution obtained by the CSP algorithm. The regularization parameter η has been chosen by five-fold cross-validation in the range of values (0.2,0.2), which are not far from zero. The negative values of η favor the expansion of the clusters, while the positive values favor their contraction. For η close to zero, the solution of this criterion should not be far from that of CSP, which improves the convergence time of the algorithm and reduces the impact of the values of α,β in the results, so both parameters have been fixed to 0.5.

Table 1 shows the typical execution time of a single run of each algorithm, programmed in MATLAB language, in a PC with Intel I7-6700 CPU @ 3.4-GHz processor and 16 GB of RAM. The algorithms that use cross-validation for selecting the hyper-parameters need more iterations, hence the run time has to be multiplied by the number of the hyper-parameters combinations that are evaluated.

Figure 6 represent the boxplot of the accuracy of the algorithms, considering together all the combinations of the motor imagery movements from all subjects in datasets III 3a and IV 2a. The p-values and t-statistics shown below the box-plots of Figure 6 are above the 5% threshold of significance, revealing that, in this experiment, one cannot reject the null hypotheses. It follows that the expected accuracies of the alternative algorithms are not significantly higher than the expected accuracies obtained with CSP. Supporting this conclusion, Figure 7 represents the specific boxplots that corresponds to MI movements involving the right hand. Additionally, we have tested, in the case “left hand versus right hand”, whether the improvement obtained by using the alternative algorithms is significant or not. The accuracy in the classification and the corresponding p-values of the tests are shown in Figure 8. The results reveal that, in general and except in a few isolated cases, the null hypothesis that the other methods do not significantly improve performance over CSP cannot be discarded.

Figure 6.

Figure 6

Comparison of the expected accuracy percentages obtained by each of the considered algorithms. The figure shows box-plot illustrations where the median is shown in red line, while the 25% and 75% percentiles are respectively at the bottom and top of each box. Larger positive values TSTAT0 and smaller PVAL1/2 would correspond with greater expected improvements over CSP. However, none of the p-values, which are shown below their respective box-plots, is able to attain the 5% threshold level of significance (PVAL<0.05), so the possible improvements cannot be claimed to be statistically significant with respect to those obtained by CSP.

Figure 7.

Figure 7

Performance of the algorithms for different motor imagery combinations involving the right hand. (a) Right-hand versus left-hand motor imagery classification; (b) right-hand versus feet motor imagery classification; (c) right-hand versus tongue motor imagery classification.

Figure 8.

Figure 8

Accuracy percentages and p-values for the testing of an improvement in performance over CSP when the right hand versus left hand movement imagination are discriminated. The results reveal that, in general and except in a few isolated cases, the null hypothesis that the other methods do not significantly improve the performance over CSP cannot be discarded. (a) Average accuracy obtained by the algorithms for each subject; (b) p-values of the t-tests that compare whether the performance of the alternative algorithms is significantly better than the one obtained by CSP. The horizontal dashed line represents the threshold level of significance of 5%.

The results of Figure 6 were obtained by choosing through cross-validation the best possible values for the different parameters of the algorithms. Figure 9 and Figure 10 show how many times each value of the parameters has been selected after cross-validation. They also show the number of times that CSP outperformed the corresponding algorithm, the number of times that the algorithm outperformed CSP or the cases in which both of them were equivalent. Without limiting the foregoing, it must be also remarked that the alternative algorithms perform better than CSP for some subjects and MI movements.

Figure 9.

Figure 9

Histogram of the values of the regularization parameter in the Sub-LD algorithm that have been chosen by cross-validation.

Figure 10.

Figure 10

Histogram of the hyper-parameters of the DivCSP algorithm selected by cross-validation. (a) Case with β[0,0.5] and ϕ=0; (b) case with β=0.5 and ϕ[0,0.5].

Results on Artificially Perturbed Data

In order to study the performance of the algorithms under artificial perturbations of the datasets we have conducted two experiments. The first one consists of introducing random label changes in the real datasets, while the second one defines sample EEG covariance matrices for each condition and artificially introduces outlier covariance matrices in the training procedure to quantify the resulting deterioration in performance.

Exchanging labels of the training set at random is one of the most harmful perturbations that one can consider in a real experiment. It models the failure of the subjects to imagine the correct target MI movements due to fatigue or lack of concentration. For this experiment, we selected a subject who has a relatively good performance in absence of perturbations. Figure 11 presents the progressive degradation of the accuracy of the algorithms as the percentage of mismatched labels increases.

Figure 11.

Figure 11

Comparison of the accuracy percentages obtained by each of the considered algorithms with respect to the percentage of mismatched labels in the training set. This experiment illustrates deterioration of the performance of the algorithms with respect to the increase of the percentage of randomly switched labels of the motor imagery movements.

In the second experiment, we have created artificial EEG data and consider the effect of adding random outliers. The artificial data were generated starting from two auxiliary covariance matrices Ck,k=1,2 for the construction of the conditional covariance matrices of each class. These covariances were generated randomly by drawing two random Gaussian matrices A(k) with i.i.d. elements aij(i)N(0,1) and forming the covariance matrices with Ck=A(k)(A(k)),k=1,2. In order to control the difficulty of the classification problem, we introduce a dissimilitude parameter δ[0,1] that interpolates between the two auxiliary covariance matrices as follows:

Σ1=C11/2(C11/2C2C11/2)(1δ)2C11/2 (53)
Σ2=C21/2(C21/2C1C21/2)(1δ)2C21/2 (54)

In this way, when δ=0, the two interpolated covariance matrices coincide Σ1=Σ2, and it is impossible to distinguish between them. On the contrary, when δ=1, we obtain the original randomly generated matrices Σ1=C1 and Σ2=C2. The matrices Σk are used as the expected covariance matrix of the observations for class k, while the sample covariance matrices for each trial are generated from a Wishart distribution with scale matrix 1TΣk and T degrees of freedom (where T denotes the trial length). The outlier matrices have been generated following a similar scheme, though interpolation is not used and the resulting covariances are scaled by a factor of five.

In our simulations with artificial data, we have set the dissimilitude parameter to δ=0.1. The results obtained for artificial data and with different percentages of outlier covariance matrices in the training set are shown in Figure 12. One can observe how the performance progressively deteriorates with the number of outliers, similarly for all the methods, although at a smaller rate than in the case having the same percentage of mismatched labels. The parameters of the algorithms have been selected by cross-validation.

Figure 12.

Figure 12

Accuracy percentages versus the percentage of training trials with outliers in a synthetic classification experiment.

7. Conclusions

In this paper, we have reviewed several information theoretic approaches for motor-imagery BCI systems. In particular, we have focused on those based on the Kullback–Leibler divergence, Beta divergence, Alpha-Beta log-det divergence and information theoretic feature extraction, exploring the existing links with common spatial patterns, which is a widely-used technique for spatial filtering in BCI applications. The performance of all these methods has been evaluated through experimental simulations using real and synthetic data. In general, the results obtained for real data from BCI competitions reveal a similar performance for all the considered criteria in terms of their percentages of accuracy. However, CSP clearly outperforms the other methods when comparing the required computational burdens. In the case of synthetic data with outliers, a comparison of the divergence-based methods with small regularization parameters reveals that they can slightly increase the frequency of obtaining a better performance, although the average accuracy results are still similar to those obtained with CSP. Therefore, although these divergence-based methods are not yet a practical alternative to CSP, this line of research is in its infancy, and divergence-based methods can have an underlying potential for improvements in performance that remains to be explored.

Acknowledgments

Part of this work was supported by the Spanish Government under MICINN project TEC2014-53103-P. Andrzej Cichocki was partially supported by the MES Russian Federation grant 14.756.31.0001. We also thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.

Author Contributions

Rubén Martín-Clemente and Sergio Cruces collaborated in writing the paper and coordinating the study. Andrzej Cichocki critically revised the manuscript by providing inspiring comments. Javier Olias and Deepa Beeta Thiyam conducted the experimental work. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Saeid S., Chambers J.A. EEG Signal Processing. John Wiley & Sons; Hoboken, NJ, USA: 2013. [Google Scholar]
  • 2.Sörnmo L., Laguna P. Bioelectrical Signal Processing in Cardiac and Neurological Applications. Volume 8 Academic Press; Cambridge, MA, USA: 2005. [Google Scholar]
  • 3.Devlaminck D., Wyns B., Grosse-Wentrup M., Otte G., Santens P. Multisubject learning for common spatial patterns in motor-imagery BCI. Comput. Intell. Neurosci. 2011:217987. doi: 10.1155/2011/217987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lotte F. Guide to Brain-Computer Music Interfacing. Springer; London, UK: 2014. A tutorial on EEG signal-processing techniques for mental-state recognition in brain-computer interfaces; pp. 133–161. [Google Scholar]
  • 5.Samek W., Meinecke F.C., Müller K.-R. Transferring subspaces between subjects in brain–computer interfacing. IEEE Trans. Biomed. Eng. 2013;60:2289–2298. doi: 10.1109/TBME.2013.2253608. [DOI] [PubMed] [Google Scholar]
  • 6.Wu W., Gao X., Hong B., Gao S. Classifying single-trial EEG during motor imagery by iterative spatio-spectral patterns learning (ISSPL) IEEE Trans. Biomed. Eng. 2008;55:1733–1743. doi: 10.1109/TBME.2008.919125. [DOI] [PubMed] [Google Scholar]
  • 7.Grosse-Wentrup M., Liefhold C., Gramann K., Buss M. Beamforming in noninvasive brain-computer interfaces. IEEE Trans. Biomed. Eng. 2009;56:1209–1219. doi: 10.1109/TBME.2008.2009768. [DOI] [PubMed] [Google Scholar]
  • 8.Gouy-Pailler C., Congedo M., Brunner C., Jutten C., Pfurtscheller G. Nonstationary brain source separation for multiclass motor imagery. IEEE Trans. Biomed. Eng. 2010;57:469–478. doi: 10.1109/TBME.2009.2032162. [DOI] [PubMed] [Google Scholar]
  • 9.Sun G., Hu J., Wu G. A novel frequency band selection method for common spatial pattern in motor imagery based brain computer interface; Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN); Barcelona, Spain. 18–23 July 2010; pp. 1–6. [Google Scholar]
  • 10.Thomas K.P., Guan C., Lau C.T., Vinod A.P., Ang K.K. A new discriminative common spatial pattern method for motor imagery brain-computer interfaces. IEEE Trans. Biomed. Eng. 2009;56:2730–2733. doi: 10.1109/TBME.2009.2026181. [DOI] [PubMed] [Google Scholar]
  • 11.Graimann B., Allison B., Pfurtscheller G. Brain-Computer Interfaces. Springer; Berlin/Heidelberg, Germany: 2009. Brain-computer interfaces: A gentle introduction; pp. 1–27. [Google Scholar]
  • 12.Pfurtscheller G., Lopes Da Silva F.H. Event-related EEG/MEG synchronization and desynchronization: Basic principles. Clin. Neurophysiol. 1999;110:1842–1857. doi: 10.1016/S1388-2457(99)00141-8. [DOI] [PubMed] [Google Scholar]
  • 13.Lotte F., Bougrain L., Cichocki A., Clerc M., Congedo M., Rakotomamonjy A., Yger F. A Review of Classification Algorithms for EEG-based Brain-Computer Interfaces: A 10-year Update. J. Neural Eng. 2018 doi: 10.1088/1741-2560/4/2/R01. (in print) [DOI] [PubMed] [Google Scholar]
  • 14.Schlögl A., Lee F., Bischof H., Pfurtscheller G. Characterization of four-class motor imagery EEG data for the BCI-competition 2005. J. Neural Eng. 2005;2:L14–L22. doi: 10.1088/1741-2560/2/4/L02. [DOI] [PubMed] [Google Scholar]
  • 15.Ehrsson H., Geyer S., Naito E. Imagery of Voluntary Movement of Fingers, Toes, and Tongue Activates Corresponding Body-Part-Specific Motor Representations. J. Neurophysiol. 2003;90:3304–3316. doi: 10.1152/jn.01113.2002. [DOI] [PubMed] [Google Scholar]
  • 16.Dagaev N., Volkova K., Ossadtchi A. Latent variable method for automatic adaptation to background states in motor imagery BCI. J. Neural Eng. 2017 doi: 10.1088/1741-2552/aa8065. [DOI] [PubMed] [Google Scholar]
  • 17.Perdikis S., Leeb R., Millán J.D. Context-aware adaptive spelling in motor imagery BCI. J. Neural Eng. 2016;13:036018. doi: 10.1088/1741-2560/13/3/036018. [DOI] [PubMed] [Google Scholar]
  • 18.Ramoser H., Müller-Gerking J., Pfurtscheller G. Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Trans. Rehabil. Eng. 2000;8:441–446. doi: 10.1109/86.895946. [DOI] [PubMed] [Google Scholar]
  • 19.Brandl S., Müller K.-R., Samek W. Robust common spatial patterns based on Bhattacharyya distance and Gamma divergence; Proceedings of the 2015 3rd International Winter Conference on Brain-Computer Interface (BCI); Sabuk, Korea. 12–14 January 2015; pp. 1–4. [Google Scholar]
  • 20.Lotte F., Guan C. Spatially regularized common spatial patterns for EEG classification; Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR); Istanbul, Turkey. 23–26 August 2010; pp. 3712–3715. [Google Scholar]
  • 21.Lu H., Plataniotis K.N., Venetsanopoulos A.N. Regularized common spatial patterns with generic learning for EEG signal classification; Proceedings of the 2009 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society; Minneapolis, MN, USA. 3–6 September 2009; pp. 6599–6602. [DOI] [PubMed] [Google Scholar]
  • 22.Samek W., Vidaurre C., Müller K.-R., Kawanabe M. Stationary common spatial patterns for brain-computer interfacing. J. Neural Eng. 2012;9:026013. doi: 10.1088/1741-2560/9/2/026013. [DOI] [PubMed] [Google Scholar]
  • 23.Samek W., Kawanabe M., Muller K.-R. Divergence-based framework for common spatial patterns algorithms. IEEE Rev. Biomed. Eng. 2014;7:50–72. doi: 10.1109/RBME.2013.2290621. [DOI] [PubMed] [Google Scholar]
  • 24.Wang H. Harmonic mean of Kullback–Leibler divergences for optimizing multiclass EEG spatio-temporal filters. Neural Process. Lett. 2012;36:161–171. doi: 10.1007/s11063-012-9228-y. [DOI] [Google Scholar]
  • 25.Samek W., Müller K.-R. Tackling noise, artifacts and nonstationarity in BCI with robust divergences; Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO); Nice, France. 31 August–4 September 2015; pp. 2741–2745. [Google Scholar]
  • 26.Lawhern V., David Hairston W., McDowell K., Westerfield M., Robbins K. Detection and classification of subject-generated artifacts in EEG signals using autoregressive models. J. Neurosci. Methods. 2012;208:181–189. doi: 10.1016/j.jneumeth.2012.05.017. [DOI] [PubMed] [Google Scholar]
  • 27.Delorme A., Sejnowski T., Makeig S. Enhanced detection of artifacts in EEG data using higher-order statistics and independent component analysis. Neuroimage. 2007;34:1443–1449. doi: 10.1016/j.neuroimage.2006.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Uusitalo M., Ilmoniemi R.J. Signal-space projection method for separating MEG or EEG into components. Med. Biol. Eng. Comput. 1997;35:135–140. doi: 10.1007/BF02534144. [DOI] [PubMed] [Google Scholar]
  • 29.Urigüen J.A., García-Zapirain B. EEG artifact removal-state-of-the-art and guidelines. J. Neural Eng. 2015;12:031001. doi: 10.1088/1741-2560/12/3/031001. [DOI] [PubMed] [Google Scholar]
  • 30.Ang K.K., Chin Z.Y., Zhang H., Guan C. Filter bank common spatial pattern (FBCSP) in brain-computer interface; Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); Hong Kong, China. 1–8 June 2008; pp. 2390–2397. [Google Scholar]
  • 31.Dornhege G., Blankertz B., Krauledat M., Losch F., Curio G., Muller K.-R. Combined optimization of spatial and temporal filters for improving brain-computer interfacing. IEEE Trans. Biomed. Eng. 2006;53:2274–2281. doi: 10.1109/TBME.2006.883649. [DOI] [PubMed] [Google Scholar]
  • 32.Kang H., Nam Y., Choi S. Composite common spatial pattern for subject-to-subject transfer. IEEE Signal Process. Lett. 2009;16:683–686. doi: 10.1109/LSP.2009.2022557. [DOI] [Google Scholar]
  • 33.Ang K., Chin Z.Y., Zang H., Guan C. Mutual information-based selection of optimal spatial-temporal patterns for single-trial EEG-based BCIs. Pattern Recognit. 2012;45:2137–2144. doi: 10.1016/j.patcog.2011.04.018. [DOI] [Google Scholar]
  • 34.Koles Z., Lind J., Flor-Henry P. Spatial patterns in the background EEG underlying mental disease in man. Electroencephalogr. Clin. Neurophysiol. 1994;91:319–328. doi: 10.1016/0013-4694(94)90119-8. [DOI] [PubMed] [Google Scholar]
  • 35.Wu W., Chen Z., Gao S., Brown E. A probabilistic framework for robust common spatial patterns; Proceedings of the Annual International Conference of the Engineering in Medicine and Biology Society (EMBC); Minneapolis, MN, USA. 3–6 September 2009; pp. 4658–4661. [DOI] [PubMed] [Google Scholar]
  • 36.Kang H., Choi S. Probabilistic models for common spatial patterns: Parameter extended EM and variational bayes; Proceedings of the XXVI AAAI Conference on Artificial Intelligence; Toronto, ON, Canada. 22–26 July 2012; pp. 970–976. [Google Scholar]
  • 37.Kawanabe M., Vidaurre C. Proceedings of the World Congress on Medical Physics and Biomedical Engineering. Springer; Munich, Germany: Sep 7–12, 2009. Improving BCI performance by modified common spatial patterns with robustly averaged covariance matrices; pp. 279–282. [Google Scholar]
  • 38.Yong X., Ward R.K., Birch G.E. Robust common spatial patterns for EEG signal preprocessing; Proceedings of the 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society; Vancouver, BC, Canada. 20–25 August 2008; pp. 2087–2090. [DOI] [PubMed] [Google Scholar]
  • 39.Samek W., Kawanabe M., Vidaurre C. Group-wise stationary subspace analysis—A novel method for studying non-stationarities. [(accessed on 19 December 2017)];Proc. Int. Brain Comput. Interfaces Conf. 2011 Available online: https://www.researchgate.net/profile/MotoakiKawanabe/publication/216887788_Group-wise_Stationary_Subspace_Analysis_-_A_Novel_Method_for_Studying_Non-Stationarities/links/02e7e51d7fec25159b000000.pdf. [Google Scholar]
  • 40.Arvaneh M., Guan C., Ang K.K., Quek C. Optimizing spatial filters by minimizing within-class dissimilarities in electroencephalogram-based brain-computer interface. IEEE Trans. Neural Netw. Learn. Syst. 2013;24:610–619. doi: 10.1109/TNNLS.2013.2239310. [DOI] [PubMed] [Google Scholar]
  • 41.Samek W., Blythe D., Müller K.-R., Kawanabe M. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA,: 2013. Robust spatial filtering with beta divergence; pp. 1007–1015. [Google Scholar]
  • 42.Beeta Thyam D., Cruces S., Olías J., Chichocki A. Optimization of Alpha-Beta log-det divergences and their application in the spatial filtering of two class motor imagery movements. Entropy. 2017;19:89. doi: 10.3390/e19030089. [DOI] [Google Scholar]
  • 43.Cichocki A., Cruces S., Amari S.-I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy. 2011;13:134–170. doi: 10.3390/e13010134. [DOI] [Google Scholar]
  • 44.Plumbley M.D. Geometrical methods for non-negative ICA: Manifolds, Lie groups and toral subalgebras. Neurocomputing. 2005;67:161–197. doi: 10.1016/j.neucom.2004.11.040. [DOI] [Google Scholar]
  • 45.Edelman A., Arias T.A., Smith S.T. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 1998;20:303–353. doi: 10.1137/S0895479895290954. [DOI] [Google Scholar]
  • 46.Moler C., Van Loan C. Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Rev. 2003;45:3–49. doi: 10.1137/S00361445024180. [DOI] [Google Scholar]
  • 47.Huang W., Absil P.-A., Gallivan K.A. Numerical Mathematics and Advanced Applications ENUMATH 2015. Springer; Cham, Switzerland: 2016. A Riemannian BFGS Method for Nonconvex Optimization Problems; pp. 627–634. [Google Scholar]
  • 48.Boumal N., Mishra B., Absil P.-A., Sepulchre R. Manopt, a Matlab Toolbox for Optimization on Manifolds. J. Mach. Learn. Res. 2014;15:1455–1459. [Google Scholar]
  • 49.Grosse-Wentrup M., Buss M. Multiclass common spatial patterns and information theoretic feature extraction. IEEE Trans. Biomed. Eng. 2008;55:1991–2000. doi: 10.1109/TBME.2008.921154. [DOI] [PubMed] [Google Scholar]
  • 50.Feder M., Merhav N. Relations between entropy and error probability. IEEE Trans. Inf. Theory. 1994;40:259–266. doi: 10.1109/18.272494. [DOI] [Google Scholar]
  • 51.Jones M.C., Sibson R. What is projection pursuit? (with discussion) J. R. Stat. Soc. Ser. A. 1987;150:1–36. doi: 10.2307/2981662. [DOI] [Google Scholar]
  • 52.Wang H., Tang Q., Zheng W. L1-norm-based common spatial patterns. IEEE Trans. Biomed. Eng. 2012;59:653–662. doi: 10.1109/TBME.2011.2177523. [DOI] [PubMed] [Google Scholar]
  • 53.Daly I., Nicolaou N., Nasuto S., Warwick K. Automated artifact removal from the electroencephalogram: A comparative study. Clin. EEG Neurosci. 2013;44:291–306. doi: 10.1177/1550059413476485. [DOI] [PubMed] [Google Scholar]
  • 54.Fatourechi M., Bashashati A., Ward R., Birch G. EMG and EOG artifacts in brain-computer interface systems: A survey. Clin. Neurophysiol. 2007;118:480–494. doi: 10.1016/j.clinph.2006.10.019. [DOI] [PubMed] [Google Scholar]
  • 55.Wang H., Li X. Regularized filters for L1-norm-based common spatial patterns. IEEE Trans. Neural Syst. Rehabil. Eng. 2016;24:201–211. doi: 10.1109/TNSRE.2015.2474141. [DOI] [PubMed] [Google Scholar]
  • 56.Arvaneh M., Guan C., Ang K.K., Quek C. Optimizing the channel selection and classification accuracy in EEG-based BCI. IEEE Trans. Biomed. Eng. 2011;58:1865–1873. doi: 10.1109/TBME.2011.2131142. [DOI] [PubMed] [Google Scholar]
  • 57.Park J., Chung W. Common spatial patterns based on generalized norms; Proceedings of the 2013 International Winter Workshop on Brain-Computer Interface (BCI); Jeongseon, Korea. 18–20 February 2013; pp. 39–42. [Google Scholar]
  • 58.Lotte F., Guan C. Learning from other subjects helps reducing brain-computer interface calibration time; Proceedings of the 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP); Dallas, TX, USA. 14–19 March 2010; pp. 614–617. [Google Scholar]
  • 59.Blankertz B., Kawanabe M., Tomioka R., Hohlefeld F.U., Nikulin V.V., Müller K.-R. Invariant common spatial patterns: Alleviating nonstationarities in brain-computer interfacing; Proceedings of the Advances in Neural Information Processing Systems 20 (NIPS 2007); Vancouver, BC, Canada. 3–5 December 2007; pp. 113–120. [Google Scholar]
  • 60.Wojcikiewicz W., Vidaurre C., Kawanabe M. Stationary common spatial patterns: Towards robust classification of non-stationary EEG signals; Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Prague, Czech. 22–27 May 2011; pp. 577–580. [Google Scholar]
  • 61.Wojcikiewicz W., Vidaurre C., Kawanabe M. Improving classification performance of BCIs by using stationary common spatial patterns and unsupervised bias adaptation; Proceedings of the International Conference on Hybrid Artificial Intelligence Systems; Wroclaw, Poland. 23–25 May 2011; pp. 34–41. [Google Scholar]
  • 62.Kawanabe M., Vidaurre C., Scholler S., Muuller K.-R. Robust common spatial filters with a maxmin approach; Proceedings of the 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society; Minneapolis, MN, USA. 3–6 September 2009; pp. 2470–2473. [DOI] [PubMed] [Google Scholar]
  • 63.Kawanabe M., Samek W., Müller K.-R., Vidaurre C. Robust common spatial filters with a maxmin approach. Neural Comput. 2014;26:349–376. doi: 10.1162/NECO_a_00544. [DOI] [PubMed] [Google Scholar]
  • 64.Lotte F., Guan C. Regularizing common spatial patterns to improve BCI designs: Unified theory and new algorithms. IEEE Trans. Biomed. Eng. 2011;58:355–362. doi: 10.1109/TBME.2010.2082539. [DOI] [PubMed] [Google Scholar]
  • 65.Suk H.-I., Lee S.-W. A novel bayesian framework for discriminative feature extraction in brain-computer interfaces. IEEE Trans. Pattern Anal. Mach. Intell. 2013;35:286–299. doi: 10.1109/TPAMI.2012.69. [DOI] [PubMed] [Google Scholar]
  • 66.Wang H., Zheng W. Local temporal common spatial patterns for robust single-trial EEG classification. IEEE Trans. Neural Syst. Rehabil. Eng. 2008;16:131–139. doi: 10.1109/TNSRE.2007.914468. [DOI] [PubMed] [Google Scholar]
  • 67.Dornhege G., Blankertz B., Curio G., Müller K.-R. Increase Information Transfer Rates in BCI by CSP Extension to Multi-class; Proceedings of the Advances in Neural Information Processing Systems 16; Vancouver and Whistler, BC, Canada. 8–13 December 2003. [Google Scholar]
  • 68.Yang Y., Chevallier S., Wiart J., Bloch I. Time-frequency optimization for discrimination between imagination of right and left hand movements based on two bipolar electroencephalography channels. EURASIP J. Adv. Signal Process. 2014:38. doi: 10.1186/1687-6180-2014-38. [DOI] [Google Scholar]
  • 69.Yang Y., Chevallier S., Wiart J., Bloch I. Subject-specific time-frequency selection for multi-class motor imagery-based BCIs using few Laplacian EEG channels. Biomed. Signal Process. Control. 2017;38:302–311. doi: 10.1016/j.bspc.2017.06.016. [DOI] [Google Scholar]
  • 70.Yang Y., Chevallier S., Wiart J., Bloch I. Subject-Specific Channel Selection Using Time Information for Motor Imagery Brain-Computer Interfaces. Cogn. Comput. 2016;8:505–518. doi: 10.1007/s12559-015-9379-z. [DOI] [Google Scholar]
  • 71.BCI Competitions. [(accessed on 5 June 2017)]; Available online: http://www.bbci.de/competition/
  • 72.Yang Y., Chevallier S., Wiart J., Bloch I. Automatic selection of the number of spatial filters for motor-imagery BCI; Proceedings of the 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN); Bruges, Belgium. 25–27 April 2012; pp. 109–114. [Google Scholar]
  • 73.Fabien Lotte Matlab Codes and Software. [(accessed on 12 November 2017)]; Available online: https://sites.google.com/site/fabienlotte/code-and-softwares.
  • 74.Wojciech Samek The Divergence Methods Web Site. [(accessed on 12 January 2017)]; Available online: http://divergence-methods.org.

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES