Skip to main content
eLife logoLink to eLife
. 2021 Apr 30;10:e56481. doi: 10.7554/eLife.56481

EEG-based detection of the locus of auditory attention with convolutional neural networks

Servaas Vandecappelle 1,2,, Lucas Deckers 1,2, Neetha Das 1,2, Amir Hossein Ansari 2, Alexander Bertrand 2, Tom Francart 1,
Editors: Barbara G Shinn-Cunningham3, Barbara G Shinn-Cunningham4
PMCID: PMC8143791  PMID: 33929315

Abstract

In a multi-speaker scenario, the human auditory system is able to attend to one particular speaker of interest and ignore the others. It has been demonstrated that it is possible to use electroencephalography (EEG) signals to infer to which speaker someone is attending by relating the neural activity to the speech signals. However, classifying auditory attention within a short time interval remains the main challenge. We present a convolutional neural network-based approach to extract the locus of auditory attention (left/right) without knowledge of the speech envelopes. Our results show that it is possible to decode the locus of attention within 1–2 s, with a median accuracy of around 81%. These results are promising for neuro-steered noise suppression in hearing aids, in particular in scenarios where per-speaker envelopes are unavailable.

Research organism: Human

Introduction

In a multi-speaker scenario, the human auditory system is able to focus on just one speaker, ignoring all other speakers and noise. This situation is called the ‘cocktail party problem' (Cherry, 1953). However, elderly people and people suffering from hearing loss have particular difficulty attending to one person in such an environment. In current hearing aids, this problem is mitigated by automatic noise suppression systems. When multiple speakers are present, however, these systems have to rely on heuristics such as the speaker volume or the listener’s look direction to determine the relevant speaker, which often fail in practice.

The emerging field of auditory attention decoding (AAD) tackles the challenge of directly decoding auditory attention from neural activity, which may replace such unreliable and indirect heuristics. This research finds applications in the development of neuro-steered hearing prostheses that analyze brain signals to automatically decode the direction or speaker to whom the user is attending, to subsequently amplify that specific speech stream while suppressing other speech streams and surrounding noise. The desired result is increased speech intelligibility for the listener.

In a competing two-speaker scenario, it has been shown that the neural activity (as recorded using electroencephalography [EEG] or magnetoencephalography [MEG]) consistently tracks the dynamic variation of an incoming speech envelope during auditory processing, and that the attended speech envelope is typically more pronounced than the unattended speech envelope (Ding and Simon, 2012; O'Sullivan et al., 2015). This neural tracking of the stimulus can then be used to determine auditory attention. A common approach is stimulus reconstruction, where the poststimulus brain activity is used to decode and reconstruct the attended stimulus envelope (O'Sullivan et al., 2015; Pasley et al., 2012). The reconstructed envelope is then correlated with the original stimulus envelopes, and the one yielding the highest correlation is then considered to belong to the attended speaker. Other methods for attention decoding include the forward modeling approach: predicting EEG from the auditory stimulus (Akram et al., 2016; Alickovic et al., 2016), canonical correlation analysis (CCA)-based methods (de Cheveigné et al., 2018), and Bayesian state-space modeling (Miran et al., 2018).

All studies mentioned above are based on linear decoders. However, since the human auditory system is inherently nonlinear (Faure and Korn, 2001), nonlinear models (such as neural networks) could be beneficial for reliable and quick AAD. In Taillez et al., 2017, a feedforward neural network for EEG-based speech stimulus reconstruction was presented, showing that artificial neural networks are a feasible alternative to linear decoding methods.

Recently, convolutional neural networks (CNNs) have become the preferred approach for many recognition and detection tasks, in particular in the field of image classification (LeCun et al., 2015). Recent research on CNNs has also shown promising results for EEG classification: in seizure detection (Acharya et al., 2018a; Ansari et al., 2018a), depression detection (Liu et al., 2017), and sleep stage classification (Acharya et al., 2018b; Ansari et al., 2018b). In terms of EEG-based AAD, Ciccarelli et al., 2019 recently showed that a (subject-dependent) CNN using a classification approach can outperform linear methods for decision windows of 10 s.

Current state-of-the-art models are thus capable of classifying auditory attention in a two-speaker scenario with high accuracy (75–85%) over a data window with a length of 10 s, but their performance drops drastically when shorter windows are used (e.g., de Cheveigné et al., 2018; Ciccarelli et al., 2019). However, to achieve sufficiently fast AAD-based steering of a hearing aid, short decision windows (down to a few seconds) are required. This inherent trade-off between accuracy and decision window length was investigated by Geirnaert et al., 2020, who proposed a method to combine both properties into a single metric, by searching for the optimal trade-off point to minimize the expected switch duration in an AAD-based volume control system with robustness constraints. The robustness against AAD errors can be improved by using smaller relative volume changes for every new AAD decision, while the decision window length determines how often an AAD decision (volume step) is made. It was found that such systems favor short window lengths (<< 10 s) with mediocre accuracy over long windows (10–30 s) with high accuracy.

Apart from decoding which speech envelope corresponds to the attended speaker, it may also be possible to decode the spatial locus of attention. That is, not decoding which speaker is attended to, but rather which location in space. The benefit of this approach for neuro-steered auditory prostheses is that no access to the clean speech stimuli is needed. This has been investigated based on differences in the EEG entropy features (Lu et al., 2018), but the performance was insufficient for practical use (below 70% for 60 s windows). However, recent research (Wolbers et al., 2011; Bednar and Lalor, 2018; Patel et al., 2018; O'Sullivan et al., 2019; Bednar and Lalor, 2020) has shown that the direction of auditory attention is neurally encoded, indicating that it could be possible to decode the attended sound position or trajectory from EEG. A few studies employing MEG have suggested that in particular the alpha power band could be tracked to determine the locus of auditory attention (Frey et al., 2014; Wöstmann et al., 2016). Another study, employing scalp EEG, found the beta power band related with selective attention (Gao et al., 2017).

The aim of this paper is to further explore the possibilities of CNNs for EEG-based AAD. As opposed to Taillez et al., 2017 and Ciccarelli et al., 2019, who aim to decode the attended speaker (for a given set of speech envelopes), we aim to decode the locus of auditory attention (left/right). When the locus of attention is known, a hearing aid can steer a beamformer in that direction to enhance the attended speaker.

Materials and methods

Experiment setup

The dataset used for this work was gathered previously (Das et al., 2016). EEG data was collected from 16 normal-hearing subjects while they listened to two competing speakers and were instructed to attend to one particular speaker. Every subject signed an informed consent form approved by the KU Leuven ethical committee.

The EEG data was recorded using a 64-channel BioSemi ActiveTwo system, at a sampling rate of 8196 Hz, in an electromagnetically shielded and soundproof room. The auditory stimuli were low-pass filtered with a cutoff frequency of 4 kHz and presented at 60 dBA through Etymotic ER3 insert earphones. APEX 3 was used as stimulation software (Francart et al., 2008).

The auditory stimuli were comprised of four Dutch stories, narrated by three male Flemish speakers (DeBuren, 2007). Each story was 12 min long and split into two parts of 6 min each. Silent segments longer than 500 ms were shortened to 500 ms. The stimuli were set to equal root-mean-square intensities and were perceived as equally loud.

The experiment was split into eight trials, each 6 min long. In every trial, subjects were presented with two parts of two different stories. One part was presented in the left ear, while the other was presented in the right ear. Subjects were instructed to attend to one of the two via a monitor positioned in front of them. The symbol '<' was shown on the left side of the screen when subjects had to attend to the story in the left ear, and the symbol '>' was shown on the right side of the screen when subjects had to attend to the story in the right ear. They did not receive instructions on where to focus their gaze.

In subsequent trials, subjects attended either to the second part of the same story (so they could follow the story line) or to the first part of the next story. After each trial, subjects completed a multiple-choice quiz about the attended story. In total, there was 8 × 6 min = 48 min of data per subject. For an example of how stimuli were presented, see Table 1. (The original experiment [Das et al., 2016] contained 12 additional trials of 2 min each, collected at the end of every measurement session. These trials were repetitions of earlier stimuli and were not used in this work.)

Table 1. First eight trials for a random subject.

Trials are numbered according to the order in which they were presented to the subject. Which ear was attended to first was determined randomly. After that, the attended ear was alternated. Presentation (dichotic/HRTF) was balanced over subjects with respect to the attended ear. Adapted from Das et al., 2016. HRTF = head-related transfer function.

Trial Left stimulus Right stimulus Attended ear Presentation
1 Story1, part1 Story2, part1 Left Dichotic
2 Story2, part2 Story1, part2 Right HRTF
3 Story3, part1 Story4, part1 Left Dichotic
4 Story4, part2 Story3, part2 Right HRTF
5 Story2, part1 Story1, part1 Left Dichotic
6 Story1, part2 Story2, part2 Right HRTF
7 Story4, part1 Story3, part1 Left Dichotic
8 Story3, part2 Story4, part2 Right HRTF

The attended ear alternated over consecutive trials to get an equal amount of data per ear (and per subject), which is important to avoid the lateralization bias described by Das et al., 2016. Stimuli were presented in the same order to each subject, and either dichotically or after head-related transfer function (HRTF) filtering (simulating sound coming from ±90 deg). As with the attended ear, the HRTF/dichotic condition was randomized and balanced within and over subjects. In this work, we do not distinguish between dichotic and HRTF to ensure there is as much data as possible for training the neural network.

Data preprocessing

The EEG data was filtered with an equiripple FIR bandpass filter and its group delay was compensated for. For use with linear models, the EEG was filtered between 1 and 9 Hz, which has been found to be an optimal frequency range for linear attention decoding (Pasley et al., 2012; Ding and Simon, 2012). For the CNN models, a broader bandwidth between 1 and 32 Hz was used, as Taillez et al., 2017 show that this is more optimal. In both cases, the maximal bandpass attenuation was 0.5 dB while the stopband attenuation was 20 dB (at 0–1 Hz) and 15 dB (at 32–64 Hz). After the bandpass filtering, the EEG data was downsampled to 20 Hz (linear model) and 128 Hz (CNN). Artifacts were removed with the generic MWF-based removal algorithm described in Somers et al., 2018.

Data of each subject was divided into a training, validation, and test set. Per set, data segments were generated with a sliding window equal in size to the chosen window length and with an overlap of 50%. Data was normalized on a subject-by-subject basis, based on statistics of the training set only, and in such a way that proportions between EEG channels were maintained. Concretely, for each subject we calculated the power per channel, based on the 10% trimmed mean of the squared samples. All channels were then divided by the square root of the median of those 64 values (one for each EEG channel). Data of each subject was thus normalized based on a single (subject-specific) value.

Convolutional neural networks

A convolutional neural network (CNN) consists of a series of convolutional layers and nonlinear activation functions, typically followed by pooling layers. In convolutional layers, one or more convolutional filters slide over the data to extract local data features. Pooling layers then aggregate the output by computing, for example, the mean. Similar to other types of neural networks, a CNN is optimized by minimizing a loss function, and the optimal parameters are estimated with an optimization algorithm such as stochastic gradient descent.

Our proposed CNN for decoding the locus of auditory attention is shown in Figure 1. The input is a 64 × T matrix, where 64 is the number of EEG channels in our dataset and T is the number of samples in the decision window. (We tested multiple decision window lengths, as discussed later.) The first step in the model is a convolutional layer, indicated in blue. Five independent 64 × 17 spatio-temporal filters are shifted over the input matrix, which, since the first dimension is equal to the number of channels, each result in a time series of dimensions 1 × T. Note that '17' is 130 ms at 128 Hz, and 130 ms was found to be an optimal filter width – that is, longer or shorter decision window lengths gave a higher loss on a validation set. A rectifying linear unit (ReLu) activation function is used after the convolution step.

Figure 1. CNN architecture (windows of T samples).

Figure 1.

Input: T time samples of a 64-channel EEG signal, at a sampling rate of 128 Hz. Output: two scalars that determine the attended direction (left/right). The convolution, shown in blue, considers 130 ms of data over all channels. EEG = electroencephalography, CNN = convolutional neural network, ReLu = rectifying linear unit, FC = fully connected.

In the average pooling step, data is averaged over the time dimension, thus reducing each time series to a single number. After the pooling step, there are two fully connected (FC) layers. The first layer contains five neurons (one for each time series) and is followed by a sigmoid activation function, and the second layer contains two (output) neurons. These two neurons are connected to a cross-entropy loss function. Note that with only two directions (left/right), a single output neuron (coupled with a binary cross-entropy loss) would have sufficed as well. With this setup, it is easier to extend to more locations, however. The full CNN consists of approximately 5500 parameters.

The implementation was done in MATLAB 2016b and MatConvNet (version 1.0-beta25), a CNN toolbox for MATLAB (Vedaldi and Lenc, 2015). The source code is available at https://github.com/exporl/locus-of-auditory-attention-cnn (copy archived at swh:1:rev:3e5e21a7e6072182e076f9863ebc82b85e7a01b1; Vandecappelle, 2021).

CNN training and evaluation

The model was trained on data of all subjects, including the subject it was tested on (but without using the same data for both training and testing). This means we are training a subject-specific decoder, where the data of the other subjects can be viewed as a regularization or data augmentation technique to avoid overfitting on the (limited) amount of training data of the subject under test.

To prevent the model from overfitting to one particular story, we cross-validated over the four stories (resulting in four folds). That is, we held out one story and trained on the remaining three stories (illustrated in Table 2). Such overfitting is not an issue for simple linear models, but may be an issue for the CNN we propose here. Indeed, even showing only the EEG responses to a part of a story could result in the model learning certain story-specific characteristics. That could then lead to overly optimistic results when the model is presented with the EEG responses to another (albeit different) part of the same story. Similarly, since each speaker has their own 'story-telling' characteristics (e.g., speaking rate or intonation), and a different voice timbre, EEG responses to different speakers may differ. Therefore, it is possible that the model gains an advantage by having 'seen' the EEG response to a specific speaker, so we retained only the folds wherein the same speaker was never simultaneously part of both the training and the test set. In the end, only two folds remained (see Table 2). We refer to the combined cross-validation approach as leave-one-story+speaker-out.

Table 2. Cross-validating over stories and speakers.

With the current dataset, there are only two folds that do not mix stories and speakers across training and test sets. Top: Story 1 as test data; story 2, 3, and 4 as training data and validation data (85/15% division, per story). Bottom: similarly, but now with a different story and speaker as test data. In both cases, the story and speaker are completely unseen by the model. The model is trained on the same training set for all subjects and tested on a unique, subject-specific, test set.

Story Speaker Subject 1 Subject 2 Subject 16
1 1 test test test
2 2 train/val
3 3 train/val
4 3 train/val
Story Speaker Subject 1 Subject 2 Subject 16
1 1 train/val
2 2 test test test
3 3 train/val
4 3 train/val

In an additional experiment, we investigated the subject dependency of the model, where, in addition to cross-validating over story and speaker, we also cross-validated over subjects. That is, we no longer trained and tested on N subjects, but instead trained on N-1 subjects and tested on the held-out subject. Such a paradigm has the advantage that new subjects do not have to undergo potentially expensive and time-consuming retraining, making it more suitable for real-life applications. Whether it is actually a better choice than subject-specific retraining depends on the difference in performance between the two paradigms. If the difference is sufficiently large, subject-dependent retraining might be a price one is willing to pay.

We trained the network by minimizing the cross-entropy between the network outputs and the corresponding labels (the attended ear). We used mini-batch stochastic gradient descent with an initial learning rate of 0.09 and a momentum of 0.9. We applied a step decay learning schedule that decreased the learning rate after epoch 10 and 35 to 0.045 and 0.0225, respectively, to assure convergence. The batch size was set to 20, partly because of memory constraints, and partly because we did not see much improvement with larger batch sizes. Weights and biases were initialized by drawing randomly from a normal distribution with a mean of 0 and a standard deviation of 0.5. Training ran for 100 epochs, as early experiments showed that the optimal decoder was usually found between epoch 70 and 95. Regularization consisted of weight decay with a value of 5 × 10–4, and, after training, of selecting the decoder in the iteration where the validation loss was minimal. Note that the addition of data of the other subjects can also be viewed as a regularization technique that further reduces the risk of overfitting.

All hyperparameters given above were determined by running a grid search over a set of reasonable values. Performance during this grid search was measured on the validation set.

Note that in this work the decoding accuracy is defined as the percentage of correctly classified decision windows on the test set, averaged over the two folds mentioned earlier (one for each story narrated by a different speaker).

Linear baseline model (stimulus reconstruction)

A linear stimulus reconstruction model (Biesmans et al., 2017) was used as baseline. In this model, a spatio-temporal filter was trained and applied on the EEG data and its time-shifted versions up to 250 ms delay, based on least-squares regression, in order to reconstruct the envelope of the attended stimulus. The reconstructed envelope was then correlated (Pearson correlation coefficient) with each of the two speaker envelopes over a data window with a predefined length, denoted as the decision window (different lengths were tested). The classification was made by selecting the position corresponding to the speaker that yielded the highest correlation in this decision window. The envelopes were calculated with the 'powerlaw subbands' method proposed by Biesmans et al., 2017; that is, a gammatone filter bank was used to split the speech into subbands, and per subband the envelope was calculated with a power law compression with exponent 0.6. The different subbands were then added again (each with a coefficient of 1) to form the broadband envelope. Envelopes were filtered and downsampled in the same vein as the EEG recordings.

For a fairer comparison with the CNN, the linear model was also trained in a leave-one-story+speaker-out way. In contrast to the CNN, however, the linear model was not trained on any other data than that of the subject under testing, since including data of other subjects harms the performance of the linear model.

Note that the results of the linear model here merely serve as a representative baseline, and that a comparison between the two models should be treated with care – in part because the CNN is nonlinear, but also because the linear model is only able to relate the EEG to the envelopes of the recorded audio, while the CNN is free to extract any feature it finds optimal (though only from the EEG, as no audio is given to the CNN). Additionally, the preprossessing is slightly different for both models. However, that preprocessing was chosen such that each model would perform optimally – using the same preprocessing would, in fact, negatively impact one of the two models.

Minimal expected switch duration

For some of the statistical tests below, we use the minimal expected switch duration (MESD) proposed by Geirnaert et al., 2020 as a relevant metric to assess AAD performance. The goal of the MESD metric is to have a single value as measure of performance, resolving the trade-off between accuracy and the decision window length. The MESD was defined as the expected time required for an AAD-based gain control system to reach a stable volume switch between both speakers, following an attention switch of the user. The MESD is calculated by optimizing a Markov chain as a model for the volume control system, which uses the AAD decision time and decoding accuracy as parameters. As a by-product, it provides the optimal volume increment per AAD decision.

One caveat is that the MESD metric assumes that all decisions are taken independently of each other, but this may not be true when the window length is very small, for example, smaller than 1 s. In that case, the model behind the MESD metric may slightly underestimate the time needed for a stable switch to occur. However, it can still serve as a useful tool for comparing models.

Results

Decoding performance

Seven different decision window lengths were tested: 10, 5, 2, 1, 0.5, 0.25, and 0.13 s. This defines the amount of data that is used to make a single left/right decision. In the AAD literature, decision windows range from approximately 60 to 5 s. In this work, the focus lies on shorter decision windows. This is done for practical reasons: in neuro-steered hearing aid applications, the detection time should ideally be short enough to quickly detect attention switches of the user.

To capture the general performance of the CNN, the reported accuracy for each subject is the mean accuracy of 10 different training runs of the model, each with a different (random) initialization. All MESD values in this work are based on these mean accuracies.

The linear model was not evaluated at a decision window length of 0.13 s since its kernel has a width of 0.25 s, which places a lower bound on the possible decision window length.

Figure 2 shows the decoding accuracy at 1 and 10 s for the CNN and the linear model. For both decision windows, the CNN had a higher median decoding accuracy, but a larger intersubject variability. Two subjects had a decoding accuracy lower than 50% at a window length of 10 s, and were therefore not considered in the subsequent analysis, nor are they shown in the figures in this section.

Figure 2. Auditory attention detection performance of the CNN for two different window lengths.

Figure 2.

Linear decoding model shown as baseline. Blue dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Red triangles: median accuracies. CNN = convolutional neural network.

For 1 s decision windows, a Wilcoxon signed-rank test yielded significant differences in detection accuracy between the linear decoder model and the CNN (W = 3, p < 0.001), with an increase in median accuracy from 58.1 to 80.8%. Similarly, for 10 s decision windows, a Wilcoxon signed-rank test showed a significant difference between the two models (W = 16, p = 0.0203), with the CNN achieving a median accuracy of 85.1% compared to 75.7% for the linear model.

The minimal expected switch duration (MESD) (Geirnaert et al., 2020) outputs a single number for each subject, given a set of window lengths and corresponding decoding accuracies. This allows for a direct comparison between the linear and the CNN model, independent of window length. As shown in Figure 3, the linear model achieves a median MESD of 22.6 s, while the CNN achieves a median MESD of only 0.819 s. A Wilcoxon signed-rank test shows this difference to be significant (W = 105, p < 0.001). The extremely low MESD for the CNN is the result of the median accuracy still being 68.7% at only 0.13 s, and the fact that the MESD typically chooses the optimal operation point at short decision window lengths (Geirnaert et al., 2020).

Figure 3. Minimal expected switch durations (MESDs) for the CNN and the linear baseline.

Figure 3.

Dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Vertical black bars: median MESD. As before, two poorly performing subjects were excluded from the analysis. CNN = convolutional neural network.

It is not entirely clear why the CNN fails for 2 of the 16 subjects. Our analysis shows that the results depend heavily on the story that is being tested on: for the two subjects with below 50% accuracy, the CNN performed poorly on story 1 and 2, but performed well on stories 3 and 4 (80% and higher). Our results are based on stories 1 and 2, however, since stories 3 and 4 are narrated by the same speaker and we wanted to avoid having the same speaker in both the training and test set. It is possible that the subjects did not comply with the task in these conditions.

Effect of decision window length

Shorter decision windows contain less information and should therefore result in poorer performance compared to longer decision windows. Figure 4 visualizes the relation between window length and detection accuracy.

Figure 4. Auditory attention detection performance as a function of the decision window length.

Figure 4.

Blue dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Red triangles: median accuracies. CNN = convolutional neural network.

A linear mixed-effects model fit for decoding accuracy, with decision window length as fixed effect and subject as random effect, shows a significant effect of window length for both the CNN model (df = 96, p < 0.001) and the linear model (df = 94, p < 0.001). The analysis was based on the decision window lengths shown in Figure 4; that is, seven window lengths for the CNN and six for the linear model.

Interpretation of results

Interpreting the mechanisms behind a neural network remains a challenge. In an attempt to understand which frequency bands of the EEG the network uses, we retested (without retraining) the model in two ways: (1) by filtering out a certain frequency range (Figure 5, left); (2) by filtering out everything except a particular frequency range (Figure 5, right). The frequency ranges are defined as follows: δ = 1–4 Hz; θ = 4–8 Hz; α = 8–14 Hz; β = 14–32 Hz.

Figure 5. Auditory attention detection performance of the CNN when one particular frequency band is removed (left) and when only one band is used (right).

Figure 5.

The original results are also shown for reference. Each box plot contains results for all window lengths and for the two test stories.

Figure 5 shows that the CNN uses mainly information from the beta band, in line with Gao et al., 2017. Note that the poor results for the other frequency bands (Figure 5, right) does not necessarily mean that the network does not use the other bands, but rather, if it does, it is in combination with other bands.

We additionally investigated the weights of the filters of the convolutional layer, as they give an indication of what channel the model finds important. We calculated the power of the filter weights per channel, and to capture the general trend, we calculated a grand-average over all models (i.e., all window lengths, stories, and runs). Moreover, we normalized the results with the per-channel power of the EEG in the training set, to account for that fact that what comes out of the convolutional layer is a function of both the filter weights and the magnitude of the input.

The results are shown in Figure 6. We see primarily activations in the frontal and temporal regions, and to a lesser extent also in the occipital lobe. Activations appear to be slightly stronger on the right side, as well. This result is in line with Ciccarelli et al., 2019, who also saw stronger activations in the frontal channels (mostly for the 'Wet 18 CH' and 'Dry 18 CH' systems). Additionally, Gao et al., 2017 also found the frontal channels to significantly differ from the other channels within the beta band (Figure 3 and Table 1 in Gao et al., 2017). The prior (eye) artifact removal step in the EEG preprocessing and the importance of the beta band in the decision-making (Figure 5) suggests that the focus on the frontal channel is not necessarily attributed to eye artifacts. It is noted that the filters of the network act as backward decoders, and therefore care should be taken when interpreting topoplots related to the decoder coefficients. As opposed to a forward (encoding) model, the coefficients of a backward (decoding) model are not necessarily predictive for the strength of the neural response in these channels. For example, the network may perform an implicit noise reduction transformation, thereby involving channels with low SNR as well.

Figure 6. Grand-average topographic map of the normalized power of convolutional filters.

Figure 6.

Effect of validation procedure

In all previous results, we used a leave-one-story+speaker-out scheme to prevent the CNN from gaining an advantage by already having seen EEG responses elicited by the same speaker or different parts of the same story. However, it is noted that in the majority of the AAD literature, training and test sets often do contain samples from the same speaker or story (albeit from different parts of the story).

To investigate the impact of cross-validating over speaker and story, we trained the CNN again, but this time using data of each trial (later referred to as 'Every trial'). Here, the training set consisted of the first 75% of each trial, the validation set of the next 15% and the test set of the last 15%. We performed this experiment twice – once using data preprocessed in the manner explained in the ‘‘Data processing’’ section, and once with the artefact removal filtering (MWF) stage excluded.

Figure 7 shows the results of all three experiments for decision windows of 1 s. Other window lengths show similar results.

Figure 7. Impact of the model validation strategy on the performance of the CNN (decision windows of 1 s).

Figure 7.

In Leave-one-story+speaker-out, the training set does not contain examples of the speakers or stories that appear in the test set. In Every trial (unprocessed), the training, validation, and test sets are extracted from every trial (although always disjoint), and no spatial filtering takes places. In Every trial (per-trial MWFs), data is again extracted from every trial, but this time per-trial MWF filters are applied. CNN = convolutional neural network.

For decision windows of 1 s, using data from all trials, in addition to applying a per-trial MWF filter, results in a median decoding accuracy of 92.8% (Figure 7, right), compared to only 80.8% when leaving out both story and speaker (Figure 7, left). A Wilcoxon signed-rank test shows this difference to be significant (W = 91, p = 0.0134). There is, however, no statistically significant difference in decoding accuracy between leaving out both story and speaker and when using data of all trials, but without applying any spatial filtering for artifact removal (W = 48, p = 0.8077).

It appears that having the same speaker and story in both the training and test set is less problematic than we had anticipated, and employing a classical scheme wherein both sets draw from the same trials (though use different parts) is fine, but only on the condition that they are preprocessed in a trial-independent way.

Subject-independent decoding

In a final experiment, we investigated how well the CNN performs on subjects that were not part of the training set. Here, the CNN is trained on N – 1 subjects and tested on the held-out subject – but still in a leave-one-story+speaker out manner, as before. The results are shown in Figure 8. For windows of 1 s, a Wilcoxon signed-rank test shows that leaving out the test subject results in a significant decrease in decoding accuracy from 80.8% to 69.3% (W = 14, p = 0.0134). Surprisingly, for one subject the network performs better when its data was not included during training. Other window lengths show similar results.

Figure 8. Impact of leaving out the test subject on the accuracy of the CNN model (decision windows of 1 s).

Figure 8.

Blue dots: per-subject results, averaged over two test stories. Gray lines: same subjects. Red triangles: median accuracies. CNN = convolutional neural network.

Discussion

We proposed a novel CNN-based model for decoding the direction of attention (left/right) without access to the stimulus envelopes, and found it to significantly outperform a linear decoder that was trained to reconstruct the envelope of the attended speaker.

Decoding accuracy

The CNN model resulted in a significant increase in decoding accuracy compared to the linear model: for decision windows as low as 1 s, the CNN’s median performance is around 81%. This is also better than entropy-based direction classification presented in literature (Lu et al., 2018), in which the average decoding performance proved to be insufficient for real-life use (less than 80% for decision windows of 60 s). Moreover, our network achieves an unprecedented median MESD of only 0.819 s, compared to 22.6 s for the linear method, allowing for robust neuro-steered volume control with a practically acceptable latency.

Despite the impressive median accuracy of our CNN, there is clearly more variability between subjects in comparison to the linear model. Figure 4, for example, shows that some subjects have an accuracy of more than 90%, while others are at chance level – and two subjects even perform below chance level. While this increase in variance could be due to our dataset being too small for the large number of parameters in the CNN, we observed that the poorly performing subjects do better on stories 3 and 4, which were originally excluded as a test set in the cross-validation. Why our system performs poorly on some stories, and why this effect differs from subject to subject, is not clear, but nevertheless it does impact the per-subject results. This story-effect is not present in the linear model, probably because that model has far fewer parameters and is unable to pick up certain intricacies of stories or speakers.

As expected, we found a significant effect of decision window length on accuracy. This effect is, however, clearly different for the two models: the performance of the CNN is much less dependent on window length than is the case for the linear model. For the CNN, going from 10 s to 1 s, the median accuracy decreases by only 4.3% (from 85.1% to 80.8%), while with the linear model it decreases by 17.6% (from 75.7% to 58.1%). Moreover, even at 0.25 s the CNN still achieves a median accuracy of 74.0%, compared to only 53.4% for the linear model. We hypothesize that this difference is because the CNN does not know the stimulus and is only required to decode the locus of attention. As opposed to traditional AAD techniques, it does not have to relate the neural activity to the underlying speech envelopes. The latter requires computing correlation coefficients between the stimulus and the neural responses, which are only sufficiently reliable and discriminative when computed over long windows.

As usual with deep neural networks, it is hard to pinpoint exactly which information the system uses to achieve attention decoding. Potential information sources are spatial patterns of brain activity related to auditory attention, but also eye gaze or (ear) muscle activity which can be reflected in the EEG. While the subjects most likely focused on a screen in front of them and were instructed to sit still, and we conducted a number of control experiments such as removing the frontal EEG channels, none of these arguments or experiments was fully conclusive, so we can not exclude the possibility that information from other sources than the brain was used to decode attention.

Lastly, we evaluated our system using a leave-one-story+speaker-out approach, which is not commonly done in the literature. The usual approach is to leave out a single trial without consideration for speaker and/or story. This is probably fine for linear models, but we wanted to see whether the same would hold for a more complex model such as a CNN. Our results demonstrate that, when properly preprocessing the data, there is no significant difference in decoding accuracy between the leave-one-story+speaker-out approach and the classical approach. However, strong overfitting effects were observed when a per-trial (data-driven) preprocessing is performed, for example, for artifact removal. This implies that the data-driven procedure generates intertrial differences in the spatio-temporal data structure that can be exploited by the network. We conclude that one should be careful when applying data-driven preprocessing methods such as independent component analysis, principal component analysis, or MWF in combination with spatio-temporal decoders. In such cases, it is important not to run the preprocessing on a per-trial basis, but run it only once on the entire recording to avoid adding per-trial fingerprints that can be discovered by the network.

Future improvements

We hypothesize that much of the variation within and across subjects and stories currently observed is due to the small size of the dataset. The network probably needs more examples to learn to generalize better. However, a sufficiently large dataset, one which also allows for the strict cross-validation used in this work, is currently not available.

Partly as a result of the limited amount of data available, the CNN proposed in this work is relatively simple. With more data, more complex CNN architectures would become feasible. Such complex CNN architectures may benefit more from generalization features such as dropout and batch normalization, not discussed in this work.

Also, for a practical neuro-steered hearing aid, it may be beneficial to make soft decisions. Instead of the translation of the continuous softmax outputs into binary decisions, the system could output a probability of left or right being attended, and the corresponding noise suppression system could adapt accordingly. In this way the integrated system could benefit from temporal relations or the knowledge of the current state to predict future states. The CNN could for example be extended by a long short term memory (LSTM) network.

Applications

The main bottleneck in the implementation of neuro-steered noise suppression in hearing aids thus far has been the detection speed (state-of-the-art algorithms only achieve reasonable accuracies when using long decision windows). This can be quantified through the MESD metric, which captures both the effect of detection speed and decoding accuracy. While our linear baseline model achieves a median MESD of 22.6 s, our CNN achieves a median MESD of only 0.819 s, which is a major step forward.

Moreover, our CNN-based system has an MESD of 5 s or less for 11 out of 16 subjects (eight subjects even have an MESD below 1 s), which is what we assume the minimum for an auditory attention detection system to be feasible in practice. Note that while a latency of 5 s may at first sight still seem long for practical use, it should not be confused with the time it takes to actually start steering toward the attended speaker: the user will already hear the effect of switching attention sooner. Instead, the MESD corresponds to the total time it takes to switch an AAD-steered volume control system from one speaker to the other in a reliable fashion by introducing an optimized amount of 'inertia' in the volume control system to avoid spurious switches due to false positives (Geirnaert et al., 2020). (For reference, an MESD of 5 s corresponds to a decoding accuracy of 70% at 1 s.) On the other hand, one subject does have an MESD of 33.4 s, and two subjects have an infinitely high MESD due to below 50% performance. The intersubject variability thus remains a challenge, since the goal is to create an algorithm that is both robust and able to quickly decode attention within the assumed limits for all subjects.

Another difficulty in neuro-steered hearing aids is that the clean speech envelopes are not available. This has so far been addressed using sophisticated noise suppression systems (Van Eyndhoven et al., 2017; O'Sullivan et al., 2017; Aroudi et al., 2018). If the speakers are spatially separated, our CNN might elegantly solve this problem by steering a beamformer toward the direction of attention, without requiring access to the envelopes of the speakers at all. Note that in a practical system, the system would need to be extended to more than two possible directions of attention, depending on the desired spatial resolution.

For application in hearing aids, a number of other issues need to be investigated, such as the effect of hearing loss (Holmes et al., 2017), acoustic circumstances (e.g., background noise, speaker locations and reverberation [Das et al., 2018; Das et al., 2016 Fuglsang et al., 2017; Aroudi et al., 2019]), mechanisms for switching attention (Akram et al., 2016), etc. The computational complexity would also need to be reduced. Especially if deeper, more complex networks are designed, CNN pruning will be necessary (Anwar et al., 2017). Then, a hardware DNN implementation or even computation on an external device such as a smartphone could be considered. Another practical obstacle is the numerous electrodes used for the EEG measurements. Similar to the work of Mirkovic et al., 2015; Mundanad and Bertrand, 2018; Fiedler et al., 2016; Montoya-Martínez et al., 2019, it should be investigated how many and which electrodes are minimally needed for adequate performance.

In addition to potential use in future hearing devices, fast and accurate detection of the locus of attention can also be an important tool in future fundamental research. Thus far, it was not possible to measure compliance of the subjects with the instruction to direct their attention to one ear. Not only may the proposed CNN approach enable this, but it will also allow to track the locus of attention in almost real-time, which can be useful to study attention in dynamic situations, and its interplay with other elements such as eye gaze, speech intelligibility and cognition.

In conclusion, we proposed a novel EEG-based CNN for decoding the locus of auditory attention (based only on the EEG), and showed that it significantly outperforms a commonly used linear model for decoding the attended speaker. Moreover, we showed that the way the model is trained, and the way the data is preprocessed, impacts the results significantly. Although there are still some practical problems, the proposed model approaches the desired real-time detection performance. Furthermore, as it does not require the clean speech envelopes, this model has potential applications in realistic noise suppression systems for hearing aids.

Acknowledgements

The work was funded by KU Leuven Special Research Fund C14/16/057 and C24/18/099, Research Foundation Flanders (FWO) project nos. 1.5.123.16N and G0A4918N, the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grants no. 637424 [T Francart] and no. 802895 [A Bertrand]) and the Flemish Government under the 'Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen' program. A Ansari is a postdoctoral fellow of the Research Foundation Flanders (FWO). We thank Simon Geirnaert for his constructive criticism and for help with some of the technical issues we encountered.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Servaas Vandecappelle, Email: servaas.vandecappelle@gmail.com.

Tom Francart, Email: tom.francart@kuleuven.be.

Barbara G Shinn-Cunningham, Carnegie Mellon University, United States.

Barbara G Shinn-Cunningham, Carnegie Mellon University, United States.

Funding Information

This paper was supported by the following grants:

  • KU Leuven C14/16/057 to Tom Francart.

  • KU Leuven C24/18/099 to Alexander Bertrand.

  • Research Foundation Flanders 1.5.123.16N to Alexander Bertrand.

  • Research Foundation Flanders G0A4918N to Alexander Bertrand.

  • European Research Council 637424 to Tom Francart.

  • European Research Council 802895 to Alexander Bertrand.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing.

Conceptualization, Resources, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft.

Conceptualization, Resources, Data curation, Writing - review and editing.

Conceptualization, Software, Supervision, Validation, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Validation, Methodology, Project administration, Writing - review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Validation, Methodology, Project administration, Writing - review and editing.

Ethics

Human subjects: The experiment was approved by the Ethics Committee Research UZ/KU Leuven (S57102) and every participant signed an informed consent form approved by the same commitee.

Additional files

Transparent reporting form

Data availability

Code used for training and evaluating the network has been made available at https://github.com/exporl/locus-of-auditory-attention-cnn (copy archived at https://archive.softwareheritage.org/swh:1:rev:3e5e21a7e6072182e076f9863ebc82b85e7a01b1). The CNN models used to generate the results shown in the paper are also available at that location. The dataset used in this study had been made available earlier at https://zenodo.org/record/3377911.

The following previously published dataset was used:

Vandecappelle S, Deckers L, Das N, Ansari AH, Bertrand A, Francart T. 2019. Auditory Attention Detection Dataset KULeuven. Zenodo.

References

  1. Acharya UR, Oh SL, Hagiwara Y, Tan JH, Adeli H. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Computers in Biology and Medicine. 2018a;100:270–278. doi: 10.1016/j.compbiomed.2017.09.017. [DOI] [PubMed] [Google Scholar]
  2. Acharya UR, Oh SL, Hagiwara Y, Tan JH, Adeli H, Subha DP. Automated EEG-based screening of depression using deep convolutional neural network. Computer Methods and Programs in Biomedicine. 2018b;161:103–113. doi: 10.1016/j.cmpb.2018.04.012. [DOI] [PubMed] [Google Scholar]
  3. Akram S, Presacco A, Simon JZ, Shamma SA, Babadi B. Robust decoding of selective auditory attention from MEG in a competing-speaker environment via state-space modeling. NeuroImage. 2016;124:906–917. doi: 10.1016/j.neuroimage.2015.09.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Alickovic E, Lunner T, Gustafsson F. A system identification approach to determining listening attention from EEG signals. 24th European Signal Processing Conference (EUSIPCO); 2016. pp. 31–35. [Google Scholar]
  5. Ansari AH, Cherian PJ, Caicedo A, Naulaers G, De Vos M, Van Huffel S. Neonatal seizure detection using deep convolutional neural networks. International Journal of Neural Systems. 2018a;29:1850011. doi: 10.1142/S0129065718500119. [DOI] [PubMed] [Google Scholar]
  6. Ansari AH, De Wel O, Lavanga M, Caicedo A, Dereymaeker A, Jansen K, Vervisch J, De Vos M, Naulaers G, Van Huffel S. Quiet sleep detection in preterm infants using deep convolutional neural networks. Journal of Neural Engineering. 2018b;15:066006. doi: 10.1088/1741-2552/aadc1f. [DOI] [PubMed] [Google Scholar]
  7. Anwar S, Hwang K, Sung W. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems. 2017;13:1–18. doi: 10.1145/3005348. [DOI] [Google Scholar]
  8. Aroudi A, Marquardt D, Daclo S. EEG-based auditory attention decoding using steerable binaural superdirective beamformer. International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018. pp. 851–855. [Google Scholar]
  9. Aroudi A, Mirkovic B, De Vos M, Doclo S. Impact of different acoustic components on EEG-Based auditory attention decoding in noisy and reverberant conditions. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2019;27:652–663. doi: 10.1109/TNSRE.2019.2903404. [DOI] [PubMed] [Google Scholar]
  10. Bednar A, Lalor EC. Neural tracking of auditory motion is reflected by Delta phase and alpha power of EEG. NeuroImage. 2018;181:683–691. doi: 10.1016/j.neuroimage.2018.07.054. [DOI] [PubMed] [Google Scholar]
  11. Bednar A, Lalor EC. Where is the cocktail party? decoding locations of attended and unattended moving sound sources using EEG. NeuroImage. 2020;205:116283. doi: 10.1016/j.neuroimage.2019.116283. [DOI] [PubMed] [Google Scholar]
  12. Biesmans W, Das N, Francart T, Bertrand A. Auditory-Inspired speech envelope extraction methods for improved EEG-Based auditory attention detection in a cocktail party scenario. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2017;25:402–412. doi: 10.1109/TNSRE.2016.2571900. [DOI] [PubMed] [Google Scholar]
  13. Cherry EC. Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustical Society of America. 1953;25:975–979. doi: 10.1121/1.1907229. [DOI] [Google Scholar]
  14. Ciccarelli G, Nolan M, Perricone J, Calamia PT, Haro S, O'Sullivan J, Mesgarani N, Quatieri TF, Smalt CJ. Comparison of Two-Talker attention decoding from EEG with nonlinear neural networks and linear methods. Scientific Reports. 2019;9:11538. doi: 10.1038/s41598-019-47795-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Das N, Biesmans W, Bertrand A, Francart T. The effect of head-related filtering and ear-specific decoding Bias on auditory attention detection. Journal of Neural Engineering. 2016;13:056014. doi: 10.1088/1741-2560/13/5/056014. [DOI] [PubMed] [Google Scholar]
  16. Das N, Bertrand A, Francart T. EEG-based auditory attention detection: boundary conditions for background noise and speaker positions. Journal of Neural Engineering. 2018;15:066017. doi: 10.1088/1741-2552/aae0a6. [DOI] [PubMed] [Google Scholar]
  17. de Cheveigné A, Wong DDE, Di Liberto GM, Hjortkjær J, Slaney M, Lalor E. Decoding the auditory brain with canonical component analysis. NeuroImage. 2018;172:206–216. doi: 10.1016/j.neuroimage.2018.01.033. [DOI] [PubMed] [Google Scholar]
  18. DeBuren Radioboeken Voor Kinderen. 2007 http://www.radioboeken.eu/kinderradioboeken.php?lang=NL
  19. Ding N, Simon JZ. Emergence of neural encoding of auditory objects while listening to competing speakers. PNAS. 2012;109:11854–11859. doi: 10.1073/pnas.1205381109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Faure P, Korn H. Is there Chaos in the brain? I. concepts of nonlinear dynamics and methods of investigation. Comptes Rendus De l'Académie Des Sciences - Series III - Sciences De La Vie. 2001;324:773–793. doi: 10.1016/S0764-4469(01)01377-4. [DOI] [PubMed] [Google Scholar]
  21. Fiedler L, Obleser J, Lunner T, Graversen C. Ear-EEG allows extraction of neural responses in challenging listening scenarios—a future technology for hearing aids?. 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 2016. pp. 5697–5700. [DOI] [PubMed] [Google Scholar]
  22. Francart T, van Wieringen A, Wouters J. APEX 3: a multi-purpose test platform for auditory psychophysical experiments. Journal of Neuroscience Methods. 2008;172:283–293. doi: 10.1016/j.jneumeth.2008.04.020. [DOI] [PubMed] [Google Scholar]
  23. Frey JN, Mainy N, Lachaux JP, Müller N, Bertrand O, Weisz N. Selective modulation of auditory cortical alpha activity in an audiovisual spatial attention task. Journal of Neuroscience. 2014;34:6634–6639. doi: 10.1523/JNEUROSCI.4813-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Fuglsang SA, Dau T, Hjortkjær J. Noise-robust cortical tracking of attended speech in real-world acoustic scenes. NeuroImage. 2017;156:435–444. doi: 10.1016/j.neuroimage.2017.04.026. [DOI] [PubMed] [Google Scholar]
  25. Gao Y, Wang Q, Ding Y, Wang C, Li H, Wu X, Qu T, Li L. Selective attention enhances Beta-Band cortical oscillation to speech under "Cocktail-Party" Listening Conditions. Frontiers in Human Neuroscience. 2017;11:34. doi: 10.3389/fnhum.2017.00034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Geirnaert S, Francart T, Bertrand A. An interpretable performance metric for auditory attention decoding algorithms in a context of Neuro-Steered gain control. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2020;28:307–317. doi: 10.1109/TNSRE.2019.2952724. [DOI] [PubMed] [Google Scholar]
  27. Holmes E, Kitterick PT, Summerfield AQ. Peripheral hearing loss reduces the ability of children to direct selective attention during multi-talker listening. Hearing Research. 2017;350:160–172. doi: 10.1016/j.heares.2017.05.005. [DOI] [PubMed] [Google Scholar]
  28. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
  29. Liu N, Lu Z, Xu B, Liao Q. Learning a convolutional neural network for sleep stage classification. Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2017 10th International Congress.2017. [Google Scholar]
  30. Lu Y, Wang M, Zhang Q, Han Y. Identification of auditory Object-Specific attention from Single-Trial electroencephalogram signals via entropy measures and machine learning. Entropy. 2018;20:386. doi: 10.3390/e20050386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Miran S, Akram S, Sheikhattar A, Simon JZ, Zhang T, Babadi B. Real-Time tracking of selective auditory attention from M/EEG: a bayesian filtering approach. Frontiers in Neuroscience. 2018;12:262. doi: 10.3389/fnins.2018.00262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Mirkovic B, Debener S, Jaeger M, De Vos M. Decoding the attended speech stream with multi-channel EEG: implications for online, daily-life applications. Journal of Neural Engineering. 2015;12:046007. doi: 10.1088/1741-2560/12/4/046007. [DOI] [PubMed] [Google Scholar]
  33. Montoya-Martínez J, Bertrand A, Francart T. Optimal number and placement of eeg electrodes for measurement of neural tracking of speech. bioRxiv. 2019 doi: 10.1101/800979. [DOI] [PMC free article] [PubMed]
  34. Mundanad AN, Bertrand A. The effect of miniaturization and galvanic separation of EEG sensor devices in an auditory attention detection task. 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC); 2018. pp. 77–80. [DOI] [PubMed] [Google Scholar]
  35. O'Sullivan JA, Power AJ, Mesgarani N, Rajaram S, Foxe JJ, Shinn-Cunningham BG, Slaney M, Shamma SA, Lalor EC. Attentional selection in a cocktail party environment can be decoded from Single-Trial EEG. Cerebral Cortex. 2015;25:1697–1706. doi: 10.1093/cercor/bht355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. O'Sullivan J, Chen Z, Herrero J, McKhann GM, Sheth SA, Mehta AD, Mesgarani N. Neural decoding of attentional selection in multi-speaker environments without access to clean sources. Journal of Neural Engineering. 2017;14:056001. doi: 10.1088/1741-2552/aa7ab4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. O'Sullivan AE, Lim CY, Lalor EC. Look at me when I'm talking to you: selective attention at a multisensory cocktail party can be decoded using stimulus reconstruction and alpha power modulations. European Journal of Neuroscience. 2019;50:3282–3295. doi: 10.1111/ejn.14425. [DOI] [PubMed] [Google Scholar]
  38. Pasley BN, David SV, Mesgarani N, Flinker A, Shamma SA, Crone NE, Knight RT, Chang EF. Reconstructing speech from human auditory cortex. PLOS Biology. 2012;10:e1001251. doi: 10.1371/journal.pbio.1001251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Patel P, Long LK, Herrero JL, Mehta AD, Mesgarani N. Joint representation of spatial and phonetic features in the human core auditory cortex. Cell Reports. 2018;24:2051–2062. doi: 10.1016/j.celrep.2018.07.076. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Somers B, Francart T, Bertrand A. A generic EEG artifact removal algorithm based on the multi-channel Wiener filter. Journal of Neural Engineering. 2018;15:036007. doi: 10.1088/1741-2552/aaac92. [DOI] [PubMed] [Google Scholar]
  41. Taillez T, Kollmeier B, Meyer BT. Machine learning for decoding listeners’ attention from electroencephalography evoked by continuous speech. European Journal of Neuroscience. 2017;51:1234–1241. doi: 10.1111/ejn.13790. [DOI] [PubMed] [Google Scholar]
  42. Van Eyndhoven S, Francart T, Bertrand A. EEG-Informed attended speaker extraction from recorded speech mixtures with application in Neuro-Steered hearing prostheses. IEEE Transactions on Biomedical Engineering. 2017;64:1045–1056. doi: 10.1109/TBME.2016.2587382. [DOI] [PubMed] [Google Scholar]
  43. Vandecappelle S. EEG-based detection of the locus of auditory attention with convolutional neural networks. swh:1:rev:8c485f2e1d3a79b55b71b3195cdf0235af488d95Software Heritage. 2021 doi: 10.7554/eLife.56481. https://archive.softwareheritage.org/swh:1:dir:8901ca73c9ef6f86de11719af6d410a02e7eb291 [DOI] [PMC free article] [PubMed]
  44. Vedaldi A, Lenc K. Matconvnet: convolutional neural networks for matlab. Proceedings of the 23rd ACM International Conference on Multimedia ACM; 2015. pp. 689–692. [Google Scholar]
  45. Wolbers T, Zahorik P, Giudice NA. Decoding the direction of auditory motion in blind humans. NeuroImage. 2011;56:681–687. doi: 10.1016/j.neuroimage.2010.04.266. [DOI] [PubMed] [Google Scholar]
  46. Wöstmann M, Herrmann B, Maess B, Obleser J. Spatiotemporal dynamics of auditory attention synchronize with speech. PNAS. 2016;113:3873–3878. doi: 10.1073/pnas.1523357113. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Barbara G Shinn-Cunningham1
Reviewed by: James O'Sullivan, Andrew Dimitrijevic

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This paper aims to assess how well attention to a speaker can be decoded from EEG using convolutional neural networks (CNNs). In particular, the authors train a CNN on EEG data from a "cocktail party" attention experiment and demonstrate impressive decoding performance, better than many prior related efforts. Though effects of eye gaze cannot be completely ruled out, the authors acknowledge this potential confound and do a diligent job of addressing this concern. These provocative results are likely to impact future research in the use of EEG to decode the focus of attention in auditory tasks.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "EEG-based detection of the attended speaker and the locus of auditory attention with convolutional neural networks" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Barbara G Shinn-Cunningham as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Andrew Dimitrijevic (Reviewer #2).

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.

All of the reviewers felt that the work has the potential to appear in eLife. However, there were substantial concerns about some of the technical details. Without some significant additional work to address potential limitations of the findings and confounds of the experiments that were conducted, we felt the manuscript was not ready for publication in eLife. The standard for asking for a revision (rather than a rejection) for eLife is that if any additional work is likely to take two months or more. Given this, we must reject the manuscript: we believe that the additional work required will take more than two months.

Reviewer #1:

This is an interesting paper that addresses a very timely and interesting question. Given the attention (pun intended) to real-time decoding of attention in the field today, the approach described is likely to be influential.

However, as written, I am not sure how general the findings are, based on the experiments described. Reviewer 3 does an excellent job of articulating the concerns I had, as well, so I am not reiterating them here. With additional controls that demonstrate the robustness of the findings, this work will be of high impact.

Overall, the paper is very clearly written. However, there are a few phrasings that are grammatically proper, but that sound awkward to a native English speaker's ear. (For instance, "Especially the elderly and people suffering from hearing loss have difficulties attending to one person in a noisy environment." is more natural when written as "Both the elderly and people suffering from hearing loss have particular difficulty attending to one person in a noisy environment." ) If the paper were being revised at this point, I would offer a more complete list of such sentences and suggested edits-- but don't believe it makes sense to do so at this juncture.

Reviewer #2:

The manuscript "EEG-based detection of the attended speaker and the locus of auditory attention with convolutional neural networks" describes a study where the authors used a convolutional neural network (CNN) to identify auditory attend locations while the EEG was recorded. The data indicated that CNNs can classify attend locations and accuracy and speed of detection increases when the stimulus envelope is included in the CNN.

As written, the manuscript may appeal to engineering or computer science audiences, however, I feel that more needs to be incorporated to appeal to a broader scope/readership of eLife. Although this may deter from the practical or real-world application of the CNN, including and relating more physiological/neuroscience aspects of the CNN may make the manuscript more palatable to a general audience. It may also demonstrate that this technique can also be used to inform how the brain operates. Two current theories relating to auditory selective attention, as the authors mention, is enhancement of envelope encoding schemes and α lateralization. What features is the CNN using? The use of filtered EEG (low frequency for envelope and band-pass 8-12 Hz, for α) may provide some indication. Some detail on the inferred neural generators, perhaps a topography of the feature weights (similar to de Taillez) would be informative. Also, more detail on the filters used for the spatial-temporal feature map would be helpful. The authors may also consider using a "control" condition to estimate false positive rates. This might be implemented as random EEG shuffling (left and right) for the final testing phase, which would have an accuracy of 50%. Some discussion on the behavioral aspect of the subject performance would also be desirable. Where there content questions about the attend speakers, did the subjects indeed listen to the appropriate target? In cases where CNN performance was not 100% was the subject "peaking a listen" to the other side?

Overall, the CCN is a novel application in this domain and determining attend location within 1-2 sec is a remarkable feat.

Reviewer #3:

This is a very nice study, and well written. The applications are very relevant, and the work is timely. However, I have a number of concerns which need to be addressed before I can believe these very impressive results.

The classification performance for the CNN:D model is very high, with accuracy using 1 second of data almost as high as that at 10 seconds. One potential downfall of CNNs (and DNNs in general) is that they might be hyper-sensitive to the particular EEG setup that they're trained on. I.e., if you tested the same subject on another day, would the performance be the same? Or are they learning to optimize performance with a particular setup of electrode locations and noise conditions? I understand that the data set was collected a few years ago, but is it possible to run the experiment again on a small subset of subjects, and use the CNN that was trained on the previous experiment to classify the data from the new experiment? This would address the concern of the CNN overfitting to the precise experimental setup of the day.

The benefit of the linear stimulus reconstruction approach is that we know how it works, and it can generalize to unseen speakers. The authors state that they tried training a DNN to perform stimulus reconstruction, but its performance was not as impressive as the CNN:S+D approach. However, the CNN:S+D specifically requires a binary decision between 2 speakers. Is it possible that the network is over-fitting to the specific speakers in the training set? If 2 new speakers were introduced, could it handle that? Is it possible for the authors to test this with the current data-set? If not, an additional experiment would be required.

In addition, the linear stimulus reconstruction approach allows for a generic subject-independent model that can decode the attention of an unseen subject. The authors do show results from a generic CNN, but this was trained on all subjects. Can the authors perform an additional analysis using a generic decoder but ensure that the test subject has been completely unseen by the network?

On a similar note, the training, cross-validation, and test data were all obtained from the same trials. I.e., in a single 6 minute trial, the first part was chosen as the training set, followed by cross-validation and test sets. This could lead to overly optimistic results. Can the authors perform an additional analysis where the training, validation, and test sets are all taken from different trials?

Can the authors provide any insight into what the network is learning, and how it can perform so well? As the authors mention in the introduction, perhaps it is α power. They could test this hypothesis by providing the CNN with different frequency bands of the neural data.

In summary, I would require to see a lot more proof that the CNN is not just overfitting to the particular subject, EEG setup, and day of recording, and that these results are generalizable.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "EEG-based detection of the locus of auditory attention with convolutional neural networks" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior and Reviewing Editor. The following individuals involved in review of your submission have agreed to reveal their identity: James O'Sullivan (Reviewer #1); Andrew Dimitrijevic (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

This manuscript presents research aimed at assessing how well attention to a speaker can be decoded from EEG using convolutional neural networks. In particular, the authors train a convolutional neural network directly on EEG data during a "cocktail party" attention experiment and compare it to an approach based on based on reconstructing an estimate of the speech envelope from the EEG using linear regression. The authors demonstrate decoding performance n with accuracies of ~80% using just 1-2 s of data, which is much better than the state of the art.

The reviewers all believe that this work may be appropriate for a Tools and Methods paper in eLife. However, there remain a few critical questions and concerns that need to be addressed for the paper to make its contribution to the field clear.

There are some potential strengths of this technical report comparing the CNNs and linear models for decoding auditory spatial attention using EEG. This research opens new avenues of exploration of auditory attention methods that can be used for real-time decoding applications such as neurally steered hearing aids. The authors claim that it is possible to decode the locus of attention with accuracies of ~80% using just 1-2 s of data is much better than the state of the art.

Because we could not obtain assessments from all of the original reviewers, one of the reviewers is new to the paper. This reviewer read the paper and wrote their own comments before going back and looking at the earlier reviews. They noted that some of the points that concerned them had been raised before. Still, the reviewers who saw your earlier submission do appreciate the changes you made.

Revisions for this paper:

The remaining critical issues that must be addressed for the paper to be published are:

1. Comparing current results to those obtained using envelope reconstruction is useful, but it is somewhat unfair. That is something that you should acknowledge. Specifically, the envelope reconstruction approach is not just a linear approach, it is a linear approach that is constrained to relating EEG responses to the envelopes of the two speech streams. No such constraint is placed on the CNN; it trains on the EEG and settles on whatever features are best for solving the question. Related to this, even the EEG preprocessing (filtering) is different for the CNN and the envelope reconstruction approaches. While this makes sense (the filters chosen for the envelope reconstruction seemed reasonable based on the literature), it also means that the information in the EEG differs in the two analyses. These issues should be acknowledged.

2. Some explanations of what features drive the CNN performance would greatly increase the impact of the paper. As a Tools and Methods paper, there are not significant expectations for demonstration of important neuroscience findings. Still, without some information about what is happening in the neural responses, readers cannot judge the likely usefulness and replicability of this "tool." Is there any way to know this? For example, some of the cited literature (e.g., Bednar; Wostmann) show that α power is important for decoding spatial attention. Α frequencies are included in your CNN analysis and might be responsible for the results you describe. You could check this by seeing how the CNN performance drops if you exclude α frequencies, for instance.

Relatedly, it is almost worrying how good the performance gets when you train on the other examples from the same story and speaker (Figure 5). Why would this be? Is the CNN picking up on some weird features in the EEG that are very specific to these speakers? Without having a sense of what drives the exceptional performance, it makes one wonder what the CNN relies on.

3. The results presented in the manuscript show no effect of window size on performance. This must, in the limit, not be true. More data must be shown to show this dependence and determine the limits of the method.

4. For 3 subjects, with a 10s window, the performance of the CNN was lower than the linear model (Figure 2). How is it possible then, that every subject had a better MESD when using the CNN (Figure 3)? I know you've excluded 1 subject from the figure, but what about the other 2 subjects?

5. You talk about the idea that future work can address some unanswered questions, like whether or not performance will drop with fewer EEG channels. However, related to the idea that the results might be driven be decoding of spatial attention, it would be interesting to know if spatial patterns are driving the CNN decoding.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your article "EEG-based detection of the locus of auditory attention with convolutional neural networks" for consideration by eLife. Your revised article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior and Reviewing Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Andrew Dimitrijevic (Reviewer #2); Behtash Babadi (Reviewer #4).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

As the editors have judged that your manuscript is of interest, but as described below that additional experiments are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Summary

This is a very interesting and provocative paper, which demonstrates decoding from EEG of the directional focus of auditory attention in a dichotic or HTRF-emulated competing-speaker setting. Using a CNN-based decoder to jointly extract the relevant features and classify the locus of attention, you show significant decoding improvements compared to the common linear decoding techniques; moreover, the decoding is rapid, and is thus able to track attentional switches. Analyses implicate the β band as well as frontal EEG channels in decoding. The paper is well-written and clear, the methods are described carefully and transparently, the results are impressive, and the discussion is thorough and inspiring. Your cross-validation scheme for training the CNN to avoid overfitting is admirable; this is very often overlooked.

In addition, we would like to note how thoughtfully you revised your original submission. Two of the three original reviewers read this revision, along with one new reviewer. It was clear to all that your revision and your reply to the previous criticisms were responsive and thorough. We want to thank and commend you for the work you put in on this revision. That said, the revision raised a new concern, discussed below.

Essential revision

1. The reviewers are not convinced that eye movements are not a substantial contributor to decoding accuracy. Specifically, the frontal topography of the convolution filters in Figure 6 looks suspiciously like an EOG signature. We think it is critical for you to clarify what features of the EEG are being used for classification. One way to test this would be look at the raw data (attend left vs right) and look the time-frequency profile.

1a. Saccade-related ERP profiles tend to have a positive peak near 0 ms followed by a negative peak around 20 ms. The attention-related ERPs using EEG, however, have key peaks at in the 100-200 ms range. Given this, the temporal profile of the filters may inform the arguments for and against eye movements contributing.

1b. Relatedly, if you found that the filters were tuned to γ band activity, this would suggest that small saccades are influencing performance. The fact that the network weights the β band as much as it does suggests that it may even like γ band more. On the other hand, if the filters are tuned to α or high δ, that would argue against saccades being the cause.

1c. Your MWF algorithm should remove large gaze artifacts. However, even very small (but consistent) gaze changes could be responsible for some of the effects you see. You should also consider the literature on micro saccades and γ, and about whether small but consistent drifts of gaze during long trials contribute.

1d. We are aware of your recent arXiv paper (Geirnaert et al) in which the CNN fails on another data set. Were subjects asked to fixate in that study, but not this? A better description of how subjects were instructed in the current study should be included, no matter what. Given the Geirnaert results, we think it is especially critical to figure out whether the results in the current paper really are attention effects in neural responses, rather than due to eye movement. It would be unfortunate to have to publish a correction if the results in the current study are attributed to attentional effects when they are actually due to gaze differences.

Given these issues, we would like you to undertake some of the above analyses to address the concerns, and consider in the Discussion the evidence for and against eye gaze contributing to the exceptional performance of your algorithm.

eLife. 2021 Apr 30;10:e56481. doi: 10.7554/eLife.56481.sa2

Author response


[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Reviewer #1:

(1) This is an interesting paper that addresses a very timely and interesting question. Given the attention (pun intended) to real-time decoding of attention in the field today, the approach described is likely to be influential.

However, as written, I am not sure how general the findings are, based on the experiments described. Reviewer 3 does an excellent job of articulating the concerns I had, as well, so I am not reiterating them here. With additional controls that demonstrate the robustness of the findings, this work will be of high impact.

We understand the concerns that Reviewer 1 and Reviewer 3 have regarding the robustness of our findings. We have made extensive changes to our experimental paradigm to address this. Because it was mostly Reviewer 3 who articulated the concerns, our answers are given below in Reviewer 3’s section. In this section we limit ourselves to additional comments made by Reviewer 1.

(2.1) Overall, the paper is very clearly written. However, there are a few phrasings that are grammatically proper, but that sound awkward to a native English speaker's ear. (For instance, "Especially the elderly and people suffering from hearing loss have difficulties attending to one person in a noisy environment." is more natural when written as "Both the elderly and people suffering from hearing loss have particular difficulty attending to one person in a noisy environment." )

We very much welcome and appreciate comments regarding the readability of our paper; we confess that we are not native speakers of English. We have revised the language to the best of our ability, but admit that improvement is undoubtedly still possible.

(2.2) If the paper were being revised at this point, I would offer a more complete list of such sentences and suggested edits but don't believe it makes sense to do so at this juncture.

Thank you, we look forward to any further comments you may have.

Reviewer #2:

(1) The manuscript "EEG-based detection of the attended speaker and the locus of auditory attention with convolutional neural networks" describes a study where the authors used a convolutional neural network (CNN) to identify auditory attend locations while the EEG was recorded. The data indicated that CNNs can classify attend locations and accuracy and speed of detection increases when the stimulus envelope is included in the CNN.

We thank the reviewer for the positive remark. To avoid misunderstanding, we note that with the adjustments made to the way the network is trained (which we elaborately explain in our comments to Reviewer 3), there was no longer a significant statistical difference between the CNN that incorporates envelopes (previously called “CNN:S+D”) and the CNN that does not (previously called “CNN:D”). It is for that reason that we decided to no longer include CNN:S+D and instead focus on CNN:D. Nevertheless, the speed of detection of this CNN:D network is a major step forward compared to the reported detection times in the recent literature.

(2.1) As written, the manuscript may appeal to engineering or computer science audiences, however, I feel that more needs to be incorporated to appeal to a broader scope/readership of eLife. Although this may deter from the practical or real-world application of the CNN, including and relating more physiological/neuroscience aspects of the CNN may make the manuscript more palatable to a general audience. It may also demonstrate that this technique can also be used to inform how the brain operates. Two current theories relating to auditory selective attention, as the authors mention, is enhancement of envelope encoding schemes and α lateralization. What features is the CNN using? The use of filtered EEG (low frequency for envelope and band-pass 8-12 Hz, for α) may provide some indication. Some detail on the inferred neural generators, perhaps a topography of the feature weights (similar to de Taillez) would be informative. Also, more detail on the filters used for the spatial-temporal feature map would be helpful.

We agree with the reviewer that insight into how the network operates, and what it learns exactly, would be informative both for the further development of neural network-based decoders and for neuroscience in general. We have done some elementary analysis to try to have a rough idea of what the network actually does by investigating the spatial and spectral topology of the convolution kernels, but we could not find clear trends. We feel that adding more advanced analyses to try to further open up the black box is beyond the scope and would also lead to an overloaded paper (both in terms of methodology and results). We also respectfully point out that the manuscript was submitted to the Tools and Resources category, for which the author guide states “This category highlights tools or resources that are especially important for their respective fields and have the potential to accelerate discovery. […] Tools and Resources articles do not have to report major new biological insights or mechanisms”. Though very interesting, we would prefer to keep the paper crisp and stick to the analysis of the performance of the network.

(3) The authors may also consider using a "control" condition to estimate false positive rates. This might be implemented as random EEG shuffling (left and right) for the final testing phase, which would have an accuracy of 50%.

Thank you for the comment. We agree that a control condition is useful. However, given that the D network (which is now the exclusive focus of the new version of the manuscript) only uses EEG and no stimulus information, each EEG segment has a “correct answer”, so shuffling the EEG in time relative to the stimulus would not work.

(4) Some discussion on the behavioral aspect of the subject performance would also be desirable. Where there content questions about the attend speakers, did the subjects indeed listen to the appropriate target? In cases where CNN performance was not 100% was the subject "peaking a listen" to the other side?

In the Das et al. 2016 dataset (which we use in this study), attention was measured behaviorally with a multiple-choice quiz given after every 6 min trial. We recognize this was not clearly explained in the manuscript and have added a more extensive description of the experimental setup. Note that there was no significant correlation found between performance on this quiz and the attention decoding performance.

Reviewer #3:

(1) This is a very nice study, and well written. The applications are very relevant, and the work is timely. However, I have a number of concerns which need to be addressed before I can believe these very impressive results.

The classification performance for the CNN:D model is very high, with accuracy using 1 second of data almost as high as that at 10 seconds. One potential downfall of CNNs (and DNNs in general) is that they might be hyper-sensitive to the particular EEG setup that they're trained on. I.e., if you tested the same subject on another day, would the performance be the same? Or are they learning to optimize performance with a particular setup of electrode locations and noise conditions? I understand that the data set was collected a few years ago, but is it possible to run the experiment again on a small subset of subjects, and use the CNN that was trained on the previous experiment to classify the data from the new experiment? This would address the concern of the CNN overfitting to the precise experimental setup of the day.

Reviewer 3 makes very relevant comments about the generalization of our findings. We understand why this is a cause of concern and have since made changes to our paradigm to improve the robustness of our results. Below, we reiterate the points above and explain how we addressed them.

Before doing so, however, we would like to point out that the manner with which we trained our model was not unusual. To reiterate, we partitioned each trial (6 minutes of the same story, attended to by the same ear) into a training, validation and testing set. The CNN was therefore never tested on data it had already seen, but one could argue that having already seen a different part of the EEG elicited by the same story could lead the model to gain an unfair advantage. The same argument holds for the narrator. As far as we know, this dependency has never been taking into account in other peer-reviewed AAD algorithm papers, though we admit this is probably much less an issue for linear models than for non-linear models.

On the other hand, a recent peer-reviewed paper also proposes a non-linear model (Ciccarelli et al., 2019), and they admit a similar training scheme— though they do not partition individual parts, but instead use a leave-one-trial-out scheme.

Nonetheless, we agree that this is a cause of concern and we have made steps to eliminate this potential dependency.

(2.1) i.e., if you tested the same subject on another day, would the performance be the same? […] I understand that the data set was collected a few years ago, but is it possible to run the experiment again on a small subset of subjects, and use the CNN that was trained on the previous experiment to classify the data from the new experiment? This would address the concern of the CNN overfitting to the precise experimental setup of the day.

This is an excellent point. Although we are also curious as to how this would impact the model performance, we regret to say we are unable to repeat the experiment. Our main results are based on a subject-dependent model that requires example data of the test subject—and unfortunately we can no longer retest the same subjects. Due to a large time gap we were unable to recruit the same students from the Das et al. (2016) study to do a re-test, as they have since moved on. Note that, as opposed to the previous version of the manuscript, the network is now trained on data from all subjects which in itself acts as a regularizer to avoid that the network overfits to one particular experiment on one particular day. The subject/experiment-dependent post-training has now been omitted as it was observed to not improve performance. If the network were to benefit from learning experiment/day-specific features, an improvement would be expected here, which is not the case.

(2.2) Or are they learning to optimize performance with a particular setup of electrode locations and noise conditions?

Properly testing these confounds would entail repeating the experiment on the same subject. We refer to (2.1) for our rationale as to why we are regrettably unable to do so. Note that we do expect that the network is indeed dependent on the electrode locations, as the initial convolutional layer is a spatial filter, which must deteriorate if the test data and train data would use different locations on the scalp. However, the same holds for all multi-channel (backwards) decoders in the current literature on AAD. We believe it is reasonable to assume a pre-fixed montage.

(2.3) The benefit of the linear stimulus reconstruction approach is that we know how it works, and it can generalize to unseen speakers. The authors state that they tried training a DNN to perform stimulus reconstruction, but its performance was not as impressive as the CNN:S+D approach. However, the CNN:S+D specifically requires a binary decision between 2 speakers. Is it possible that the network is over-fitting to the specific speakers in the training set? If 2 new speakers were introduced, could it handle that? Is it possible for the authors to test this with the current data-set? If not, an additional experiment would be required.

To be clear, we had four stories in total; two were narrated by two different speakers, and two by the same speaker. In our original train/val/test setup, both the train and test sets contained EEG elicited by speech of the same speaker (although never the same part of the story). We agree that having already seen (EEG elicited by) the same speaker could lead to overly optimistic model performance. We have therefore made changes so that the model is trained in a leave-one-story+speaker-out way. That means:

1. For leave-one-story-out: Per subject, we partitioned the data into four subsets, one for each (attended) story. During training we then iterated over the four stories, taking the current story as the test story, while the other three were used for training. That way the network was tested on a story it had never seen before. The performance was defined as the average performance over the four folds.

Author response table 1. Leave-one-story-out scheme.

Example of one out of four folds. In this particular fold, the test set consists of story 1, and the training and validation sets consist of stories 2, 3, and 4. Training and validation sets are completely separate from the test set. Per-subject accuracies are based on a subject-specific test set (noted by multiple mentions of "test" in Author response table 1). The model is trained on data of all subjects (noted by a single mention of "train/val").

Story Subject 1 Subject 2 …. Subject 16
1 test test test test
2 train/val
3 train/val
4 train/val

2. For leave-one-speaker-out, we note that story 3 and 4 were narrated by the same speaker. As a consequence, in two of the four folds, the test story and one of the three training stories were narrated by the same speaker. To also exclude the effects of speaker dependency, we discarded those two folds. The performance was then defined as the average of the two other folds. (This had no consequences on the amount of training data—in each fold the network was still trained on three stories and tested on one.)

We decided on this combined speaker+story scheme because it was the only way to eliminate both confounds without having to collect new data.

(Note that the Das et al. (2016) dataset is balanced in terms of attended ear—also on the level of stories—and that, hence, the folds are also balanced in that regard.)

(2.4) In addition, the linear stimulus reconstruction approach allows for a generic subject-independent model that can decode the attention of an unseen subject. The authors do show results from a generic CNN, but this was trained on all subjects. Can the authors perform an additional analysis using a generic decoder but ensure that the test subject has been completely unseen by the network?

Thank you for the suggestion. Note that the generic CNN mentioned by the reviewer is in fact a subject-specific decoder, as training data from the subject under test was included (yet other subjects were included in the training set to increase the training data and avoid overfitting). To avoid confusion with a subject-independent decoder, we avoid the term “generic” to describe such a decoder.

As suggested by the reviewer, we have added a section where we show results of a model trained on N− 1 subjects and tested on the unseen subject. We show that there is a significant drop in median accuracy, but that the decoding accuracy remains above 70% for 7 out of 16 subjects. We feel that this is an additional strength of our model, and certainly something we would like to further explore in the future.

(3) On a similar note, the training, cross-validation, and test data were all obtained from the same trials. I.e., in a single 6 minute trial, the first part was chosen as the training set, followed by cross-validation and test sets. This could lead to overly optimistic results. Can the authors perform an additional analysis where the training, validation, and test sets are all taken from different trials?

Thank you for pointing this out. This was mainly answered in our response to (2.3), but we would like to point out again that while our test set now comes from held-out stories/speakers, our training and validation are still taken from the same trials. This should result in a model that does well on the “average” of the three training stories, rather than on one particular story (which would be the case when the validation set consists of one story only). This does not cause an issue with generalizability or over-optimistic test results, however, because the test story and speaker are still completely unseen. In that sense we feel we have satisfied the reviewer’s request.

(4) Can the authors provide any insight into what the network is learning, and how it can perform so well? As the authors mention in the introduction, perhaps it is α power. They could test this hypothesis by providing the CNN with different frequency bands of the neural data.

We certainly acknowledge this would be interesting, but for reasons explained in Reviewer 2’s section, we would rather not extend the scope at this time.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Revisions for this paper:

The remaining critical issues that must be addressed for the paper to be published are:

1. Comparing current results to those obtained using envelope reconstruction is useful, but it is somewhat unfair. That is something that you should acknowledge. Specifically, the envelope reconstruction approach is not just a linear approach, it is a linear approach that is constrained to relating EEG responses to the envelopes of the two speech streams. No such constraint is placed on the CNN; it trains on the EEG and settles on whatever features are best for solving the question. Related to this, even the EEG preprocessing (filtering) is different for the CNN and the envelope reconstruction approaches. While this makes sense (the filters chosen for the envelope reconstruction seemed reasonable based on the literature), it also means that the information in the EEG differs in the two analyses. These issues should be acknowledged.

Thank you for the insightful comment. We agree that the comparison is not obvious and that reader should fully understand the assumptions that are being made. We have added the following paragraph at the end of Section II.E to make this more clear:

“Note that the results of the linear model here merely serve as a representative baseline, and that a comparison between the two models should be treated with care—in part because the CNN is non-linear, but also because the linear model is only able to relate the EEG to the envelopes of the recorded audio, while the CNN is free to extract any feature it finds optimal (though only from the EEG, as no audio is given to the CNN). Additionally, the prepossessing is slightly different for both models. However, that preprocessing was chosen such that each model would perform optimally—using the same preprocessing would in fact negatively impact one of the two models.”

2. Some explanations of what features drive the CNN performance would greatly increase the impact of the paper. As a Tools and Methods paper, there are not significant expectations for demonstration of important neuroscience findings. Still, without some information about what is happening in the neural responses, readers cannot judge the likely usefulness and replicability of this "tool." Is there any way to know this? For example, some of the cited literature (e.g., Bednar; Wostmann) show that α power is important for decoding spatial attention. Α frequencies are included in your CNN analysis and might be responsible for the results you describe. You could check this by seeing how the CNN performance drops if you exclude α frequencies, for instance.

We very much agree with the reviewer that an analysis of how the network works would make the paper more impactful. We do feel that in order to answer this question in full, a thorough and non-trivial analysis is required, one that we certainly want to do in the future but is out of scope for this

particular paper.

We acknowledge that an experiment such as the one suggested by the reviewer could already provide some insight. To that end, we have performed two experiments, one per the suggestion and one variation thereupon:

1. For each frequency band (δ, θ, α, β, with ranges taken from the literature.) X:

– (Original) We filtered the original data such that all frequency bands except X were present.

– (Variation) We filtered the original data such that only X was present.

2. We loaded the original models and evaluated them again on the filtered data.

We have added the results to the "Interpretation of the results" section. In short, we found that our network primarily uses the β-band, rather than the α-band. There is literature that also reports on the importance of the β-band in spatial decoding (i.e., Gao et al. (2017)), which is now also discussed in the paper.

Relatedly, it is almost worrying how good the performance gets when you train on the other examples from the same story and speaker (Figure 5). Why would this be? Is the CNN picking up on some weird features in the EEG that are very specific to these speakers? Without having a sense of what drives the exceptional performance, it makes one wonder what the CNN relies on.

Thank you for the comment. It is indeed true that our experiments show that providing EEG data of the same story and speaker provides a significant and unrealistic benefit, given that in a real-life situation we want our models to generalize to unknown stories and speakers. Previously, we would have made sure to simply take apart testing and training data (as is done in other literature in our field), but clearly this does not suffice and knowing the characteristics of the speaker and/or story in advance helps.

Exactly what drives this is not entirely clear to us. The original experiment (Das et al. 2016) was not designed to investigate this, and due to the way it was set up we feel we can at best only establish this fact. For starters there were only 3 speakers, all male, and there were only 4 stories. As mentioned in the paper, that resulted in only two unique speaker/story combinations. We could include the other combinations, but then it is not clear what the interaction between story and speaker is. To properly investigate this we would expect to do an experiment with many short stories, each narrated by a different speaker.

3. The results presented in the manuscript show no effect of window size on performance. This must, in the limit, not be true. More data must be shown to show this dependence and determine the limits of the method.

Thank you for the comment. We agree that this is worthwhile to have in the paper and have retrained the model on the following window sizes: 0.5s, 0.25s and 0.13s. We choose 0.13s as lowest value since the CNN kernel is also 0.13s wide, which puts a lower bound on the size of the decision window. Also, for the same reason the linear model has not been rerun at 0.13s, since the kernel width there is 0.25s.

We have added the new results to Figure 4, in the "Interpretation of the results" section. The statistical analysis has also been rerun to take into account the new window sizes, and now does show a significant effect of decision window length on performance for the CNN (previously: no effect). The previous result for the linear model remained unchanged.

As a consequence of adding these extra window sizes, the MESD values in the paper have changed slightly.

4. For 3 subjects, with a 10s window, the performance of the CNN was lower than the linear model (Figure 2). How is it possible then, that every subject had a better MESD when using the CNN (Figure 3)? I know you've excluded 1 subject from the figure, but what about the other 2 subjects?

We are grateful to the reviewer for checking our work in such detail. However, this is not an inconsistency. The MESD takes into account all window sizes (initially 10s, 5s, 2s and 1s, but now also even shorter windows), but Figure 2, on the other hand, shows only the results for 10s and 1s. (2s and 5s were not shown because the results were similar to either 10s or 1s.) The particular subjects that the reviewer refers to did indeed have worse results for the CNN than for the linear model, but that was only true for 10s; it was actually the other way around for 1s. And because the MESD metric places more importance on smaller window sizes, those subjects had the best MESD value on the CNN.

5. You talk about the idea that future work can address some unanswered questions, like whether or not performance will drop with fewer EEG channels. However, related to the idea that the results might be driven be decoding of spatial attention, it would be interesting to know if spatial patterns are driving the CNN decoding.

Again an excellent point. We think that spatial patterns must be involved due to the very short window lengths and the fact that no temporal information about the stimulus (envelope) is given as an input to the network. However, similarly to remark (2), it is hard to open the black box and fully understand what spatial patterns the CNN uses. In an attempt to answer this question, we investigated the weights of the convolutional filters by computing a grand-average topographic map, as follows:

1. Calculated the power of each channel in the training set. Normalized exactly as was done during training. This resulted in 64 values, one for each EEG channel.

2. Per model and per filter, calculated the power of the filter coefficients in each channel. Normalized those values by multiplying with the power in the training set, calculated in the previous step. Applied the sqrt to those values, to account for the fact that we want to show power.

3. Performed the above step for each model and each filter, and averaged the results. Normalized the values to lie in the interval [0, 1].

The resulting figure was added to the paper in the "Interpretation of the results" section.

We see primarily activations in the frontal and temporal channels, plus some smaller activations in the occipital lobe. Although that still does not provide us with concrete information regarding the inner workings of the network, it is somewhat in line with other studies in the literature. Ciccarelli et al. (2019), for example, included similar heatmaps of the weights, and also demonstrates strong activity on frontal and temporal channels (although this network also had access to the speech envelope). However, Ciccarelli does not provide any discussion as to what may cause those activations. Additionally, Gao et al. (2017) also found the frontal channels to significantly differ from the other channels within the β band (Figure 3 and Table 1 in Gao et al. (2017)). The prior MWF artefact removal step in the EEG preprocessing and the importance of the β band in the decision making (Figure 5 in the paper) implies that the focus on the frontal channel is not attributed to eye artifacts. It is noted that the filters of the network act as backward decoders, and therefore care should be taken when interpreting topoplots related to the decoder coefficients. As opposed to a forward (encoding) model, the coefficients of a backward (decoding) model are not necessarily predictive for the strength of the neural response in these channels. For example, the network may perform an implicit noise reduction transformation, thereby involving channels with low SNR as well.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential revision

1. The reviewers are not convinced that eye movements are not a substantial contributor to decoding accuracy. Specifically, the frontal topography of the convolution filters in Figure 6 looks suspiciously like an EOG signature. We think it is critical for you to clarify what features of the EEG are being used for classification. One way to test this would be look at the raw data (attend left vs right) and look the time-frequency profile.

While frontal topographies have also been found in other AAD papers (see our Discussion section for a comparison), we wholeheartedly agree that this is an important point and that it might indicate that the network (partially) uses EOG information.

We had a thorough look at the raw data as suggested, but we could not see anything that would suggest eye movement.

In addition, we also investigated the time-frequency profile of the filters— please see our answer to the next question (1a). Also our answers to comments (1b)-(1h) relate to the possible influence of eye-related activity.

1a. Saccade-related ERP profiles tend to have a positive peak near 0 ms followed by a negative peak around 20 ms. The attention-related ERPs using EEG, however, have key peaks at in the 100-200 ms range. Given this, the temporal profile of the filters may inform the arguments for and against eye movements contributing.

Thank you for the thoughtful suggestion. However, we kindly note that in this particular case the filters are not time-locked with the stimulus (we are not decoding a stimulus-following response as in traditional speech envelope reconstruction methods). That is, in our experiment, subjects continuously direct their attention to one ear, and for each x-second segment we determine the direction of attention, without relating/correlating the EEG to the stimulus waveform. We therefore don’t think a temporal profile as suggested would yield the desired result.

None the less, per the suggestion, we have calculated the frequency response of the filters in the convolutional layer. We did so in a grand-average fashion, similar to the topoplot that was added in the last revision (Figure 6 in the paper). That is, we first estimated the PSD of a single filter, averaged over all 64 channels, and subsequently averaged again over all five filters and over all runs and all window sizes. The result is a single, grand-average, magnitude response of the filters in the convolutional layer. The result is shown in Author response image 1. The relevant EEG-bands are shown on the figure, as well.

Author response image 1. Grand-average temporal profile of the filters in the convolutional layer.

Author response image 1.

One can see that it is mostly the β band that is being targeted, which is also in correspondence with the results of the band-removal experiment that was also added in the previous revision (Figure 5 in the paper).

We feel that the temporal profile shown in Author response image 1 does not tell us anything new regarding the possibility that the model may in part be driven by eye movement, at least compared to what we already knew from the band-removal experiment. Even when relatively high frequency components are targeted, it does not automatically follow that these are saccades, or any other type of eye movement.

1b. Relatedly, if you found that the filters were tuned to γ band activity, this would suggest that small saccades are influencing performance. The fact that the network weights the β band as much as it does suggests that it may even like γ band more. On the other hand, if the filters are tuned to α or high δ, that would argue against saccades being the cause.

Please refer to our answer to (1a). In short, we do not feel that the fact that the filters are mainly tuned to the β band tells us much regarding the presence or non-presence of saccades.

1c. Your MWF algorithm should remove large gaze artifacts. However, even very small (but consistent) gaze changes could be responsible for some of the effects you see. You should also consider the literature on micro saccades and γ, and about whether small but consistent drifts of gaze during long trials contribute.

Thank you for the suggestion. We kindly note that a spatial filtering method such as MWF that attempts to remove large gaze artifacts will also remove smaller eye movements, as they originate from the same dipole as larger eye movements (the filter only uses spatial information).

1d. We are aware of your recent arXiv paper (Geirnaert et al) in which the CNN fails on another data set. Were subjects asked to fixate in that study, but not this? A better description of how subjects were instructed in the current study should be included, no matter what. Given the Geirnaert results, we think it is especially critical to figure out whether the results in the current paper really are attention effects in neural responses, rather than due to eye movement. It would be unfortunate to have to publish a correction if the results in the current study are attributed to attentional effects when they are actually due to gaze differences.

We agree that we could have been more clear regarding the instructions subjects received. We have added the following text to the “Experiment” section:

“The experiment was split into eight trials, each 6min long. In every trial, subjects were presented with two parts of two different stories. One part was presented in the left ear, while the other was presented in the right ear. Subjects were instructed to attend to one of the two via a monitor positioned in front of them. The symbol “<” was shown on the left side of the screen when subjects had to attend to the story in the left ear, and the symbol “>” was shown on the right side of the screen when subjects had to attend to the story in the right ear. They did not receive instructions on where to focus their gaze.”

The other dataset in Geirnaert et al. is the one published by Fuglsang et al. (2017). In this dataset subjects fixated on a crosshair. However, in pilot experiments with other datasets from our own lab we found that fixating on a point did not affect whether our DNN approach worked or not, so there must be another unknown difference between the Das et al., and Fuglsang et al. datasets.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Vandecappelle S, Deckers L, Das N, Ansari AH, Bertrand A, Francart T. 2019. Auditory Attention Detection Dataset KULeuven. Zenodo. [DOI] [PMC free article] [PubMed]

    Supplementary Materials

    Transparent reporting form

    Data Availability Statement

    Code used for training and evaluating the network has been made available at https://github.com/exporl/locus-of-auditory-attention-cnn (copy archived at https://archive.softwareheritage.org/swh:1:rev:3e5e21a7e6072182e076f9863ebc82b85e7a01b1). The CNN models used to generate the results shown in the paper are also available at that location. The dataset used in this study had been made available earlier at https://zenodo.org/record/3377911.

    The following previously published dataset was used:

    Vandecappelle S, Deckers L, Das N, Ansari AH, Bertrand A, Francart T. 2019. Auditory Attention Detection Dataset KULeuven. Zenodo.


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES