Skip to main content
Heliyon logoLink to Heliyon
. 2024 Jul 19;10(15):e34860. doi: 10.1016/j.heliyon.2024.e34860

The impact of face masks on face-to-face neural tracking of speech: Auditory and visual obstacles

M Fantoni a, A Federici a, I Camponogara b, G Handjaras a, A Martinelli c, E Bednaya a, E Ricciardi a, F Pavani d,e, D Bottari a,
PMCID: PMC11328033  PMID: 39157360

Abstract

Face masks provide fundamental protection against the transmission of respiratory viruses but hamper communication. We estimated auditory and visual obstacles generated by face masks on communication by measuring the neural tracking of speech. To this end, we recorded the EEG while participants were exposed to naturalistic audio-visual speech, embedded in 5-talker noise, in three contexts: (i) no-mask (audio-visual information was fully available), (ii) virtual mask (occluded lips, but intact audio), and (iii) real mask (occluded lips and degraded audio). Neural tracking of lip movements and of the sound envelope of speech was measured through backward modeling, that is, by reconstructing stimulus properties from neural activity. Behaviorally, face masks increased perceived listening difficulty and phonological errors in speech content retrieval. At the neural level, we observed that the occlusion of the mouth abolished lip tracking and dampened neural tracking of the speech envelope at the earliest processing stages. By contrast, degraded acoustic information related to face mask filtering altered neural tracking of speech envelope at later processing stages. Finally, a consistent link emerged between the increment of perceived listening difficulty and the drop in reconstruction performance of speech envelope when attending to a speaker wearing a face mask. Results clearly dissociated the visual and auditory impact of face masks on the neural tracking of speech. While the visual obstacle related to face masks hampered the ability to predict and integrate audio-visual speech, the auditory filter generated by face masks impacted neural processing stages typically associated with auditory selective attention. The link between perceived difficulty and neural tracking drop also provides evidence of the impact of face masks on the metacognitive levels subtending face-to-face communication.

1. Introduction

The natural statistics of audio-visual speech are characterized by correlations and temporal correspondences between the visual changes in the mouth area and the sound envelope [1]. These signals are coupled to the neural processes of the listener, who exploits all available information [2], particularly when facing potential ambiguities in the incoming signals [3]. Seeing the speaker's facial movements, notably lip movements, enhances speech processing [4]. This boost is particularly relevant in challenging listening conditions, as in the case of noisy environments [5,6] or when listeners experience hearing impairment [7,8]. When visual cues are lacking, such as when talking to a person wearing a face mask (like during the pandemic of COVID-19), comprehension becomes more challenging, requiring higher cognitive effort [9,10]. Noteworthy is that face masks act both as a visual obstacle and an acoustic filter. They occlude the visibility of the mouth area and alter the sound emitted by the speaker [11,12]. Thus, face masks represent a valuable natural context to estimate the relative contribution of visual and acoustic cues during audio-visual speech processing.

Behavioral studies have revealed that face masks impact speech intelligibility, especially when listening to speech in noise, whereas in quiet, there is limited interference [13]. Different studies tested the degree to which various face masks impact speech comprehension and how this changes according to the background noise level, e.g., Refs. [13,14]. Moreover, it was documented that face masks affect the metacognitive dimensions of the listening experience. Specifically, they increase perceived listening effort and reduce confidence in what has been heard. In addition, they affect meta-cognitive monitoring (i.e., the ability to correctly determine one's own performance) for speech recognition [10,15]. In sum, a face mask covering the speaker's lips leads to reduced speech comprehension, as well as other difficulties related to meta-cognitive assessments of the listening experience.

These behavioral difficulties are associated with measurable changes in brain responses. Recent MEG investigations revealed that surgical face masks impact listeners' neural tracking of acoustic features, such as sound envelope and spectral features (i.e., pitch and formant frequencies; [12]. Moreover, neural tracking of higher-level speech features such as phonemes and words’ onsets was found to be significantly hampered when listening to a speaker wearing a face mask in challenging auditory contexts (i.e., in the presence of additional speakers in the background). Nevertheless, these results could not be causally linked to the visual or acoustic filtering role exerted by the face mask, as the experimental design did not allow disambiguation between the two speech features. Recently, Haider and colleagues [16] have further investigated this issue, showing that face masks primarily affect speech processing by hiding visual speech cues rather than by acoustic degradation. The experiment comprised natural speech in unmasked and masked contexts, presented only auditorily or audio-visually.

The present study aimed to selectively investigate acoustic and visual obstacles by isolating their roles by always employing audiovisual speech stimuli. First, we compared two audio-visual conditions that exclusively differed in their acoustic component (clean or degraded). Second, we contrasted the two audio-visual conditions that only differed in their visual component (mouth area visible vs. not visible) without manipulating the acoustic properties. We estimated the visual and auditory obstacles generated by face masks by measuring the neural tracking of face-to-face speech. To this aim, we recorded the electroencephalography (EEG) while participants were exposed to naturalistic audio-visual speech stimuli in three conditions: (i) no-mask, (ii) virtual mask, and (iii) real mask. In the no-mask condition, visual and acoustic information was fully available to the listener. In the virtual mask condition, the acoustic information was fully preserved, but a digital face mask hid the mouth area. Finally, in the real mask condition, the acoustic information was degraded by the physical filter constituted by a surgical face mask, and the speaker's lips were hidden. Importantly, speech was always embedded in babble noise (5 overlapping talkers in the background). Since listening to speech in noise benefits from having concurrent audio-visual information, adding babble noise in the background could enhance the integration of auditory and visual speech signals [[17], [18], [19]].

We measured participants’ accuracy in answering content comprehension questions and error types as indicators of performance. In addition, we measured confidence in the response and perceived difficulty when attending speech in noise as indicators of metacognitive listening experience. Since behavior represents the outcome of many processes originating in the brain, we employed multivariate data analyses of the EEG to unveil the temporal dynamics subtending speech processing that could be unseen from performance alone. Neural tracking was estimated through backward modeling, aiming to reconstruct the visual or auditory input from neural data [20,21]. By measuring the neural tracking of lip movements and sound envelope across conditions, we assessed the perceptual cost of attending speech in noise when the speaker wore a face mask, distinguishing between the effects of the occlusion of the lips, degraded audio, and their combination. We predicted a better reconstruction of lip movements and sound envelope in the no-mask condition, in which both the acoustic and visual cues were accessible, and a progressive dampening of neural tracking in the virtual and real mask conditions. We could selectively unravel the visual masking effect by contrasting the neural tracking of speech in the no-mask and virtual mask conditions (since these conditions share the same acoustic information). Instead, the comparison of the speech neural tracking in the virtual vs. real mask conditions allowed us to objectively disentangle the effect of the acoustic filter (present only in the real mask condition). Finally, we could estimate the combined effect of visual and acoustic obstacles by comparing the speech neural tracking between the no-mask and the real mask conditions. Results clearly revealed that the visual occlusion of the lips prevented lip tracking and hampered the neural tracking of the sound envelope in early processing stages, which benefited from the integration of audio-visual speech cues. Auditory filtering instead affected the neural tracking of the sound envelope at later stages, which are typically linked to higher-level auditory processing. Finally, the neural costs of listening to a speaker wearing a face mask were associated with behavioral costs measured at the metacognitive level (i.e., the perceived difficulty). Our results unequivocally revealed distinct effects of visual and auditory obstacles induced by face masks on the neural tracking of speech signals.

2. Materials and methods

2.1. Participants

Thirty adults, all native Italian speakers, were enrolled in the study (N = 30, Mean age = 27.7 years, SD = 2.1, Range: 21.3–31.3; 15 females). Participants did not report having a previous history of neurological disorders by self-report. The local Ethics Committee approved the study (protocol number: 1485/2017), which was conducted following the Declaration of Helsinki (2013). All participants signed informed consent before starting the experiment and received monetary compensation for their participation. We excluded one participant from the dataset due to the high number of interpolated channels (final sample EEG data: N = 29 participants, mean age = 27.6 years; 14 females). Due to technical issues, the behavioral data from one participant were not recorded (final sample behavioral data: N = 28 participants, mean age = 27.6 years; 13 females).

2.2. Stimuli

Speech stimuli consisted of 9 stories of approximately 3 min each. All stories were in Italian and chosen to be auto-conclusive. Stories were selected from the “Fiabe Italiane” by Italo Calvino [22], (“Italian tales”), “I Tacchini non ringraziano” by Andrea Camilleri [23] (“Turkeys do not give thanks”), and “Il memento è delicato” by Nicolò Ammaniti [24] (“The delicate moment”).

A native Italian speaker read the stories while she was video recorded. Recordings were performed in a sound-attenuated chamber (BOXY, B-Beng s.r.l., Italy) at the IMT School for Advanced Study Lucca using an iPhone 7 (camera with 12 MP, video resolution in HD, 720p with 30 fps, at a sampling frequency of 48000 kHz) and a condensation microphone (YC-LM10 II, Yichuang). The video frames comprised the speaker's face with a homogeneous grey wall in the background. All stories were recorded twice, once while the speaker wore a surgical face mask (real mask condition) and once without it (no-mask condition). The no-mask recordings were also employed to create a third condition in which the speaker's face was partially covered with a digital face mask artificially superimposed (virtual mask condition). The face mask used in the real mask condition was a Type IIR, three-layer, single-use medical face mask.

Each video began with silence, during which the speaker looked into the camera, then started reading the story's title and began the storytelling. While reading, the speaker kept her eyes down (reading the text) and looked at the camera again only once the story was over. The speaker was asked to keep the head as still as possible. All audio-video recordings were imported in iMovie® (version 10.3.1; 1600 x 900 pixels resolution). To create the virtual mask condition for each video of the no-mask condition, we uploaded recordings on Instagram® (version 237.0.0.0.35) and employed a filter named “Black Mask” on each recording. Finally, with a custom Matlab script, we ensured that each pixel within the virtual mask area was black for the entire video duration.

We extracted the audio of each video, which was set to mono, down-sampled to 44100 Hz, and set to 32-bit sample definition using Audacity® (version 2.4.2, https://www.audacityteam.org/). The resultant audios were imported in Matlab (version: R2019b) and equalized to a target Root Mean Square. As a result of a pilot experiment conducted on three participants, we selected the RMS of one story as a target value for all of them (RMS value = 0.03). A separate audio stream of 5-talker babble noise (five voices: two females) was also created. We selected five speakers since it was shown that with this number, it is rather difficult to extract meaningful information from the background [25,26]. Each babble talker's voice was recorded in a soundproof room (like the target stimuli) while reading unrelated passages from the fiction book “La Strada [27]. These single recordings were equalized to the same mean RMS and then superimposed to obtain the 5-talker babble noise. The 5-talker babble noise was then added to the target speech stories. The first 5 s of the babble were muted (i.e., set to zero), and then the volume linearly increased to generate 5 s of fade-in. This allowed participants to identify the target stories to follow clearly. For the EEG data analysis, we removed the first 11 s of each recording to exclude the reading of the title story, the 5 s part in which the babble was muted, and the following 5 s in which the babble was fading in. Different segments of 5-talker babble were randomly selected and added to the target stories to avoid adaptation to a specific babble noise. Each target story with babble noise lasted approximately 3 min. After piloting, we chose an SNR level of 13.97 dB, which represented an SNR level suitable for reducing speech intelligibility without causing complete disruption. iMovie was used to re-pair each preprocessed audio with the video file of the same story and save the obtained final audio-video files in “.mp4” format.

2.3. Task and experimental procedure

2.3.1. Control to verify participant's lip reading ability with silenced videos

We performed a behavioral experiment to assess the ability of our participants to understand speech content when looking at the videos of the no-mask, virtual mask, and real mask conditions in the absence of sound. For each subject, we presented sixty silent video recordings of the speaker pronouncing single words extracted from our stories. For each participant, twenty video recordings out of sixty were randomly selected for each condition (i.e., no-mask, virtual mask, and real mask). In each trial, a black screen was shown for 2 s, followed by the video recording of a single word presented twice, interleaved by a black screen lasting 2 s. At the end of each trial, participants were required to verbally identify the word they perceived from the video previously shown, always in silence. They were not provided with a list of potential responses; instead, they had to articulate what they comprehended from the silent video presentation. Once the participant performed the behavioral test, the EEG experiment started.

This control experiment was performed to verify whether participants could identify words in the virtual mask and real mask conditions (in the absence of sound) by extracting information from the movements of the mask covering the lips.

2.3.2. EEG experimental session

After the behavioral experiment, we started the EEG experiment; each participant attended to audio-visual stories, that is, videos of a speaker talking across no-mask, virtual mask, and real mask conditions. Participants listened to 3 stories per condition randomly presented (see Fig. 1). Each story could be presented to each participant only once and could be associated with either no-mask, virtual mask, or real mask conditions, pseudo-randomically. Participants were instructed to listen carefully and maintain fixation at the center of the screen where the speaker's face was displayed. To help them, a small white fixation cross appeared at the center of the screen before each video. As soon as the experimenter pressed the spacebar, the fixation cross disappeared, and the video started showing the face of the speaker looking at the participant. None of the participants reported knowing the stories.

Fig. 1.

Fig. 1

Schematic representation of the three conditions used in the EEG experiment. Participants were attending to continuous audio-visual speech embedded in babble noise (multiple talkers in the background) to promote multisensory integration. In the no-mask condition, visual and acoustic information was fully available to the listener. In the virtual mask condition, the acoustic information was preserved (i.e., identical to no-mask), but a digital face mask hid the lips. In the real mask condition, the acoustic information was degraded by the physical filter constituted by a real-surgical face mask, and the lips were occluded.

Participants were required to answer thirteen questions at the end of each story. The first question was (a) how difficult it was to follow the story on a range from 1 to 7 (1 = not difficult at all, 7 = extremely difficult); (b) six questions concerning the content of the stories to evaluate the retrieval performance, i.e., subject's accuracy in performing the task. We employed a four alternative forced-choice (4AFC) task. Each question had a correct answer, a wrong answer semantically related to the correct one, a wrong answer phonologically related to the correct one, and an incongruent answer; the order of presentation of the four answers was randomized. (c) Following each question about the content comprehension of the story, participants were also asked to provide a confidence rating on a 1 to 7 scale (where 1 = not confident at all; 7 = extremely confident) to assess how confident they felt when answering those questions. The experiment was performed in a sound-attenuated chamber at the IMT School for Advanced Studies Lucca. Speech stimuli were presented to participants using Psychopy® software (PsychoPy3, v2020.1.3). The sound was delivered using a single front-facing loudspeaker (Bose Companion® Series III multimedia speaker system, country, USA) placed behind the computer screen where the video was displayed. The participant sat approximately 60 cm distant from the screen. Stimuli were delivered at ∼80 dB, measured in front of the loudspeaker (Meterk MK09 Sound Level Meter).

3. Analysis

3.1. EEG recording

Participants were asked to remain relaxed and avoid unnecessary movements during the recording session to limit muscle-related artifacts. Blinking was permitted whenever they wanted. The EEG data were recorded continuously during the entire experimental session, using a Brain Products system (BrainVision system ActiCHampPlus) with elastic caps with 32 active electrodes at a sampling rate of 500 Hz (Easy cap Standard 32Ch actiCAP snap). Electrode impedances were kept below the threshold of 30 kΩ. The experiment lasted approximately 1.5 h per participant, including instructions, EEG Net application, EEG recordings, and breaks when needed by the participants. The alignment of stimulus timing and EEG marker was measured using the AV device (EGI).

3.2. EEG preprocessing

We analyzed only EEG segments in which stories were presented to participants. These segments were concatenated and preprocessed offline with the EEGLab toolbox, Version 14.1.2 [28], and a validated preprocessing pipeline [29,30]. Continuous EEG data were low-pass filtered (cut-off at 40 Hz, Hanning filter order filter 50), downsampled to 250 Hz to reduce computational time, and high-pass filtered (cut-off at 1 Hz, Hanning filter order 500). Then, we segmented the data into consecutive 1-s epochs. The joint probability algorithm removed all the noisy epochs (threshold = 3 SD; Delorme et al., 2007). These filtered and cleaned data were submitted to Independent Component Analysis (ICA, using an EEGLab function based on the extended Infomax [[31], [32], [33], [34], [35]], and resultant ICA weights were applied to the continuous raw data [29,30]. Subsequently, we used CORRMAP, a semi-automatic ICA clustering tool, to identify and remove the components associated with stereotypical artifacts such as blinks or eye movements [36]. CORRMAP works with a correlation of ICA inverse weights that finds independent components (ICs) similar to a user-defined template. The similarity between the template and each IC is verified by a correlation procedure identifying components exceeding a threshold (r = 0.95; [36]). We also inspected the selected ICs using the ICLabel toolbox provided in EEGLAB. Across participants, the mean number of ICs removed by COORMAP was 1.8, SD = 0.4.

Afterward, the cleaned data were low-pass filtered (40 Hz filter order 50), downsampled to 250 Hz, and high-pass filtered (0.1 Hz, Hanning filter order 5000). Noisy channels were detected by applying the automatic bad channel detection algorithm of EEGLab (correlation threshold = 0.9). Deleted channels were interpolated with spherical interpolation (mean channels interpolated across participants = 1.76 ± SD: 1.58). Data were re-referenced to an average reference and band-pass filtered between 2 and 8 Hz (for 2 Hz cut-off order filter 250, for 8 Hz cut-off order filter 126; see Refs. [[37], [38], [39], [40]]. Preprocessed EEG data about each story was then epoched. Epochs lasted 3 min and started when both target speech and babble noise were present and at stable volume (see stimuli section). Finally, epochs were downsampled to 64 Hz and re-segmented into trials of 1 min, obtaining 9 trials for each condition and participant. Trials were necessary for the cross-validation procedure; all data were scaled to the max during the evaluation of the regularization parameter [20].

3.3. Extraction of acoustic and visual speech features

We extracted two speech features related to the visual and the auditory features from our audio-visual stimuli: (i) the lip movements from the video and (ii) the sound envelope from speech (see Supplementary Materials section 3.2 “Pre-processing of the Motion energy and the auditory-only speech features” for additional speech features extracted as control features: the Motion and the auditory-only). Each speech feature has been extracted and analyzed separately for each condition.

3.3.1. Lip movements

Lip movements were tracked using DeepLabCut [41]. We tracked twelve points: the oral commissures (two) and five parts of the lower and upper lips. For the upper lip, we considered the Philtrum's movements, the cupid's bows (two), and the mid-points between the peaks of the cupid's bows and the lateral commissures. For the lower lip, we considered the upper lips' mirror counterparts along the lower vermillion border (See Supplementary Materials section 1 “Stimulus Reconstruction: The Backward Model” and Fig. S1). For each point, we extracted the X and Y positions (in pixels) for each frame and calculated the mouth's area. In this way, we measured the change in lip movements over time. The lip movements were downsampled to 64 Hz and epoched according to the EEG Preprocessing (3 min for each story, including only the portions of the recordings where the audio of both target speech and babble noise were present and at stable volume). Epochs were then cut into 1-min long trials, obtaining three trials for each story, i.e., nine trials in total, in each condition. Signals were then normalized. Normalization was carried out for neural and stimulus data, following Crosse et al. [21]. This approach was applied to normalize stimulus features to prevent variations in feature scales that could impact the magnitude of feature weights.

3.3.2. Sound envelope

To extract the sound envelope, we imported the audio streams in MATLAB. As for the lip movements, we epoched each story and cut files into 1-min-long trials (N = 9 trials in total). Then, the audio envelope was extracted for each trial taking the absolute value of the Hilbert transform of the original stories (i.e., the envelope was estimated using the clean stories without 5-talker babble noise; [42], applying a low-pass filter with a third order Butterworth filter (with 8 Hz cut-off; filtfilt MATLAB function), and downsampling to 64 Hz to match the EEG data resolution [39,40]. Finally, we normalized the envelope by dividing each value by its maximum value to optimize the cross-validation necessary to estimate the regularization parameter [20,37].

3.4. Stimulus Reconstruction

To estimate the reconstruction of different speech features (lip movements and sound envelope) from the neural data, we employed a backward model using the mTRF Toolbox [20]; for details, see Supplementary Materials section 1, Stimulus Reconstruction). This linear model reconstructs the speech feature of interest from the EEG activity recorded in all electrodes; importantly, the sensors' activity is weighted according to their informativeness [20,43].

By applying this model, we wanted to estimate associations between the EEG signals and the stimuli speech features, including typical predictions occurring before the unfolding of the stimulus in the case of continuous language [44,45].

In this model, the time lags equal to zero represent the exact synchrony between brain activity and speech features. Thus, at negative or positive lags, the model estimates the association between brain activity ahead or in late with respect to speech features.

To train the model, a leave-one-out cross-validation was performed to find our optimal regularization parameter, λ (for more details, see Supplementary Materials section 2, “Regularization Parameter Estimation”). Once we tuned the model parameter, λ, the model was tested on the same data set [20]. We trained the decoder with the chosen lambda with leave-one-out cross-validation, keeping apart one trial at each iteration, and then we performed the mean of the models obtained at each iteration. Afterward, we tested the averaged model on each left-out trial, and we computed the reconstructed speech feature (lip movements or sound envelope). Then, we correlated the reconstructed speech feature to the original one; the result of this Pearson correlation is the reconstruction performance (Decoding r).

This procedure was performed for each decoding model calculated across time lags (sliding window of 45 ms in steps of 15 ms) in a range between −115 and + 575 ms. We obtained a correlation coefficient, r-value, for each time lag (45 ms long). Due to the sliding window of the time lag, we obtained an r value every 15 ms [39]. As an example, the first r-value is calculated for the time lag [−115 -70 ms], the second at 15 ms later, between [−100 and −55 ms], and so on. Thus, the first r refers to the time window (−115 -70 ms) having its center at −92.5 ms, the second at −77.5 ms, and so on. For visualization purposes, we selected three approximated 50-ms windows of interest according to the reconstruction performance peaks on which we performed the decoding model and then transformed in the forward direction to become neurophysiologically interpretable and to be shown in the results images. Indeed, it is impossible to neurophysiologically interpret the resulting decoder weights topography [46]. Transformed decoder weights have a more straightforward interpretation; their value and sign are directly related to the strength of the signal at different channels [21].

3.4.1. Estimation of the null effect (null decoding)

We computed a null decoding model for each participant and condition to test whether the decoder's performance was above chance. To this end, we permuted the order of 1-min trials of the original speech feature (lip movements or sound envelope) to obtain mismatched speech feature and EEG response pairs on which the decoding was fitted to estimate the null decoding (mTRFpermute function with 100 iterations; [20,21]. Then, the correlation coefficients of all these decoding models, computed across iterations, were averaged to obtain the ‘null decoding.’ This procedure was done separately for each participant.

3.5. Statistical analysis

3.5.1. Report of the p-values

For analyzing behavioral data, which involved less repeated measures, the results were Bonferroni corrected (pcorr); we reported both Bonferroni corrected and uncorrected p-values (p) as exact p-values. Instead, for the comparisons of the EEG-decoding data, since we performed a larger number of tests (in the time domain) and modeled noisy data (the EEG), all results were FDR corrected (pFDR), as it is a less strict correction approach. In case of significant results, both FDR corrected and uncorrected p-values (p) have been reported as <0.05. The non-statistically significant p-values have been reported as >0.05.

For significant effects, the effect size (d’Cohen [47,48]; and the confidence intervals (CIs) have been reported to evaluate the magnitude of the observed effects. The CIs were computed using the bootstrap method with N = 1000 bootstrap replicas.

3.5.2. Behavioral control: lip reading ability of silenced videos

The behavioral test on lip-reading allowed us to evaluate the participants’ ability to comprehend silent speech across the different conditions (no-mask, virtual mask, and real mask). A non-parametric Friedman test (one factor, three levels) was employed to control the fact that the occlusion of the lips reduced lip reading ability. Multiple comparison post-hoc tests were performed with Bonferroni correction.

3.5.3. Behavioral test within the EEG experiment

A Friedman test was used as the primary statistic to test the behavioral outcomes of the EEG experiment. Post-hoc tests were performed by applying the multiple comparison tests with Bonferroni correction; the corresponding p-values were reported as pcorr when statistically significant.

We evaluated (1) the participant's accuracy in answering content retrieval questions and the relative (2) committed errors (which will be described in detail subsequently), (3) the confidence of the participants when answering the content questions, (4) the perceived difficulty by participants when listening to speech embedded in babble noise.

As mentioned, we wanted to explore further the type of errors committed by participants when answering the content questions. Participants had to select among four alternative forced choices (4AFC), and the answers were differentiated into (1) correct answers, (2) answers semantically related to the correct one, (3) answers phonologically related, and (4) incongruent answers.

3.5.4. Statistical analysis of the decoding outcomes

First, we assessed the existence of neural tracking of the lip movements and the sound envelope at the group level. To this end, we compared the reconstruction performance at the group level of each of our speech features against their null decoding performance, i.e., decoding vs. null decoding within each condition. Analyses have been conducted by transforming the correlation coefficient values (r; of both the real decoding, indicating the correlation between reconstructed and original stimulus, and the null decoding, showing the correlation between reconstructed and the mismatched stimulus) of each condition through Fisher's transformation, obtaining z transformed values. The z-values of each condition were then compared with one-tailed paired t-tests at each time lag (−115 + 575 ms). P-values were corrected across all time lags (N = 47) using the False Discovery Rate (FDR) procedure to control for multiple comparisons [49].

Second, we contrasted the decoding z-values across conditions. Here, we directly compared the reconstruction performance values between conditions with a series of one-tailed paired t-tests across all time lags in which we measured reliable reconstruction performance (where the real decoding was significantly higher than the null decoding). We computed each t-test under the specific hypotheses that reconstruction performance would have the following directionality: no-mask (full cues) > virtual mask (visual obstacle) > real mask (visual and acoustic obstacle). P-values were corrected for multiple comparisons (time lags) with the FDR procedure [49].

3.5.5. Analysis of the correlation between neural and behavioural data

After establishing whether neural tracking of the sound envelope and the behavioral measures varied across conditions, we were interested in analyzing potential correlations between neural tracking and behavioral outcomes. Thus, the analysis was limited to the data that were indicative of neural and behavioral changes. To achieve this, we conducted a correlation analysis between the significant behavioral measure and the reconstruction performance (neural tracking). These correlations were computed at each time lag. Subsequently, we examined both uncorrected and corrected p-values, the latter applying a correction for multiple comparisons across time lags [49].

4. Results

4.1. Control on lip reading ability of silenced videos

As expected, the Friedman test revealed that participants were severely impaired at recognizing words when the lips were hidden (χ2Friedman (2) = 51.37, p < 0.001; N = 30 participants, mean age = 27.7 years; 15 females). Post-hoc tests (Bonferroni corrected) confirmed a significant difference between no-mask and the two face mask conditions (no-mask vs. virtual mask conditions pcorr< 0.001; d’Cohen = 1.86, CI [1.35 2.39]); no-mask vs. real mask conditions pcorr< 0.001; d’Cohen = 1.83, CI [1.35 2.44]); no significant difference emerged between virtual and real mask conditions (pcorr = 1, see Supplementary Materials Fig. S2).

4.2. Behavioral test results of the EEG experiment

Results on participants' accuracy in answering content questions did not reveal significant differences across conditions, suggesting that even in mask conditions, participants are able to attend to the auditory input efficiently. Indeed, individuals' ability to process and retrieve speech content embedded in 5-talker babble noise did not differ across conditions (χ2Friedman (2) = 5.3, p = 0.07). Although not statistically significant, there was the expected trend indicating decreasing accuracy from the no-mask to the real mask condition (see Fig. 2). Furthermore, the mask condition modulated the kind of errors participants committed when answering the content comprehension questions and their confidence in answering those questions. Participants had four alternative forced choices (4AFC), the correct answer, and three wrong answers. The wrong answers could be phonologically or semantically related to the correct answer or completely unrelated. A statistically significant difference emerged in phonological errors (χ2Friedman (2) = 10.14, p = 0.006); the post hoc test (Bonferroni corrected) revealed a selective difference between no-mask and real mask conditions (pcorr = 0.004; d’Cohen = 0.75, CI [0.39 1.18]; see Fig. 2b). This result suggested that the presence of a surgical face mask covering the lips and filtering the auditory information increased errors in which wrong answers were phonologically associated with correct responses (no significant effects for the comparisons across the other conditions; no-mask vs. virtual mask: p = 0.33; virtual mask vs. real mask: p = 0.33). Conversely, neither semantic2Friedman (2) = 1.77, p = 0.41) nor incongruent errors were modulated by the type of condition (χ2Friedman (2) = 1.15, p = 0.56; see Fig. 2b).

Fig. 2.

Fig. 2

(a) Results of the three behavioral measures acquired during the EEG experiment (accuracy, confidence, and perceived difficulty) shown across conditions. Listening to a person wearing a face mask (virtual or real) increased participants' perceived difficulty. (b) Type of errors performed in the content questions: on the top, the phonological, middle, and semantic, and on the bottom, the incongruent. In the real mask condition, the number of errors committed increased significantly when the answers were phonologically associated with the correct responses.

Participants’ confidence when answering each content question was not affected by the mask condition (χ2Friedman (2) = 5.39, p = 0.06). Nonetheless, as for the accuracy, even in this case, we had a trend of responses. Specifically, confidence was highest in the no-mask condition and progressively lower in the virtual mask and real mask conditions.

Finally, results on the perceived difficulty highlighted a significant main effect of the condition (χ2Friedman (2) = 17.11, p < 0.001). Post-hoc tests (Bonferroni corrected) revealed a statistically significant difference between the no-mask and the virtual mask condition (no-mask vs. virtual mask pcorr = 0.006 d’Cohen = 0.44, CI [0.16 0.71]) and between the no-mask and the real mask condition (no-mask vs. real mask pcorr<0.001; d’Cohen = 0.57, CI [0.31 0.85]); but no difference between virtual and real mask conditions (pcorr = 1). Results indicated that the presence of a mask hiding the lips, both virtual or real, substantially increased perceived difficulty following speech embedded in 5-talker babble noise.

4.3. Decoding Model Results

4.3.1. Lip movements decoding results

First, we evaluated the decoding of lip movements in the no-mask and virtual mask conditions (the virtual mask videos were created by applying a virtual mask on the actor's face recorded in the no-mask condition videos, and thus, the lip movements, despite being invisible, had precisely the same movements). We obtained successful lip movement reconstruction only in no-mask condition. Indeed, reconstruction performances exceeded the null decoding at time lags from 50 to 575 ms (pFDR<0.05; d’Cohen = 1.03, CI [0.71 1.43]; see Fig. 3). Conversely, we obtained no significant results in the virtual mask condition (pFDR>0.05), suggesting that we found no evidence for participants reconstructing lip movements from acoustic information. Additionally, we investigated the decoding of a control visual model in the no-mask condition (the Motion Energy [50]; to verify if it was possible to decode the information related to the displacement of the mouth area. In this case, we found that it was possible to decode general motion information, but the lip movements were indeed more informative (see Supplementary Materials Fig. S3).

Fig. 3.

Fig. 3

Neural decoding of the lip movements in no-mask (left) and virtual mask conditions (right). The correlation coefficients of the lip movements represent the reconstruction performance - Decoding (r), on the y-axis - calculated as the correlation between the reconstructed and original lip movements every 15 ms; the corresponding null decoding is depicted in grey. The x-axis comprises time lags between ∼ −100 and + 600 ms. The colored continuous line represents the group mean, and the shaded areas represent the SE. Grey horizontal lines under the null decoding highlight the statistically significant time lags in which the reconstruction performance exceeded the null decoding of the envelope (FDR corrected statistics). Only for the no-mask condition we observed a reliable reconstruction of the lip movements from the EEG data. Under each plot, forward topographies for each condition are depicted at representative 50 ms time windows.

4.3.2. Sound envelope decoding results

For each condition, we estimated the neural tracking of the sound envelope by comparing the reconstruction performance of auditory stimulus features against their corresponding null reconstruction, also defined as null decoding. The reconstruction performance values exceeded the null decoding in each condition (no-mask condition: 115 to 410 ms, pFDR<0.05; d’Cohen = 2.58, CI [1.48 3.34]; virtual mask condition: 115 to 425 ms, pFDR<0.05; d’Cohen = 2.54, CI [1.91 3.32]; real mask condition: 115 ms up to 320 ms, pFDR<0.05; d’Cohen = 2.01, CI [1.47 2.52]). Results clearly provided evidence of successful decoding of sound envelope irrespective of whether the mouth was visible and/or speech was filtered by a face mask (see Fig. 4a).

Fig. 4.

Fig. 4

(a) Sound envelope reconstruction in no-mask, virtual, and real mask conditions. The correlation coefficients represent the reconstruction performances - Decoding (r), on the y-axis - calculated as the correlation between the reconstructed and original sound envelope every 15 ms; the corresponding null decoding performances are depicted in grey. The x-axis comprises time lags between ∼ −100 and + 600 ms. The colored continuous line represents the group mean, and the shaded areas represent the SE. Grey horizontal lines under the null decoding highlight the statistically significant time lags in which the reconstruction performance exceeded the null decoding of the envelope (FDR corrected statistics). In each condition, we measured reliable reconstruction performance. Below each plot, forward projected topographies of decoding models at representative 50 ms time windows. (b) The visual obstacle effect was investigated by comparing the no-mask vs. the virtual mask condition, the acoustic obstacle effect was observed by analyzing the virtual mask vs. the real mask condition, and the combined audio-visual obstacle effect was obtained by contrasting the no-mask vs. the real mask condition. For all comparisons, the statistically significant time lags, that is, the time points in which the neural tracking differed between conditions, are represented by a grey line (FDR corrected statistics).

Having demonstrated reliable sound envelope reconstructions, we investigated the impact of auditory and visual obstacles provided by face masks. To this aim, we contrasted the decoding accuracy across conditions. First, we tested the no-mask vs. virtual face mask condition to estimate the specific impact of covering the lips on the neural tracking of the sound envelope (the acoustic information was identical across these two conditions), that is, the impact of the visual obstacle. The decoding of the sound envelope in the no-mask condition exceeded the virtual mask condition at the earliest time lags, between −55 and −10 ms (pFDR<0.05; d’Cohen = 0.43, CI [0.08 0.79]; see Fig. 4b), suggesting that available lip movements contributed to the neural tracking by anticipating the sound envelope tracking.

We then specifically estimated the acoustic obstacle determined by the face mask. To do so, we compared the reconstruction performance of the sound envelope between the virtual mask and the real mask conditions. In both conditions, the lips were not visible due to the presence of a face mask, but only in the real mask condition was the auditory input filtered by the surgical mask, as the speech was recorded while the speaker was wearing it. Sound envelope reconstruction was higher in the virtual mask as compared to the real mask at several time lags (between 5 and 50 ms and between 125 and 320 ms; pFDR<0.05; d’Cohen = 0.49, CI [0.21 0.81]; see Fig. 4b), suggesting that the physical barrier constituted by the surgical mask hampered the auditory neural tracking at later processing stages as compared to the visual obstacle (mouth occlusion).

Finally, we contrasted the reconstruction performance of the sound envelope between no-mask and real mask conditions to estimate the combined impact of lips occlusion and acoustic filter. Results revealed a protracted dampening of reconstruction performance between −115 and 290 ms (pFDR<0.05; d’Cohen = 0.57, CI [0.31 0.91]; see Fig. 4b). This effect revealed the combined audio-visual obstacle generated by listening to a speaker wearing a surgical face mask.

Having measured the different obstacles on neural tracking generated by face masks, we further investigated the importance of the shared information between lip movements and sound envelope in the absence of any obstacle. To this aim, we created a new speech feature, the auditory-only, in which we regressed out all the common information between the lip movements and the sound envelope (this was possible only in the no-mask condition). By comparing the decoding of the auditory-only speech feature and the standard sound envelope in the no-mask condition, we selectively described the auditory neural tracking at the net of information provided by audio-visual correspondences (between lip movements and sound envelope). A significant difference emerged, with a long-lasting decrease in reconstruction performance −115 to 380 ms (see Supplementary Materials Fig. S4), highlighting the role of the common information between lip movements and sound envelope dynamics for the neural tracking of speech. The decoding drop occurred at time lags substantially overlapping with the combined audio-visual obstacle generated by listening to a speaker wearing a surgical face mask.

4.4. Correlation between neural and behavioural data

At the behavioral level, face masks significantly affected only the perceived difficulty in attending speech. Here, we aimed to explore whether a relationship existed between this behavioral measure and neural tracking. We hypothesized that the increase in perceived difficulty would be associated with a decrease in neural tracking and performed a one-tailed analysis. In each condition, we analyzed the correlation between the reconstruction performance of the sound envelope at each time lag (at the single participant level, for each time lag, decoding -r- were transformed into z-values with Fisher transform), and the related perceived difficulty. An association emerged between the two mask conditions. In the virtual mask condition, correlation coefficients were statistically significant between 200 and 245 ms (p < 0.05); in the real mask condition, between 170 and 230 ms (pFDR<0.05; see Fig. 5).

Fig. 5.

Fig. 5

Correlations at each time lag between the neural tracking of the sound envelope and the perceived difficulty in each condition (left panel). A negative correlation between the neural tracking and the perceived difficulty was found in the two mask conditions (virtual and real mask) at time lags ∼200 ms.

5. Discussion

In the present work, we aimed to dissociate auditory and visual obstacles generated by face masks during face-to-face communication. Face masks increased perceived difficulty in attending the speech. In addition, when listeners were asked to report communication content, there were greater phonemic errors when the speaker wore a real surgical mask. Sound envelope reconstruction was possible in all conditions, whereas covering the mouth area prevented the neural tracking of lips. Yet, when contrasting the decoding accuracy of sound envelope across conditions, we observed a progressive reduction of reconstruction performance depending on whether we measured the visual, the acoustic, or the combined filtering effect. Finally, we found that the difficulty of listening when the speaker wears a face mask was mirrored by the costs measured at the neural level.

5.1. Behavioral effects of face masks on speech processing

No significant changes in accuracy in answering speech content questions emerged across the different conditions, suggesting that the SNR was high enough to allow efficient speech processing in all tested contexts. This outcome aligns with previous studies showing that face masks do not substantially affect speech comprehension; [12,14]. Nonetheless, even if not statistically significant, there was a noticeable trend indicating decreasing accuracy from the no-mask to the real mask condition. Similarly, participants' confidence was unaffected when answering content questions, again, with a decreasing trend. Noteworthy, the mask condition modulated the type of errors when answering the content comprehension questions. Phonological errors increased when listening to a speaker wearing a surgical face mask, that is when both visual and auditory obstacles were combined.

We attribute this effect to the attenuation of frequencies above 1–2 kHz caused by the use of surgical face masks (see also [11,51]). The reduced sound quality made the distinction between phonetically (i.e., acoustic) related answers in the 4AFC task more challenging. Indeed, no such effect emerged when only visual cues were missing (virtual mask). This result aligns with previous literature that has explored the influence of face masks on the acoustic properties of voice, such as the alteration of formant frequency, focusing on how this phenomenon changed across different types of masks [52,53].

When listening to a speaker, we process visual and acoustic information in parallel to support speech comprehension [54,55]. Seeing the articulatory movements of the speaker's face sustains intelligibility [56], especially in challenging listening conditions [[57], [58], [59]]. Moreover, speech processing is hampered when the acoustic information is degraded, such as by the presence of face masks [11,51], which are known to dampen frequencies above 1–2 kHz (with different effects depending on the material used for the mask; [11].

We also found a progressive increase of perceived difficulty across conditions, depending on whether only the lip information was absent (no-mask vs. virtual mask) and when both visual and auditory inputs were altered (no-mask vs. real mask). This result is coherent with previous findings highlighting the impact of face masks on metacognitive levels of communication [10]. In recent years, various studies used face masks as a model to investigate audio-visual speech integration [13,14,51,60,61], but the metacognitive dimensions of speech perception remained largely overlooked. Giovanelli and colleagues [10] revealed that compared to listening to an unmasked talker, listening to a speaker wearing a face mask increased listening effort and reduced confidence in speech understanding. In fact, the face mask condition was comparable to an auditory-only condition of the same study.

5.2. The neural tracking of lip movements and sound envelope

First, we analyzed the neural tracking of the visual information in the no-mask and virtual mask conditions. Lip movements could be successfully reconstructed only in the no-mask condition, between 50 and 575 ms, highlighting activity at occipital sensors. Conversely, we measured neural tracking of the sound envelope for all conditions at central sensors. In the no-mask and virtual mask conditions, we had robust and prolonged sound envelope tracking up to −115 and 425 ms. In the real mask condition, it was from −115 to 320 ms instead.

Time lags at which decoding of lip movements was successful were delayed compared to the sound envelope reconstruction dynamic. This agrees with results showing that lip movement processing is more protracted than a sound envelope [56]. A recent MEG study showed that the brain can reconstruct acoustic information from silent lip movements [62]. If it were possible to decode the lip movements even in the virtual mask condition, it would have indicated that the brain reconstructs the unseen lip movements from heard speech. However, we found no such evidence. The present paradigm did not include a visual-only condition that could potentially evoke a higher neural tracking and, thus, more sensitivity. Bourguignon and colleagues [63] recently conducted a MEG experiment, demonstrating how the brain utilizes lip-reading signals in the absence of any sound to generate a basic auditory speech representation in early auditory cortices. Their findings emphasized that such activation represents a rapid synthesis of the auditory stimulus rather than mental imagery of unrelated sounds.

When listening to speech in noise, like in a 5-talker situation, paying attention to the target sound of speech and filtering out unattended voices implies greater effort for the listener [19,40]. Especially in noise, the availability of both acoustic and visual speech signals is fundamental for efficient tracking and comprehension [54]. Indeed, looking at the lips greatly supports speech tracking and intelligibility [4,55]. A recent MEG study showed that surgical face masks negatively affect listeners' neural tracking of acoustic features (i.e., sound envelope and spectral features; [12]). The authors interpreted these results as the consequence of the missing visual input and the subsequent impossibility of integrating acoustic and visual information. In a subsequent experiment, the same authors [16] expanded their previous findings and described the mechanisms underlying the contributions of acoustic and visual cues, further highlighting the pivotal role of visual speech in multi-speaker scenarios. Nonetheless, to fully understand the mechanisms underlying this effect, it is crucial to disentangle the impact of face masks’ acoustic and visual filters. The main novelty of this study lies in the clear separation of the contributions provided by audio and visual speech features for the neural tracking of speech. Based on previous findings [12,63,64], we expected a decreasing reconstruction performance of speech envelope along a continuum spanning from no-mask, virtual, and real masks.

5.3. Masking the lips impairs early reconstruction of the sound envelope

We analyzed the specific visual obstacle on the sound envelope reconstruction, that is, how the visual filter of the face mask undermined speech processing. To this aim, we compared the no-mask vs. virtual mask conditions and measured a drop in the reconstruction performance of the envelope at time lags between −55 and −10 ms.

Acoustic and visual information are correlated in continuous speech, with a peak in frequency bands corresponding to the syllabic rate between 4 and 8 Hz (e.g., Ref. [4]. Evidence of the influence of visual cues on acoustic processing is given by the cross-modal phase modulation of auditory tracking by visual speech signals. Power and colleagues [65] evaluated the neural tracking of speech signals and analyzed the role of congruent visual speech cues on auditory tracking. The phase of auditory tracking was altered in the presence of congruent visual stimulation. Visual speech also represents a critical complement to auditory speech signals by providing information on temporal markers of the upcoming acoustic stimulus and the relative content. Coherent visual cues support the neural tracking of the auditory cortex, enhancing speech processing [66]. This phenomenon is an example of multimodal facilitation of speech processing [67]. When available, visual cues decrease the listener's cognitive load and enhance the precision in predicting speech [66]. Moreover, visual cues convey information about the position of the speaker's articulators, which becomes fundamental when listening to speech in noise [66] such as in the present experiment. The brain can efficiently integrate audiovisual signals of speech despite lip movement onsets preceding the sound onsets with a range of about 100–300 ms [1]. Our findings are coherent with these observations. Specifically, we observed a precocious detrimental effect of the lack of visual information over sound envelope tracking (no-mask vs. virtual mask) at the earliest time lags due to the role of visual cues in the early stages of speech processing.

5.4. Uncovering the effect of the acoustic filter

By contrasting the virtual vs. real mask condition, we assessed the impact of the degraded auditory information on the sound envelope tracking, namely the acoustic obstacle. In this comparison, we selectively evaluated the sound filtering effect of a real face mask since the mouth area was covered in both conditions. Given that listening to speech in noisy environments demands focused attention on target speech and suppress distractors [68,69], we predicted more favorable neural tracking in the virtual mask condition compared to the real mask condition, as the latter represented a more compromised scenario. Coherently, we found that the envelope tracking in the virtual condition exceeded the one in the real mask condition at several time lags, from 5 to 50 ms and 125–320 ms. The main difference between the speech envelope reconstruction occurred around 200 ms, a typical time scale of neural tracking associated with auditory selective attention (e.g., Ref. [40]).

Protective face masks have been of fundamental importance during the pandemic of COVID-19; despite that, consequent difficulties in face-to-face communication have been underlined, especially for listeners with hearing impairment [11,70]. Previous behavioral studies highlighted the impact of face masks in filtering acoustic information, leading to decreased comprehension both in normal hearing and hearing-impaired participants [51,70,71]. Also, face masks challenge auditory attention, intelligibility, and memory recall, especially when listening to speech in noise [72,73]. According to the mask's fabric, the detrimental effect on comprehension and subjective listening effort can vary [13,61]. Overall, the present results confirmed and expanded such evidence by providing their possible neural correlates [13,14].

5.5. Real surgical masks: combined acoustic and visual obstacles

The contrast between the no-mask vs. real mask condition allowed us to evaluate the effect of the combined acoustic-visual obstacles. Here, we found that the decoding of the sound envelope in the no-mask condition exceeded the real mask within time lags between −115 and 290 ms. As described earlier, face masks affect speech at multiple levels: by covering the mouth area [11] and preventing lip reading, by hampering the acoustic propagation of the voice [61], and affecting the subjective perceived difficulty and metacognitive performance monitoring [10]. Coherently, we observed that the acoustic-visual obstacle led to the greatest and most prolonged dampening of speech neural tracking. To further dissect the role of audio-visual information on the tracking of the sound envelope, we also developed a predictor (auditory-only) from which we regressed out shared information between the lip movements and the sound envelope. This allowed estimating neural tracking without the influence of audio-visual correspondences. Coherently with the previously described combined audio-visual obstacle, the decoding drop associated with the removal of shared audio-visual information occurred at similar times, as in the case of listening to a speaker wearing a real surgical face mask (see Supplementary Material section 3.4, Decoding Model Results of the auditory-only speech feature).

5.6. Link between neural and behavioral data

We explored potential associations between the perceived difficulty in attending speech and the decoding of the sound envelope as a function of the different face mask conditions. Notably, negative correlations between behavioral and neural measures were observed in the virtual mask condition from 200 to 245 ms and in the real mask condition between 170 and 230 ms time lags. Conversely, no significant results were found in the no-mask condition. Results demonstrated a link between the progressive decline of neural tracking depending on the perceptive load induced by the mask conditions and the increase in the perceived difficulty in listening to speech. The association between neural activity and metacognition emerged at time lags at around 200 ms, typically indicative of auditory attention processes (e.g., Ref. [40]). While a positive relationship between speech intelligibility and the tracking of acoustic features is well known (see Ref. [74]), these results are consistent with recent evidence illustrating a connection between neural tracking of sound envelope and perceived difficulty [75].

6. Conclusions

In the current study, we developed a naturalistic paradigm based on continuous speech to investigate how the brain synchronizes with speech signals in detrimental listening conditions (face masks and 5-talker noise). Results allowed disentangling between the acoustic and visual impact of face masks on speech processing. These obstacles had impacts at partially distinct processing stages. Findings substantiated the role of audio-visual congruent information in speech processing and showed an increased load on metacognition in the case audio-visual processing of speech is hindered. These findings highlight the adverse impact of face masks on social communication while introducing objective metrics for the development and validation of new mask materials. Ultimately, our results provide evidence that could be useful for alleviating perceptual barriers and metacognitive load.

Fundings

D.B. and F.P. were supported by a PRIN 2017 grant from the Italian Ministry for University (Prot. 20177894 ZH) and by a COVID-19 grant from the University of Trento. F.P. was also supported by a grant from Velux Stiftung (n.1439).

Data availability

Preprocessed and anonymized EEG data together with the behavioral data will be made available upon request to the corresponding author.

CRediT authorship contribution statement

M. Fantoni: Writing – review & editing, Writing – original draft, Validation, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. A. Federici: Writing – review & editing, Writing – original draft, Visualization, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. I. Camponogara: Writing – review & editing, Methodology, Formal analysis. G. Handjaras: Writing – review & editing, Methodology, Formal analysis. A. Martinelli: Writing – review & editing, Methodology, Data curation. E. Bednaya: Writing – review & editing, Methodology, Formal analysis, Data curation. E. Ricciardi: Writing – review & editing, Funding acquisition. F. Pavani: Writing – review & editing, Writing – original draft, Funding acquisition, Conceptualization. D. Bottari: Writing – review & editing, Writing – original draft, Visualization, Supervision, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization.

Declaration of competing interest

The authors declare no competing financial interests.

Acknowledgments

The authors want to thank Dr. Chiara Valzogher for her contribution to the first draft of the manuscript.

Footnotes

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.heliyon.2024.e34860.

Appendix A. Supplementary data

The following is the Supplementary data to this article:

Multimedia component 1
mmc1.docx (2.8MB, docx)

References

  • 1.Chandrasekaran C., Trubanova A., Stillittano S., Caplier A., Ghazanfar A.A. The natural statistics of audiovisual speech. PLoS Comput. Biol. 2009;5(7) doi: 10.1371/journal.pcbi.1000436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lakatos P., Gross J., Thut G. 2019. A New Unifying Account of the Roles of Neuronal Entrainment. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Holmes N.P. The principle of inverse effectiveness in multisensory integration: some statistical considerations. Brain Topogr. 2009 doi: 10.1007/s10548-009-0097-2. [DOI] [PubMed] [Google Scholar]
  • 4.Park H., Kayser C., Thut G., Gross J. Lip movements entrain the observers' low-frequency brain oscillations to facilitate speech intelligibility. Elife. 2016;5 doi: 10.7554/eLife.14521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Grant K.W., Seitz P.-F. The use of visible speech cues for improving auditory detection of spoken sentences. J. Acoust. Soc. Am. 2000;108(3) doi: 10.1121/1.1288668. [DOI] [PubMed] [Google Scholar]
  • 6.Ross L.A., Saint-Amour D., Leavitt V.M., Javitt D.C., Foxe J.J. Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cerebr. Cortex. 2007;17(5) doi: 10.1093/cercor/bhl024. [DOI] [PubMed] [Google Scholar]
  • 7.Song J.J., Lee H.J., Kang H., Lee D.S., Chang S.O., Oh S.H. Effects of congruent and incongruent visual cues on speech perception and brain activity in cochlear implant users. Brain Struct. Funct. 2015;220(2) doi: 10.1007/s00429-013-0704-6. [DOI] [PubMed] [Google Scholar]
  • 8.Moradi S., Lidestam B., Danielsson H., Ng E.H.N., Rönnberg J. Visual cues contribute differentially to audiovisual perception of consonants and vowels in improving recognition and reducing cognitive demands in listeners with hearing impairment using hearing aids. J. Speech Lang. Hear. Res. 2017;60(9) doi: 10.1044/2016_JSLHR-H-16-0160. [DOI] [PubMed] [Google Scholar]
  • 9.Blackburn C.L., Kitterick P.T., Jones G., Sumner C.J., Stacey P.C. Visual speech benefit in clear and degraded speech depends on the auditory intelligibility of the talker and the number of background talkers. Trends Hear. 2019;23 doi: 10.1177/2331216519837866. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Giovanelli E., Valzolgher C., Gessa E., Todeschini M., Pavani F. Unmasking the difficulty of listening to talkers with masks: lessons from the COVID-19 pandemic. Iperception. 2021;12(2) doi: 10.1177/2041669521998393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Corey R.M., Jones U., Singer A.C. Acoustic effects of medical, cloth, and transparent face masks on speech signals. J. Acoust. Soc. Am. 2020;148(4) doi: 10.1121/10.0002279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Haider C.L., Suess N., Hauswald A., Park H., Weisz N. Masking of the mouth area impairs reconstruction of acoustic speech features and higher-level segmentational features in the presence of a distractor speaker. Neuroimage. 2022;252 doi: 10.1016/j.neuroimage.2022.119044. [DOI] [PubMed] [Google Scholar]
  • 13.Brown V.A., Van Engen K.J., Peelle J.E. Face mask type affects audiovisual speech intelligibility and subjective listening effort in young and older adults. Cogn Res Princ Implic. 2021;6(1) doi: 10.1186/s41235-021-00314-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Toscano J.C., Toscano C.M. Effects of face masks on speech recognition in multi-talker babble noise. PLoS One. 2021;16(2 February) doi: 10.1371/journal.pone.0246842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Giovanelli E., et al. The effect of face masks on sign language comprehension: performance and metacognitive dimensions. Conscious. Cognit. 2023;109 doi: 10.1016/j.concog.2023.103490. [DOI] [PubMed] [Google Scholar]
  • 16.Haider C.L., Park H., Hauswald A., Weisz N. Neural speech tracking highlights the importance of visual speech in multi-speaker situations. J. Cognit. Neurosci. Jan. 2024;36(1):128–142. doi: 10.1162/jocn_a_02059. [DOI] [PubMed] [Google Scholar]
  • 17.Zion Golumbic E., Cogan G.B., Schroeder C.E., Poeppel D. Visual input enhances selective speech envelope tracking in auditory cortex at a ‘Cocktail Party. J. Neurosci. 2013;33(4) doi: 10.1523/JNEUROSCI.3675-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Crosse M.J., Butler J.S., Lalor E.C. Congruent visual speech enhances cortical entrainment to continuous auditory speech in noise-free conditions. J. Neurosci. 2015;35(42) doi: 10.1523/JNEUROSCI.1829-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ahmed F., Nidiffer A.R., O'Sullivan A.E., Zuk N.J., Lalor E.C. The integration of continuous audio and visual speech in a cocktail-party environment depends on attention. Neuroimage. 2023;274(Jul) doi: 10.1016/j.neuroimage.2023.120143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Crosse M.J., Di Liberto G.M., Bednar A., Lalor E.C. The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli. Front. Hum. Neurosci. 2016;10 doi: 10.3389/fnhum.2016.00604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Crosse M.J., Zuk N.J., Di Liberto G.M., Nidiffer A.R., Molholm S., Lalor E.C. 2021. Linear Modeling of Neurophysiological Responses to Speech and Other Continuous Stimuli: Methodological Considerations for Applied Research. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Calvino I. 1956. Fiabe Italiane. Einaudi. [Google Scholar]
  • 23.Camilleri A. 2018. I Tacchini non Ringraziano. Salani. [Google Scholar]
  • 24.Ammaniti N. 2012. Il Memento È Delicato. [Google Scholar]
  • 25.Brungart D.S. Informational and energetic masking effects in the perception of two simultaneous talkers. J. Acoust. Soc. Am. 2001;109:1101–1109. doi: 10.1121/1.1345696. [DOI] [PubMed] [Google Scholar]
  • 26.Wang X., Xu L. Speech perception in noise: masking and unmasking. J. Otolaryngol. Apr. 2021;16(2):109–119. doi: 10.1016/j.joto.2020.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.McCarthy C. 2014. La Strada. [Google Scholar]
  • 28.Delorme A., Makeig S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods. 2004;134(1) doi: 10.1016/j.jneumeth.2003.10.009. [DOI] [PubMed] [Google Scholar]
  • 29.Stropahl M., Bauer A.K.R., Debener S., Bleichner M.G. Source-Modeling auditory processes of EEG data using EEGLAB and brainstorm. Front. Neurosci. 2018;12(MAY) doi: 10.3389/fnins.2018.00309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bottari D., et al. EEG frequency-tagging demonstrates increased left hemispheric involvement and crossmodal plasticity for face processing in congenitally deaf signers. Neuroimage. 2020;223 doi: 10.1016/j.neuroimage.2020.117315. [DOI] [PubMed] [Google Scholar]
  • 31.Delorme A., Sejnowski T., Makeig S. Enhanced detection of artifacts in EEG data using higher-order statistics and independent component analysis. Neuroimage. 2007;34(4) doi: 10.1016/j.neuroimage.2006.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Bell A.J., Sejnowski T.J. An information-maximization approach to blind separation and blind deconvolution. Neural Comput. 1995;7(6) doi: 10.1162/neco.1995.7.6.1129. [DOI] [PubMed] [Google Scholar]
  • 33.Lee T.W., Girolami M., Sejnowski T.J. Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Comput. Feb. 1999;11(2):417–441. doi: 10.1162/089976699300016719. [DOI] [PubMed] [Google Scholar]
  • 34.Jung T.P., Makeig S., Westerfield M., Townsend J., Courchesne E., Sejnowski T.J. Removal of eye activity artifacts from visual event-related potentials in normal and clinical subjects. Clin. Neurophysiol. 2000;111(10) doi: 10.1016/S1388-2457(00)00386-2. [DOI] [PubMed] [Google Scholar]
  • 35.Jung T.P., et al. Removing electroencephalographic artifacts by blind source separation. Psychophysiology. 2000;37(2) doi: 10.1017/S0048577200980259. [DOI] [PubMed] [Google Scholar]
  • 36.Campos Viola F., Thorne J., Edmonds B., Schneider T., Eichele T., Debener S. Semi-automatic identification of independent components representing EEG artifact. Clin. Neurophysiol. 2009;120(5) doi: 10.1016/j.clinph.2009.01.015. [DOI] [PubMed] [Google Scholar]
  • 37.Bednaya E., et al. Early visual cortex tracks speech envelope in the absence of visual input. bioRxiv. 2022 [Google Scholar]
  • 38.Legendre G., Andrillon T., Koroma M., Kouider S. 2019. Sleepers Track Informative Speech in a Multitalker Environment. [DOI] [PubMed] [Google Scholar]
  • 39.Mirkovic B., Debener S., Jaeger M., De Vos M. Decoding the attended speech stream with multi-channel EEG: implications for online, daily-life applications. J. Neural. Eng. 2015;12(4) doi: 10.1088/1741-2560/12/4/046007. [DOI] [PubMed] [Google Scholar]
  • 40.O'Sullivan J.A., et al. Attentional selection in a cocktail party environment can Be decoded from single-trial EEG. Cerebr. Cortex. 2015;25(7) doi: 10.1093/cercor/bht355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Mathis A., et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. Sep. 2018;21(9):1281–1289. doi: 10.1038/s41593-018-0209-y. [DOI] [PubMed] [Google Scholar]
  • 42.Kortelainen J., Vayrynen E. Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society. EMBS; 2015. Assessing EEG slow wave activity during anesthesia using Hilbert-Huang Transform. [DOI] [PubMed] [Google Scholar]
  • 43.Holdgraf C.R., Rieger J.W., Micheli C., Martin S., Knight R.T., Theunissen F.E. 2017. Encoding and Decoding Models in Cognitive Electrophysiology. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.de Lange F.P., Heilbron M., Kok P. 2018. How Do Expectations Shape Perception? [DOI] [PubMed] [Google Scholar]
  • 45.Heilbron M., Armeni K., Schoffelen J.M., Hagoort P., De Lange F.P. A hierarchy of linguistic predictions during natural language comprehension. Proc. Natl. Acad. Sci. U. S. A. 2022;119(32) doi: 10.1073/pnas.2201968119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Haufe S., et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage. 2014;87 doi: 10.1016/j.neuroimage.2013.10.067. [DOI] [PubMed] [Google Scholar]
  • 47.Cousineau D., Goulet-Pelletier J.C. A study of confidence intervals for Cohen's d in within-subject designs with new proposals. The Quantitative Methods for Psychology. Mar. 2021;17(1):51–75. [Google Scholar]
  • 48.J. Cohen, “Statistical Power Analysis for the Behavioral Sciences second ed.”.
  • 49.Benjamini Y., Hochberg Y. Controlling the False Discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B. 1995;57(1) doi: 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
  • 50.Nishimoto S., Vu A.T., Naselaris T., Benjamini Y., Yu B., Gallant J.L. Reconstructing visual experiences from brain activity evoked by natural movies. Curr. Biol. Oct. 2011;21(19):1641–1646. doi: 10.1016/j.cub.2011.08.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Choi J.H., Choi H.J., Kim D.H., Park J.H., An Y.H., Shim H.J. Effect of face masks on speech perception in noise of individuals with hearing aids. Front. Neurosci. 2022;16 doi: 10.3389/fnins.2022.1036767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Gama R., Castro M.E., van Lith-Bijl J.T., Desuter G. Springer Science and Business Media Deutschland GmbH; Apr. 01, 2022. Does the Wearing of Masks Change Voice and Speech Parameters? [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Latoszek B.B.v., Jansen V., Watts C.R., Hetjens S. The impact of protective face coverings on acoustic markers in voice: a systematic review and meta-analysis. J. Clin. Med. Sep. 2023;12(18) doi: 10.3390/jcm12185922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Crosse M.J., Di Liberto G.M., Lalor E.C. Eye can hear clearly now: inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration. J. Neurosci. Sep. 2016;36(38):9888–9895. doi: 10.1523/JNEUROSCI.1396-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Giordano B.L., Ince R.A.A., Gross J., Schyns P.G., Panzeri S., Kayser C. 2017. “Contributions of Local Speech Encoding and Functional Connectivity to Audio-Visual Speech Perception,”. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Holler J., Levinson S.C. 2019. Multimodal Language Processing in Human Communication. [DOI] [PubMed] [Google Scholar]
  • 57.Sumby W.H., Pollack I. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. Mar. 1954;26(2):212–215. doi: 10.1121/1.1907309. [DOI] [Google Scholar]
  • 58.Moradi S., Lidestam B., Rönnberg J. Gated audiovisual speech identification in silence vs. noise: effects on time and accuracy. Front. Psychol. 2013;4 doi: 10.3389/fpsyg.2013.00359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Puschmann S., et al. Hearing-impaired listeners show increased audiovisual benefit when listening to speech in noise. Neuroimage. 2019;196 doi: 10.1016/j.neuroimage.2019.04.017. [DOI] [PubMed] [Google Scholar]
  • 60.Rahne T., Fröhlich L., Plontke S., Wagner L. Influence of surgical and N95 face masks on speech perception and listening effort in noise. PLoS One. Jul. 2021;16(7 July) doi: 10.1371/journal.pone.0253874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Thibodeau L.M., Thibodeau-Nielsen R.B., Tran C.M.Q., Jacob R.T. de S. Communicating during COVID-19: the effect of transparent masks for speech recognition in noise. Ear Hear. Jul. 2021;42(4):772–781. doi: 10.1097/AUD.0000000000001065. [DOI] [PubMed] [Google Scholar]
  • 62.Hauswald A., Lithari C., Collignon O., Leonardelli E., Weisz N. A visual cortical network for deriving phonological information from intelligible lip movements. Curr. Biol. May 2018;28(9):1453–1459.e3. doi: 10.1016/j.cub.2018.03.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Bourguignon M., Baart M., Kapnoula E.C., Molinaro N. Lip-reading enables the brain to synthesize auditory features of unknown silent speech. J. Neurosci. Jan. 2020;40(5):1053–1065. doi: 10.1523/JNEUROSCI.1101-19.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Tan S.H.J., Kalashnikova M., Di Liberto G.M., Crosse M.J., Burnham D. Seeing a talking face matters: the relationship between cortical tracking of continuous auditory‐visual speech and gaze behaviour in infants, children and adults. Neuroimage. 2022;256(Aug) doi: 10.1016/j.neuroimage.2022.119217. [DOI] [PubMed] [Google Scholar]
  • 65.Power A.J., Mead N., Barnes L., Goswami U. Neural entrainment to rhythmically presented auditory, visual, and audio-visual speech in children. Front. Psychol. 2012;3 doi: 10.3389/fpsyg.2012.00216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Peelle J.E., Sommers M.S. 2015. Prediction and Constraint in Audiovisual Speech Perception. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Drijvers L., Holler J. The multimodal facilitation effect in human communication. Psychon. Bull. Rev. Apr. 2023;30(2):792–801. doi: 10.3758/s13423-022-02178-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Mesgarani N., Chang E.F. May 09, 2012. Selective Cortical Representation of Attended Speaker in Multi-Talker Speech Perception. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Kim S., Emory C., Choi I. Neurofeedback training of auditory selective attention enhances speech-in-noise perception. Front. Hum. Neurosci. 2021;15(Jun) doi: 10.3389/fnhum.2021.676992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Homans N.C., Vroegop J.L. The impact of face masks on the communication of adults with hearing loss during COVID-19 in a clinical setting. Int. J. Audiol. 2022;61(5) doi: 10.1080/14992027.2021.1952490. [DOI] [PubMed] [Google Scholar]
  • 71.Moon I.J., et al. How does a face mask impact speech perception? Healthcare (Switzerland) 2022;10(9) doi: 10.3390/healthcare10091709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Rimmele J.M., Zion Golumbic E., Schröger E., Poeppel D. The effects of selective attention and speech acoustics on neural speech-tracking in a multi-talker scene. Cortex. Jul. 2015;68:144–154. doi: 10.1016/j.cortex.2014.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Smiljanic R., Keerstock S., Meemann K., Ransom S.M. Face masks and speaking style affect audio-visual word recognition and memory of native and non-native speech. J. Acoust. Soc. Am. Jun. 2021;149(6):4013–4023. doi: 10.1121/10.0005191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Ding N., Simon J.Z. Frontiers Media S. A; May 28, 2014. Cortical Entrainment to Continuous Speech: Functional Roles and Interpretations. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.P. Reisinger et al., “Neural Speech Tracking Benefit of Lip Movements Predicts Behavioral Deterioration when the Speaker's Mouth Is Occluded”, doi: 10.1101/2023.04.17.536524. .

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Multimedia component 1
mmc1.docx (2.8MB, docx)

Data Availability Statement

Preprocessed and anonymized EEG data together with the behavioral data will be made available upon request to the corresponding author.


Articles from Heliyon are provided here courtesy of Elsevier

RESOURCES