Perceptual uncertainty explains activation differences between audiovisual congruent speech and McGurk stimuli

Chenjie Dong; Uta Noppeney; Suiping Wang

doi:10.1002/hbm.26653

. 2024 Mar 15;45(4):e26653. doi: 10.1002/hbm.26653

Perceptual uncertainty explains activation differences between audiovisual congruent speech and McGurk stimuli

Chenjie Dong ^1,², Uta Noppeney ², Suiping Wang ^1,^✉

PMCID: PMC10964917 PMID: 38488460

Abstract

Face‐to‐face communication relies on the integration of acoustic speech signals with the corresponding facial articulations. In the McGurk illusion, an auditory /ba/ phoneme presented simultaneously with a facial articulation of a /ga/ (i.e., viseme), is typically fused into an illusory ‘da’ percept. Despite its widespread use as an index of audiovisual speech integration, critics argue that it arises from perceptual processes that differ categorically from natural speech recognition. Conversely, Bayesian theoretical frameworks suggest that both the illusory McGurk and the veridical audiovisual congruent speech percepts result from probabilistic inference based on noisy sensory signals. According to these models, the inter‐sensory conflict in McGurk stimuli may only increase observers' perceptual uncertainty. This functional magnetic resonance imaging (fMRI) study presented participants (20 male and 24 female) with audiovisual congruent, McGurk (i.e., auditory /ba/ + visual /ga/), and incongruent (i.e., auditory /ga/ + visual /ba/) stimuli along with their unisensory counterparts in a syllable categorization task. Behaviorally, observers' response entropy was greater for McGurk compared to congruent audiovisual stimuli. At the neural level, McGurk stimuli increased activations in a widespread neural system, extending from the inferior frontal sulci (IFS) to the pre‐supplementary motor area (pre‐SMA) and insulae, typically involved in cognitive control processes. Crucially, in line with Bayesian theories these activation increases were fully accounted for by observers' perceptual uncertainty as measured by their response entropy. Our findings suggest that McGurk and congruent speech processing rely on shared neural mechanisms, thereby supporting the McGurk illusion as a valid measure of natural audiovisual speech perception.

Keywords: Bayesian causal inference, fMRI, McGurk illusion, perceptual uncertainty

This study investigated the neurocognitive mechanisms underlying the McGurk illusion. We found that McGurk stimuli increased activations in a widespread neural system, and the activation level of this system varied across unisensory and audiovisual congruent stimuli. The activation differences between McGurk and congruent stimuli can be attributed to perceptual uncertainty.

graphic file with name HBM-45-e26653-g006.jpg

Practitioner points.

Both McGurk illusion and natural audiovisual congruent speech perception result from inference based on noisy audiovisual signals and rely on shared neural mechanisms.
McGurk stimuli increase activations in a widespread neural system that is typically involved in cognitive control processes, while the level of activation in this system also varies across unisensory and audiovisual congruent syllables.
The activation differences between McGurk and audiovisual congruent syllables can be directly attributed to observers' degree of perceptual uncertainty.

1. INTRODUCTION

Effective face‐to‐face communication relies on the integration of auditory speech with the corresponding facial articulations. In laboratory settings, the McGurk illusion is often employed to study audiovisual integration of speech (McGurk & MacDonald, 1976). This illusion arises when a discrepancy is introduced between the visual and auditory speech signals. For instance, when an auditory /ba/ phoneme is presented simultaneously with a facial articulation of a /ga/ (i.e., viseme), observers frequently perceive an illusory auditory ‘da’ percept.

Despite its widespread use, the relevance of the McGurk illusion for understanding the mechanisms of natural audiovisual speech comprehension has recently been questioned (Getz & Toscano, 2021; Van Engen et al., 2022). Critics argue that McGurk stimuli categorically differ from natural speech stimuli because they introduce a conflict between auditory and visual signals (Erickson et al., 2014; Getz & Toscano, 2021; van Engen et al., 2022)—thereby invoking additional conflict monitoring and cognitive control processes. In line with this view, neuroimaging studies showed that McGurk stimuli activated not only posterior superior temporal gyri and sulci (pSTG/S) (Baum et al., 2012; Benoit et al., 2010; Bernstein et al., 2008; Luttke et al., 2016; Szycik et al., 2012), areas traditionally associated with audiovisual speech processing, but also anterior cingulate cortex (ACC) /pre‐supplementary motor area (pre‐SMA) (Moris Fernandez et al., 2017; Moris Fernandez et al., 2018; Murakami et al., 2018), and inferior frontal gyri/sulci (IFG/S) (Gau & Noppeney, 2016; Hasson et al., 2007; Murakami et al., 2018; Skipper et al., 2007; Tse et al., 2015), that is, regions implicated in conflict monitoring and cognitive control.

In contrast, Bayesian theoretical frameworks propose that veridical percepts for audiovisual congruent stimuli and illusory percepts for McGurk stimuli emerge from common computational mechanisms (Noppeney, 2021; Noppeney & Lee, 2018; Shams & Beierholm, 2022). In both instances, observers need to infer the phoneme‐viseme pair and its underlying causal structure from audiovisual signals that are corrupted by noise (Magnotti & Beauchamp, 2015, 2017). Various internal and external noise sources may thus introduce discrepancies between audiovisual inputs on congruent trials, while on McGurk trials they may eliminate discrepancies. Thus, the noisy inputs for McGurk and congruent stimuli may even be identical on a particular trial, even though the true underlying audiovisual phoneme‐viseme pairs differ. Indeed, recent research shows that when observers perceive an illusory ‘da’ percept on McGurk trials, they often incorrectly infer a congruent relationship between these physically conflicting audiovisual signals. Conversely, they can incorrectly infer incongruent relationships between a phoneme and a viseme, when they miscategorized the phoneme on congruent trials (Meijer & Noppeney, 2023).

These findings highlight the inherent uncertainty observers face when making perceptual decisions about phonemes, visemes and their causal relationship on both congruent and McGurk trials. For both congruent and McGurk stimuli, perceptual inference relies on the integration of noisy multisensory information with prior knowledge. Yet, despite the shared computational mechanisms, the perceptual decisions on McGurk trials may on average be associated with greater uncertainty compared to trials on which audiovisual information is congruent (Kimmet et al., 2023; Meijer & Noppeney, 2023). Based on this theoretical framework we hypothesized that the increased SMA and IFS activations during McGurk trials may be explained by observers' greater perceptual uncertainty when presented with conflicting than congruent audiovisual signals.

To test this hypothesis, this fMRI study presented observers with audiovisual congruent (AVc), incongruent (AVi, auditory /ga/ + visual /ba/), and McGurk (AVm, auditory /ba/ + visual /ga/) phoneme‐viseme pairs along with their unisensory counterparts in a syllable categorization task. First, we assessed activation differences for McGurk stimuli compared to congruent and incongruent phoneme‐viseme pairs. Second, we examined whether blood oxygen level‐dependent (BOLD) responses varied for different syllables within each sensory context (i.e., auditory, visual, audiovisual congruent). Third, we investigated whether variations in observers' perceptual uncertainty, as reflected in the entropy of their response distributions, can account for the activation differences between McGurk and congruent/incongruent pairs as well as across different syllables.

2. MATERIALS AND METHODS

2.1. Participants

Forty‐four healthy participants (24 women, mean age = 20.86, SD = 2.27) were recruited from South China Normal University. Two participants (2 women) were excluded from the data analysis as they were unable to follow the experimental instruction. All participants had normal or corrected‐to‐normal vision, no history of neurological or psychiatric disorders, and provided informed written consent before the experiment. The protocol for this study was approved by the Human Research Ethics Committee of the School of Psychology at South China Normal University.

2.2. Materials

Short clips recorded from a male actor were presented to the participants. The clips included three auditory stimuli (A: A_/ba/, A_/da/, A_/ga/), three visual stimuli (V: V_/ba/, V_/da/, V_/ga/) and three audiovisual congruent stimuli (AVc: AVc_/ba/, AVc_/da/, AVc_/ga/). In addition, we included two audiovisual classes of incongruent stimuli: in the McGurk fusion stimulus (from now on: McGurk stimulus, AVm = A_/ba/ + V_/ga/) an auditory /ba/ phoneme paired with the facial articulation of a /ga/ is typically fused into a ‘da’ percept. In the incongruent combination stimulus (from now on incongruent stimulus, AVi = A_/ga/ + V_/ba/), an auditory /ga/ phoneme is paired with the facial articulation of a /ba/. These incongruent phoneme‐viseme pairs are typically not fused into a ‘da’ percept, but perceived as combination illusions (e.g., ‘bga’) or simply as the auditory stimulus component (i.e., ‘ga’). The latter percept (i.e., ‘ga’) is observed particularly in Chinese speakers (Sekiyama, 1997; Sekiyama & Tohkura, 1991).

The auditory stimuli were generated by removing the video of the audiovisual congruent stimuli; the visual stimuli were generated by removing the audio of the audiovisual congruent stimuli; the AVm stimuli were generated by dubbing an auditory /ba/ over the video of facial articulations of /ga/; the AVi stimuli were generated by dubbing an auditory /ga/ over the video of facial articulations of /ba/ in Adobe Premiere.

The video clips had a duration of 2 s, were recorded at a frame rate of 25 frames per second and had a resolution of 640 × 480 pixels. The audio stimuli were recorded at a sampling rate of 48 kHz and were presented to the participants at an approximate sound level of 70 dB using MRI‐compatible earphones. To ensure the intelligibility of the stimuli inside the MRI scanner, participants (n = 28) completed a 6‐scale questionnaire about the intelligibility of the stimuli (1—very unclear, 3—clear, 6—very clear) after scanning. The mean intelligibility of auditory stimuli was 5.15 (SD = 0.87), and the mean intelligibility of visual stimuli was 5.58 (SD = 0.61).

2.3. Experimental design and procedures

In the syllable categorization task, participants were presented with auditory, visual, audiovisual congruent, incongruent, or McGurk stimuli and indicated the syllable they heard (or saw on visual trials) by pressing one of the 4 buttons (‘ba’, ‘da’, ‘ga’, and ‘others’) with index and middle fingers of their left and right hand (Figure 1). The response fingers were counterbalanced both within and between participants. Each trial started with a fixation cross (0.5 s), followed by the A, V or AV stimulus (2 s), a blank screen (1.5 s), the four‐alternative forced choice (4AFC) screen (2 s) and an inter‐trial interval randomly sampled from 2, 4, or 6 s. A baseline condition (a white fixation on center of the screen [14 s]) was presented at the beginning and end of each run. The A and V stimuli were presented in separate unisensory runs in counterbalanced order; the AV stimuli were presented in a randomized order in each audiovisual run. Each unisensory run included 30 trials (i.e., 10 repetitions per syllable, run duration = 8 min); each of the four audiovisual runs included 30 congruent trials (i.e., 10 repetitions per syllable), 20 McGurk trials, and 10 incongruent trials (run duration = 12 min).

Experimental procedure and stimuli. (a) Experimental design. Participants were presented with auditory, visual, and audiovisual syllables. (b) Example trial. Participants were presented with an audiovisual movie (2 s) followed by a blank and response period. In a four alternative forced choice task they categorized the syllable as ‘ba’, ‘da’, ‘ga’, or ‘other’ (c) Set of unisensory and audiovisual stimuli. A_/ba/, auditory /ba/; A_/da/, auditory /da/; A_/ga/, auditory /ga/; V_/ba/, visual /ba/; V_/da/, visual /da/; V_/ga/, visual /ga/; AVc_/ba/, audiovisual congruent /ba/; AVc_/da/, audiovisual congruent /da/; AVc_/ga/, audiovisual congruent /ga/; AVm, McGurk (i.e., auditory /ba/ with visual /ga/); AVi, audiovisual incongruent (i.e., auditory /ga/ with visual /ba/).

2.4. Behavioral data analyses

Behavioral data were analyzed using the JASP toolbox (https://jasp-stats.org) based on R. For each stimulus, we calculated the syllable categorization accuracy and Shannon entropy over the response distribution.

The Shannon entropy is maximal when observers randomly choose each of the four response options with equal probability (i.e., 25% for ‘ba’, ‘da’, ‘ga’, and ‘other’). While entropy is computed over trials, it is related to participants' uncertainty on a particular trial. Hence, in this study we use response entropy as an index for observers' perceptual uncertainty associated with different types of stimuli.

Categorization accuracy: To assess whether observers benefitted from audiovisual integration we entered categorization accuracy into a 3 (modality: A, V, and AVc) × 3 (syllable: /ba/, /da/, and /ga/) repeated measures ANOVA.

Entropy: We examined whether the audiovisual congruent stimuli reduced the perceptual uncertainty compared to the unisensory stimuli using a 3 (modality: A, V, and AVc) × 3 (syllable: /ba/, /da/, and /ga/) repeated measures ANOVA on entropy.

Exploring intersubject variability: Capitalizing on the substantial inter‐subject variability we investigated whether participants' McGurk illusion rate (on auditory /ba/ + visual /ga/ stimuli) could be predicted by (i) their percentage of veridical ‘da’ percepts for AVc_/da/ stimuli, (ii) their response entropy, and (iii) their percentage of ‘da’ responses on the unisensory auditory /ba/ and visual /ga/ stimuli in three linear regression models (each included a constant term).

2.5. MRI data acquisition

MRI data were collected at the Brain Imaging Center at South China Normal University on a Siemens 3T Prisma fit scanner with a 20‐channel head coil. High‐resolution T1‐weighted anatomical images were collected using a multi‐echo MPRAGE pulse sequence (repetition time [TR] = 2.53 s; echo time [TE] = 1.94 ms, flip angle = 7°, field of view [FOV] = 256 mm, matrix = 256 × 256, slice thickness = 0.5 mm, slices number = 176). Functional data were collected using a T2*‐weighted echo planar imaging EPI pulse sequence sensitive to BOLD contrast (TR = 2 s, TE = 30 ms, flip angle = 90, FOV = 192 × 192 mm, matrix = 64 × 64 mm, slice thickness = 2 mm, slice number = 62).

2.6. MRI data analysis

2.6.1. fMRI data preprocessing

Data were analyzed using statistical parametric mapping (SPM12, Wellcome Department of Imaging Neuroscience, University College London; http://www.fil.ion.ucl.ac.uk/spm) (Friston et al., 1994) running on MATLAB 2021b. Scans from each participant were realigned using the first as a reference, spatially normalized into MNI standard space using parameters from segmentation of the T1 structural image (Ashburner & Friston, 2005), resampled to 2*2*2 mm³ voxels, and spatially smoothed with a Gaussian kernel of 6 mm FWHM. The time‐series in each voxel was high‐pass filtered to 1/128 Hz.

General linear model analysis

In all general linear models, data were modeled in an event‐related fashion with regressors entering the design matrix after convolving each event‐related boxcar (representing a single trial, duration of trial = 2 s) with a canonical hemodynamic response function (HRF). Realignment parameters were included as nuisance covariates to account for residual motion artifacts.

Overall, we generated five first level general linear models: GLM‐1A investigated the effects of sensory context by comparing A, V, AVc, AVm, and AVi (pooling over syllable categories). GLM‐2A assessed the effect of syllable categories separately for A, V, and AV sensory modalities. GLM‐1B and GLM‐2B investigated whether activation differences across sensory contexts or syllables can be explained away by differences in perceptual uncertainty as measured by Shannon entropy for each stimulus. GLM‐3 assessed the overall effect of entropy (irrespective of sensory modality or syllable).

GLM‐1A included six regressors (fixation baseline, A, V, AVc, AVm, and AVi). At the first (i.e., within subject) level, we computed the following contrasts. (1) A > baseline, V > baseline, (2) AVm > AVc, AVm > AVi, and (3) AVm < AVc, AVm < AVi. The contrast images were entered into one sample t‐tests at the 2nd, that is, group level.

GLM‐2A included 12 regressors (fixation baseline, A_/ba/, A_/da/, A_/ga/, V_/ba/, V_/da/, V_/ga/, AVc_/ba/, AVc_/da/, AVc_/ga/, AVm and AVi). At the first level, we computed contrasts for each condition relative to baseline. We entered the nine contrast images for three syllables (/ba/, /da/, /ga/) × (A, V, and AVc) into a second level repeated‐measures ANOVA. In this repeated measures ANOVA, we assessed differences across the three syllables by computing F‐tests across syllables separately for the auditory (A_/ba/ vs. A_/da/ vs. A_/ga/), visual (V_/ba/ vs. V_/da/ vs. V_/ga/), and audiovisual congruent (AVc_/ba/ vs. AVc_/da/ vs. AVc_/ga/) stimuli.

GLM‐3 includes only two regressors, one regressor modeling the onsets of all stimuli irrespective of sensory modality or syllable categories and a second regressor, a parametric modulator, which encodes Shannon entropy associated with each stimulus on each trial. The parameter estimates of the parametric modulator were entered into a 2nd level one sample t‐test to test for the effect of entropy.

GLM‐1B and GLM‐2B were equivalent to GLM‐1A and 2A, except that they additionally included a single parametric modulator that modeled the stimulus‐specific entropy for each trial (i.e., each trial was assigned the entropy of the stimulus on this trial). We replicated the contrasts and analyses of GLM‐1A and GLM‐2A to investigate whether the effects of sensory context and syllable categories can be explained away by modeling the effect of entropy.

Exploring inter‐subject variability: Given the extensive variability in the McGurk illusion rate we investigated whether the activation level for McGurk stimuli relative to baseline covaries with observers' illusion rate. For this, we generated an additional second level GLM that used observers' illusion rates as regressor (+ constant term) to predict the activation levels for McGurk stimuli relative to baseline (as estimated by GLM‐1A) across participants.

Further, we investigated whether the activation level for McGurk stimuli relative to baseline covaries with observers' entropy on McGurk stimuli. For this, we generated an additional second level GLM that used observers' entropy on McGurk stimuli as regressor (+ constant term) to predict the activation levels for McGurk stimuli relative to baseline (as estimated by GLM‐1A) across participants.

At the second random effects level, we report results at p _FWE <0.05 cluster level corrected for the whole brain using an auxiliary voxel threshold of p < 0.001 uncorrected. For completeness we also present results using only a voxel threshold of p < 0.001(uncorrected) in the supplementary materials (Figure S4).

3. RESULTS

3.1. Behavioral results

3.1.1. Response distributions over the four choice options

Figure 2 shows the distribution of participants' responses over the four choice options: ‘ba’, ‘da’, ‘ga’, ‘other’. An auditory /ba/ stimulus is perceived mainly as a ‘ba’ percept, but on a fraction of trials also as a ‘da’ percept. Conversely, an auditory /ga/ stimulus mainly evokes a ‘ga’ percept. Crucially, a ‘da’ percept is thus a possible perceptual interpretation for both auditory /ba/ and visual /ga/ stimuli (Figure 2a). This perceptual uncertainty explains that participants can merge the conflicting McGurk signals (i.e., A_/ba/ + V_/ga/) into an illusory ‘da’ percept, because it is the perceptual interpretation that is possible for auditory and visual signals. By contrast, a visual /ba/ is almost exclusively perceived as a ‘ba’ and an auditory /ga/ as a ‘ga’ percept. As a result, participants cannot fuse these signals into a joint perceptual interpretation for both signals (Figure 2b). Hence, on incongruent audiovisual trials (i.e., A_/ga/ + V_/ba/), participants segregate audiovisual signals and report the task‐relevant ‘ga’ percept.

Behavioral results. (a) Response distribution for auditory /ba/, visual /ga/, and McGurk stimuli. (b) Response distribution for auditory /ga/, visual /ba/, and incongruent stimuli. (c) Predict McGurk illusion rate by the rate of veridical ‘da’ percepts for audiovisual congruent /da/ stimulus. (d) Response entropy for audiovisual congruent syllables (pooled over /ba/, /da/, and /ga/ stimuli), audiovisual congruent /ba/, audiovisual congruent /da/, audiovisual congruent /ga/, McGurk, and incongruent stimuli.

3.1.2. Response accuracy and illusion susceptibilities

A 3 (modality: A, V, and AVc) × 3 (syllable: /ba/, /da/, and /ga/) repeated measures ANOVA revealed significant main effects of modality (F = 42.25, p < 0.001, η² = 0.09) and syllable (F = 58.13, p < 0.001, η² = 0.31) and a significant interaction effect between modality and syllable (F = 18.45, p < 0.001, η² = 0.11) (Table S1). Overall, participants showed higher accuracy for the audiovisual congruent syllables than for the auditory or visual syllables. Likewise, they exhibited lower accuracy for /da/ syllables than for /ba/ and /ga/ syllables. Post‐hoc analysis showed that the categorization accuracy was higher for the /ba/ syllable under AV congruent (AVc_/ba/, mean = 0.90, SD = 0.12) than unisensory auditory presentation (mean = 0.67, SD = 0.32); the accuracy for the /da/ syllable was higher for the audiovisual congruent (mean = 0.69, SD = 0.26) than the auditory (mean = 0.36, SD = 0.34) and visual modalities (mean = 0.43, SD = 0.27); the accuracy for /ga/ syllable was higher for the AV congruent (mean = 0.98, SD = 0.05) than the visual condition (mean = 0.67, SD = 0.25).

3.1.3. Response entropy for unisensory and audiovisual stimuli

Figure 2d shows the response entropy for congruent stimuli (pooled over categories), for congruent stimuli separately for /ba/, /da/ and /ga/ syllables, McGurk, and incongruent audiovisual stimuli. While the overall entropy for congruent stimuli was lower than that for McGurk stimuli (t = − 5.66, p < 0.001, Cohen's d = −0.87), there was no significant difference in entropy between the McGurk stimulus and the corresponding audiovisual congruent /da/ stimulus (t = −0.79, p = 0.43, Cohen's d = −0.12). This pattern suggests that observers successfully merge incongruent audiovisual McGurk signals into a unified ‘da’ percept whose perceptual uncertainty is comparable to that of their ‘da’ percept on congruent trials. It seamlessly aligns with previous findings that observers can be equally confident about their percept on McGurk and audiovisual congruent trials (Meijer & Noppeney, 2023).

The Figure S2 comprehensively shows the response entropy separately for 3 (syllables: /ba/, /da/, /ga/) × 3(sensory modalities: A, V, AV) in addition to the McGurk and AV incongruent stimuli. A 3 (modality: A, V, and AVc) × 3 (syllable: /ba/, /da/, and /ga/) repeated measures ANOVA revealed significant main effects of modality (F = 20.19, p < 0.001, η² = 0.08) and syllable (F = 32.88, p < 0.001, η² = 0.15) and a significant interaction effect between modality and syllable (F = 30.35, p < 0.001, η² = 0.20) (Table S2). Follow‐up t‐tests for the simple main effects indicate a significant decrease in entropy for the audiovisual congruent /ba/ relative to the auditory /ba/ syllable (t = −4.36, p _bonf < 0.001, Cohen's d = −0.93), for the AV congruent /da/ relative to the visual /da/ syllable (t = −5.37, p _bonf < 0.001, Cohen's d = −1.15), and for the AV congruent /ga/ relative to the visual /ga/ syllable (t = −8.43, p _bonf < 0.001, Cohen's d = −1.80).

3.1.4. Intersubject variability correlating McGurk illusion rate with response accuracy and entropy

The susceptibility to the McGurk illusion showed large interindividual variability, ranging from 0% to 100% ‘da’ percepts across observers. This substantial inter‐subject variability allowed us to assess whether observers' McGurk illusion rate was predicted by (i) their percentage of veridical ‘da’ percepts for AVc_/da/ stimuli, (ii) Their response entropy, and (iii) their percentage of ‘da’ responses on the unisensory auditory /ba/ and visual /ga/ stimuli in three linear regression models.

Participants' McGurk illusion rate could be predicted by their percentage of veridical ‘da’ percepts for AVc_/da/ stimuli (R ² = 0.19, F = 8.22, p = 0.007, Figure 2c and Table S3). We also observed a trend for the percentage of ‘da’ response on the auditory /ba/ and visual /ga/ stimuli (R ² = 0.14, F = 2.77, p = 0.076, Table S4). However, we did not observe a significant effect for the response entropy on the auditory /ba/ and visual /ga/ stimuli (R ² = 0.004, F = 0.07, p = 0.93, Table S5).

3.2. fMRI results

3.2.1. Auditory and visual activations

Our results confirmed that auditory and visual stimuli increased activations along either the auditory (GLM‐1A, Figure 3a, regions coded in red) or visual (GLM‐1A, Figure 3a, blue) processing pathways that then converged in an extensive frontoparietal neural system shared across sensory modalities (GLM‐1A, Figure 3a, pink).

Effects of sensory contexts in GLM‐1A. (a) Increased activations for auditory (A, red) and visual (V, blue) stimuli relative to baseline, and their intersection (pink). (b) Middle: increased activations for McGurk stimuli (AVm) relative to audiovisual congruent (AVc, red) and incongruent (AVi, blue) stimuli, and their intersection (pink). left and right columns: violin plots showing the distribution over the subject‐specific parameter estimates for AVm, AVc, and AVi relative to baseline at the MNI peak coordinate in left IFS (X = −46, Y = 26, Z = 28), right IFS (X = 38, Y = 36, Z = 20), right pre‐SMA (X = 2, Y = 18, Z = 52), left insula (X = −40, Y = 16, Z = 0), and right insula (X = 30, Y = 24, Z = 6) defined by the AVm > AVc contrast. (c) Decreased activations for McGurk (AVm) relative to audiovisual congruent (AVc, red) and audiovisual incongruent (AVi, blue), and their intersection (pink). Activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001. Thresholded images were rendered on an inflated canonical brain. AG, angular gyri; IFS, inferior frontal sulci; L, left; pre‐SMA, pre‐supplementary motor area; R, right.

3.2.2. McGurk relative to audiovisual congruent stimuli

Compared to congruent stimuli, McGurk stimuli increased activations in IFS, pre‐SMA extending into medial frontal gyrus, and insulae bilaterally, i.e., a network of regions involved in conflict processing and cognitive control (Brown & Braver, 2005; Duncan & Owen, 2000; Kerns et al., 2004). Further, they decreased activations in bilateral angular gyri and right posterior middle temporal gyrus (GLM‐1A, Figure 3b and Table 1). In a follow‐up analysis, we compared the BOLD‐response for the McGurk stimulus that typically elicits an illusory ‘da’ percept selectively to the audiovisual congruent /da/ stimulus that typically elicits a congruent ‘da’ percept. This more constrained comparison did not reveal any significant activations—thereby mirroring the behavioral results showing comparable response entropy for McGurk stimuli and audiovisual congruent /da/ stimuli.

TABLE 1.

Brain activation differences between McGurk and congruent stimuli.

Comparisons	Brain regions	Cluster size	MNI coordinates			Z‐scores (peak)	p _FWE value (cluster)
Comparisons	Brain regions	Cluster size	X	Y	Z	Z‐scores (peak)	p _FWE value (cluster)
AVm > AVc	Right pre‐supplementary motor area	1381	2	18	52	5.52	<0.001
	Left insula	394	−40	16	0	4.90	<0.001
	Left inferior frontal sulcus	1209	−46	26	28	4.85	<0.001
	Right inferior frontal sulcus	904	38	36	20	4.84	<0.001
	Right insula	177	30	24	6	4.83	<0.001
AVm < AVc	Right angular	250	48	−64	18	5.06	<0.001
	Right middle temporal gyrus	250	54	−58	14	3.84	<0.001
	Right temporal pole	183	54	10	−32	4.59	0.001
	Left angular gyrus	416	−40	−54	28	4.36	<0.001
	Left superior frontal gyrus	192	−8	48	42	4.19	0.001

Open in a new tab

Note: GLM‐1A, activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001.

Abbreviations: AVm, McGurk stimulus (i.e., visual /ga/ with auditory /ba/); AVc, audiovisual congruent stimuli.

3.2.3. McGurk relative to audiovisual incongruent stimuli

Compared to incongruent stimuli, McGurk stimuli increased activations in pre‐SMA extending into medial frontal gyrus and right insula and decreased activations in bilateral angular gyri and middle temporal gyri (GLM‐1A, Figure 3c and Table 2).

TABLE 2.

Brain activation differences between McGurk and incongruent stimuli.

Comparisons	Brain regions	Cluster size	MNI coordinates			Z‐score s (peak)	p _FWE value (cluster)
Comparisons	Brain regions	Cluster size	X	Y	Z	Z‐score s (peak)	p _FWE value (cluster)
AVm > AVi	Left pre‐supplementary motor area	751	−2	14	58	4.33	<0.001
	Right pre‐supplementary motor area	751	8	22	42	4.21	<0.001
	Right insula	183	30	22	−2	4.05	0.001
AVm < AVi	Right middle temporal gyrus	1045	56	−42	16	4.75	<0.001
	Right angular gyrus	1045	60	−56	22	4.44	<0.001
	Left angular gyrus	464	−54	−62	28	4.20	<0.001
	Left middle temporal gyrus	106	−56	−18	−14	3.97	0.022

Open in a new tab

Note: GLM‐1A, activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001.

Abbreviations: AVm, McGurk stimulus (i.e., visual /ga/ with auditory /ba/); AVi, audiovisual incongruent stimuli (i.e., visual /ba/ with auditory /ga/).

3.2.4. Activation differences across syllables

We observed activation differences across /ba/, /ga/ and /da/ syllables in widespread neural systems that were largely shared across auditory, visual and audiovisual stimuli. Key regions included the IFS/IFG, pre‐SMA extending into medial frontal gyrus, and insulae bilaterally (for detailed results, see Table S6 and Figure 4). Thus, activation differences across syllables arose in a network of regions that also exhibited differences between McGurk and congruent stimuli.

Effects of syllables in GLM‐2A. (a) Activation differences across syllables (/ba/, /ga/ and /da/) separately for auditory (A, yellow), visual (V, blue), audiovisual congruent (AVc, red) stimuli, intersections between A and V (green), intersections between A and AV (wine dark red), intersections between V and AV (pink), and intersections between A, V, and AVc (white). (b) Parameter estimates for auditory, visual, and audiovisual congruent syllables relative to baseline at the MNI peak coordinate in left IFS (X = −46, Y = 26, Z = 28), right pre‐SMA (X = 2, Y = 18, Z = 52), and left insula (X = −40, Y = 16, Z = 0) defined by the AVm > AVc contrast in GLM‐1A. Activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001. Thresholded images were rendered on an inflated canonical brain. L, left; R, right; IFS, inferior frontal sulci; pre‐SMA, pre‐supplementary motor area.

3.2.5. Effect of entropy over response distribution

Response entropy predicted activations in a widespread network of regions (GLM3, Figure 5). The BOLD‐response increased with greater entropy in bilateral IFS, pre‐SMA extended to MFG, and insulae (Figure 5, red and Table 3) and decreased in left angular gyrus and bilateral anterior MTG (Figure 5, blue and Table 3).

Effects of perceptual uncertainty (measured by entropy, GLM‐ 3). Increased activation with greater entropy over the response distribution (red), decreased activation with greater entropy over the response distribution (blue). Activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001. Thresholded images were rendered on an inflated canonical brain. AG, angular gyrus; IFS, left inferior frontal sulci; pre‐SMA, pre‐supplementary motor area.

TABLE 3.

Brain activations predicted by entropy.

Modulator	Brain regions	Cluster size	MNI coordinates			Z‐score s (peak)	p _FWE value (cluster)
Modulator	Brain regions	Cluster size	X	Y	Z	Z‐score s (peak)	p _FWE value (cluster)
Positive prediction	Right pre‐supplementary motor area	2260	4	18	50	7.32	<0.001
	Left insula	5089	−28	22	4	6.98	<0.001
	Left inferior frontal sulcus	5089	−46	18	24	6.01	<0.001
	Right insula	2244	32	24	−6	6.63	<0.001
	Right inferior frontal sulcus	2244	46	8	24	5.90	<0.001
Negative prediction	Left angular gyrus	882	−38	−64	33	5.06	<0.001
Negative prediction	Left middle temporal gyrus	463	−58	−16	−26	4.63	<0.001

Open in a new tab

Note: GLM‐3, activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001.

Further, we repeated the analyses and statistical comparisons described above (both whole brain and small volume corrected) using GLM‐1B and GLM‐2B that included entropy as one additional regressor. After accounting for variation in response entropy, none of the statistical comparisons revealed any significant results. This suggests that variation in response entropy across stimuli explains away activation differences across McGurk and congruent, incongruent stimuli as well as across syllables.

3.2.6. Intersubject variability: Predicting brain activation for McGurk stimuli by observers' illusion rate or response entropy on McGurk trial

Given the extensive variability in the McGurk illusion rate we investigated whether the activation level for McGurk stimuli relative to baseline covaries with observers' illusion rate or their response entropy on McGurk trials. The results showed that the activation level for McGurk stimuli relative to baseline at left STG covaries negatively with observers' response entropy (Figure S5 and Table S8) but not their illusion rate on McGurk trials.

4. DISCUSSION

The McGurk illusion, a perceptual phenomenon that arises from merging conflicting audiovisual speech signals, has been a valuable tool for investigating how we understand speech (Alsius et al., 2018; Peelle, 2019; Sams et al., 1991; Sekiyama et al., 2003; Tiippana, 2014; Tiippana et al., 2011; Tuomainen et al., 2005). Some researchers however have recently argued that the neural and cognitive mechanisms underlying the McGurk illusion differ fundamentally from those involved in congruent audiovisual speech perception (Getz & Toscano, 2021; Rosenblum, 2019; van Engen et al., 2017; van Engen et al., 2022). Our study initially seems to support this view by showing activation increases for McGurk stimuli relative to congruent audiovisual stimuli in a network of areas typically associated with conflict monitoring and cognitive control processes (Brown & Braver, 2005; Duncan & Owen, 2000; Kerns et al., 2004). Intriguingly, these areas also exhibited greater activation for McGurk than incongruent trials, even though the latter are thought to place greater demands on conflict monitoring and control. Moreover, the level of activation also varied across /ba/, /da/, and /ga/ syllables, even when the auditory and visual signals were congruent. This finding suggests that even congruent speech can place varying executive demands. It aligns with our observation that perceptual uncertainty, measured by Shannon entropy over observers' response distributions, strongly predicts the activation level in these areas, effectively explaining away differences between McGurk and normal congruent speech as well as differences between syllable classes.

Bayesian models view perception as inference based on noisy sensory signals (Helmholtz, 1867; Noppeney, 2021; Yuille & Bülthoff, 1993). Applied to speech perception, they propose that the brain infers a syllable from noisy acoustic signals and the corresponding facial movements (Kimmet et al., 2023; Lindborg & Andersen, 2021; Magnotti et al., 2018; Magnotti et al., 2020; Magnotti & Beauchamp, 2017; Meijer & Noppeney, 2023). Critically, the brain should integrate audiovisual signals from common sources, but segregate those from independent sources. Bayesian causal inference models deal with this so‐called causal inference by computing estimates that fuse and segregate the signals. To account for observers' uncertainty about the signals' causal structure, they compute a final perceptual estimate by combining the fusion and segregation estimates weighted by the probabilities of common or independent sources. Bayesian causal inference models thereby intimately link observers' causal and perceptual uncertainty (Kording et al., 2007; Noppeney, 2021; Noppeney & Lee, 2018; Shams & Beierholm, 2010). In particular at intermediate levels of audiovisual conflicts, ambiguity about whether signals come from one or two sources will increase observers' perceptual uncertainty about their perceived phoneme.

Moreover, in a syllable categorization task, perceptual uncertainty arises from two distinct sources: i. the internal and external noise that corrupts the sensory signals, ii. the variability in how a particular phoneme or viseme is produced (Bejjanki et al., 2011). This latter variability explains that the viseme /ba/ is almost exclusively categorized as a ‘ba’, while the viseme /ga/ is often confused with a ‘da’. Conversely, the auditory phoneme /ga/ is almost always perceived as a ‘ga’, whereas the phoneme /ba/ can also be perceived as a ‘da’ (see Figure 2a, b). Hence, the McGurk illusion arises, because a ‘da’ percept is a possible interpretation for both the auditory /ba/ and the visual /ga/ signal (Massaro, 1998; Oden & Massaro, 1978). Further, this illusory ‘da’ percept is especially likely in participants that predominantly report a veridical ‘da’ percept for the audiovisual congruent /da/ phoneme‐viseme pairs (i.e., significant Pearson correlation). By contrast, the reverse pairing, such as a visual /ba/ with an auditory /ga/ in the incongruent trials, seldomly results in a fused ‘da’ percept, because such a ‘da’ interpretation is near‐impossible for either auditory /ga/ or visual /ba/ inputs.

Crucially, understanding perception as inference based on noisy sensory signals dispenses with the categorical distinction between congruent, McGurk, and incongruent phoneme‐viseme pairs. This perspective acknowledges that perceptual inference always carries a degree of uncertainty (Li & Ma, 2020). Even congruent phoneme‐viseme pairs can generate seemingly conflicting sensory signals due to sensory noise and within‐category variability. However, perceptual uncertainty is typically increased for McGurk stimuli because of their small non‐noticeable intersensory conflict which invokes causal and perceptual uncertainty. Consistent with Bayesian principles, we indeed observed larger response entropies for McGurk than audiovisual congruent stimuli when pooled over all syllable categories. However, the response entropy is only slightly greater for the McGurk than the audiovisual congruent /da/ stimulus. Moreover, the perceptual uncertainty was greater for McGurk than audiovisual incongruent conditions that introduce an even larger inter‐sensory conflict. This difference most likely results from the fact that in the unisensory context the auditory /ba/ syllable in McGurk stimuli is associated with a greater perceptual uncertainty than the /ga/ syllable that is the auditory component in the incongruent stimuli.

In short, McGurk stimuli are associated with greater perceptual uncertainty for two reasons. First, the auditory /ba/ and visual /ga/ stimuli are associated with greater perceptual uncertainty than, for example, auditory /ga/ and visual /ba/ stimuli, even when presented in unisensory contexts. Second, the intersensory conflict between auditory /ba/ and visual /ga/ introduces additional causal and perceptual uncertainty. While both factors jointly determine the posterior distribution and hence observers' perceptual uncertainty, the greater response entropy for the McGurk than the incongruent stimuli suggests that the former is the driving factor. Our results align with recent studies (Gonzales et al., 2021; Iqbal et al., 2023) showing that the McGurk illusions arises in particular when visual stimuli—even in unisensory context—are associated with substantial perceptual uncertainty and ambiguity. From the perspective of Bayesian causal inference, the auditory /ba/ and visual /ga/ signals also need to be compatible with a ‘da’ interpretation, so that the brain can integrate them into a single ‘da’ percept. Hence, perceptual ambiguity in the unisensory context is a driving factor for the McGurk illusion.

In line with this entropy profile, we observed enhanced activations for McGurk relative to congruent trials in IFS, pre‐SMA and insulae bilaterally—regions typically involved in conflict and cognitive control processes (Adam & Noppeney, 2010; Noppeney et al., 2008; Noppeney et al., 2010). Previous studies often attributed these activation increases to the intersensory conflict in McGurk stimuli and concluded that McGurk stimuli must therefore rely on distinct neural mechanisms (Perrachione & Ghosh, 2013; Wiersinga‐Post et al., 2010). However, in our study activation differences were no longer significant when we compared McGurk stimuli to the congruent stimuli selectively from the corresponding /da/ syllable. Further, activations in the same regions were also increased for McGurk relative to incongruent phoneme‐viseme pairs (Table 4), even though the latter should theoretically be associated with a greater perceived intersensory conflict. Activation levels in the same regions also varied across syllable categories of audiovisual congruent trials. Collectively, this pattern suggests that rather than being sensitive to inter‐sensory conflict these regions respond more generically to perceptual uncertainty that differs not only across congruent, McGurk and incongruent stimuli, but also across syllable categories (Figures S1 and S2). Indeed, in support of this conjecture, our follow‐up analysis revealed a positive correlation between perceptual uncertainty (i.e., response entropy) and activation levels in exactly this network of regions (Figure 5). Moreover, including response entropy as a predictor in our regression models effectively explained away activation differences between McGurk and congruent/incongruent phoneme‐viseme pairs.

TABLE 4.

Conjunctions over McGurk vs. congruent stimuli and McGurk vs. incongruent stimuli.

Comparisons	Brain regions	Cluster size	MNI coordinates			Z‐score (peak)	p _FWE value (cluster)
Comparisons	Brain regions	Cluster size	X	Y	Z	Z‐score (peak)	p _FWE value (cluster)
AVm > AVc ∩ AVm > AVi	Left pre‐supplementary motor area	695	−2	14	58	4.33	<0.001
	Right pre‐supplementary motor area	695	8	22	42	4.22	<0.001
	Right insula	173	30	22	−2	4.05	0.002
AVm < AVc ∩ AVm < AVi	Right angular gyrus	202	48	−64	18	4.20	0.001
	Right middle temporal gyrus	202	54	−58	14	3.84	0.001
	Left angular gyrus	231	−52	−62	26	3.98	<0.001

Open in a new tab

Note: GLM‐1A, activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001.

Abbreviations: AVm, McGurk stimulus (i.e., visual /ga/ with auditory /ba/); AVc, audiovisual congruent stimuli; AVi, audiovisual incongruent stimulus (i.e., visual /ba/ with auditory /ga/).

Taken together, our results suggest that McGurk and congruent phoneme‐viseme pairs rely on shared neural processing systems. As McGurk trials are often associated with greater perceptual uncertainty than congruent speech, they place greater demands on conflict and cognitive control processes and the associated network of regions. However, McGurk and audiovisual congruent /da/ stimuli were associated with comparable response entropy and neural activity levels, even though the latter was physically congruent. These findings suggest that at least for our McGurk stimuli observers successfully fused the conflicting audiovisual signals into ‘da’ percepts that are comparable to their audiovisual congruent /da/ counterparts. These behavioral and neuroimaging results dovetail nicely with a recent psychophysics and computational modeling study showing comparable perceptual and causal confidence for McGurk and congruent stimuli (Meijer & Noppeney, 2023).

In addition, the role of ‘classical multisensory areas’ such as superior temporal gyri/sulci (pSTG/S) in the processing of McGurk stimuli has been debated (Beauchamp, 2016; Beauchamp et al., 2010; Brang et al., 2020; Erickson et al., 2014; Hickok et al., 2018; Olson et al., 2002). Our data demonstrate that pSTG/S is sensitive to the incongruency of auditory and visual signals, exhibiting increased activation for the incongruent compared to the congruent stimuli (Figure S3 and Table S7, GLM‐1A). Similar results have previously been reported in studies focusing on linguistic (Benoit et al., 2010; Bernstein et al., 2008; Jones & Callan, 2003; Moris Fernandez et al., 2018; Murakami et al., 2018; Nath & Beauchamp, 2012; Szycik et al., 2009) and non‐linguistic stimuli (Davies‐Thompson et al., 2019; Watson et al., 2013). Specifically, one study using audiovisual emotional stimuli revealed that, while the increased pre‐SMA/SMA activations were associated with task difficulty, the increased pSTG/S activations were related to audiovisual incongruency (Watson et al., 2013). Despite the substantial individual variation in the susceptibilities to the McGurk illusion, we did not find a significant correlation between the activation level of pSTS/G and observers' illusion rate (Benoit et al., 2010; Nath et al., 2011; Nath & Beauchamp, 2012), which may be explained by differences in participants, stimuli (e.g., speaker), and tasks used in our experiment (Alsius et al., 2018; Brown et al., 2018; Feng et al., 2019; Magnotti et al., 2015; Mallick et al., 2015).

5. CONCLUSION

In conclusion, our research indicates that both the McGurk illusion and natural audiovisual speech perception result from inference based on noisy audiovisual signals. Consequently, both come with an inherent degree of uncertainty. Furthermore, both engage shared neural systems encompassing STS, IFS, pre‐SMA and insulae. Notably, the McGurk stimuli increase activations in the latter IFS, pre‐SMA and insular regions that form part of a wider cognitive control system. Critically, however, the activation differences between McGurk and congruent audiovisual stimuli within this cognitive control system can be directly attributed to observers' degree of perceptual uncertainty. Collectively, our behavioral and neural results suggest that the McGurk illusion and natural speech perception lie on a continuous spectrum rather than being categorically different. From a practical viewpoint they support the validity of the McGurk illusion as a tool for studying natural audiovisual speech perception.

AUTHOR CONTRIBUTIONS

Chenjie Dong, conceptualization, data collection, data analysis, writing manuscript; Uta Noppeney, conceptualization, resources, writing manuscript, supervision; Suiping Wang, conceptualization, resources, writing manuscript, supervision, funding acquisition, project administration.

CONFLICT OF INTEREST STATEMENT

All authors declare that they have no conflicts of interest.

Supporting information

FIGURE S1. Distribution of responses over the four choice options (‘ba’, ‘da’, ‘ga’, and ‘other’) across the 11 stimulus classes. A_/ba/, auditory /ba/; A_/da/, auditory /da/; A_/ga/, auditory /ga/; V_/ba/, visual /ba/; V_/da/, visual /da/; V_/ga/, visual /ga/; AVc_/ba/, audiovisual congruent /ba/; AVc_/da/, audiovisual congruent /da/; AVc_/ga/, audiovisual congruent /ga/; AVm, McGurk (i.e., visual /ga/ with auditory /ba/), AVi, audiovisual incongruent (i.e., visual /ba/ with auditory /ga/).

FIGURE S2. Response entropy for the 11 stimuli. A_/ba/, auditory /ba/; A_/da/, auditory /da/; A_/ga/, auditory /ga/; V_/ba/, visual /ba/; V_/da/, visual /da/; V_/ga/, visual /ga/; AVc_/ba/, audiovisual congruent /ba/; AVc_/da/, audiovisual congruent /da/; AVc_/ga/, audiovisual congruent /ga/; AVm, McGurk (i.e., visual /ga/ with auditory /ba/), and AVi, audiovisual incongruent (i.e., visual /ba/ with auditory /ga/).

FIGURE S3. Increased activations for audiovisual incongruent (AVi) relative to audiovisual congruent stimuli (AVc, GLM‐1A). Activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001.

FIGURE S4. Effects of sensory contexts and phonemes at voxel threshold of p < 0.001 (uncorrected for illustrational purposes). (A) GLM‐1A: increased activations for McGurk stimulus (AVm) relative to audiovisual congruent (AVc, red) and incongruent (AVi, blue) stimuli, and their intersection (pink). (B) GLM‐2A: activation differences across syllables (/ba/, /ga/, and /da/) for auditory (A, yellow), visual (V, blue), audiovisual congruent (AVc, red) stimuli, intersections between A and V (green), intersections between A and AV (wine dark red), intersections between V and AV (pink), and intersections between A, V, and AVc (white).

FIGURE S5. Brain areas that covaries with the response entropy on the McGurk stimulus across participants. Activations are shown at p _FWE < 0.05 at the cluster level corrected for multiple comparisons within the entire brain, using an auxiliary uncorrected voxel threshold of p < 0.001.

TABLE S1. Statistical results of the 3 (modality: A, V, and AVc) × 3 (syllable: /ba/, /da/, and /ga/) repeated measures ANOVA on categorization accuracy.

TABLE S2. Statistical results of the 3 (modality: A, V, and AVc) × 3 (syllable: /ba/, /da/, and /ga/) repeated measures ANOVA on response entropy.

TABLE S3. Predict McGurk illusion rate by percentage of veridical ‘da’ percepts for audiovisual congruent /da/ stimulus across participants.

TABLE S4. Predict McGurk illusion rate by the fraction of ‘da’ response on auditory /ba/ and visual /ga/ stimuli across participants.

TABLE S5. Predict McGurk illusion rate by entropy on auditory /ba/ and visual /ga/ stimuli across participants.

TABLE S6. Brain activation differences across syllables separately for AV, A and V modalities.

TABLE S7. Brain activations for audiovisual incongruent > congruent stimuli.

TABLE S8. Brain areas that covaries with the response entropy on the McGurk stimulus across participants.

HBM-45-e26653-s001.docx^{(1.5MB, docx)}

ACKNOWLEDGMENTS

This research was funded by the Key Research and Development Program of Guangdong, China (2023B0303010004) and the National Natural Science Foundation of China (No. 32171051). We thank Lizhen Qiu and Qi Yao for helping with the data acquisition.

Dong, C. , Noppeney, U. , & Wang, S. (2024). Perceptual uncertainty explains activation differences between audiovisual congruent speech and McGurk stimuli. Human Brain Mapping, 45(4), e26653. 10.1002/hbm.26653

Uta Noppeney and Suiping Wang are senior authors and also contributed equally to this study.

DATA AVAILABILITY STATEMENT

Data are publicly available on the Open Science Framework at: https://osf.io/9pbh7/.

REFERENCES

Adam, R. , & Noppeney, U. (2010). Prior auditory information shapes visual category‐selectivity in ventral occipito‐temporal cortex. NeuroImage, 52(4), 1592–1602. 10.1016/j.neuroimage.2010.05.002 [DOI] [PubMed] [Google Scholar]
Alsius, A. , Pare, M. , & Munhall, K. G. (2018). Forty years after hearing lips and seeing voices: The McGurk effect revisited. Multisensory Research, 31(1–2), 111–144. 10.1163/22134808-00002565 [DOI] [PubMed] [Google Scholar]
Ashburner, J. , & Friston, K. J. (2005). Unified segmentation. Neuroimage, 26(3), 839–851. 10.1016/j.neuroimage.2005.02.018 [DOI] [PubMed] [Google Scholar]
Baum, S. H. , Martin, R. C. , Hamilton, A. C. , & Beauchamp, M. S. (2012). Multisensory speech perception without the left superior temporal sulcus. NeuroImage, 62(3), 1825–1832. 10.1016/j.neuroimage.2012.05.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
Beauchamp, M. S. (2016). Audiovisual speech integration: Neural substrates and behavior. In Neurobiology of language (pp. 515–526). Elsevier. [Google Scholar]
Beauchamp, M. S. , Nath, A. R. , & Pasalar, S. (2010). fMRI‐guided transcranial magnetic stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk effect. The Journal of Neuroscience, 30(7), 2414–2417. 10.1523/JNEUROSCI.4865-09.2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
Bejjanki, V. R. , Clayards, M. , Knill, D. C. , & Aslin, R. N. (2011). Cue integration in categorical tasks: Insights from audio‐visual speech perception. PLoS One, 6(5), e19812. 10.1371/journal.pone.0019812 [DOI] [PMC free article] [PubMed] [Google Scholar]
Benoit, M. M. , Raij, T. , Lin, F. H. , Jääskeläinen, I. P. , & Stufflebeam, S. (2010). Primary and multisensory cortical activity is correlated with audiovisual percepts. Human Brain Mapping, 31(4), 526–538. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bernstein, L. E. , Lu, Z. L. , & Jiang, J. (2008). Quantified acoustic‐optical speech signal incongruity identifies cortical sites of audiovisual speech processing. Brain Research, 1242, 172–184. 10.1016/j.brainres.2008.04.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
Brang, D. , Plass, J. , Kakaizada, S. , & Hervey‐Jumper, S. L. J. b. (2020). Auditory‐visual speech behaviors are resilient to left pSTS damage. 2020.2009. 2026.314799.
Brown, J. W. , & Braver, T. S. (2005). Learned predictions of error likelihood in the anterior cingulate cortex. Science, 307(5712), 1118–1121. 10.1126/science.1105783 [DOI] [PubMed] [Google Scholar]
Brown, V. A. , Hedayati, M. , Zanger, A. , Mayn, S. , Ray, L. , Dillman‐Hasso, N. , & Strand, J. F. (2018). What accounts for individual differences in susceptibility to the McGurk effect? PLoS One, 13(11), e0207160. 10.1371/journal.pone.0207160 [DOI] [PMC free article] [PubMed] [Google Scholar]
Davies‐Thompson, J. , Elli, G. V. , Rezk, M. , Benetti, S. , van Ackeren, M. , & Collignon, O. (2019). Hierarchical brain network for face and voice integration of emotion expression. Cerebral Cortex, 29(9), 3590–3605. 10.1093/cercor/bhy240 [DOI] [PubMed] [Google Scholar]
Duncan, J. , & Owen, A. M. (2000). Common regions of the human frontal lobe recruited by diverse cognitive demands. Trends in Neurosciences, 23(10), 475–483. 10.1016/s0166-2236(00)01633-7 [DOI] [PubMed] [Google Scholar]
Erickson, L. C. , Zielinski, B. A. , Zielinski, J. E. V. , Liu, G. , Turkeltaub, P. E. , Leaver, A. M. , & Rauschecker, J. P. (2014). Distinct cortical locations for integration of audiovisual speech and the McGurk effect. Frontiers in Psychology, 5, 534. [DOI] [PMC free article] [PubMed] [Google Scholar]
Feng, G. , Zhou, B. , Zhou, W. , Beauchamp, M. S. , & Magnotti, J. F. (2019). A laboratory study of the McGurk effect in 324 monozygotic and dizygotic twins. Frontiers in Neuroscience, 13, 1029. 10.3389/fnins.2019.01029 [DOI] [PMC free article] [PubMed] [Google Scholar]
Friston, K. J. , Holmes, A. P. , Worsley, K. J. , Poline, J.‐P. , Frith, C. D. , & Frackowiak, R. S. J. (1994). Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping, 2(4), 189–210. 10.1002/hbm.460020402 [DOI] [Google Scholar]
Gau, R. , & Noppeney, U. (2016). How prior expectations shape multisensory perception. NeuroImage, 124, 876–886. 10.1016/j.neuroimage.2015.09.045 [DOI] [PubMed] [Google Scholar]
Getz, L. M. , & Toscano, J. C. (2021). Rethinking the McGurk effect as a perceptual illusion. Attention, Perception, & Psychophysics, 83(6), 2583–2598. [DOI] [PubMed] [Google Scholar]
Gonzales, M. G. , Backer, K. C. , Mandujano, B. , & Shahin, A. J. (2021). Rethinking the mechanisms underlying the McGurk illusion. Frontiers in Human Neuroscience, 15, 616049. 10.3389/fnhum.2021.616049 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hasson, U. , Skipper, J. I. , Nusbaum, H. C. , & Small, S. L. (2007). Abstract coding of audiovisual speech: Beyond sensory representation. Neuron, 56(6), 1116–1126. 10.1016/j.neuron.2007.09.037 [DOI] [PMC free article] [PubMed] [Google Scholar]
Helmholtz, H. J. H. (1867). Handbuch der physiologischen Optik. Leopold Voss. [Google Scholar]
Hickok, G. , Rogalsky, C. , Matchin, W. , Basilakos, A. , Cai, J. , Pillay, S. , Ferrill, M. , Mickelsen, S. , Anderson, S. W. , & Love, T. (2018). Neural networks supporting audiovisual integration for speech: A large‐scale lesion study. Cortex, 103, 360–371. [DOI] [PMC free article] [PubMed] [Google Scholar]
Iqbal, Z. J. , Shahin, A. J. , Bortfeld, H. , & Backer, K. C. (2023). The McGurk illusion: A default mechanism of the auditory system. Brain Sciences, 13(3), 510. 10.3390/brainsci13030510 [DOI] [PMC free article] [PubMed] [Google Scholar]
Jones, J. A. , & Callan, D. E. (2003). Brain activity during audiovisual speech perception: An fMRI study of the McGurk effect. Neuroreport, 14(8), 1129–1133. [DOI] [PubMed] [Google Scholar]
Kerns, J. G. , Cohen, J. D. , MacDonald, A. W., 3rd , Cho, R. Y. , Stenger, V. A. , & Carter, C. S. (2004). Anterior cingulate conflict monitoring and adjustments in control. Science, 303(5660), 1023–1026. 10.1126/science.1089910 [DOI] [PubMed] [Google Scholar]
Kimmet, F. , Pedersen, S. , Cardenas, V. , Rubiera, C. , Johnson, G. , Sans, A. , Baldwin, M. , & Odegaard, B. (2023). Metacognition and causal inference in audiovisual speech. Multisensory Research, 36(3), 289–311. 10.1163/22134808-bja10094 [DOI] [PubMed] [Google Scholar]
Kording, K. P. , Beierholm, U. , Ma, W. J. , Quartz, S. , Tenenbaum, J. B. , & Shams, L. (2007). Causal inference in multisensory perception. PLoS One, 2(9), e943. 10.1371/journal.pone.0000943 [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, H. H. , & Ma, W. J. (2020). Confidence reports in decision‐making with multiple alternatives violate the Bayesian confidence hypothesis. Nature Communications, 11(1), 2004. 10.1038/s41467-020-15581-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindborg, A. , & Andersen, T. S. (2021). Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception. PLoS One, 16(2), e0246986. 10.1371/journal.pone.0246986 [DOI] [PMC free article] [PubMed] [Google Scholar]
Luttke, C. S. , Ekman, M. , van Gerven, M. A. , & de Lange, F. P. (2016). Preference for audiovisual speech congruency in superior temporal cortex. Journal of Cognitive Neuroscience, 28(1), 1–7. 10.1162/jocn_a_00874 [DOI] [PubMed] [Google Scholar]
Magnotti, J. F. , Basu Mallick, D. , Feng, G. , Zhou, B. , Zhou, W. , & Beauchamp, M. S. (2015). Similar frequency of the McGurk effect in large samples of native mandarin Chinese and American English speakers. Experimental Brain Research, 233(9), 2581–2586. 10.1007/s00221-015-4324-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
Magnotti, J. F. , & Beauchamp, M. S. (2015). The noisy encoding of disparity model of the McGurk effect. Psychonomic Bulletin & Review, 22(3), 701–709. 10.3758/s13423-014-0722-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
Magnotti, J. F. , & Beauchamp, M. S. (2017). A causal inference model explains perception of the McGurk effect and other incongruent audiovisual speech. PLoS Computational Biology, 13(2), e1005229. 10.1371/journal.pcbi.1005229 [DOI] [PMC free article] [PubMed] [Google Scholar]
Magnotti, J. F. , Dzeda, K. B. , Wegner‐Clemens, K. , Rennig, J. , & Beauchamp, M. S. (2020). Weak observer‐level correlation and strong stimulus‐level correlation between the McGurk effect and audiovisual speech‐in‐noise: A causal inference explanation. Cortex, 133, 371–383. 10.1016/j.cortex.2020.10.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
Magnotti, J. F. , Smith, K. B. , Salinas, M. , Mays, J. , Zhu, L. L. , & Beauchamp, M. S. (2018). A causal inference explanation for enhancement of multisensory integration by co‐articulation. Scientific Reports, 8(1), 18032. 10.1038/s41598-018-36772-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
Mallick, D. B. , Magnotti, J. F. , & Beauchamp, M. S. (2015). Variability and stability in the McGurk effect: Contributions of participants, stimuli, time, and response type. Psychonomic Bulletin & Review, 22(5), 1299–1307. 10.3758/s13423-015-0817-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
Massaro, D. W. (1998). Perceiving talking faces: From speech perception to a behavioral principle. The MIT Press. [Google Scholar]
McGurk, H. , & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5588), 746–748. [DOI] [PubMed] [Google Scholar]
Meijer, D. , & Noppeney, U. (2023). Metacognition in the audiovisual McGurk illusion: Perceptual and causal confidence. Philosophical Transactions of the Royal Society B: Biological Sciences, 378(1886), 20220348. 10.1101/2023.03.21.533540%JbioRxiv [DOI] [PMC free article] [PubMed] [Google Scholar]
Moris Fernandez, L. , Macaluso, E. , & Soto‐Faraco, S. (2017). Audiovisual integration as conflict resolution: The conflict of the McGurk illusion. Human Brain Mapping, 38(11), 5691–5705. 10.1002/hbm.23758 [DOI] [PMC free article] [PubMed] [Google Scholar]
Moris Fernandez, L. , Torralba, M. , & Soto‐Faraco, S. (2018). Theta oscillations reflect conflict processing in the perception of the McGurk illusion. The European Journal of Neuroscience, 48(7), 2630–2641. 10.1111/ejn.13804 [DOI] [PubMed] [Google Scholar]
Murakami, T. , Abe, M. , Wiratman, W. , Fujiwara, J. , Okamoto, M. , Mizuochi‐Endo, T. , Iwabuchi, T. , Makuuchi, M. , Yamashita, A. , & Tiksnadi, A. (2018). The motor network reduces multisensory illusory perception. Journal of Neuroscience, 38(45), 9679–9688. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nath, A. R. , & Beauchamp, M. S. (2012). A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion. NeuroImage, 59(1), 781–787. 10.1016/j.neuroimage.2011.07.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
Nath, A. R. , Fava, E. E. , & Beauchamp, M. S. (2011). Neural correlates of interindividual differences in children's audiovisual speech perception. Journal of Neuroscience, 31(39), 13963–13971. [DOI] [PMC free article] [PubMed] [Google Scholar]
Noppeney, U. (2021). Perceptual inference, learning, and attention in a multisensory world. Annual Review of Neuroscience, 44, 449–473. 10.1146/annurev-neuro-100120-085519 [DOI] [PubMed] [Google Scholar]
Noppeney, U. , Josephs, O. , Hocking, J. , Price, C. J. , & Friston, K. J. (2008). The effect of prior visual information on recognition of speech and sounds. Cerebral Cortex, 18(3), 598–609. 10.1093/cercor/bhm091 [DOI] [PubMed] [Google Scholar]
Noppeney, U. , & Lee, H. L. (2018). Causal inference and temporal predictions in audiovisual perception of speech and music. Annals of the new York Academy of Sciences, 1423, 102–116. 10.1111/nyas.13615 [DOI] [PubMed] [Google Scholar]
Noppeney, U. , Ostwald, D. , & Werner, S. (2010). Perceptual decisions formed by accumulation of audiovisual evidence in prefrontal cortex. The Journal of Neuroscience, 30(21), 7434–7446. 10.1523/JNEUROSCI.0455-10.2010 [DOI] [PMC free article] [PubMed] [Google Scholar]
Oden, G. C. , & Massaro, D. W. (1978). Integration of featural information in speech perception. Psychological Review, 85(3), 172–191. 10.1037/0033-295x.85.3.172 [DOI] [PubMed] [Google Scholar]
Olson, I. R. , Gatenby, J. C. , & Gore, J. C. (2002). A comparison of bound and unbound audio‐visual information processing in the human cerebral cortex. Brain Research. Cognitive Brain Research, 14(1), 129–138. 10.1016/s0926-6410(02)00067-8 [DOI] [PubMed] [Google Scholar]
Peelle, J. E. (2019). The neural basis for auditory and audiovisual speech perception. In The Routledge handbook of phonetics (pp. 193–216). Routledge. [Google Scholar]
Perrachione, T. K. , & Ghosh, S. S. (2013). Optimized design and analysis of sparse‐sampling FMRI experiments. Frontiers in Neuroscience, 7, 55. 10.3389/fnins.2013.00055 [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenblum, L. D. (2019). Audiovisual speech perception and the McGurk effect. Oxford Research Encyclopedia of Linguistics. 10.1093/acrefore/9780199384655.013.420 [DOI] [Google Scholar]
Sams, M. , Aulanko, R. , Hamalainen, M. , Hari, R. , Lounasmaa, O. V. , Lu, S. T. , & Simola, J. (1991). Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex. Neuroscience Letters, 127(1), 141–145. 10.1016/0304-3940(91)90914-f [DOI] [PubMed] [Google Scholar]
Sekiyama, K. (1997). Cultural and linguistic factors in audiovisual speech processing: The McGurk effect in Chinese subjects. Perception & Psychophysics, 59(1), 73–80. 10.3758/Bf03206849 [DOI] [PubMed] [Google Scholar]
Sekiyama, K. , Kanno, I. , Miura, S. , & Sugita, Y. (2003). Auditory‐visual speech perception examined by fMRI and PET. Neuroscience Research, 47(3), 277–287. 10.1016/s0168-0102(03)00214-1 [DOI] [PubMed] [Google Scholar]
Sekiyama, K. , & Tohkura, Y. I. (1991). McGurk effect in non‐English listeners: Few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility. The Journal of the Acoustical Society of America, 90(4), 1797–1805. 10.1121/1.401660 [DOI] [PubMed] [Google Scholar]
Shams, L. , & Beierholm, U. (2022). Bayesian causal inference: A unifying neuroscience theory. Neuroscience and Biobehavioral Reviews, 137, 104619. 10.1016/j.neubiorev.2022.104619 [DOI] [PubMed] [Google Scholar]
Shams, L. , & Beierholm, U. R. (2010). Causal inference in perception. Trends in Cognitive Sciences, 14(9), 425–432. 10.1016/j.tics.2010.07.001 [DOI] [PubMed] [Google Scholar]
Skipper, J. I. , van Wassenhove, V. , Nusbaum, H. C. , & Small, S. L. (2007). Hearing lips and seeing voices: How cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex, 17(10), 2387–2399. [DOI] [PMC free article] [PubMed] [Google Scholar]
Szycik, G. R. , Jansma, H. , & Munte, T. F. (2009). Audiovisual integration during speech comprehension: An fMRI study comparing ROI‐based and whole brain analyses. Human Brain Mapping, 30(7), 1990–1999. 10.1002/hbm.20640 [DOI] [PMC free article] [PubMed] [Google Scholar]
Szycik, G. R. , Stadler, J. , Tempelmann, C. , & Münte, T. F. (2012). Examining the McGurk illusion using high‐field 7 tesla functional MRI. Frontiers in Human Neuroscience, 6, 95. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tiippana, K. (2014). What is the McGurk effect? [opinion]. Frontiers in Psychology, 5, 725. 10.3389/fpsyg.2014.00725 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tiippana, K. , Puharinen, H. , Mottonen, R. , & Sams, M. (2011). Sound location can influence audiovisual speech perception when spatial attention is manipulated. Seeing and Perceiving, 24(1), 67–90. 10.1163/187847511x557308 [DOI] [PubMed] [Google Scholar]
Tse, C. Y. , Gratton, G. , Garnsey, S. M. , Novak, M. A. , & Fabiani, M. (2015). Read my lips: Brain dynamics associated with audiovisual integration and deviance detection. Journal of Cognitive Neuroscience, 27(9), 1723–1737. 10.1162/jocn_a_00812 [DOI] [PubMed] [Google Scholar]
Tuomainen, J. , Andersen, T. S. , Tiippana, K. , & Sams, M. (2005). Audio‐visual speech perception is special. Cognition, 96(1), B13–B22. 10.1016/j.cognition.2004.10.004 [DOI] [PubMed] [Google Scholar]
van Engen, K. J. , Dey, A. , Sommers, M. S. , & Peelle, J. E. (2022). Audiovisual speech perception: Moving beyond McGurk. The Journal of the Acoustical Society of America, 152(6), 3216–3225. [DOI] [PMC free article] [PubMed] [Google Scholar]
van Engen, K. J. , Xie, Z. , & Chandrasekaran, B. (2017). Audiovisual sentence recognition not predicted by susceptibility to the McGurk effect. Attention, Perception, & Psychophysics, 79(2), 396–403. 10.3758/s13414-016-1238-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Watson, R. , Latinus, M. , Noguchi, T. , Garrod, O. , Crabbe, F. , & Belin, P. (2013). Dissociating task difficulty from incongruence in face‐voice emotion integration. Frontiers in Human Neuroscience, 7, ARTN744. 10.3389/fnhum.2013.00744 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wiersinga‐Post, E. , Tomaskovic, S. , Slabu, L. , Renken, R. , de Smit, F. , & Duifhuis, H. (2010). Decreased BOLD responses in audiovisual processing. Neuroreport, 21(18), 1146–1151. 10.1097/WNR.0b013e328340cc47 [DOI] [PubMed] [Google Scholar]
Yuille, A. L. , & Bülthoff, H. H. (1993). Bayesian decision theory and psychophysics (2). Max Planck Institute for Biological Cybernetics. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials