Skip to main content
The Journal of Neuroscience logoLink to The Journal of Neuroscience
. 2024 Aug 21;44(34):e0767242024. doi: 10.1523/JNEUROSCI.0767-24.2024

The Contribution of Motion-Sensitive Brain Areas to Visual Speech Recognition

Jirka Liessens 1,, Simon Ladouce 1
PMCID: PMC11340274  PMID: 39168648

The integration of multiple sources of sensory information, particularly visual information such as lip movements, can substantially facilitate comprehension of spoken language. Therefore, although research on verbal comprehension initially focused on the processing of auditory information, the field has since expanded to study a wider range of input modalities, and it now encompasses the integration of both auditory and visual information (Bernstein and Liebenthal, 2014). Consideration of the processing of visual information is essential for addressing questions of speech comprehension such as how individuals compensate for unreliable auditory input (e.g., a noisy environment) or hearing impairments due to auditory system deficiency. The growing evidence that visual processing contributes to speech recognition prompts investigations into the functional and structural neural underpinnings of this contribution.

Visual speech recognition involves interpreting mouth and lip movements to comprehend spoken words. The middle temporal brain area V5 (V5/MT), known for its role in motion processing, has also been linked to visual speech perception. Indeed, both positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) studies have consistently reported V5/MT activation during lip reading (Paulesu et al., 2003; Santi et al., 2003). However, there is no direct evidence that V5/MT is necessary for speech comprehension beyond this association. Furthermore, Mather et al. (2016) found that disruptive transcranial magnetic stimulation (TMS) over V5/MT affected the processing of nonbiological motion but did not impact the processing of biological motion. These findings suggest that while V5/MT is involved in visual speech perception, it is not directly involved in the processing of biological motion. Therefore, further functional studies are needed to clarify the role of V5/MT in motion processing and speech comprehension.

In a recent study, Jeschke et al. (2023) aimed to confirm the necessity of V5/MT in visual speech recognition by applying inhibitory TMS pulses over V5/MT while volunteers performed tasks requiring these functions. Participants received bilateral inhibitory TMS stimulation, either over V5/MT or over the vertex, to disrupt neural processing in a focal area akin to a temporary virtual lesion. Participants performed two tasks, one to assess nonbiological motion and another to assess visual speech recognition. In the first task, participants viewed two displays of randomly moving dots sequentially and had to indicate if the overall direction of movement matched across the two displays. In the second task, participants viewed two muted videos of a female speaker uttering visually distinct vowel–consonant–vowel combinations. Similarly, participants had to indicate whether the combinations in the two videos matched. Accuracy and response times were recorded before TMS was applied, to establish baseline performance, and then participants performed both tasks during inhibitory TMS. After the stimulation, participants performed the behavioral tasks again.

As expected, TMS-mediated inhibition of V5/MT was associated with significantly increased response times in both the visual speech recognition task and the nonbiological motion task relative to stimulation of the vertex. In addition, TMS inhibition of V5/MT was associated with a significantly smaller practice effect (indicated by the ratio of pre-stimulation-to-post-stimulation response times) in both tasks compared with stimulation of the vertex, likely as a result of reduced task performance during stimulation. Taken together, these results suggest that V5/MT causally contributes to visual speech recognition and nonbiological motion processing.

These findings help establish V5/MT as part of the interconnected neural network underlying visual speech recognition (see Fig. 1 for a simplified illustration). Visual speech recognition relies on a network of visual areas related to the processing of both form and motion (Arnal et al., 2011). Previous work showed that V5/MT is part of the dorsal visual stream and is thus associated with motion. Other studies have linked visual speech recognition to activation in the posterior superior temporal sulcus (pSTS) and in a region in inferior and posterior to the pSTS known as the temporal visual speech area (TVSA; Paulesu et al., 2003; Arnal et al., 2011; Bernstein and Liebenthal, 2014). The TVSA is primarily associated with the ventral visual stream for form. Bernstein and Liebenthal (2014) proposed that the pSTS integrates visual information from both V5/MT and the TVSA to facilitate the processing of auditory signals by modulating primary auditory cortical areas within Heschl's gyrus. Future work should elucidate the functional connections between subareas of the pSTS and V5/MT, using high spatial resolution imaging methods, such as 7+ tesla fMRI.

Figure 1.

Figure 1.

Illustration of the connections between V5/MT, TVSA, pSTG/S, and Heschl's gyrus within the visual speech recognition functional network. The V5/MT and TVSA (in blue) have feedforward connections to the pSTG/S (associative area in yellow). The pSTG has an efferent modulatory connection (reflected by the dotted arrow) with primary auditory cortical areas located within the Heschl's gyrus (auditory areas in green) to facilitate the processing of auditory input based on previously integrated visual information. Only brain areas relevant to this Journal Club article are depicted in this illustration of the visual speech recognition network.

An aspect of visual speech recognition that Jeschke et al. (2023) did not address is cross-linguistic differences in how this network functions and how V5/MT contributes to it. Most of the research on visual speech recognition has involved native speakers of the Germanic language family. However, Sekiyama and Burnham (2008) found behavioral differences between Japanese and English native speakers in the use of visual speech information: English native speakers exhibit shorter reaction times for visually congruent speech information than their Japanese counterparts. To further investigate these behavioral differences, Shinozaki et al. (2016) studied functional connectivity during a visual speech recognition task. V5/MT activity correlated more strongly with activity in Heschl's in English speakers than in Japanese speakers. This finding hints toward a distinct contribution of V5/MT to the network depending on the language. It is also possible that motion information from V5/MT is weighed more strongly during multisensory integration in the STS in English speakers than in Japanese speakers. A TMS study considering multiple languages should reveal differences in the effect of TMS stimulation depending on the language, for example, through the comparison of languages that differ in terms of number of visual speech signals (i.e., visemes), as suggested by Shinozaki et al. (2016). English and Japanese would provide a suitable contrast for future studies.

In conclusion, Jeschke et al. (2023) found that inhibitory TMS over V5/MT increased response times and decreased practice effects during visual speech recognition. Future research should examine what the exact contribution of V5/MT is within the larger visual speech recognition network and whether cross-linguistic differences have structural and/or functional effects. More broadly, a better understanding of the role of V5/MT in speech recognition could contribute to building a more comprehensive model of speech processing that could be applied across different languages.

References

  1. Arnal LH, Wyart V, Giraud A-L (2011) Transitions in neural oscillations reflect prediction errors generated in audiovisual speech. Nat Neurosci 14:797–801. 10.1038/nn.2810 [DOI] [PubMed] [Google Scholar]
  2. Bernstein LE, Liebenthal E (2014) Neural pathways for visual speech perception. Front Neurosci 8:386. 10.3389/fnins.2014.00386 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Jeschke L, Mathias B, Von Kriegstein K (2023) Inhibitory TMS over visual area V5/MT disrupts visual speech recognition. J Neurosci 43:7690–7699. 10.1523/JNEUROSCI.0975-23.2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Mather G, Battaglini L, Campana G (2016) TMS reveals flexible use of form and motion cues in biological motion perception. Neuropsychologia 84:193–197. 10.1016/j.neuropsychologia.2016.02.015 [DOI] [PubMed] [Google Scholar]
  5. Paulesu E, Perani D, Blasi V, Silani G, Borghese NA, De Giovanni U, Sensolo S, Fazio F (2003) A functional-anatomical model for lipreading. J Neurophysiol 90:2005–2013. 10.1152/jn.00926.2002 [DOI] [PubMed] [Google Scholar]
  6. Santi A, Servos P, Vatikiotis-Bateson E, Kuratate T, Munhall K (2003) Perceiving biological motion: dissociating visible speech from walking. J Cogn Neurosci 15:800–809. 10.1162/089892903322370726 [DOI] [PubMed] [Google Scholar]
  7. Sekiyama K, Burnham D (2008) Impact of language on development of auditory-visual speech perception. Dev Sci 11:306–320. 10.1111/j.1467-7687.2008.00677.x [DOI] [PubMed] [Google Scholar]
  8. Shinozaki J, Hiroe N, Sato M, Nagamine T, Sekiyama K (2016) Impact of language on functional connectivity for audiovisual speech integration. Sci Rep 6:31388. 10.1038/srep31388 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of Neuroscience are provided here courtesy of Society for Neuroscience

RESOURCES