Abstract
Masapollo, Polka, and Ménard (2017) recently reported a robust directional asymmetry in unimodal visual vowel perception: adult perceivers discriminate a change from an English /u/ viseme to a French /u/ viseme significantly better than a change in the reverse direction. This asymmetry replicates a frequent pattern found in unimodal auditory vowel perception that points to a universal bias favoring more extreme vocalic articulations, which lead to acoustic signals with increased formant convergence. In the present article, we report five experiments designed to investigate whether this asymmetry in the visual realm reflects a speech-specific or general processing bias. We successfully replicated the directional effect using Masapollo et al.'s dynamically-articulating faces, but failed to replicate the effect when the faces were shown under static conditions. Asymmetries also emerged during discrimination of canonically-oriented point-light stimuli that retained the kinematics and configuration of the articulating mouth. In contrast, no asymmetries emerged during discrimination of rotated point-light stimuli or Lissajou patterns that retained the kinematics, but not the canonical orientation or spatial configuration, of the labial gestures. These findings suggest that the perceptual processes underlying asymmetries in unimodal visual vowel discrimination are sensitive to speech-specific motion and configural properties, and raise foundational questions concerning the role of specialized and general processes in vowel perception.
Keywords: visual speech perception, natural referent vowel framework, focal vowels, eye-tracking, point-light stimuli, non-speech
1. Introduction
Throughout the years, considerable research has shown that during face-to-face conversational interactions, humans make use of sensory information across the auditory and visual modalities (see, e.g., Fowler, 2004; Rosenblum, 2005). For all talkers, except perhaps the very best ventriloquists, the production of speech sounds is accompanied by corresponding facial movements (see, e.g., Vatikiotis-Bateson, Munhall, Kasahara, Garcia, & Yehia, 1996a; Yehia, Rubin, & Vatikiotis-Bateson, 1998; Munhall & Vatikiotis-Bateson, 2004). Perceivers are highly sensitive to these visual correlates of speech: everyone can engage in lip reading to some degree, and the well-known “McGurk effect” (McGurk & McDonald, 1976) shows that the perception of speech sounds can be altered when the acoustic signal is dubbed with phonetically-incongruent facial movements. Even young infants show evidence of detecting cross-modal correspondences between the acoustic and optic structures of speech (e.g., Kuhl & Meltzoff, 1982; Yeung & Werker, 2013; Guellai, Streri, Chopin, Rider, and Kitamura, 2016).
Investigating how perceivers use visual information from a talker's face provides an opportunity to inform our understanding of the processes underlying speech perception, as well as the nature of the information that those processes operate on (Fowler, 2004; Rosenblum, 2005, 2008). However, the majority of research on speech perception has focused on identifying aspects of the acoustic signal that underlie speaker to perceiver communication. Relatively little is known about how perceivers track and extract dynamic visual-facial features specified by the optical speech signal and map them onto the phonetic categories of their language. In the present study, we address this issue in the domain of vowel perception, focusing on a well-established perceptual bias that human perceivers display when processing auditory as well as visual vowel stimuli.
Numerous experimental studies using unimodal acoustic vowel stimuli with both infants and adults have repeatedly demonstrated that discrimination of the same vowel pair varies depending on the order in which the vowels are presented. Specifically, discrimination is better when the direction of change is from a relatively less to a relatively more peripheral vowel within the standard phonetic vowel space(defined by the first and second formant frequencies) compared to the reverse direction (i.e., more to less peripheral; for reviews and meta-analytic findings, see Polka & Bohn, 2003, 2011; Shuji & Cristia, 2017). To explain these findings, Polka and Bohn (2011) recently outlined the Natural Referent Vowel framework (NRV), in which they propose that these asymmetries reflect a universal perceptual bias favoring “focal” vowels, whose adjacent formants converge in frequency, which concentrates acoustic energy into a narrow spectral region (described further below; Polka & Bohn, 2003, 2011; see also Schwartz, Abry, Boë, Ménard, & Vallée, 2005). However, recent studies reveal that this same asymmetric pattern occurs in discrimination tasks where vowel pairs are not heard, but are only perceived visually (Masapollo, Polka, Molnar, & Ménard, 2017; Masapollo, Polka, & Ménard, 2017). These findings indicate that these directional effects are based on detecting and processing phonetic information that is specified across perceptual modalities, not just via the acoustic speech signal. Here, we report the results of five experiments designed to explore the conditions under which asymmetries obtain in the perception of the visible articulatory movements that normally accompany audiovisual vowel production. The results have important implications for the NRV framework, which seeks to explicate the perceptual processes underlying these asymmetries, as well as the type of information that those processes are sensitive to.
According to NRV, asymmetries in vowel perception reveal a bias that is phonetically grounded, i.e. it reflects our sensitivity, as humans, to the way that articulatory movements shape the physical speech signal. Specifically, it is argued that speech perceivers are biased toward vowels produced with extreme articulatory maneuvers, which give rise to salient acoustic signals with well-defined spectral prominences due to formant frequency convergence, or “focalization.” When spectrally adjacent formants move close together in frequency there is a mutual reinforcement of their acoustic energy, such that the amplitude of each formant increases; this results in a concentration (a.k.a. a focalization) of energy into a narrower spectral region (see Stevens, 1989; Kent & Read, 2002, for a discussion). Focalization and the associated acoustic-phonetic salience increases in a graded fashion for vowels closer to the corners of the vowel space. Accordingly, focalization maxima are observed for vowels found at the corners of the vowel space, which require the most extreme vocal-tract maneuvers (relative to a neutral or resting position of the vocal-tract). For example, F2 and F3 and F4 converge maximally for /i/ (which is the highest, and most fronted close vowel), and F1 and F2 converge in a mid-frequency range for /a/ (which is the lowest and most back open vowel) and in a low-frequency range for /u/ (which is the highest and most back closed vowel). According to NRV, the salience of more focal vowels make them easier to detect and encode. As phonetic units, focal vowels are favored in perception and in the composition of vocalic inventories; virtually all human languages include the focal vowels, /i/, /a/, /u/. The central premise of the NRV framework, then, is that the perceptual bias favoring focal vowels is universal and that this bias, interacting with language experience, plays an important role in the development of functional vowel perception skills.
Recent work confirms the NRV claim that discrimination asymmetries in auditory speech perception are related to formant convergence differences and reveal a universal perceptual bias that is independent of language experience (Masapollo et al., 2017a, 2017b). The vowel /u/ was chosen for use in these studies because previous cross-linguistic vowel production studies have consistently demonstrated that French speakers produce more extreme /u/ gestures, with a greater degree lip rounding (lip compression and protrusion) and tongue backness, compared to English speakers (Escudero & Polka, 2003; MacLeod et al., 2009; Noiray, Cathiard, Ménard, & Abry, 2011). Consequently, French /u/ has been phonetically described as more focal than English /u/ (i.e. the frequency distance between F1 and F2 is smaller in French /u/ than in English /u/).
In Masapollo et al. (2017a), we constructed an array of /u/ vowels that systematically differed in their degree of formant convergence and category-goodness ratings, and established that French adults select a more focal /u/ as a prototypic French /u/ and English adults select a less focal /u/ as a prototypic English /u/. Next, we showed that when discriminating these /u/ variants in an AX task, both English and French adults showed the same directional asymmetry, consistent with a focalization bias. Specifically, discrimination was better when they heard a less-focal/English-prototypic /u/ vowel followed by the more-focal/French-prototypic /u/ vowel compared to the reverse direction. Despite the differences in rating of category goodness, we did not find that the focalization bias was modulated by language experience. These findings confirm that adults display a universal vowel perception bias that can be linked to formant convergence/focalization (see also Schwartz & Escudier, 1989). These findings raise questions regarding the interpretation of discrimination asymmetries reported in earlier research that were attributed to the native language magnet effect documented by Kuhl and colleagues (Kuhl, 1991; Kuhl et al., 1992; see Masapollo et al., 2017a, for discussion).
As noted earlier, new data (Masapollo et al., 2017b) has revealed that directional asymmetries occur not only during auditory vowel discrimination, but during visual and audiovisual vowel discrimination as well. Specifically, we presented monolingual English-speaking and French-speaking adults with naturally-spoken variants of /u/ produced in English and in French by a simultaneous bilingual speaker in an AX discrimination task. Critically, the differences in labial constriction gestures in the production of French /u/ and English /u/ are optically, as well as acoustically, distinct. Thus, if the perceptual processes underlying asymmetries reflect a phonetic bias, then the visible facial gestures that accompany these acoustic vowel signals should also be capable of modulating directional asymmetries perceiver. If, however, asymmetries arise from a general auditory processing bias, we might not find this pattern when subjects discriminate the stimuli in the visual-alone condition. Our findings showed that subjects who could only lip-read the vowels showed the same directional asymmetry as subjects who could only hear the same vowels: specifically, discriminating the change from a less-focal/English /u/ to a more-focal/French /u/ resulted in significantly better performance than a change in the reverse direction. Further more, the same asymmetry was observed when subjects were tested in a bimodal condition using AV stimuli that combined phonetically-congruent video and audio signals (e.g., English /u/ viseme dubbed onto audio English /u/). However, when subjects were tested using bimodal AV stimuli that combined phonetically-incongruent video and audio signals (e.g., French /u/ viseme dubbed onto audio English /u/), a different pattern emerged. In this condition, an asymmetry was observed which appears to reflect a focalization bias assuming that the perceiver favors use of the visual information over the acoustic information.
It is important to note that the NRV framework is not based on or limited to auditory speech information and allows for multimodal information that is naturally accessible to also drive phonetic biases. Thus, the observation of a directional asymmetry in visual speech discrimination is entirely compatible with the NRV view that such asymmetries index a phonetic bias that reflects our sensitivity to how articulation shapes the physical speech signal. However, to date, NRV has not proposed a specific mechanism behind this phonetic perception bias; this is open to several interpretations. Within a gestural approach to speech perception (e.g., Fowler, 2004; Rosenblum, 2005, 2008; Best, Goldstein, Nam, & Tyler, 2016), information presented acoustically or optically may specify certain constriction gestures made by the vocal-tract articulators, such as the labial gestures for vowels involving lip rounding. Under this view, the visually influenced responses in Masapollo et al.'s (2017b) incongruent bimodal condition might indicate that the perceptual processes underlying the focal vowel bias operate on articulatory information available across perceptual modalities.
However, these findings can also be taken to support an “analysis by synthesis” approach to speech perception (e.g., Halle & Stevens, 1967; Liberman & Mattingly, 1985; Poeppel & Monahan, 2011). According to this view, speech perceivers use visual, as well as acoustic, information in speech to generate a forward model to synthesize and mimic the phonemes that a speaker might have been attempting to produce, and that feedback from the motor system (in the form of an efference copy) influences phonetic perception. Consistent with this premise, findings from cognitive neuroscience research reveal that the perception of unimodal acoustic or visual speech, with no explicit motor task, evokes activity in neural regions known to be involved in speech production and motor control, including the left inferior frontal gyrus, ventral premotor cortex, and cerebellum (e.g., Wilson, Saygin, Sereno, & Iacoboni, 2004; Skipper, Nusbaum, & Small, 2005; Imada, Zhang, Cheour, Taulu, Ahonen, & Kuhl, 2006; Skipper, van Wassenhove, Nusbaum, & Small, 2007; Kuhl, Rameríz, Bosseler, Lotus Lin, & Imada, 2014). Interestingly, though, this network of motor areas does not appear to be activated by the presentation of non-speech stimuli that acoustically resemble speech (see,e.g., Wilson et al., 2004; Imada et al., 2006). Thus, in this account, the influence of the visual display on asymmetries in Masapollo et al.'s incongruent condition arises because the visible facial movements of the more extreme French /u/ gestures(which temporally precede the onset of the acoustic signal) leads perceivers to internally synthesize and experience a more focal vowel percept, regardless of the identity of the co-occurring acoustic signal.
An alternative possibility is that these visually influenced responses reflect a general processing bias favoring the more dynamic visible facial movements that normally accompany the production of more extreme vowel articulations. In other words, the oral-facial kinematics (i.e., the spatial direction and motion of the lips) corresponding to the more extreme French /u/ gestures may have been more visually salient (relative to the English /u/ gestures) because they specified larger, more rapid articulatory movements. Indeed, analyses of the visual vowel stimuli used in Masapollo et al. (2017b) confirm that their model speaker produced more extreme labial gestures, with a greater degree of lip rounding and faster rates of lip movement, when producing French /u/compared to English /u/. Similarly, differences in the formant dynamics of these stimuli may contribute to the asymmetric discrimination of these vowels in the auditory only condition.
The present study was designed to gain further insights into the perceptual processes under vowel perception biases by investigating the precise nature of the information that elicits directional asymmetries during visual-only vowel discrimination. To accomplish this, we used the natural visual French /u/ and English /u/ stimuli from our recent work to create non-speech analogs that allowed us to isolate and manipulate the oral-facial kinematics, spatial configuration, and orientation of the stimuli. Using these visual non-speech analogs, we can begin to identify which stimulus features are required to maintain an asymmetry in visual vowel discrimination performance. In doing so, we can address the question of whether speech-specific or general perceptual processes are responsible for directional asymmetries in visual vowel perception. A finding that asymmetries require more than simple kinematic differences in the visual displays would constrain possible interpretations for these effects. Specifically, it may lend support to the articulatory-gesture or analysis by synthesis hypotheses discussed above.
We conducted five experiments using the same task with the same stimuli as Masapollo et al. (2017b) and also with non-speech analogs of them. Experiment 1 was designed to establish that perceivers in fact selectively attend to the lip movements of the model speaker during visual vowel discrimination tasks that elicit perceptual biases. Detailed analyses of intramuscular recordings and kinematic data indicate that visible speech movements are spatially distributed across the whole face (e.g.,Vatikiotis-Bateson et al., 1996a; Vatikiotis-Bateson, Munhall, Hirayama, Lee, & Terzopolous, 1996b; Munhall & Vatikiotis-Bateson, 2004; Lucero & Munhall, 1999; Lucero, Maciel, Johns, & Munhall, 2005). Furthermore, perception studies employing eye-tracking methodology indicate that perceivers frequently gaze at facial regions that extend beyond the lips while they watch and listen to a speaker talk (e.g., Vatikiotis-Bateson, Eigsti, Yano, & Munhall, 1998; Paré, Richler, ten Hove, & Munhall, 2003; Everdell, Marsh, Yurick, Munhall, & Pare, 2007; Irwin & Brancazio, 2013). Thus, it was necessary to establish that perceivers do indeed selectively attend to the mouth when watching the model speaker in our vowel perception task before pursuing experiments in which we create non-speech analogs that are restricted to features of the model speaker's lip/mouth movements.1
Toward this end, Experiment 1 was undertaken to partially replicate Masapollo et al.'s earlier finding of an asymmetry in visual vowel discrimination (with English-speaking adults only), and to extend that finding by using eye-tracking technology to assess the gaze patterns that subject use when processing the less-focal/English /u/ versus the more-focal/French /u/ visemes. Specifically, we examined whether perceivers would deploy selective attention to the oral region of the model speaker in this test condition, and if so, whether articulatory peripherality would modulate relative attention to the oral region. If perceivers selectively attend to the mouth, then they should look at it proportionately longer, relative to the eyes, while viewing the dynamic visemes. In addition, if the oral-facial kinematics of the French /u/ gestures are visually more salient, then attention to the mouth might be enhanced during the perception of the French /u/ visemes compared to the English /u/ visemes. Such a bias in gaze behavior might contribute to asymmetries in visual vowel perception.
Experiment 2 addressed the question of whether asymmetries in visual vowel discrimination are due to the more extreme mouth-shapes characteristic of the French /u/ visemes, independent of a kinematic form. To do so, we tested whether perceivers show directional asymmetries in visual vowel discrimination, consistent with the NRV framework, while discriminating single-frame static face stimuli extracted from Masapollo et al.'s original dynamically-articulating facial displays.
Experiments 3-5 examined whether differences in the labial kinematics between the English /u/ and French /u/ visemes are sufficient to elicit an asymmetry in visual vowel perception. To do so, we tested whether perceivers show asymmetries while discriminating schematic visual analogs that isolate the kinematics of the model speaker's lips, but that are not identified as visible speech movements. Specifically, we created dynamic “point-light mouth” stimuli and more abstractly-shaped Lissajou displays whose motions preserve the labial kinematic parameters of Masapollo et al.'s model speaker. Previous studies have observed speech effects using point-light speaker stimuli that simulate realistic facial movements that accompany speech production (see, e.g., Rosenblum, & Saldaña, 1996, 1998; Rosenblum, Johnson, & Saldaña, 1996).
Critically, if asymmetries in visual vowel perception are driven solely by differences in oral-facial motion, independent of spatial configuration information specifying a talking mouth, then one can make two predictions. First, asymmetries should emerge during discrimination of the dynamic facial displays (Experiment 1), but not the static facial displays (Experiment 2). Second, asymmetries should also emerge during discrimination of the artificial point-light stimuli (Experiments 3 and 4) and Lissajou stimuli (Experiment 5) that have kinematic properties similar to those of the visual vowel stimuli, regardless of whether perceivers interpret them as actual visual speech. If, however, asymmetries are elicited only by optical stimuli that resemble a mouth-like shape depicting speech-like movements, then we would expect asymmetric performance with point-light stimuli, but not with the Lissajou stimuli, which retain the kinematic properties of the talkers' vowel movements without the spatial configuration of a mouth.
2. Experiment 1
2.1 Materials and Methods
All experiments complied with the principles of research involving human subjects as stipulated by Brown University and McGill University.
2.1.1 Subjects
Sixteen students2 from Brown University served as participants (mean age= 22.3 years [SD= 6.3] 9 males). All were monolingual American English speakers who reported normal hearing and normal (or corrected-to-normal) vision. The experiment took approximately one hour, and subjects received course credit for their participation.
2.1.2 Stimuli
The visual stimulus materials for Experiment 1 were identical to those in Masapollo et al. (2017b, Experiment 2). A simultaneous bilingual speaker of Canadian English and Canadian French was video recorded while producing the nonsense syllable /gu/ containing either English /u/ or French /u/.3 The video was recorded using a digital camcorder (Panasonic AG-DVX100B). The video stream was digitized at the standard NTSC frame rate (29.97 images per second) and the audio signal was digitized at a sampling frequency of 44,100 Hz. Five different video tokens of /gu/ in each language were selected as stimuli based on their visual similarity in head position and facial expression. To create the audio-only (AO) stimuli, the video track was removed from the audio-visual (AV) videos using Adobe Premiere (San Jose, CA).
In an acoustic analysis, the duration, fundamental frequency (F0), and first (F1) and second (F2) formant frequencies of the vocalic portions of the selected tokens were measured using Praat (Boersma & Weenink, 2013). As shown in Figure 1, the results confirmed that the French /u/'s were more acoustically peripheral and focal between F1 and F2, compared to the English /u/'s. Additional analyses of the initial stop portion of the recorded syllables revealed differences across the French and English /g/'s along multiple acoustic-phonetic dimensions, i.e., in their stop closure duration, voice-onset-time, amplitude of pre-voicing, etc. To control this aspect of the stimuli in their auditory-only and bimodal auditory-visual discrimination tests, Masapollo et al. cross-spliced the stop portion from a prototypic English /gu/ with the vocalic portion from each of the selected acoustic tokens. This way, each acoustic token of /gu/ had the same acoustic specification of the stop consonant. Pilot data reported in Masapollo et al., 2017 showed that monolingual adults (French and English) identified all of the stimuli as intelligible instances of /u/; English adults rated the less focal English /u/ tokens as better examples of their native /u/ category and French adults rated the more focal French /u/ tokens as better examples of their native /u/ category.
Figure 1.

Plot of F1/F2 frequencies for the unimodal acoustic /u/ stimuli used in Masapollo et al. (2017b). These stimuli were produced by a female bilingual speaker; the English /u/'s are indicated by the open circles and the French /u/'s are indicated by the filled circles. As the plot shows, the French /u/'s are more acoustically peripheral and more focal (between F1 and F2). The arrow points in the direction that was found to be easier to discriminate by both English and French adults in the present study (see text for explanation). Example video frames (from the midpoint of the vocalic trajectory) of the corresponding visemes are also shown.
To create the visual-only stimuli, the video track was digitally excised from the audio-visual video recordings of the model speaker's productions. The visual stimuli were 1,440 × 1,056 pixels in size, resulting in a width of 38.1 cm and a height of 27.9 cm on screen. Figures 1 and 2 show example video frames of the visual tokens during the English /u/ and French /u/ gestures, taken from the midpoint of the vocalic trajectory (see also Figure 3 [top row]).
Figure 2.

(A) Frames of the visible vocal-tract configuration of the model speaker during the production of less-focal/English /u/ (top) and more-focal/French /u/ gestures (bottom) at 20%, 50%, and 80% of the vocalic trajectory. Note that the speaker produces a more extreme labial constriction (i.e., lip compression and protrusion) while articulating the French /u/ gesture.
Figure 3.

Sample images of the speech and non-speech stimuli used in Experiments 1-5 (see text for explanation). Panels A, B and C show, respectively, the video frame that occurred at 20%, 50%, and 80% of the duration of the corresponding stimulus. Note that, in Experiment 2, the stimuli only consisted of (single-frame) static images of the model speaker's face taken at 50% of the vocalic trajectory (as shown in Panel B for visual speech).
A computer-vision analysis of the frames showed that the model speaker consistently produced a greater degree of lip rounding (i.e., lip compression and protrusion) while articulating the French /u/ gestures compared to the English /u/ gestures, which in turn, resulted in an overall smaller inter-labial area. The speaker also constricted her lips at a faster rate when articulating the French /u/'s compared to the English /u/'s.
To examine further how these differences in the visual dynamics of the model speaker's face related to the formant convergence patterns in the resulting acoustic signals, Masapollo et al. then measured the first two formants (F1 and F2) of the acoustic tokens throughout the vocalic trajectories. Articulatory and acoustic models of speech production indicate that lip compression and protrusion will lower F1 and F2 (see, e.g., Stevens, 1989). As well, moving the tongue dorsum back will also contribute to lowering F2. Indeed, the results of the acoustic analysis showed that F1 and F2 converged more rapidly and to a greater extent (i.e., were closer in frequency) during the production of the French /u/'s compared to the production of the English /u/'s.
2.1.3 Procedure and Design
The experimental protocol matched the procedures used by Masapollo et al. (2017b, Experiment 2), except that eye movements were also measured. Subjects completed a same-different (AX) discrimination task. On each trial, subjects watched a sequence of two unimodal visual /u/ tokens, and then judged whether they were the “same” or “different” with an eye movement to a visual target (see below). For each same trial, different tokens of the same vowel type were paired (i.e., two different English /u/ tokens or two different French /u/ tokens were paired). For each different trial, tokens from the two different vowel types were paired (i.e., an English /u/ token was paired with a French /u/ token). Thus, subjects had to indicate whether pairs of physically different stimuli were members of the same vowel set or members of the two different vowel sets.
During the AX task, we recorded subjects' gaze fixations to the model speaker's face with a remote binocular eye-tracking system (SensoMotoric Instruments [SMI] Red 500; Boston, MA) paired with iViewX software. The visual stimuli appeared on a 22-in. flat screen monitor about 45 cm in front of the subject, on a height-adjustable mount. The stimuli were presented using Experiment Center software (SMI). Subjects were seated in a chair while an experimenter positioned the infrared light beam and adjusted it for optimal tracking. When necessary, the height of the monitor was adjusted so that the participant's eye level was in the middle of the stimulus display screen. All subjects were tested in a sound-treated laboratory room that was dimly lit. Before the experiment began, the experimenter performed a standardized tracking calibration procedure, which was repeated if necessary until criterion was reached. This calibration process was repeated after each break (see below).
Each trial consisted of the following sequence of events. First, a fixation cross appeared in the center of the screen. Subjects' gaze had to dwell on the fixation cross for 1000 ms to trigger the presentation of a vowel pair. Immediately after fixation-point offset, subjects watched a sequence of two visual /u/ tokens, separated by an ISI of 1500-ms.4 During the ISI, a blank black screen was presented. Immediately after the second visual token was presented, two text stimuli, named “Same” and “Different,” simultaneously appeared in the center of the left and right halves of the screen, respectively. Subjects were instructed to indicate their perceptual judgment by looking at the appropriate response target word. Gaze responses recorded on each trial were used to determine the subject's perceptual judgment by examining which text stimulus (Same or Different) they fixated proportionately longer on (see below).
Subjects saw every possible type of pairing of the 10 stimuli, 4 times, in both presentation orders. The test trials were organized into four blocks. Each block had 90 trials, which consisted of each possible pairing (i.e., 50 different-type trials and 40 same-type trials). This resulted in a total of 360 test trials (200 different-type, 160 same-type). Of the 200 different-type trials, half presented a less-focal/English /u/ first followed by a more-focal/French /u/, while the remaining half followed the reverse order. Note that in this task, there were no trials with a stimulus being paired with itself. Because these within-category pairs did not consist of visually identical pairings, subjects had to generalize across small optical differences to perceptually group the stimuli. Several practice trials were included at the start of the experiment to confirm that subjects understood the instructions and were able to perform the task. Subjects took a short break after completing each block. No feedback was provided.
It should be noted that during pilot testing with their visual-only stimuli, Masapollo et al. observed that some subjects produced vowel production-like articulatory movements while performing the discrimination task. Other researchers have suggested that the sensorimotor feedback that perceivers receive from their own self-generated articulatory movements can influence their speech perceptual performance (see, e.g., Ito, Tiede, & Ostry, 2009; Yeung & Werker, 2013; Bruderer, Danielson, Kandhadai, & Werker, 2015). Thus, to control for potential sensorimotor influences, we instructed our subjects to inhibit any urge to mirror the oral-facial movements of the model speaker.5
2.1.4 Fixation analyses
Monocular (left eye) gaze was recorded at 120 Hz and down-sampled to 60 Hz. Proportions were computed for 16-ms temporal bins, beginning at the onset of each trial. Raw gaze data were preliminarily analyzed using in-house software to ascertain gaze fixation time to distinct Regions of Interest (ROI) defined by pixels in each trial.
During presentation of the visual /u/ stimuli, we assessed subjects' fixations on the face of the model speaker with respect to two ROIs; the ocular region and the oral region (including the regions between the bottom of the nose and the upper lip, and the lower lip and the jaw, as shown in Figure 5). We calculated the proportion of total time spent fixating the eye and mouth ROIs, respectively, by dividing the total amount of time fixating each ROI by the total amount of time fixating any location on the screen.
Figure 5.

(A) Box plot of mean proportion of fixation duration (relative to the entire visual display) to the ocular region and the oral region of the model speaker's face for the English /u/ (left) and French /u/ gestures (right). (B) The ocular and oral regions of interest are superimposed on the image of the speaker's face for illustration (these did not appear during the presentation of the experimental stimuli).
To index subject's response on each trial, we also measured fixation on two ROIs defined by the text stimuli (i.e., “Same” and “Different”). Participants' eye movements to the text stimuli were monitored during the interval starting from the onset of the presentation of the text stimuli and lasting 1000 ms. We calculated the proportion of total time spent fixating the “Same” and “Different” ROIs, respectively, by dividing the total amount of time fixating each ROI by the total amount of time fixating any location on the screen. Responses were classified based on whichever text stimulus subjects fixated on proportionately longer. Subject's response choice was clearly indicated; they looked almost exclusively at the text stimuli during the response interval and no trials emerged where each text stimulus was fixated in roughly equal proportions.
2.2 Results
Our analysis focused on comparing subjects' discrimination for each direction of vowel change – from a less-focal /English /u/ to a more-focal/French /u/ compared to the reverse direction. Subjects achieved an overall score of 68% correct. However, discrimination performance varied significantly depending on the order of the vowel change indicating a focalization bias. Specifically, percent-correct scores were significantly higher when discriminating a change from a less focal /u/ to a more French /u/, (M = 72.8; SD = 12.3) compared to the reverse direction, less focal to more focal /u/ (M = 63.7; SD = 11.4) [t(15)= 3.865, p = 0.002, r2 = 0.70].
To ensure that these differences in discrimination performance did not reflect an inherent bias to respond “same” or “different,” we employed a signal detection analysis (Grier, 1971; MacMillan & Creelman, 2005). Each subject's performance on the different pairs was converted to an A′ score, which is an unbiased index of perceptual sensitivity ranging from .50 (chance) to 1.0 (perfect discrimination). The following formula (from Grier, 1971) was used: A′ = .5 + (H-FA) (1+H-FA)/[4H(1-FA)], where H = proportion of hits and FA = proportion of false alarms. The false alarm rate was the combined error rate observed on same trials involving each vowel within the stimulus pair. For each subject, an A′ score was computed for each direction of vowel change. Figure 4 (far left) displays the mean A′ scores for each direction of vowel change, illustrating that we successfully replicated earlier findings: subjects performed better at discriminating changes from dynamic visemes corresponding to relatively less to relatively more focal /u/'s (M = 0.80; SD = 0.09), compared to the reverse direction (M = 0.76; SD = 0.10) [t(15)= 3.602, p = .003, r2 = 0.67].
Figure 4.

Box plot of mean A′ scores for Experiments 1-5. The means are plotted for each order of vowel change (less to more focal [i.e., English Ixxl to French /u/] vs. more to less focal [i.e., French /u/ to English /u/]). Error bars represent standard errors, single-asterisks = p < .05, double-asterisks =p < .01.
In addition, we quantified gaze fixation locations by determining the proportion of total time in which subjects fixated the eye and mouth regions of the model speaker's face. Figure 5 summarizes the results of this analysis. We found greater overall fixation to the mouthregion (M = 0.68; SD = 0.21) compared to the eye region of the speaker's face (M = .03; SD = .03) [t(15) = 12.509, p < .001, r2 = 0.95]. As well, gaze fixations to the mouth were significantly higher when subjects viewed the more-focal /French /u/'s (M = 0.72; SD = .20) compared to the less-focal/English /u/'s (M = .64; SD = .23) [t(15) = -4.622, p < .000, r2 = 0.74]. Thus, subjects looked longer at the speaker's mouth when they were presented the more focal French /u/ articulations compared to the less focal English /u/ stimuli.
2.3 Discussion
Experiment 1 successfully replicated the finding in Masapollo et al. (2017b, Experiment 2), using the same vowel stimuli. Subjects showed directional asymmetries, such that they performed better at discriminating a change in a dynamic viseme from one associated with a relatively less-focal/English /u/ to one associated with a relatively more-focal /French/u/ compared to the reverse, confirming that the perceptual processes underlying the focal vowel bias operate on visual, as well as acoustic, speech information. Experiment 1 also confirmed that perceivers are indeed selectively attending to the talker's mouth during this task. The results from the fixation analyses examining the distribution of gaze fixation positions confirmed this pattern, showing more fixation to the mouth than to the eyes, the two dominate gaze foci during visual speech processing. The observation that subjects deployed selective attention to the mouth (relative to the eyes) of the model speaker is not surprising given the nature of the vowel stimuli. Because the speech tokens themselves were brief, subjects did not have a lot of time to scan the face of the speaker. Rather, they had to quickly home in on the oral-facial movements that served to distinguish the two types of vocalic gestures. Apart from the transient nature of the stimuli, foveal fixation on the mouth may have also been necessary for subjects to successfully detect the differences in the degree of lip constriction between the English /u/ and French /u/ visemes. In other words, the resolution of the speaker's mouth provided by peripheral vision may not have been sufficient to discriminate the two vowel types.
Gaze analyses also revealed that subjects adjusted their gaze patterns depending on the type of vowel they were viewing: the proportion of mouth fixations was higher when subjects were viewing the more-focal/French /u/gestures, compared to the less-focal/English /u/gestures. Thus, visual focalization cues modulated subjects' relative attention to the mouth of the speaker. This bias in subjects' pattern of gaze behavior may contribute to the directional asymmetries in visual vowel perception reported by Masapollo et al. (2017b).
What, then, underlies this pattern of eye movements? One account is that it reflects a general visual processing bias favoring more dynamic optical motion patterns, which is being applied in the present context where the stimuli are being interpreted as speech. Indeed, video analyses confirmed that the French /u/ gestures featured larger, more rapid lip movements than the English /u/ gestures (Masapollo et al., 2017b). Alternatively, this perceptual pattern might derive from the more extreme mouth positions of the French /u/ gestures, independent of a kinematic form. The next experiment was conducted to test these two competing hypotheses.
3. Experiment 2
Experiment 2 tested whether the extreme mouth-shapes of Masapollo et al.'s (2017b, Experiment 2) dynamic facial stimuli played a role in driving the directional asymmetries observed in Experiment 1. If this is the case, then perceivers should display directional asymmetries while attempting to discriminate pairs of static faces consisting of single-frame images of Masapollo et al.'s bilingual speaker producing the English /u/'s and French /u/'s. If, on the other hand, perceivers do not show asymmetries while discriminating such pictorial facial stimuli, this would suggest that the processes underlying the asymmetries found in Experiment 1 are sensitive to some type of time-varying kinematic information.
3.1 Materials and Methods
3.1.1 Subjects
Sixteen students from McGill University served as participants (mean age= 20.8 years [SD= 3.6] 2 males). All were monolingual Canadian English speakers who reported normal hearing and normal (or corrected-to-normal) vision. The experiment took approximately one hour, and subjects were paid for their participation.
3.1.2 Stimuli
The stimuli consisted of static single-frame images of Masapollo et al.'s (2017b) model bilingual speaker producing the English /u/'s and French /u/'s. The stimuli were created by taking a screen shot of the visual vowel tokens at vowel midpoint. The images were presented for an equal amount of time as the corresponding video tokens in Experiment 1. Thus, any differences in task performance could not be attributed to an effect of shorter stimulus presentation.
3.1.3 Procedure and Design
The experimental protocol for Experiment 2 matched the procedures used in Experiment 1, except that subjects were instructed to discriminate static images, as opposed to dynamically articulating visual displays, of the model speaker producing the two different types of /u/ vowels. The subjects were explicitly told that the mouth-shapes observed in the static images corresponded to speech movements, and that they should interpret them as speech. Importantly, however, subjects were not specifically told that the speaker was producing different variants of the vowel /u/.
3.2 Results
The critical question in Experiment 2 was whether subjects would show directional asymmetries in visual vowel discrimination, consistent with NRV, while discriminating static, as opposed to dynamically-articulating, faces recorded during vowel production. Overall, subjects were highly accurate at discriminating the differences between pairs of static vowels. Across both stimulus presentation orders, subjects achieved an overall percent-correct score of 84.6%. However, discrimination performance did not vary significantly depending on the order of the vowel change(MEnglish to French = 83.1; SD = 8.1 vs. MFrench to English = 86.1; SD = 11.6). A paired samples t-test revealed no significant differences between the two stimulus orders, t(15)= -1.629, p = 0.124, r2 = 0.374.
In a second analysis, we examined subjects' mean A′ scores, which are displayed in Figure 4. As shown, for the less to more focal /u/ changes, subjects achieved a mean A′ score of 0.83 (SD = 0.05), whereas, for more to less focal /u/ changes, subjects achieved a mean A′ score of 0.85 (SD = 0.04). A paired samples t-test was used to compare the mean A′ scores for each stimulus presentation order. The analysis revealed that this difference was non-significant, t(15) = -1.326, p = 0.205, r2 = 0.101. This finding suggests that the directional asymmetry documented in Experiment 1 does not reflect a bias favoring more extreme oral postures, independent of a kinematic form.
An alternative interpretation of this finding is that discrimination of the static face was more difficult because the static stimuli are impoverished compared to the dynamic faces. If this were the case, however, then one might expect overall task performance to be poorer during discrimination of the static visual stimuli, compared to the dynamic visual stimuli. To address this concern, we compared the mean A′ scores across Experiments 1 and 2. An independent samples t-test revealed that subjects' overall task performance was higher when subjects discriminated the static face stimuli (M = 0.84; SD = 0.04) compared to the dynamic stimuli (M = 0.78; SD = 0.09; t(30) = -2.463, p = 0.020, r2 = 0.43), indicating that this explanation cannot account for our results.
3.3 Discussion
The results of Experiment 2 demonstrate that asymmetries do not emerge when visemes are presented without any dynamic facial cues, and that overall perceptual sensitivity is enhanced during discrimination of static visemes compared to dynamic visemes (Experiment 1). This enhanced sensitivity to may reflect two alternative, although not mutually exclusive, possibilities. One possibility is that direct visual comparisons of the visemes was made easier by the removal of the oral-facial kinematic cues. More specifically, since subjects did not have to track visual articulatory movements in the static condition, they had more time to focus on the relevant differences in the “target” visible vocal-tract postures between the two types of /u/ gestures. An alternative, although not mutually exclusive possibility, is that the Canadian-English subjects in Experiment 2 displayed enhanced discrimination performance because they had some passive exposure to spoken French, while the American-English subjects in Experiment 1 did not.
The lack of a directional effect in Experiment 2 suggests that asymmetries do not derive from the more extreme mouth positions of the French gestures. There are several possible interpretations of this difference in the response patterns between dynamic and static visual vowels. First, as outlined above, the asymmetries found by Masapollo et al.may in fact be driven by a general processing bias favoring more dynamic kinematic patterns. On this view, asymmetries result from differences in the temporal aspects of oral-facial motion that normally accompany vowel production.
However, this view may need to be qualified. An alternative interpretation is that the static vowels produced different results because they recruited different perceptual processes from those used to process the spatiotemporal cues of the dynamic vowels, rather than because of the lack of oral-facial movement itself. Consistent with this view, Munhall, Servos, Santi, and Goodale (2002) reported a case study of visual agnosia, in which a patient showed normal performance on a phonetic identification task with dynamic speech stimuli, but impaired performance with static speech stimuli. However, these results are inconsistent with other behavioral findings showing that perceivers can lip-read from photographs of faces (Rosenblum, 2005), as well as neuro-imaging findings showing that the processing of dynamic and static visual speech is carried out by similar neural substrates (Calvert & Campbell, 2003).
If indeed the asymmetries found by Masapollo et al. are driven by oral-facial kinematic differences per se, then they should also emerge during the discrimination of optic stimuli that retain isolated kinematic information about the lips of the model speaker in the absence of any other facial information. To address this issue, Experiments 3-5 further explore asymmetries in visual vowel discrimination using schematic visual analogs that simulate the kinematic features of the English /u/'s and French /u/'s, but which are not identified as real visual speech.
4. Experiment 3
Experiment 3 examined whether asymmetries would emerge during the discrimination of “point-light mouth” displays that retained only the isolated kinematics (spatial direction and motion) of the lips of Masapollo et al. 's model speaker. As mentioned earlier, previous studies demonstrate that untrained subjects can lip-read phonemes, syllables and words from point-light faces (see, e.g., Jordan & Beven, 1997; Rosenblum & Saldaña, 1998; Rosenblum, 2005, 2008). In addition, there is evidence that point-light speech influences the perception of acoustic speech (as in the “McGurk effect”). The point-light stimuli for the present experiment were carefully constructed to provide sparse information about the displacement, velocity, and acceleration of individual points on the speaker's lips.
On the basis of the findings from Experiments 1 and 2, we reasoned that if asymmetries in visual vowel perception are triggered solely by oral-facial kinematic differences between the English /u/ and French /u/ visemes, then perceivers should show asymmetric perceptual responses comparable to those observed in Experiment 1 while discriminating these schematic analogs, regardless of whether they are aware that the stimuli are based on the configuration and motion of a mouth. Indeed, pilot testing revealed that these dynamic displays (described below) were not identified as a mouth by naïve perceivers when they were presented without information on how they were generated. Alternatively, if asymmetries derive from a speech-specific information, then they may only emerge when perceivers are informed about the point-light technique.
4.1 Materials and Methods
4.1.1 Subjects
Thirty-two students from Brown University served as participants (mean age=19.6 years [SD =3.0] 12 males). All were monolingual American English speakers who reported normal hearing and normal (or corrected-to-normal) vision. The experiment took approximately one hour, and subjects received course credit for their participation.
4.1.2 Stimuli
The point-light stimuli consisted of four small dots that moved in synchrony with the horizontal and vertical lip openings of Masapollo et al.'s model speaker. The location of the four dots corresponded to the speaker's maximal horizontal lip (spread) and vertical lip (opening) apertures. These locations were near (but did not necessarily correspond to) the center of the upper and lower lips and the corners of the mouth. The locations were chosen so as to depict isolated information about the lip movements of the speaker. Sample images of the dot configurations are shown in Figure 3; the dots appeared black against a white background. There were no lines connecting the individual dots. Consequently, the motions that corresponded to the lips of the speaker provided the pairing between the points.
To create the stimuli, we extracted measurements (from the 2D video tokens used in Experiment 1) corresponding to the speaker's inter-labial are a for each frame of each video token using an in-house MATLAB script (MathWorks). For each video frame, the pixels were first automatically labeled as being inside or outside of the inter-labial area based on color segmentation of the image. This labelling was reviewed and manually corrected as needed by the first author. The maximal 2D distance between the upper and lower lips and between the two corners of the mouth for each frame for each token was then computed. These measurements were used to generate point-light visual speech motion. An animated schema of the mouth was generated by displaying the positions of the four dots marking the extreme lip apertures in MATLAB using a marker visualization tool.
4.1.3 Procedure and Design
The general procedure was the same as for Experiment 1, except for the choice of stimuli; namely, subjects were instructed to discriminate the point-light displays instead of the original dynamically articulating faces. Prior to the start of the experiment, subjects were informed that they would be presented with pairs of videos that show patches of moving dots that moved in either the same way or in different ways, and that their task was to indicate (by pressing a button on a response pad) whether each pair of dots moved in the same way or in a different way.
Subjects were assigned randomly to one of two conditions, sixteen in each. The subjects in condition 1 were not informed of the technique used to create the point-light displays, while the subjects in condition 2 were told that the dots corresponded to oral speech movements. The subjects in the former group were told that the purpose of the experiment was to examine the nature of visual motion perception. After the test session, the experimenter then pointed out to these subjects that the placement and motion of the dots corresponded to lip movements produced during speech. All of the subjects were surprised to learn this and reported that they did not interpret the displays as actual visual speech during the discrimination task.
4.2 Results
Mean percent-correct and A′ scores were calculated for each subject for all different stimulus pairs contrasting a point-light display simulating a less-focal/English/u/ labial gesture with a point-light display simulating a more-focal/French/u/ labial gesture; separate scores were computed for each condition (1vs.2) and order of stimulus presentation (less to more focal vs. more to less focal). Examining first the overall discrimination performance, the percent-correct scores achieved by subjects (across both conditions) was 60.6%. Percent-correct scores were submitted to a two-way analysis of variance (ANOVA) with condition (1 vs.2) as a between-subjects factor, and order of vowel change (less to more focal vs. more to less focal) as a within-subjects factor. There was no main effect of condition [F(1, 30) = 1.296, p = 0.264, η2p = 0.041], indicating that subjects' knowledge of the point-light technique did not influence overall task performance. There was, however, a significant main effect of order of vowel change [F(1, 30) = 9.699, p = 0.004, η2p = 0.244], such that subjects performed better at discriminating a change from a point-light display simulating the lip movements of a less-focal/English /u/ to one simulating the lip movements of a more-focal/French /u/ (M = 63.6, SD = 16.7), compared to the reverse (M = 57.7, SD = 15.3). There was no order by condition interaction [F(1, 30) = 0.056, p = 0.815, η2p = 0.002].
In a second analysis, A′ scores were then submitted to a two-way analysis of variance (ANOVA) with condition (1 vs.2) as a between-subjects factor, and order of vowel change (less to more focal vs. more to less focal) as a within-subjects factor. The result of this analysis is shown in Figure 4. Again, there was no main effect of condition [F(1, 30) = 0.061, p = 0.807, η2p = 0.002]. There was, however, a significant main effect of order of vowel change [F(1, 30) = 12.494, p = 0.001, η2p = 0.294], such that subjects performed better at discriminating a change from a point-light display simulating the lip movements of a less focal /u/ to one simulating the lip movements of a more focal /u/ (M = 0.67, SD = 0.07), compared to the reverse (M = 0.62, SD = 0.07). There was no order by condition interaction [F(1, 30) = 0.106, p = 0.747, η2p = 0.004].
In a third analysis, task performance was directly compared across Experiments 1 and 3 to examine if the magnitude of the asymmetry differed across the visual speech and point-light display conditions. Mean A′ scores were submitted to a two-way ANOVA with experiment (1 vs. 3 [condition 1] vs. 3 [condition 2]) as a between-subjects factor, and order of vowel change (less to more focal vs. more to less focal) as a within-subjects factor. There were significant main effects of experiment [F(2, 45) = 14.722, p < .000, η2p = 0.396] and order of vowel change [F(1, 45) = 22.038, p < .000, η2p =0.329]. Critically, however, there was no interaction [F(1, 45) = 0.088, p = .916, η2p = 0.004]. Post-hoc LSD t-tests indicatedthat the subjects in Experiment 1 displayed significantly greater overall task performance than the subjects across both conditions in Experiment 3 (condition 1: p < 0.000; condition 2: p < 0.000), presumably because the visual vowel stimuli containing the entire face of the model speaker were richer and more natural than the minimalist mouth-like configuration of the point-light displays.
4.3 Discussion
The results of Experiment 3 demonstrate that both point-light conditions successfully elicited a directional asymmetry consistent with the predictions of NRV. This result is even more compelling when it is noted that the magnitude of the directional effect was statistically comparable to that observed with the dynamically-articulating face stimuli (Experiment 1). These results suggest that perception of the optic stimuli as a talking mouth is not needed to induce an asymmetry, thus lending support to the view that the asymmetries found by Masapollo et al. (2017b) may reflect a general visual processing bias favoring more dynamic kinematic patterns.
However, the findings of Experiment 3 are not necessarily inconsistent with an account that invokes perceivers' sensitivity to more extreme articulatory movements to explain the observed discrimination asymmetry. Visible movements on the face that naturally accompany audiovisual vowel production are highly familiar, ecologically-relevant stimuli that are known to be salient to (normally-sighted) perceivers. Previous work demonstrates that adults are sensitive to visual-phonetic information from lip movements, even in the absence of any other facial information (e.g., Jordan & Beven, 1997). Moreover, recent developmental research reveals that even newborns can detect cross-modal correspondences between point-light speech and acoustic speech (Guellai et al., 2016). It is possible, then, that human cognition has evolved to rapidly fine-tune our perception to detect the movements and features of faces that give rise to con specific communicative signals. In this way, the focalization bias could still occur even if perceivers did not consciously interpret the point-light stimuli as speech. If this is the case, then perhaps disrupting the canonical orientation and/or global configuration of the point-light stimuli is necessary to disrupt the bias that drives directional asymmetries. Experiments 4 and 5 were conducted to test this hypothesis.
5. Experiment 4
Experiment 4 examined whether the orientation, as well as the kinematic form, of the point-light speech (utilized in Experiment 3) is critical for eliciting a directional asymmetry. To do so, we tested whether asymmetries would be observed with point-light stimuli that were spatially rotated 45° (clockwise); this rotation disturbs the spatial orientation in which we typically encounter human faces. If the asymmetries observed in Experiment 3were triggered solely by the kinematic differences between the English /u/ and French /u/ visemes, independent of orientation, then perceivers should still show an asymmetry with rotated point-light displays. Alternatively, if the perceptual processes underlying asymmetries are sensitive to both the typical orientation and kinematics of a talking mouth then the asymmetry should be weaker or absent when the stimuli do not correspond to the usual orientation of a mouth.
5.1 Materials and Methods
5.1.1 Subjects
Sixteen students from Brown University served as participants (mean age= 20.2 years [SD =3.09] 7 males). All were monolingual American English speakers who reported normal hearing and normal (or corrected-to-normal) vision. The experiment took approximately one hour, and subjects received course credit for their participation.
5.1.2 Stimuli
The visual stimuli for Experiment 4 were the same point-light displays used in Experiment 3, except that they were spatially rotated 45° (clockwise), so that they resembled a rhombus-like shape. All other aspects of the stimuli, including the size and position of the dots, remained the same. Several example frames of the rotated stimuli are shown in Figure 3.
5.1.3 Procedure and Design
The procedure for Experiment 4 was the same as for Experiment 3a.
5.2 Results
Paralleling previous analyses, mean percent-correct and A′ scores were calculated for each subject for all different stimulus pairs contrasting a point-light display simulating an English /u/ labial gesture with a point-light display simulating a French /u/ labial gesture; separate scores were computed for each order of vowel change (less to more focal vs. more to less focal). Overall, the percent-correct scores achieved by subjects was 67.9%. Although we observed a trend suggesting an order effect, paired samples t-test revealed no significant differences between subjects' percent-correct scores (MEnglish to French = 69.0, SD = 9.5 vs. MFrench to English = 66.8, SD = 16.3; t(15) = 0.615, p = .548, r2 = .141) or mean A′ scores across the two stimulus presentation orders(MEnglish to French = 0.68, SD = 0.06 vs. MFrench to English = 0.66, SD = 0.11; t(15) = 0.775, p = .450, r2 = .194). Subjects' mean A′ scores for each stimulus presentation order are shown in Figure 4.
To assess the impact of stimulus rotation more directly we compared performance with rotated stimuli (Experiment 4) and with non-rotated stimuli (Experiment 3 - condition 1). There was no significant main effect of experiment [F(1, 30) = 1.709, p = .201, η2p = 0.054], indicating that overall task performance was not effected by stimulus orientation. However, when the experiments are combined the main effect of order of vowel change was only marginally significant [F(1, 30) = 4.183, p = .050, η2p = 0.122], and there was no experiment by order interaction [F(1, 30) = 0.494, p = 0.613, η2p = 0.021]. Collectively, the cross-experiment analyses, along with the clear difference in effect size across experiments, suggest that the rotated point-light displays may not be capable of eliciting a perceptual asymmetry comparable to that elicited with non-rotated stimuli.
5.3 Discussion
The goal of Experiment 4 was to test whether asymmetries would be observed when the point-light stimuli(utilized in Experiment 3) were spatially rotated in a way that disturbed the canonical orientation of the model speaker's labial gestures. While previous studies have found that manipulating the spatial orientation of a talking face does not completely disrupt the identification of visual speech (Jordan & Bevan, 1997) or the McGurk effect (Rosenblum, Yakel, & Green, 2000), it is not entirely clear from the present results whether the orientation of the point-light speech is critical for eliciting a directional asymmetry. Whether the rotated point-light displays elicit an asymmetry remains an open question, as shown by the contrast between within- and across-experiment analyses. However, the clear difference in effect size across experiments indicates that if the rotated displays are capable of eliciting such an asymmetry, it is weaker than the bias elicited with non-rotated stimuli. As a next step, Experiment 5 was designed to provide a more rigorous test of whether dynamic facial-kinematic information drives asymmetries, independent of configural information about the model speaker's mouth.
6. Experiment 5
Experiment 5 investigated whether asymmetries would emerge during the discrimination of non-speech analogs that retained the isolated kinematics of the lips of the model speaker, but without the configuration information that specifies mouth shape. We used Lissajou patterns, which resemble a figure-eight oriented horizontally, to create dynamic displays that increased and decreased in size to track the displacement, velocity, and acceleration of specific points on the speaker's lips. Critically, this display does not trace the outline or shape of a human mouth. Similar stimuli have been utilized in other studies comparing the perception of visual speech versus non-speech (Irwin & Brancazio, 2014). We predicted that if the asymmetries observed with the point-light displays used in Experiments 3 and 4 are driven by kinematic cues, independently of configural information about mouth shape, then perceivers should show a comparable asymmetry while discriminating the Lissajou stimuli. Alternatively, if the asymmetries observed with the point-light stimuli are driven, at least in part, by information about mouth shape, then perceivers are expected to fail to show an asymmetry (or at least show a weaker directional effect) while discriminating the Lissajou stimuli.
6.1 Materials and Methods
6.1.1 Subjects
Sixteen students from Brown University served as participants (mean age=19.1 years [SD =0.89] 6 males). All were monolingual American English speakers who reported normal (or corrected-to-normal) vision. The experiment took approximately one hour, and subjects received course credit for their participation.
6.1.2 Stimuli
The stimuli consisted of Lissajou curves that dynamically opened and closed in synchrony with the maximal horizontal and vertical lip apertures of Masapollo et al.'s model speaker. Thus, they retained the labial kinematic cues of the point-light stimuli, but did not exhibit a configuration that resembled a mouth. These animated schemas were created using the same procedures used to create the point-light displays (see Experiment 3, Methods). Sample images of the Lissajoustimuliare shown in Figure 2.
6.1.3 Procedure and Design
The experimental protocol for Experiment 5 was the same as for Experiment 3a, except that subjects were instructed to discriminate pairs of the Lissajou curves instead of the point-light displays. As in Experiment 3a, subjects were not made aware of the technique used to create the visual displays. Subjects were told that they would be presented with pairs of “figure-eight shapes” that increased and decreased in size, and that their task was to indicate (by pressing a button on a response pad) whether a given pair of shapes moved in the same way or in different ways.
6.2 Results
As shown in Figure 4, the results of Experiment 5 are different from those in Experiments 1,3 and 4. With the Lissajou displays discrimination performance was higher for stimuli presented in the more to less focal direction, which is the opposite direction predicted based on focalization. As in all of the previous experiments, we again compared subjects' mean percent-correct and A′ scores for each order of vowel change. Here, subjects achieved an overall percent-correct score of 60%. However, percent-correct scores were very similar across both orders. A paired samples t-test revealed no significant differences between subjects' percent-correct scores for the two stimulus presentation orders (MEnglish to French = 58.8, SD = 12.8 vs. MFrench to English = 60.7, SD = 13.7; t(15) = -.630, p = .538, r2 = .158). Figure 4 shows that the mean A′ scores were also quite similar across both orders. A paired samples t-test revealed no significant differences between the two orders (MEnglish to French = 0.61, SD = .08 vs. MFrench to English = 0.62, SD = 0.09;t(15) = -0.517, p = .613, r2 = .130).
To assess the effect of configuration on asymmetries more directly we compared performance with the point light display (Experiment 3 (condition 1) and with the Lissajou displays (Experiment 5). Mean A′ scores were submitted to a two-way ANOVA with experiment (3 [condition1] vs. 5) as a between-subjects factor, and order of vowel change (less to more focal vs. more to less focal) as a within-subjects factor. There was no significant main effect of experiment [F(1, 30) = 2.367, p = .134, η2p = 0.073], indicating subjects' overall task performance did not differ with the point-light speech or Lissajou curves. There was also no significant main effect of order of vowel change [F(1, 30) = 1.573, p = .219, η2p = 0.050]. There was, however, amarginally significant interaction [F(1, 30) = 3.965, p = 0.056, η2p = 0.117]. Taken together, these findings indicate that the directional asymmetries found with the point-light stimuli are influenced by the global shape of the visual stimuli, as well as by the kinematics.
6.3 Discussion
The findings of Experiment 5 provide critical data demonstrating that directional asymmetries in visual vowel perception are not driven solely by oral-facial kinematics independent of other visual attributes. Manipulating other optical elements besides the kinematics of the stimuli disrupted the directional effect. Specifically, Experiment 5, which used a more abstractly shaped non-speech visual analog, failed to replicate the results of Experiment 3 with the point light displays, even though the kinematic properties were the same across these conditions. Taken together, we can conclude that in order to elicit a directional asymmetry in visual vowel discrimination, perceivers must be presented with an optical stimulus that contains information for both the movements and spatial configuration feature of a talking mouth. This finding provides strong support against a general processing account of asymmetries in which perceivers are simply biased toward more dynamic visual motion patterns.
7. General Discussion
The goal of the present experiments was to determine whether asymmetries in visual vowel perception reflect a speech-specific bias favoring vowels produced with more extreme articulatory maneuvers or a general processing bias favoring more dynamic visual stimuli. The results of Experiment 2 demonstrated that the asymmetries observed with dynamic speech in Experiment 1 are not observed with static visual speech, indicating that these effects do not simply reflect a bias favoring more extreme mouth positions. The results of Experiments 3 and 4 further demonstrated that asymmetries are preserved with point-light mouth stimuli that retain both the canonical orientation and the kinematic form of an articulating mouth, suggesting that speech-specific facial movement patterns may be critical to elicit a directional asymmetry. Further, contrary to the general processing account of asymmetries outlined in the introduction, the results of Experiments 4 and 5 suggest (using spatially rotated point-light displays and dynamic displays of Lissajour curves) that perceivers are not simply biased toward more dynamic patterns of motion, independent of other visual attributes. Specifically, changes to the spatial orientation or global configuration of the point-light animations disrupted asymmetries, even when the kinematic properties of the stimuli were controlled. Overall, the picture that emerges is that asymmetries appear to be elicited by optical stimuli that depict both lip-motion and configural information, which is more consistent with a gestural or analysis by synthesis account of this perceptual asymmetry.
That said, the results from Experiment 3 may seem troubling for a speech-specific account in that, even when perceivers did not interpret the (canonically-oriented) point-light animations as visible speech movements, they still showed directional asymmetries comparable to those observed with speech. Previous studies comparing the perception of speech with non-speech analogs (i.e., sine wave or point-light speech) have yielded mixed results. In some cases, similar response patterns are found to speech and non-speech(e.g., Mann, 1986; Rosenblum, Johnson, & Saldaña, 1996; Rosenblum, & Saldaña, 1996, 1998; Holt, 2005; Viswanathan, Magnuson, & Fowler, 2014), whereas in other cases, different response patterns are found (e.g., Miyawaki, Strange, Verbrugge, Liberman, & Jenkins, 1975). As well there are some reports in which the same physical stimuli are perceived differently depending on whether they are interpreted as speech or non-speech (e.g., Best, Morrongiello, & Robson, 1981; Brancazio, Best, & Fowler, 2006). Explanations for these different patterns of results cover a range of theoretical perspectives (see Fowler, 1990, for discussion). In the present case, it could be that perceivers have extensive experience viewing talking faces, and so they have ample opportunity to develop both abstract and highly-detailed memory representations of the spatiotemporal properties of facial movements during face-to-face speech communication. It is possible, then, that the point-light stimuli could unconsciously and automatically have the same causal effects on perception as more ecologically-valid visual speech stimuli.
An important limitation of the present study, however, is that there is a potential confound in comparing visual speech (Experiment 1) with the non-speech point-light (Experiments 3-4) and Lissajou displays (Experiment 5). Aside from not having been caused by a distal vocal-tract event, the non-speech signals also have an ambiguous source. Most studies in which speech and non-speech have been compared typically suffer from this limitation (but see Fowler & Rosenblum 1990; Brancazio, Best & Fowler, 2006). Because of this confounding the present findings do not provide definitive evidence that the perceptual processes underlying the focal vowel bias operate exclusively on speech information, and not on generic spatiotemporal stimulus properties that also specify other familiar natural events. Accordingly, from an ecological psychological perspective (see e.g., Fowler, 1990), future research that attempts to determine whether the focal vowel bias is in fact gestural in nature needs to examine whether asymmetries are also observed in perception of other familiar (non-speech) “real-event-in-the-world” actions that exhibit less versus more extreme motions (e.g., stepping motions of legs and feet).
Future studies designed to examine whether young infants also show asymmetries in visual and audiovisual vowel perception will shed light on the nature and development of the focal vowel bias. Over the years, considerable developmental research has shown that speech perception in early infancy is not unimodal, and that babies possess some capacity to integrate information about speech cross-modally(see, e.g., Kuhl & Meltzoff, 1982; Rosenblum, Schmuckler., & Johnson, 1997; Yeung & Werker, 2013). Within hours of being born, infants can already detect some cross-modal correspondences for acoustic and visual speech information, even when the facial dynamics are specified by point-light stimuli, according to some recent results by Guellai et al. (2016).
There is good reason to think that the focal vowel bias could be induced by exposure to infant-directed speech very early in development (see Polka & Bohn, 2011, for discussion). Acoustic analyses of infant-directed speech in many languages have consistently demonstrated that caregivers tend to produce more exaggerated (i.e., hyper-articulated) peripheral vowels when talking to their babies compared to other adults (e.g., Kuhl, Andruski, Chistovich, Chistovich, Kozhevnikova, Ryskina, et al., 1997). As well recent articulatory data suggests that the vocalic gestures produced in infant-directed speech also exhibit more exaggerated facial movements compared to those produced in adult-directed speech (Green, Nip, Wilson, Mefferd, & Yunusova, 2014; but see Kalashnikova, Carignan, & Burnham, 2017).
Further investigations of sensorimotor integration processes in vowel perception may also be informative for phonetic theories. As mentioned above, during pilot testing with their unimodal visual stimuli, Masapollo et al. (2017b) observed that some subjects spontaneously produced covert vowel production-like oral-facial movements while performing the discrimination task. Thus, subjects' articulators were active during vowel perception. As in Masapollo et al., we minimized the involvement of the speech motor system by instructing our subjects to refrain from producing articulatory movements during the present visual perception experiments. However, the propensity to produce speech movements in this context deserves further exploration. Functional neuro-imaging data reveal that visual speech information activates cortical motor and somatosensory areas involved in speech production to a greater degree than acoustic speech information (e.g., Skipper et al., 2005, 2007). A complementary result was obtained by Sundara et al. (2001), who demonstrated using transcranial magnetic stimulation that visual speech, but not acoustic speech, elicits motor-evoked potentials in the muscles recruited to produce speech. Venezia, Fillmore, Matchin, Isenber, Hickok, and Fridriksson (2016) have also recently identified a visuomotor pathway for speech motor control. These results are directly relevant to the observation of visual vowel imitation because activation of the speech articulators during the present visual vowel discrimination task might reflect the generation of an internal motor model, and that this “synthesis” could be used to perceive speech (see Skipper et al., 2007, for discussion). Thus, it will be informative to directly examine effects of articulator activation versus suppression on directional asymmetries in unimodal visual vowel perception.
There is one final point to make concerning the implications of the present study. While the findings reported here shed some light on the nature of the information driving asymmetries in visual vowel perception, we are still left with the question as to why tracking this information gives rise to perceptual asymmetries. As reviewed in Polka and Bohn (2003), there is evidence that directional asymmetries are widespread in human cognition and perception, having been well-documented in a number of stimulus domains, including consonants, lexical tones, musical patterns, colors, geometric figures, and numbers. For example, in the perception of Roman capitalized letters, perceivers are more likely to misidentify “E” as “F” or “R” as “P” than the reverse (e.g., Gilmore, Hersch, Caramazza, & Griffin, 1979). Despite widespread interest in these phenomena, the perceptual processes behind them are not yet fully understood. However, a prominent role of perceptual salience likely places a central role in understanding directional asymmetries.
The visual asymmetries documented in the present study may ultimately be explained as the application of a general cognitive bias to distinct domain-specific information, i.e., information specifying speech movements. The finding that certain kinematic and configural information is crucial in eliciting asymmetries in the domain of vowel perception shows that this bias reflects a sensitivity to the information generated by distal speech movements. The eye-tracking findings showing increased attention to more dynamic, facial movements in Experiment 1 also align with this view. Importantly, this is the first study to show that there are visual correlates to acoustic focalization. It is not yet known if other focalized vowel phonemes are also visually more salient, and if so, what the appropriate metrics would be for quantifying differences in optical salience between visemes. Future experiments utilizing eye-tracking can provide further insights into the attention processes and stimulus properties that shape the focalization bias.
In sum, the present findings provide critical information on the nature of the focal vowel bias, in keeping with the fundamental auditory-visual nature of everyday speech (Fowler, 2004; Rosenblum, 2005, 2008). A complete understanding of the processes involved in this perceptual bias is not only important to vowel perception theories, but provides a critical basis for the study of phonetic development as well as the perceptual factors that shape and constrain vowel inventories across human languages.
Supplementary Material
Public Significance Statement.
The present research explores the precise nature of the visual speech information that adults perceive during face-to-face conversational interactions by examining how adults discriminate unimodal visual vowels and schematic visual analogs of them that track lip movements during vowel production but are not perceived as speech. We examined a well-established perceptual bias favoring vowels produced with extreme vocal tract maneuvers, which can be measured as a directional asymmetry in auditory or visual vowel discrimination performance. Results show that adults display this bias when discriminating visual-only vowels and attend closely to the mouth during the discrimination task. Adults also display the bias when discriminating visual stimuli that capture both the shape and movement of a talking mouth producing vowels but not when either the shape or the movement is disturbed. Collectively, the findings demonstrate that adults are sensitive to the observable shape and movement patterns that occur when a person talks.
Acknowledgments
The research reported here was supported in part by NSERC Grant 105397 to Linda Polka, NSERC Grant 312395 to Lucie Ménard, and NIH Grant 5-27025 to James Morgan. We are grateful to Alexina Hicks, Angela Chang, Fiona Higgins (McGill University), Lori Rolfe, Ellen Macaruso, and Leah Mann (Brown University) for assistance with subject recruitment and data-collection, and Laureline Arnaud and Paméla Trudeau-Fisette (University of Quebec at Montreal)for their help with stimulus preparation. I thank Alexina Hicks, who served as the model speaker in Masapollo et al. (2017b) and the present study. Finally, this work benefited from helpful discussions with, or comments from Sheila Blumstein, Frank Guenther, Carol Fowler, Julia Irwin, Douglas Whalen, and three anonymous reviewers.
Footnotes
Note that eye-tracking was not used in any of the subsequent experiments.
An a priori power analysis for a paired t-test was conducted in R (R Development Core Team, 2008) to determine a sufficient sample size using an alpha of 0.05, a power of 0.80, a medium effect size (r2 = 0.58), and two tails. The effect size was observed in some of our earlier research (Masapollo et al., 2017b, Experiment 2), which is nearly identical in design to the present experiments. Based on the aforementioned assumptions, the desired sample size is 16. An additional post-hoc power analysis supported this result; in Experiment 1 of the present study (n = 16), the power was 0.96 at a significance level of 0.05 with a large effect size (r2 = 0.67).
Masapollo et al. (2017b) recorded the model speaker producing stop-initial CV syllables instead of isolated vowels to facilitate cross-stimulus splicing for other bimodal (auditory-visual) vowel perception experiments.
In a previous study using unimodal acoustic vowel stimuli, Polka and Bohn (2011) found a directional asymmetry with a relatively long ISI (i.e., 1500 ms), but not a relatively short ISI (i.e., 500 ms), suggesting that, along with factors related to stimulus salience (e.g., formant proximity), auditory working memory and attention also contribute to the focalization bias (i.e., as ISIs increase, demands on attention and auditory working memory also increase [see, e.g., Werker & Logan, 1985; Cowan & Morse, 1986; Strange, 2011]). An ISI of 1500-ms was used in Masapollo et al.'s (2017b) visual vowel experiments to be consistent with prior work in the auditory realm. It is possible that the influence of ISI on asymmetries may differ across sensory modalities – however, this issue was not the focus of the current research.
We also video recorded subjects to ensure that they did not produce covert articulatory movements during the discrimination task. No trials or participants were discarded in any of the present experiments due to covert vowel production-like articulatory movements.
References
- Best CT, Morrongiello B, Robson R. Perceptual equivalence of acoustic cues in speech and nonspeech signals. American Journal of Psychology. 1981;74:17–26. doi: 10.3758/bf03207286. [DOI] [PubMed] [Google Scholar]
- Best CT, Goldstein LM, Nam H, Tyler MD. Articulating what infants attune to in native speech. Ecological Psychology. 2016;28(4):216–261. doi: 10.1080/10407413.2016.1230372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boersma P, Weenink D. Praat: doing phonetics by computer [Computer program] Version 6.0.19. 2016 http://www.praat.org/
- Brancazio L, Best CT, Fowler CA. Visual influences on perception of speech and non-speech vocal-tract events. Language and Speech. 2006;49(1):21–53. doi: 10.1177/00238309060490010301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bruderer AG, Danielson DK, Kandhadai P, Werker JF. Sensorimotor influences on speech perception in infancy. Proceedings of the National Academy of Sciences. 2015;112(44):13531–13536. doi: 10.1073/pnas.1508631112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calvert GA, Campbell R. Reading speech from still and moving faces: The neural substrates of visible speech. Journal of Cognitive Neuroscience. 2003;15(1):57–70. doi: 10.1162/089892903321107828. [DOI] [PubMed] [Google Scholar]
- Cowan N, Morse P. The use of auditory and phonetic memory in vowel discrimination. Journal of the Acoustical Society of America. 1986;79:500–507. doi: 10.1121/1.393537. [DOI] [PubMed] [Google Scholar]
- Escudero P, Polka L. A cross-language study of vowel categorization and vowel acoustics. In: Sole MJ, Recansens D, Romero J, editors. Proceedings of the International Congress of Phonetic Sciences; Barcelona. 2003. pp. 861–864. [Google Scholar]
- Everdell IT, Marsh H, Yurick MD, Munhall KG, Paré M. Gaze behavior in audiovisual speech perception: Asymmetrical distribution of face-directed fixations. Perception. 2007;36:1535–1545. doi: 10.1068/p5852. [DOI] [PubMed] [Google Scholar]
- Fowler CA. Sound-producing sources as objects of perception: rate normalization and nonspeech perception. Journal of the Acoustical Society of America. 1990;88(3):1236–1249. doi: 10.1121/1.399701. [DOI] [PubMed] [Google Scholar]
- Fowler CA, Rosenblum L. Duplex perception: a comparison of monosyllables and slamming doors. Journal of Experimental Psychology: Human Perception and Performance. 1990;16(4):742–754. doi: 10.1037//0096-1523.16.4.742. [DOI] [PubMed] [Google Scholar]
- Fowler CA. Speech as a supramodal or amodal phenomenon. In: Calvert GA, Spence C, Stein BE, editors. The Handbook of Multisensory Processes. Cambridge, MA: MIT Press; 2004. pp. 189–201. [Google Scholar]
- Gilmore GC, Hersch H, Caramazza A, Griffin J. Multidimensional letter similarity derived from recognition errors. Perception and Psychophysics. 1979;25(5):425–431. doi: 10.3758/bf03199852. [DOI] [PubMed] [Google Scholar]
- Green J, Nip I, Wilson E, Mefferd A, Yunusova Y. Lip movement exaggerations in infant-directed speech. Journal of Speech, Language, and Hearing Research. 2010;53(6):1529–1542. doi: 10.1044/1092-4388(2010/09-0005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grier JB. Nonparametric indexes for sensitivity and bias: Computing formulas. Psychological Bulletin. 1971;75:424–429. doi: 10.1037/h0031246. [DOI] [PubMed] [Google Scholar]
- Guellai B, Streri A, Chopin A, Rider D, Kitamura C. Newborns' sensitivity to the visual aspects of infant-directed speech: Evidence from point- light displays of talking faces. Journal of Experimental Psychology. 2016;42(9):1275–1281. doi: 10.1037/xhp0000208. [DOI] [PubMed] [Google Scholar]
- Holt L. Temporally non-adjacent non-linguistic sounds affect speech categorization. Psychological Science. 2005;16:305–316. doi: 10.1111/j.0956-7976.2005.01532.x. [DOI] [PubMed] [Google Scholar]
- Imada T, Zhang Y, Cheour M, Taulu S, Ahonen A, Kuhl PK. Infant speech perception activates Broca's area: a developmental magnetoencephalography study. Neuroreport. 2006;17(10):957–962. doi: 10.1097/01.wnr.0000223387.51704.89. [DOI] [PubMed] [Google Scholar]
- Irwin JR, Brancazio L. Seeing to hear? Patterns of gaze to speaking faces in children with autism spectrum disorders. Frontiers in Psychology. 2014;5:1–10. doi: 10.3389/fpsyg.2014.00397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Iskarous K. Vowel constrictions are recoverable from formants. Journal of Phonetics. 2010;38:375–387. doi: 10.1016/j.wocn.2010.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ito T, Tiede M, Ostry DJ. Somatosensory function in speech perception. Proceedings of the National Academy of Sciences. 2009;106:1245–1248. doi: 10.1073/pnas.0810063106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jordan TR, Beven K. Seeing and hearing rotated faces: Influences of facial orientation on visual and audiovisual speech recognition. Journal of Experimental Psychology: Perception and Performance. 1997;23:388–403. doi: 10.1037//0096-1523.23.2.388. [DOI] [PubMed] [Google Scholar]
- Kalashnikova M, Carignan, Burnham D. The origins of babytalk: smiling, teaching, or social convergence? Royal Society Open Science. 2017;4:170306. doi: 10.1098/rsos.170306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent RD, Read C. The Acoustic Analysis of Speech. Singular/Thomson Learning; 2002. [Google Scholar]
- Kuhl PK, Meltzoff AN. The bimodal perception of speech in infancy. Science. 1982;218:1138–1141. doi: 10.1126/science.7146899. [DOI] [PubMed] [Google Scholar]
- Kuhl PK. Human adults and human infants show a “Perceptual Magnet Effect” for the prototypes of speech categories: Monkeys do not. Perception & Psychophysics. 1991;50:93–107. doi: 10.3758/bf03212211. [DOI] [PubMed] [Google Scholar]
- Kuhl PK, Williams KA, Lacerda F, Stevens KN, Lindblom B. Linguistic experience alters phonetic perception in infants by 6 months of age. Science. 1992;255:606–608. doi: 10.1126/science.1736364. [DOI] [PubMed] [Google Scholar]
- Kuhl PK, Andruski JE, Chistovich IA, Chistovich LA, Kozhevnikova EV, Ryskina VL, et al. Cross-language analysis of phonetic units in language addressed to infants. Science. 1997;277:684–686. doi: 10.1126/science.277.5326.684. [DOI] [PubMed] [Google Scholar]
- Kuhl PK, Ramírez R, Bosseler A, Jotus Lin JF, Imada T. Infants' brain responses to speech suggest analysis by synthesis. Proceedings of the National Academy of Sciences. 2014;111(31):11238–11245. doi: 10.1073/pnas.1410963111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liberman AM, Mattingly IG. The motor theory of speech perception revised. Cognition. 1985;21:1–36. doi: 10.1016/0010-0277(85)90021-6. [DOI] [PubMed] [Google Scholar]
- Lucero JC, Munhall KG. A model of facial biomechanics for speech production. Journal of the Acoustical Society of America. 1999;106(5):2834–2842. doi: 10.1121/1.428108. [DOI] [PubMed] [Google Scholar]
- Lucero J, Maciel S, Johns D, Munhall KG. Empirical modeling of human face kinematics during speech using motion clustering. Journal of the Acoustical Society of America. 2005;118:405–409. doi: 10.1121/1.1928807. [DOI] [PubMed] [Google Scholar]
- MacLeod A, Stoel-Gammon C, Wassink AB. Production of high vowels in Canadian English and Canadian French: A comparison of early bilingual and monolingual speakers. Journal of Phonetics. 2009;374:374–387. [Google Scholar]
- Macmillan NA, Creelman CD. Detection theory: A users guide. 2nd. New York: Cambridge University Press; 2005. [Google Scholar]
- Mann VA. Distinguishing universal and language-dependent levels of speech perception: Evidence from Japanese listeners' perception of English “l” and “r,”. Cognition. 1986;24:169–196. doi: 10.1016/s0010-0277(86)80001-4. [DOI] [PubMed] [Google Scholar]
- Miyawaki K, Strange W, Verbrugge R, Liberman A, Jenkins J. An effect of linguistic experience: The discrimination of [r] and [kl] by native speakers of Japanese and English. Perception and Psychophysics. 1975;18(5):331–340. [Google Scholar]
- Masapollo M, Polka L, Molnar M, Ménard L. Directional asymmetries reveal a universal bias in adult vowel perception. Journal of the Acoustical Society of America. 2017a;141(4):2857–2869. doi: 10.1121/1.4981006. [DOI] [PubMed] [Google Scholar]
- Masapollo M, Polka L, Ménard L. A universal bias in adult vowel perception – by ear or by eye. Cognition. 2017b;166:358–370. doi: 10.1016/j.cognition.2017.06.001. [DOI] [PubMed] [Google Scholar]
- McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976;264:746–748. doi: 10.1038/264746a0. [DOI] [PubMed] [Google Scholar]
- Munhall KG, Servos P, Santi a, Goodale MA. Dynamic visual speech perception in a patient with visual form agnosia. NeuroReport. 2002;13(4):1793–1796. doi: 10.1097/00001756-200210070-00020. [DOI] [PubMed] [Google Scholar]
- Munhall KG, Vatikiotis-Bateson E. Spatial and temporal constraints on audiovisual speech perception. In: Calvert G, Spence J, Stein B, editors. Handbook of Multisensory Processing. Cambridge, MA: MIT Press; 2004. [Google Scholar]
- Noiray A, Cathiard MA, Ménard L, Abry C. Test of the Movement Expansion Model: Anticipatory vowel lip protrusion and constriction in French and English speakers. Journal of the Acoustical Society of America. 2011;129(1):340–349. doi: 10.1121/1.3518452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paré M, Richler RC, ten Hove M, Munhall KG. Gaze behavior in audiovisual speech perception: The influence of ocular fixations on the McGurk effect. Perception & Psychophysics. 2003;65(4):553–567. doi: 10.3758/bf03194582. [DOI] [PubMed] [Google Scholar]
- Polka L, Bohn O. Asymmetries in vowel perception. Speech Communication. 2003;41:221–231. [Google Scholar]
- Polka L, Bohn O. Natural Referent Vowel (NRV) framework: An emerging view of early phonetic development. Journal of Phonetics. 2011;39:467–478. [Google Scholar]
- Poeppel D, Monahan PJ. Feedforward and feedback in speech perception: Revisiting analysis by synthesis. Language and Cognitive Processes. 2011;26(7):935–951. [Google Scholar]
- R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2008. URL http://www.R-project.org. [Google Scholar]
- Rosenblum LD, Yakel DA, Green KP. Face and mouth inversion effects on visual and audiovisual speech perception. Journal of Experimental Psychology: Perception and Performance. 2000;26:806–819. doi: 10.1037//0096-1523.26.2.806. [DOI] [PubMed] [Google Scholar]
- Rosenblum LD, Saldaña HM. An audiovisual test of kinematic primitives for visual speech perception. Journal of Experimental Psychology: Human Perception and Performance. 1996;22(2):318–331. doi: 10.1037//0096-1523.22.2.318. [DOI] [PubMed] [Google Scholar]
- Rosenblum LD, Johnson JA, Saldaña HM. Visual kinematic information for embellishing speech in noise. Journal of Speech and Hearing Research. 1996;39(6):1159–1170. doi: 10.1044/jshr.3906.1159. [DOI] [PubMed] [Google Scholar]
- Rosenblum LD, Schmuckler MA, Johnson JA. The McGurk effect in infants. Perception & Psychophysics. 1997;59(3):347–357. doi: 10.3758/bf03211902. [DOI] [PubMed] [Google Scholar]
- Rosenblum LD, Saldaña HM. Time-varying information for visual speech perception. In: Campbell R, Dodd B, Burnham D, editors. Hearing by Eye : Part 2, The Psychology of Speech-reading and Audiovisual Speech. Earlbaum; Hillsdale, NJ: 1998. pp. 61–81. [Google Scholar]
- Rosenblum LD. The primacy of multimodal speech perception. In: Pisoni D, Remez R, editors. Handbook of Speech Perception. Blackwell; Malden, MA: 2005. pp. 51–78. [Google Scholar]
- Rosenblum LD. Speech perception as a multimodal phenomenon. Current Directions in Psychological Science. 2008;18:405–409. doi: 10.1111/j.1467-8721.2008.00615.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwartz JL, Escudier P. A strong evidence for the existence of a large-scale integrated spectral representation in vowel perception. Speech Communication. 1989;8:235–259. [Google Scholar]
- Schwartz JL, Boë LJ, Vallée N, Abry C. The Dispersion-Focalization Theory of vowel systems. Journal of Phonetics. 1997;25:255–286. [Google Scholar]
- Schwartz JL, Abry C, Boë LJ, Ménard L, Vallée N. Asymmetries in vowel perception, in the context of the Dispersion-Focalization Theory. Speech Communication. 2005;45:425–434. [Google Scholar]
- Tsuji S, Cristia A. Which Acoustic and Phonological Factors Shape Infants' Vowel Discrimination? Exploiting Natural Variation in InPhonDB. Proceedings of Interspeech 2017. 2017:2108–2112. doi: 10.21437/Interspeech.2017-1468. [DOI] [Google Scholar]
- Skipper JI, van Wassenhove V, Nusbaum HC, Small SL. Hearing lips and seeing voices: How cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex. 2007;17(10):2387–2399. doi: 10.1093/cercor/bhl147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stevens KN, Halle M. Remarks on analysis by synthesis and distinctive features. In: Wathen-Dunn W, editor. Models for the perception of speech and visual form. Cambridge, MA: MIT Press; 1967. pp. 88–102. [Google Scholar]
- Stevens KN. On the quantal nature of speech. Journal of Phonetics. 1989;17:3–46. [Google Scholar]
- Strange W. Automatic selective perception (ASP) of first and second language speech: A working model. Journal of Phonetics. 2011;39:456–466. [Google Scholar]
- Sundara M, Namasivayam AK, Chen R. Observation-execution matching system for speech: a magnetic stimulation study. Neuroreport. 2001;12(7):1341–1344. doi: 10.1097/00001756-200105250-00010. [DOI] [PubMed] [Google Scholar]
- Vatikiotis-Bateson E, Munhall KG, Kasahara Y, Garcia F, Yehia H. Characterizing audiovisual information during speech. Proceedings of the Fourth International Conference on Spoken Language Processing. 1996a:1485–1488. ICSLP-96. [Google Scholar]
- Vatikiotis-Bateson E, Munhall KG, Hirayama M, Lee YV, Terzopolous D. Dynamics of facial motion in speech: Kinematic and electromyographic studies of orofacial structures. In: Stork DG, Hennecke M, editors. Speechreading by Humans and Machines: Models, Systems & Applications. Kluwer Academic Publishers; 1996b. [Google Scholar]
- Vatikiotis-Bateson E, Eigsti IM, Yano S, Munhall KG. Eye movement of perceivers during audiovisual speech perception. Perception and Psychophysics. 1998;60:926–940. doi: 10.3758/bf03211929. [DOI] [PubMed] [Google Scholar]
- Venezia JH, Fillmore P, Matchin W, Lisette Isenberg A, Hickok G, Fridriksson J. Perception drives production across sensory modalities: A network for sensorimotor integration of visual speech. Neuroimage. 2016;126:196–207. doi: 10.1016/j.neuroimage.2015.11.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Viswanathan N, Magnuson JS, Fowler CA. Information for Coarticulation: Static Signal Properties or Formant Dynamics? Journal of Experimental Psychology: Human Perception and Performance. 2014;40(3):1228–1236. doi: 10.1037/a0036214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Werker JF, Logan JS. Cross-language evidence for three factors in speech perception. Perception & Psychophysics. 1985;37:35–44. doi: 10.3758/bf03207136. [DOI] [PubMed] [Google Scholar]
- Wilson SM, Saygin AP, Sereno MI, Iacoboni M. Listening to speech activates motor areas involved in speech production. Nature Neuroscience. 2004;7(7):701–702. doi: 10.1038/nn1263. [DOI] [PubMed] [Google Scholar]
- Yehia H, Rubin P, Vatikiotis-Bateson E. Quantitative association of vocal-tract and facial behavior. Speech Communication. 1998;26:23–43. [Google Scholar]
- Yeung HH, Werker JF. Lip movements affect infant audiovisual speech perception. Psychological Science. 2013;24(5):603–12. doi: 10.1177/0956797612458802. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
