Abstract
While perceiving speech, people see mouth shapes that are systematically associated with sounds. In particular, a vertically stretched mouth produces a /woo/ sound, whereas a horizontally stretched mouth produces a /wee/ sound. We demonstrate that hearing these speech sounds alters how we see aspect ratio, a basic visual feature that contributes to perception of 3D space, objects and faces. Hearing a /woo/ sound increases the apparent vertical elongation of a shape, whereas hearing a /wee/ sound increases the apparent horizontal elongation. We further demonstrate that these sounds influence aspect ratio coding. Viewing and adapting to a tall (or flat) shape makes a subsequently presented symmetric shape appear flat (or tall). These aspect ratio aftereffects are enhanced when associated speech sounds are presented during the adaptation period, suggesting that the sounds influence visual population coding of aspect ratio. Taken together, these results extend previous demonstrations that visual information constrains auditory perception by showing the converse – speech sounds influence visual perception of a basic geometric feature.
Keywords: Auditory–visual, Aspect ratio, Crossmodal, Shape perception, Speech perception
1. Introduction
Mouth shapes are systematically associated with sounds due to the anatomy of vocalization (e.g., Liberman & Mattingly, 1985; Sapir, 1929; Yehia, Rubin, & Vatikiotis-Bateson, 1998). Experiencing these crossmodal associations may lead to neural connectivity or multimodal tuning for visual processing of mouth shapes and auditory processing of speech sounds (Nath & Beauchamp, 2011; Wilson, 2002). Indeed, patches of temporal cortex are activated more strongly by combinations of faces and voices than by either alone (Beauchamp, Argall, Bodurka, Duyn, & Martin, 2004), and viewing of silent lip reading activates the auditory cortex (Calvert et al., 1997). Behaviorally, presenting a talking face influences speech perception in infants (Kuhl & Meltzoff, 1982) and improves speech recognition in adults (von Kriegstein et al., 2008). In a classic study, McGurk and MacDonald (1976) demonstrated that auditory perception of a phoneme was altered by a concurrently presented face pronouncing a different phoneme.
Because audition is usually regarded as the primary modality for speech perception, speech related auditory–visual interactions have been evaluated in terms of how looking at the mouth influences hearing of speech. Here we investigated the converse. Does hearing speech sounds alter how we see shapes? We examined the perception of aspect ratio (horizontal or vertical elongation) for two reasons. First, aspect ratio is a fundamental visual feature that is population-coded in the ventral visual pathway (see Suzuki (2005) for a review), contributing to perception of 3D space, objects and faces (e.g., Biederman, 2001; Knill, 1998a, 1998b; Young & Yamane, 1992). Second, horizontal and vertical mouth elongations are ubiquitous in speech production, with a horizontally elongated mouth typically producing a /wee/ sound and a vertically elongated mouth typically producing a /woo/ sound. We thus used flat (horizontally elongated) and tall (vertically elongated) ellipses as the visual stimuli and /wee/ and /woo/ sounds as the auditory stimuli. We used simple ellipses rather than images of mouths so that observers would be unaware of the relationship between the sounds and aspect ratios. This was important because we wanted to test the hypothesis that consistent auditory–visual coincidences during speech perception develop general auditory–visual associations that influence visual perception at the level of basic shape coding. If the experience of looking at mouth shapes while listening to speech establishes associations between auditory representations of phonemes and visual representations of associated shapes, hearing a /wee/ sound may make a flat ellipse appear even flatter and hearing a /woo/ sound may make a tall ellipse appear even taller (Fig. 1a).
Fig. 1.
Speech sounds increased the perceived elongation of consistent ellipses when observers were unaware of the associations between the sounds and shapes. (a) A schematic (not-to-scale) illustration of our primary finding; hearing a /wee/ sound (produced by a horizontally-stretched mouth) made a flat ellipse appear flatter, whereas hearing a / woo/ sound (produced by a vertically-stretched mouth) made a tall ellipse appear taller. (b) Perceived elongation of the ellipses (in the direction of the stimulus elongation, in log aspect ratio) is shown for the consistent-sound, inconsistent-sound, and environmental-sound conditions for the single-stimulus trials (open bars; a display example shown in the top right panel) and the multi-stimulus trials (dark bars; a display example shown in the bottom right panel). The error bars represent ±1 SEM adjusted for the repeated-measures design of the experiment. *p < .05.
2. Experiment 1: Speech sounds exaggerate appearances of associated visual aspect ratios
We examined perception of a briefly flashed ellipse in three conditions. In the consistent-sound condition, an ellipse was presented with a consistent speech sound (a flat ellipse with a /wee/ sound or a tall ellipse with a / woo/ sound). In the inconsistent-sound condition, an ellipse was presented with an inconsistent speech sound (a flat ellipse with a /woo/ sound or a tall ellipse with a /wee/ sound). In the control, environmental-sound, condition, an ellipse was presented with an environmental sound of no relation to speech or mouth shape (a door shutting or ice cracking). A critical aspect of this experiment was that none of the observers reported awareness of any association between the /wee/ and /woo/ sounds and the flat and tall ellipses during the post-experiment interview. This is not surprising because the ellipses did not resemble mouths and the experimenter did not mention any crossmodal associations during the instructions. Thus, any effect of auditory speech on the visual perception of aspect ratio would have likely occurred implicitly, without an explicit strategy or response bias.
2.1. Method
2.1.1. Observers
In all experiments, undergraduate students from Northwestern University with normal or corrected-to-normal vision and normal hearing gave informed consent to participate, and they were tested individually in a dimly lit room. Seventeen observers participated in this experiment for partial course credit.
2.1.2. Stimuli
We generated ellipses (drawn with dark [54 cd/m2] 0.057°-thick lines against a white [110 cd/m2] background) with 11 different aspect ratios (the vertical major axis divided by the horizontal major axis) ranging from flat to tall. The aspect ratios were symmetrically distributed (in log scale) around the circle, −0.485 (1.59° × 0.52°), −0.387 (1.44° × 0.59°), −0.271 (1.25° × 0.67°), −0.201 (1.19° × 0.75°), ×0.091 (1.05° × 0.85°), 0.0 (circle; 0.95° × 0.95), 0.091 (0.85° × 1.05°), 0.201 (0.75° × 1.19°), 0.271 (0.67° × 1.25°), 0.387 (0.59° × 1.44°), and 0.485 (0.52° × 1.59°). Each ellipse was treated with a Gaussian blur of 2.0-pixel radius to reduce aliasing. The least flat and least tall ellipses (log aspect ratios = ±0.091) were used as the target stimuli because perceived aspect ratios of briefly presented ellipses tend to be exaggerated (e.g., Suzuki & Cavanagh, 1998; Sweeny, Kim, Grabowecky, & Suzuki, 2011). Circles were used as distractors on multi-stimulus trials (see below). Ten ellipses excluding the circle were presented in the response display for reporting the perceived shape of the flashed ellipse via matching. All ellipses had equivalent areas.
Four sounds were used in the experiment. We created a “flat-mouth” (/wee/) sound by recording a man’s voice as he produced a phoneme while stretching the corners of his mouth to make a flat shape. We created a “tall-mouth” (/woo/) sound by recording the same man’s voice as he produced a phoneme while bringing the corners of his mouth together to make a tall shape. Two environmental sounds (a door shutting and ice cracking, obtained from a personal sound collection) were used as non-speech control sounds. All sounds were presented via Sennheiser-HD265 headphones, matched for perceived loudness (approximately 53 dB SPL), and differed only slightly in duration (/wee/ = 708 ms, /woo/ = 840 ms, door slamming = 1098 ms, ice cracking = 1156 ms).
Ellipses were presented along an invisible circular orbit (17.9° diameter) around the central fixation cross (0.10°, 62 cd/m2). The six stimulus locations (top, upper right, lower right, bottom, lower left, and upper left) were symmetrically arranged and evenly spaced (separated by 60° in rotation angle or 4.2° in visual angle) (see Fig. 1b). On each trial, a flat or tall ellipse was presented at one of the six locations with equal probability (16 times at each location, 96 trials total). On half of the trials, the target ellipse was presented alone—single-stimulus trials (e.g., Fig. 1b, top right panel), and on the remaining trials, the ellipse was presented among five circles (e.g., Fig. 1b, bottom right panel)—multi-stimulus trials. We included multi-stimulus trials because crossmodal interactions have been shown to depend on whether a visual feature is presented alone or in a crowd (Sherman, Sweeny, Grabowecky, & Suzuki, 2012). The multi-stimulus trials also provided a control for response bias (i.e., responding based on sounds irrespective of what is seen); a bias should be more pervasive when perception of the ellipse is less certain, such as in a multi-stimulus trial where the location of the ellipse was difficult to determine in a brief display.
On each trial, a consistent sound (a /wee/ sound for a flat ellipse or a /woo/ sound for a tall ellipse), an inconsistent sound (a /wee/ sound for a tall ellipse or a /woo/ sound for a flat ellipse), or an environmental sound (a door-shutting sound or an ice-cracking sound) was presented along with a flat or tall ellipse presented alone or among circles; all combinations of the four sounds, tall and flat ellipses, and single-stimulus and multi-stimulus displays were randomly intermixed and equally frequent across trials. The experiment was controlled with an Apple MacBook OS X using MATLAB (version 2009b) with the Psychophysics Toolbox extensions (Brainard, 1997; Kleiner, Brainard, & Pelli, 2007; Pelli, 1997). The visual stimuli were presented on a 19″ CRT monitor at a viewing distance of 115 cm. Approximately 10 practice trials were given prior to the experiment.
2.1.3. Procedure
Each trial began with the presentation of a fixation cross. The experimenter instructed observers to fixate the central cross (though strict central fixation was not crucial for this experiment). Observers were told to focus on the visual task and that the sounds were uninformative. Following 1000 ms of the fixation display, a sound (a /wee/, /woo/, or environmental sound) was initiated, which was immediately followed by a brief presentation of a flat or tall ellipse (alone or among five circles). The sound (950 ms, on average) was started slightly (33 ms) before the ellipse onset to allow processing of the sound’s acoustic property prior to presentation of an ellipse. This brief asynchrony, however, was well within the range known to create audio–visual fusion (e.g., Miller & D’Esposito, 2005). The ellipse (alone or among circles) was briefly presented (33 ms). In addition to preventing influences from saccades and high-level deliberative processes, brief presentations make visual perception sensitive to contextual effects (e.g., Suzuki & Cavanagh, 1998; Wolfe, 1984), providing a means to measure crossmodal interactions with increased sensitivity. The fixation cross remained on the screen after the offset of the ellipse for 1000 ms, and was followed by a response display containing a horizontal array of 10 sample ellipses gradually changing from flat to tall. The observer chose the ellipse that appeared most similar to the flashed ellipse by pressing the corresponding button.
2.1.4. Analysis
To reveal crossmodal effects specific to the sound-shape association beyond general effects of a sound or bias in aspect-ratio perception, we averaged the perceived ellipse elongations (in log aspect ratio, with elongations in the veridical direction as positive) across the flat and tall ellipses for each sound condition. Thus, in the consistent-sound condition we averaged the perceived elongations between the tall ellipse presented with the /woo/ sound and the flat ellipse presented with the /wee/ sound, whereas in the inconsistent-sound condition we averaged the perceived elongations between the tall ellipse presented with the / wee/ sound and the flat ellipse presented with the /woo/ sound. In the environmental-sound condition, we averaged the perceived elongations between the tall and flat ellipses presented with the same environmental sound. Because the perceived elongations did not differ between the two environmental sounds, we combined those data together. Perception in this condition provided a baseline against which to evaluate the specificity of the effects with speech sounds, because the environmental sounds had no speech relevance or association with the ellipses (see Section 3.1, Stimuli, for a discussion of why this condition provides a more appropriate baseline than a no sound condition).
2.2. Results
On the single-stimulus trials, perceived elongation was larger with the consistent sounds relative to both the inconsistent sounds, t(16) = 2.457, p < .03, d = 0.596, and environmental sounds, t(16) = 2.802, p < .02, d = 0.679 (Fig. 1b). Perceived elongations with the inconsistent and environmental sounds did not differ, t(16) = .281, n.s. The speech sounds thus increased the perceived elongation of the consistent ellipses (Fig. 1a) without affecting the inconsistent ellipses. No observers reported awareness of the sound-shape associations or knowledge that the shapes could have been interpreted as mouths. This suggests that the crossmodal shape exaggeration occurs implicitly, and this reasonably rules out response bias.
Note that this cross-modal effect occurred in the context of an overall exaggeration effect. The ellipses appeared more elongated than their veridical aspect ratios (0.091 on average) independently of any effect from the sounds, especially on the single-stimulus trials (Fig. 1b). This overall exaggeration of briefly flashed ellipses was expected based on prior results showing that briefly flashed visual features tend to appear exaggerated (e.g., Suzuki and Cavanagh (1998) for aspect ratio, and Sweeny et al. (2011) for curvature; see Sweeny et al. (2011) for a discussion of possible mechanisms of this exaggeration).
Sounds produced no significant effects on the multi-stimulus trials (t’s < 1.213, n.s., with a marginal consistency-by- display-size [one vs. multiple] interaction, F[2,32] = 3.093, p = .059, ), when the location of the target ellipse was difficult to determine and reported elongations were small (Fig. 1b, shaded bars). This result is consistent with our prior finding that the crossmodal effect of laughter that enhanced the perception of a happy expression for a single face disappeared when the happy face was presented among a crowd of neutral faces (Sherman et al., 2012). This may make sense in the context of the load theory of attention (e.g., Lavie, 2005), which states that when selective attention is not strongly engaged to a target feature, task-irrelevant features receive a relative increase in processing (Pinsk, Doniger, & Kastner, 2004) and are more likely to interfere with perception of the target feature. Thus, the brief presentation and the uncertainity of the location of the target made it unlikely that selective attention was strongly engaged to the target on multi-stimulus trials. It is possible that auditory effects on visual shape coding might depend on attention being focused on the relevant visual target. Note that the absence of the crossmodal effect in the multi-stimulus condition provides further evidence against response bias because if the sounds simply biased observers’ responses, they would have equivalently affected responses in the single-stimulus and multiple-stimulus conditions.
3. Experiment 2: Speech sounds influence the population coding of aspect ratio
We have demonstrated that speech sounds associated with tall and flat mouth shapes implicitly (i.e., with no explicit awareness of the auditory–visual associations) exaggerate visual aspect ratios of simple ellipses. A potential mechanism of this crossmodal effect is that hearing speech sounds enhances responses of visual neurons tuned to the associated aspect ratios. To psychophysically evaluate this hypothesis, we investigated the speech sounds’ influences on aspect-ratio aftereffects; when a tall (or flat) adaptor shape is followed by a symmetric test shape, the test shape appears to be elongated in the orthogonal direction. Previous research suggested that this repulsive aspect-ratio aftereffect reflects an activation (and adaptation) of aspect- ratio-tuned neurons in the ventral visual pathway (see Section 3.3). Conveniently, aspect-ratio aftereffects occur with brief adaptation, comparable to the brief speech sounds, when the aftereffects are measured with brief test stimuli (e.g., Suzuki, 2005; Suzuki & Cavanagh, 1998). If hearing a speech sound enhances the activation (and thus adaptation) of neurons tuned to the associated aspect ratio, the aspect-ratio aftereffect should be larger when a consistent speech sound is presented during adaptation than when an inconsistent speech sound or an environmental sound is presented. Note that a response bias would predict the opposite pattern. For example, if a /wee/ sound increased the responses and adaptation of flat-tuned neurons while adapting to a flat ellipse, the test stimulus would appear taller. In contrast, if the /wee/ sound simply produced a bias to respond “flat,” observers would report the test stimulus as flatter. As in Experiment 1, none of the observers reported awareness of any auditory–visual associations during the post-experiment interview.
3.1. Method
3.1.1. Observers
Eleven new observers were paid to participate after giving informed consent.
3.1.2. Stimuli
Visual and auditory stimuli were similar to those used in Experiment 1 with the following exceptions. The black ellipses (33 cd/m2) and the white background (109 cd/m2) were slightly darker. An ellipse was always presented at the center of the screen and had one of 21 aspect ratios symmetrically distributed (in log scale) around the circle, −0.419 (1.67° × 0.63°), −0.374 (1.61° × 0.68°), −0.343 (1.56° × 0.73°), −0.311 (1.51° × 0.78°), −0.285 (1.46° × 0.83°), −0.221 (1.41° × 0.89°), −0.176 (1.35° × 0.94°), −0.131 (1.30° × 0.99°), −0.087 (1.25° × 1.04°), −0.043 (1.20° × 1.09°), 0.0 (circle; 1.15° × 1.15°), 0.043 (1.09° × 1.20°), 0.087 (1.04° × 1.25°), 0.131 (0.99° × 1.30°), 0.176 (0.94° × 1.35°), 0.221 (0.89° × 1.41°), 0.285 (0.83° × 1.46°), 0.311 (0.78° × 1.51°), 0.343 (0.73° × 1.56°), 0.374 (0.68° × 1.61°), 0.419 (0.63° × 1.67°).
The flat or tall adaptor had one of three amounts of elongation, ±0.043, ±0.131, and ±0.311, and the test stimulus was always the circle, backward masked by a random-dot pattern (consisting of 50% black and 50% white pixels covering the central 10° [horizontal] by 7° [vertical] region). We used three different adaptor elongations for the following reasons. On the one hand, the aspect-ratio aftereffect (on a circle) increases with increased adaptor elongation (e.g., Suzuki, 2005), and a larger aftereffect might provide greater sensitivity for detecting crossmodal modulation. On the other hand, a crossmodal boost of the aftereffect might be more salient when the baseline magnitude of the aftereffect is smaller. The use of different adaptor aspect ratios allowed us to demonstrate effects of speech sounds on the population coding of aspect ratio regardless of these influences.
The sounds (/wee/, /woo/, and environmental sounds, at ~62 db SPL) were presented via a pair of JBL speakers (10–25,000 Hz frequency response) placed symmetrically just in front of the visual display screen; each sound was played through both speakers and was perceived to be colocalized with the centrally presented adaptor ellipse. Each sound was approximately exponentially attenuated after the first 200 ms, becoming inaudible within approximately 500 ms of onset. In addition to trials with the three sound types, trials with no sounds were intermixed to make sure that the presence of a sound per se did not disrupt aspect-ratio aftereffects, and it did not (see footnote 1). Note that the environmental sound condition is a more appropriate control than the no-sound condition because it is matched to the sound conditions in terms of the potential arousing, alerting, and temporal cueing effects of a sound coincident with the brief visual adaptor. We will thus compare aspect-ratio aftereffects among the consistent-sound (/wee/ with a flat adaptor and /woo/ with a tall adaptor), inconsistent- sound (/wee/ with a tall adaptor and /woo/ with a flat adaptor), and environmental-sound conditions as in Experiment 1. Each block included 24 trials (2 adaptor orientations [tall or flat], 3 amounts of elongation, and 4 sound conditions). Each observer was tested in 12 blocks. Sixteen practice trials were given prior to the experiment.
3.1.3. Procedure
Each trial began with a fixation point (0.10° diameter, 62 cd/m2) lasting 1760 ms. A visual adaptor lasting 176 ms and a sound lasting ~500 ms (consistent, inconsistent, environmental, or none) were simultaneously initiated. The adaptor was followed by a blank display lasting 470 ms (the sound was audible through the first ~324 ms of the blank display), and then by a test circle lasting 47 ms, which was immediately followed by the random-dot mask lasting 294 ms. After a 1760 ms blank interval (we chose this duration so that the test ellipse would be temporally distinct from the preceding sequence of stimuli), a method of adjustment (e.g., Sweeny et al., 2011) began; a circle appeared in the center of the screen and observers pressed the left or right arrow key to gradually change the aspect ratio of the circle to be flatter (coded as negative aspect ratios) or taller (coded as positive aspect ratios), stepping through the 21 aspect ratios indicated above. Once observers satisfactorily matched the image on the screen with their percept of the test shape, they pressed the space bar and the next trial started after 1 s.
3.2. Results
In order to analyze the magnitude of aspect-ratio aftereffects beyond the variability due to individual differences in the baseline bias (i.e., a tendency to see briefly presented shapes as horizontally or vertically elongated), we computed an aftereffect index (in log-aspect-ratio units). Specifically, we subtracted the mean perceived aspect ratio of the test shape following adaptation to a tall ellipse from that following adaptation to the flat ellipse for each observer for each amount of adaptor elongation and for each sound condition. A larger positive value of this index indicates a larger magnitude of the aspect-ratio aftereffect.
A two-factor ANOVA with sound condition (consistent, inconsistent, and environmental) and adaptor elongation (three magnitudes) as the independent variables and the aftereffect index as the dependent variable, yielded significant main effects of sound condition, F(2,20) = 4.783, p < .02, , and adaptor elongation, F(2,20) = 33.408, p < .0001, . The latter indicates that a more elongated adaptor produced a larger aspect-ratio aftereffect (e.g., Suzuki, 2005). The former indicates that the sounds significantly influenced the magnitude of aspect-ratio aftereffects. Follow-up analyses showed that the aspect-ratio aftereffect was significantly larger with the consistent sounds relative to both the inconsistent sounds, t(10) = 2.701, p < .023, d = 0.814, and environmental sounds, t(10) = 2.653, p < .025, d = 0.800, while the aftereffect was equivalent with the inconsistent and environmental sounds, t(10) = .035, n.s. (Fig. 2a).1
Fig. 2.

Speech sounds presented during adaptation to consistent ellipses increased aspect ratio aftereffects. The aftereffect index (see main text for details) is shown for the consistent-sound, inconsistent-sound and environmental-sound conditions. A larger positive value indicates a stronger aftereffect. (a) The results averaged across all amounts of adaptor elongation. (b) The results for the most elongated adaptors. Error bars represent ±1 SEM adjusted for the repeated-measures design of the experiment. *p < .05, and **p < .01.
The effect of the consistent sound appears to be especially strong for the most elongated adaptors (Fig. 2b) as reflected in a marginal interaction between sound condition and adaptor elongation, F(4,40) = 2.297, p < .08, . The consistent sound enhanced aspect-ratio aftereffects relative to both the inconsistent sound, t(10) = 5.067, p < .001, d = 1.528, and environmental sounds, t(10) = 2.403, p < .037, d = 0.724, with no significant difference between the inconsistent and environmental sounds, t(10) = 1.311, n.s.
Overall, consistent speech sounds presented during adaptation enhanced aspect-ratio aftereffects relative to inconsistent speech sounds and environmental sounds.
3.3. Discussion
Our results complement the classic McGurk effect by showing that hearing speech sounds distorts visual perception of shape. Importantly, our results demonstrate that associations between auditory and visual features are not merely metaphorical (e.g., Köhler, 1947; Marks, 1996; Sapir, 1929) or limited to influencing response times, accuracy (e.g., Bernstein & Edelstein, 1971; Gallace & Spence, 2006; Marks, 1987) or temporal and spatial integration (e.g., Parise & Spence, 2009), but that they also change a visual feature’s appearance.
Aspect ratio is a fundamental visual feature presumably coded by relative activation of a population of neurons tuned to different aspect ratios in the ventral visual pathway (e.g., Kayaert, Biederman, & Vogels, 2003; Regan & Hamstra, 1992; Suzuki, 2005). Our results suggest that audiovisual speech experience facilitates feature specific interactions between auditory processing of spectral patterns and ventral-visual processing of aspect ratio. Specifically, a / wee/ sound (typically associated with a horizontally elongated mouth) and a /woo/ sound (typically associated with a vertically elongated mouth) might make a shape appear flatter or taller (Experiment 1) by crossmodally boosting the activity of neurons tuned to flat and tall aspect ratios, respectively. We behaviorally tested this possibility in Experiment 2 by evaluating the effects of sounds on the aspect-ratio aftereffect.
Viewing a tall (or flat) adaptor shape makes a subsequently presented symmetric shape appear elongated in the orthogonal direction. Comparison between the psychophysical properties of this aftereffect and known physiological properties of cortical visual neurons suggests that this aftereffect reflects an adaptive population coding of aspect ratio in the ventral visual pathway (e.g., Regan & Hamstra, 1992; Suzuki, 2003, 2005; Suzuki & Cavanagh, 1998). For example, viewing a tall shape would strongly activate tall-tuned neurons but only weakly activate flat-tuned neurons, thus strongly adapting (desensitizing) tall-tuned neurons but only weakly adapting flat-tuned neurons. When a symmetric shape is subsequently presented, the strongly adapted tall-tuned neurons would be less activated than the weakly adapted flat-tuned neurons so that the symmetric shape would appear flat. Stronger activation (causing stronger adaptation) of the tall-tuned neurons would produce a larger aftereffect (e.g., making the symmetric shape appear flatter to a greater degree). The magnitude of the aspect-ratio aftereffect therefore provides a behavioral measure of activation of aspect-ratio tuned neurons (Regan & Hamstra, 1992; Suzuki, 2003, 2005; Suzuki & Cavanagh, 1998). A speech sound presented during adaptation to a consistently elongated adaptor increased the aftereffect, suggesting that a /wee/ sound increases activation and adaptation of flat-tuned neurons and a /woo/ sound increases activation and adaptation of tall-tuned neurons.
Although a psychophysical investigation cannot directly reveal how auditory processing might influence the activity of visual neurons, the data can inform and constrain some possibilities. Auditory neurons could boost the responses of visual neurons through excitatory connections, either directly from auditory to visual areas, or indirectly, through a multisensory integration area with connections to auditory and visual cortices (e.g., Nath & Beauchamp, 2011). Another equally plausible account is provided by the motor theory of speech perception (Galantucci, Fowler, & Turvey, 2006; Liberman & Mattingly, 1985). Listening to speech sounds is known to recruit motor areas underlying speech production (Fadiga, Craighero, Buccino, & Rizzolatti, 2002; Watkins, Strafella, & Paus, 2003). Here, covert motor simulation of mouth shapes consistent with speech sounds could have influenced the visual encoding of associated aspect ratios through feedback connectivity (Skipper, Nusbaum, & Small, 2005; Skipper, van Wassenhove, Nusbaum, & Small, 2007). While our data cannot discriminate among these alternatives, a future investigation, which directly addresses them, is warranted.
Interestingly, consistent sounds increased perceived elongation (Experiment 1) and adaptation (Experiment 2), but inconsistent sounds had little effect on either measure. This may suggest that auditory or motor-simulation input boosts responses of associated visual neurons that are already activated, but that it does not drive visual neurons on its own or inhibit responses of unassociated visual neurons. Thus, the underlying crossmodal influences may be excitatory and multiplicative.
In conclusion, what we see is shaped by what we hear, not only when seeing meaningful objects such as faces (Smith, Grabowecky, & Suzuki, 2007), but also when seeing a simple geometric feature, adding to growing evidence that perceptual reality is fundamentally multimodal (e.g., Schroeder & Foxe, 2005).
Acknowledgments
This research was supported by National Institutes of Health Grants R01 EY018197-02S1, EY018197, EY021184, and T32 EY007043, and National Science Foundation Grant BCS 0643191.
Footnotes
The magnitude of the aspect-ratio aftereffect in the no-sound condition was intermediate (0.067, SE = 0.003). This was confirmed by the fact that the contrast reflecting the hypothesis that the no-sound condition was exactly intermediate, that is {+5, −3, −3, +1} for {Consistent, Inconsistent, Environmental, no-sound}, was significant, t(10) = 3.597, p < .005, d = .990. However, as discussed in the Methods section, the interpretation of the no-sound condition is ambiguous because it differs from the sound conditions on several factors such as arousal, alertness, and temporal cueing, each of which may increase or decrease the aftereffect.
References
- Beauchamp MS, Argall BD, Bodurka J, Duyn JH, Martin A. Unraveling multisensory integration: Patchy organization within human STS multisensory cortex. Nature Neuroscience. 2004;7:1190–1192. doi: 10.1038/nn1333. [DOI] [PubMed] [Google Scholar]
- Bernstein IH, Edelstein BA. Effects of some variations in auditory input upon visual choice reaction time. Journal of Experimental Psychology. 1971;87:241–247. doi: 10.1037/h0030524. [DOI] [PubMed] [Google Scholar]
- Biederman I. Recognizing depth-rotated objects: A review of recent research and theory. Spatial Vision. 2001;13:241–253. doi: 10.1163/156856800741063. [DOI] [PubMed] [Google Scholar]
- Brainard DH. The psychophysics toolbox. Spatial Vision. 1997;10:433–436. [PubMed] [Google Scholar]
- Calvert GA, Bullmore ET, Brammer MJ, Campbell R, Williams SCR, McGuire PK, et al. Activation of auditory cortex during silent lip reading. Science. 1997;276:593–596. doi: 10.1126/science.276.5312.593. [DOI] [PubMed] [Google Scholar]
- Fadiga L, Craighero L, Buccino G, Rizzolatti G. Speech listening specifically modulates the excitability of tongue muscles: A TMS study. European Journal of Neuroscience. 2002;15:399–402. doi: 10.1046/j.0953-816x.2001.01874.x. [DOI] [PubMed] [Google Scholar]
- Galantucci B, Fowler CA, Turvey MT. The motor theory of speech reviewed. Psychonomic Bulletin & Review. 2006;13:361–377. doi: 10.3758/bf03193857. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gallace A, Spence C. Multisensory synesthetic interactions in the speeded classification of visual size. Perception & Psychophysics. 2006;68:1191–1203. doi: 10.3758/bf03193720. [DOI] [PubMed] [Google Scholar]
- Kayaert G, Biederman I, Vogels R. Shape tuning in macaque inferior temporal cortex. The Journal of Neuroscience. 2003;23:3016–3027. doi: 10.1523/JNEUROSCI.23-07-03016.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kleiner M, Brainard D, Pelli D. What’s new in Psychtoolbox-3? Perception. 2007:36. (ECVP Abstract Supplement) [Google Scholar]
- Knill DC. Surface orientation from texture: Ideal observers, generic observers and the information content of texture cues. Vision Research. 1998a;38:1655–1682. doi: 10.1016/s0042-6989(97)00324-6. [DOI] [PubMed] [Google Scholar]
- Knill DC. Discriminating surface slant from texture: Comparing human and ideal observers. Vision Research. 1998b;38:1683–1711. doi: 10.1016/s0042-6989(97)00325-8. [DOI] [PubMed] [Google Scholar]
- Köhler W. Gestalt psychology. 2. New York: Liveright; 1947. [Google Scholar]
- Kuhl PK, Meltzoff AN. The bimodal perception of speech in infancy. Science. 1982;218:1138–1141. doi: 10.1126/science.7146899. [DOI] [PubMed] [Google Scholar]
- Lavie N. Distracted and confused? Selective attention under load. Trends in Cognitive Sciences. 2005;9:75–82. doi: 10.1016/j.tics.2004.12.004. [DOI] [PubMed] [Google Scholar]
- Liberman AM, Mattingly IG. The motor theory of speech revised. Cognition. 1985;21:1–36. doi: 10.1016/0010-0277(85)90021-6. [DOI] [PubMed] [Google Scholar]
- Marks LE. Auditory–visual interactions in speeded discrimination. Journal of Experimental Psychology: Human Perception and Performance. 1987;13:384–394. doi: 10.1037//0096-1523.13.3.384. [DOI] [PubMed] [Google Scholar]
- Marks LE. On perceptual metaphors. Metaphor and symbolic activity. 1996;11:39–66. [Google Scholar]
- McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976;264:746–748. doi: 10.1038/264746a0. [DOI] [PubMed] [Google Scholar]
- Miller LM, D’Esposito M. Perceptual fusion and stimulus coincidence in the cross-modal integration of speech. The Journal of Neuroscience. 2005;25:5884–5893. doi: 10.1523/JNEUROSCI.0896-05.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nath AR, Beauchamp MS. Dynamic changes in superior temporal sulcus connectivity during perception of noisy audiovisual speech. The Journal of Neuroscience. 2011;31:1704–1714. doi: 10.1523/JNEUROSCI.4853-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parise CV, Spence C. When birds of a feather flock together: Synesthetic correspondences modulate audiovisual integration in non-synesthetes. PLoS ONE. 2009;4:e5664. doi: 10.1371/journal.pone.0005664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pelli DG. The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision. 1997;10:437–442. [PubMed] [Google Scholar]
- Pinsk MA, Doniger GM, Kastner S. Push–pull mechanism of selective attention in human extrastriate cortex. Journal of Neurophysiology. 2004;92:622–629. doi: 10.1152/jn.00974.2003. [DOI] [PubMed] [Google Scholar]
- Regan D, Hamstra SJ. Shape discrimination and the judgment of perfect symmetry: Dissociation of shape from size. Vision Research. 1992;32:1845–1864. doi: 10.1016/0042-6989(92)90046-l. [DOI] [PubMed] [Google Scholar]
- Sapir E. A study in phonetic symbolism. Journal of Experimental Psychology. 1929;12:225–239. [Google Scholar]
- Schroeder CE, Foxe J. Multisensory contributions to low-level, ‘unisensory’ processing. Current Opinion in Neurobiology. 2005;15:454–458. doi: 10.1016/j.conb.2005.06.008. [DOI] [PubMed] [Google Scholar]
- Sherman A, Sweeny TD, Grabowecky M, Suzuki S. Laughter exaggerates happy and sad faces depending on visual context. Psychonomic Bulletin & Review. 2012;1069–9384:1–7. doi: 10.3758/s13423-011-0198-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Skipper JI, Nusbaum HC, Small SL. Listening to talking faces: Motor cortical activation during speech perception. NeuroImage. 2005;25:76–89. doi: 10.1016/j.neuroimage.2004.11.006. [DOI] [PubMed] [Google Scholar]
- Skipper JI, van Wassenhove V, Nusbaum HC, Small SL. Hearing lips and seeing voices: How cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex. 2007;17:2387–2399. doi: 10.1093/cercor/bhl147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith E, Grabowecky M, Suzuki S. Auditory–visual crossmodal integration in perception of face gender. Current Biology. 2007;17:1680–1685. doi: 10.1016/j.cub.2007.08.043. [DOI] [PubMed] [Google Scholar]
- Suzuki S. Attentional selection of overlapped shapes: A study using brief aftereffects. Vision Research. 2003;43:549–561. doi: 10.1016/s0042-6989(02)00683-1. [DOI] [PubMed] [Google Scholar]
- Suzuki S. High-level pattern coding revealed by brief shape aftereffects. In: Clifford C, Rhodes G, editors. Fitting the mind to the world: Adaptation and aftereffects in high-level vision. Advances in Visual Cognition Series. Vol. 2. Oxford University Press; 2005. [Google Scholar]
- Suzuki S, Cavanagh P. A shape-contrast effect for briefly presented stimuli. Journal of Experimental Psychology: Human Perception and Performance. 1998;24(5):1315–1341. doi: 10.1037//0096-1523.24.5.1315. [DOI] [PubMed] [Google Scholar]
- Sweeny TD, Kim YJ, Grabowecky M, Suzuki S. Internal curvature signal and noise in low- and high-level vision. Journal of Neurophysiology. 2011;105:1236–1257. doi: 10.1152/jn.00061.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- von Kriegstein K, Dogan O, Giraud AL, Kell CA, Gruter T, Kleinschmidt A, et al. Simulation of talking faces in the human brain improves auditory speech recognition. Proceedings of the National Academy of Sciences, USA. 2008;105:6747–6752. doi: 10.1073/pnas.0710826105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watkins KE, Strafella AP, Paus T. Seeing and hearing speech excites the motor system involved in speech production. Neuropsychologia. 2003;41:989–994. doi: 10.1016/s0028-3932(02)00316-0. [DOI] [PubMed] [Google Scholar]
- Wilson M. Six views of embodied cognition. Psychonomic Bulletin & Review. 2002;9:625–636. doi: 10.3758/bf03196322. [DOI] [PubMed] [Google Scholar]
- Wolfe J. Short test flashes produce large tilt aftereffects. Vision Research. 1984;24:1959–1964. doi: 10.1016/0042-6989(84)90030-0. [DOI] [PubMed] [Google Scholar]
- Yehia H, Rubin P, Vatikiotis-Bateson E. Quantitative association of vocal tract and facial behavior. Speech Communication. 1998;26:23–43. [Google Scholar]
- Young MP, Yamane S. Sparse population coding of faces in the inferotemporal cortex. Science. 1992;256:1327–1331. doi: 10.1126/science.1598577. [DOI] [PubMed] [Google Scholar]

