Abstract
Because of the slow speed of sound relative to light, acoustic and visual signals from a distant event often will be received asynchronously. Here, using acoustic signals with a robust cue to sound source distance, we show that judgments of perceived temporal alignment with a visual marker depend on the depth simulated in the acoustic signal. For distant sounds, a large delay of sound relative to vision is required for the signals to be perceived as temporally aligned. For nearer sources, the time lag corresponding to audiovisual alignment is smaller and scales at rate approximating the speed of sound. Thus, when robust cues to auditory distance are present, the brain can synchronize disparate audiovisual signals to external events despite considerable differences in time of arrival at the perceiver. This ability is functionally important as it allows auditory and visual signals to be synchronized to the external event that caused them.
Keywords: audiovisual interactions, auditory distance perception, auditory psychophysics
Studies of audiovisual temporal alignment generally have found that an auditory stimulus needs to be delayed to be perceptually aligned with a visual stimulus (1–7). This temporal offset, on the order of several tens of milliseconds, is thought to reflect the slower processing times for visual stimuli. This offset arises because acoustic transduction between the outer and inner ears is a direct mechanical process and is extremely fast at just 1 ms or less (8, 9), whereas, by contrast, phototransduction in the retina is a relatively slow photochemical process followed by several cascading neurochemical stages and lasts ≈50 ms (10–14). Thus, differential latencies between auditory and visual processing agree well with the common finding that auditory signals must lag visual signals by ≈40–50 ms if they are to be perceived as temporally aligned.
Most studies of audiovisual alignment, however, are based on experiments done in the near field, meaning that auditory travel time is a negligible factor. Studies of audiovisual alignment conducted over greater distances have examined whether the brain can compensate for the slow travel time of sound, but these have produced contradictory results (15–17). Here, we test whether subjective audiovisual alignment reflects only the relatively stable internal latency differences that are well documented or whether knowledge of the external distance of an auditory source can be used to compensate for the slow travel time of sound relative to light. In the experiments presented here, we use a powerful cue to auditory source distance, the ratio of direct-to-reverberant energy (18), to vary the apparent distance of the acoustic signal's origin. By manipulating this cue, we find that increases in sound source distance cause sounds to be perceptually aligned with earlier visual events. When large auditory distances are simulated, the auditory lag required for vision to be aligned with sound becomes exaggerated far beyond what can be accounted for by differential neural latencies. These results suggest that the perceptual system is capable of compensating for the relatively slow speed of sound when reliable auditory depth information is available to guide it.
Methods
All measurements were made in a large (64 m3), darkened high-fidelity anechoic chamber (>99% absorption down to 200 Hz) with insertion loss >60 dB above 500 Hz. Subjects sat 57 cm from a laptop computer screen that displayed the visual stimulus. The screen was dark except for a 13-ms presentation of a circular Gaussian-profile luminance blob 4° in width. Two loudspeakers (Yamaha MSP 5) flanked the screen and played the auditory signal, an impulse response function (IRF) recorded by our laboratory in the concert hall of the Sydney Opera House. The IRF was convolved with white noise and began with a direct (i.e., anechoic) portion lasting 13 ms, followed by a long reverberant tail that dissipated over 1,350 ms. The initial and reverberant portions of the signal were windowed with raised cosines overlapping at half-height at 13 ms, so that the direct onset burst could be scaled smoothly in amplitude relative to the reverberant tail (using the “overlap and add” method) with no detectable discontinuity. The reverberant tail was identical for all simulated depths. The IRF was recorded 5 m from a source on the auditorium stage. Scaling down the direct portion (originally 84-dB peak sound pressure level) by 6 dB created a stimulus-simulating 10-m sound source distance, and further 6-dB attenuations produced stimuli-simulating 20- and 40-m source distances, respectively (see Fig. 1A). Thus auditory source distance was simulated by using the powerful depth cue of direct-to-reverberant energy ratio. In enclosed reverberant environments, the ratio of direct-to-reverberant energy is the strongest cue to auditory sound source distance because the incident level decreases by 6 dB with each doubling of distance, whereas the level of the reverberant tail is approximately invariant (18–20). In the darkened anechoic chamber, each stimulus sounded like a brief white-noise burst heard in a large reverberant space at approximately the specified depth. The visual stimulus was also fixed in distance, 57 cm from the observer. It did not need to be colocated in depth with the auditory signal because its role was primarily to provide a reference point in time rather than a fusible audiovisual event.
Subjects were instructed to listen to the entire auditory stimulus and then judge whether the auditory burst had occurred before or after the visual stimulus. Different auditory depths were paired in a given block of trials and were randomly interleaved. Initially, visual and auditory stimuli were played simultaneously and, depending on the subject's response, the onset time of the auditory stimulus was then advanced or retarded according to an adaptive staircase routine [quest (21)] designed to home in on the point of subjective audiovisual simultaneity. Six quests of 25 trials each were run, and the data were pooled into a single data set and fitted with a cumulative Gaussian by using a maximum-likelihood method to model the psychometric function. The auditory-onset asynchrony (relative to the visual flash) at the half-height of the psychometric function defined the point of subjective audiovisual alignment. Each psychometric function was resampled by using a bootstrapping procedure to generate a distribution of 1,000 functions, the standard deviation of which was used as an error bar for the point of subjective alignment. Slopes of linear fits to subjective alignments across auditory distance were tested for significance by using a Monte Carlo procedure simulating 1,000 replications of the experiment.
Results
The point of subjective alignment of auditory and visual stimuli depended on the source distance simulated in the auditory stimulus. As shown for four observers in Fig. 2C, sound-onset times had to be increasingly delayed to produce alignment with the visual stimulus as perceived acoustic distance increased. Best-fitting linear functions describe the data well, with slopes varying between observers from 2.51 to 4.17 ms per m. The intersubject variation is consistent with findings showing auditory distance estimates vary considerably between observers (15, 22), and the data also show the typical compressive function seen in studies of auditory depth (23); however, the average slope of 3.43 ms·m-1 is approximately consistent with the speed of sound. These results suggest two conclusions: first, that in making their audiovisual alignments observers were taking account of the distance of the sound source, as defined by the direct-to-reverberant energy ratio, and second, they were attempting to compensate for the travel time from that source with a subjective estimate of the speed of sound.
To verify that the depth cue underlying the delaying of audiovisual alignment really was the ratio of direct-to-reverberant energies, the audiovisual alignment experiment was repeated by using just the initial 13-ms direct portion of the sound (with no reverberant tail) at the same four amplitude levels used in the first experiment. The results of this control experiment (Fig. 3, ▪) demonstrate first that the reverberant tail is necessary if the point of audiovisual alignment is to be shifted in time, and second that differences in the amplitude of the brief initial portion did not influence audiovisual alignment in our first experiment. Rather, it was the ratio of those initial amplitude changes with respect to the reverberant tail that was crucial to the delaying of audiovisual alignment as auditory depth increased.
A second control experiment tested whether changes in loudness, caused by the differing amplitudes of the early portion of the signals, may have determined the delaying of audiovisual alignment with increasing auditory distance. Loudness can be used by the auditory system to estimate depth, although without a context to indicate the likely amplitude at source it is a rather poor cue (22), just as sheer size in vision is a poor indicator of depth (24, 25). In generating loudness (perceived sound level), the auditory system integrates signals over a period of ≈200 ms (26, 27). We therefore calculated the average sound level of the first 200 ms of the four conditions used in the original experiment (see Fig. 1A) and expressed these time-averaged levels as ratios relative to that of the 5-m condition. These ratios estimate the perceptual changes in loudness in the early part of the signal across the four auditory depths. The stimuli for this second control experiment were then produced by scaling the entire original 5-m stimulus in amplitude (using the time-averaged level ratios), to create four stimuli whose time-averaged levels match those of the original 5-, 10-, 20-, and 40-m stimuli but whose direct-to-reverberant energy ratios are identical (because all are scaled versions of the 5-m stimulus). The audiovisual alignment task conducted with these control stimuli produced a flat function with a nonsignificant slope (Fig. 3, ▪), indicating that the intensity differences in our original stimuli (and the corresponding changes in perceived loudness) did not produce the shifts in temporal alignment we obtained. Rather, it was the systematic variation in perceived distance, as determined by the direct-to-reverberant energy ratio, which was the essential determinant of our results.
In a final manipulation, we repeated the audiovisual alignment task with the original stimuli in a “speeded attention” condition. Participants were told to direct their attention to the onset burst (ignoring the reverberant tail) and to respond as fast as possible while maintaining accuracy. For this reason, responses were rejected as being too slow if their reaction times exceeded 650 ms. Because this time window must include the alignment decision and the motor response to indicate it it ensures that only the early part of the auditory signal can be processed before an alignment response is required. By responding before the reverberant signal is fully encoded, the direct-to-reverberant ratio as a depth cue should be compromised or even eliminated. Audiovisual alignments for this speeded attention condition (Fig. 3, ○) show no systematic variation across auditory depth, although they do appear to be shifted vertically downward with respect to the other control conditions. These vertical shifts probably are caused by a phenomenon known as the modality shifting effect (28), which describes the considerable delay involved in switching an attentional focus from one sensory modality to another. In previous conditions, attention would have been divided between the visual and auditory modalities to optimally perform the alignment task. In the present condition, because attention was focused on audition, the time required to shift the focus of attention to vision would effectively mean that an apparently simultaneous visual stimulus would actually have a later onset than if vision and audition had been jointly attended. The consequence of this would be downward-shifted curves (as plotted in Fig. 3, ○).
Discussion
The essential finding in this article is that the auditory lag that results in perceived audiovisual alignment increases as the simulated auditory source distance increases. This finding implies that the brain is able to compensate for the fact that, with increasing source distance, the acoustic signal arising from a real bimodal event will arrive at the perceiver's head at progressively later times than the corresponding visual signal. Together, the results argue against an account of audiovisual alignment based solely on passive and low-level differential delays. On this model, subjective timing is determined simply by the times of physical arrival at the sensory epithelia summed with the respective processing latencies for each sensory modality. The low-level model therefore would predict a common auditory lag for all simulated source distances, one determined by the differential neural processing latencies for vision and audition. Our first experiment clearly demonstrates this is not so: the point of subjective alignment became systematically delayed as simulated auditory distance increased. Thus, the data suggest an active, interpretative process capable of exploiting auditory depth cues to temporally align auditory and visual signals at the moment they occur at their external sources.
The delaying of audiovisual alignment with increasing auditory distance only occurred when the depth cue was the ratio of direct-to-reverberant energy. In the first two control conditions, where loudness was the depth cue and the energy ratio was not available, no change in the point of temporal alignment was observed. Instead, the moment when the auditory and visual signals became available perceptually determined audiovisual alignment (“internal” alignment). By contrast, the shifting alignment point observed in our first experiment demonstrates “external” alignment. That is, as perceived auditory depth increased, observers delayed the point of subjective alignment (effectively, waiting for the acoustic signal to arrive) so that the auditory and visual signals would remain synchronous at the source. Because external alignment requires the brain to ignore a considerable temporal asynchrony between two neural signals (specifically, the late arrival of the auditory signal), it is unlikely to do so unless there is a robust depth cue to guide it, so that spurious alignments are avoided. The reason external alignment could be accomplished in the first experiment was likely because the ratio of direct-to-reverberant energy is a very powerful cue to auditory depth (perhaps bolstered by the naturally covarying level of the direct portion), a far more robust cue than loudness, and a suitable basis on which to attempt external alignment.
The perceptual system therefore appears to have several strategies available for determining audiovisual synchrony. Without a reliable depth cue, the brain will default to aligning signals internally. However, even given reliable depth information, external alignment is not mandatory. Rather, external alignment must be “task relevant” or it will not occur. As our final speeded attention condition demonstrates, focusing attention on the early part of the auditory signal eliminated the external alignment originally obtained with exactly the same stimuli. Thus, the perceptual system exhibits a flexibility in determining audiovisual alignment. The available depth information partly constrains which type of alignment can occur, but there is also a strategic, attentional component guided by the task needs of the perceiver.
The capacity to align external auditory and visual signals is important functionally as it ties them to the environmental event that caused them. Realizing an appropriate perceptual compensation requires knowledge of source distance and speed of sound, as well as of the effects of other more subtle factors that may affect the speed of neural signal processing (for discussion, see ref. 17). The direct-to-reverberant energy ratio provides a reliable auditory distance cue. As for speed of sound, listeners presumably derive an experience-based estimate of the speed of sound that is validated and refined through interaction with the environment. To a first approximation, then, estimates of both quantities required for external alignment are available. The fact that audiovisual alignment varies systematically with simulated auditory distance demonstrates that the brain can make use of these estimates when it is task-relevant to do so.
Acknowledgments
This work was supported by Australian Research Council Grant DP0345797 (to D.A.).
Author contributions: D.A. and S.C. designed research; and D.A. performed research, analyzed data, and wrote the paper.
This paper was submitted directly (Track II) to the PNAS office.
Abbreviation: IRF, impulse response function.
References
- 1.Bald, L., Berrien, F. K., Price, J. B. & Sprague, R. O. (1942) J. Appl. Psychol. 26, 382-388. [Google Scholar]
- 2.Bushara, K. O., Grafman, J. & Hallett, M. (2001) J. Neurosci. 21, 300-304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hamlin, A. J. (1895) Am. J. Psychol. 6, 564-575. [Google Scholar]
- 4.Hirsh, I. J. & Sherrick, C. E. (1961) J. Exp. Psychol. 62, 423-432. [DOI] [PubMed] [Google Scholar]
- 5.Lewkowicz, D. J. (1996) J. Exp. Psychol. Hum. Percept. Perform. 5, 1094-1106. [DOI] [PubMed] [Google Scholar]
- 6.Rutschmann, J. & Link, R. (1964) Percept. Mot. Skills 18, 345-352. [DOI] [PubMed] [Google Scholar]
- 7.Smith, W. F. (1933) J. Exp. Psychol. 16, 239-257. [Google Scholar]
- 8.Corey, D. P. & Hudspeth, A. J. (1979) Biophys. J. 26, 499-506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.King, A. J. & Palmer, A. R. (1985) Exp. Brain Res. 60, 492-500. [DOI] [PubMed] [Google Scholar]
- 10.Lamb, T. D. & Pugh, E. N. (1992) J. Physiol. (London) 449, 719-758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lennie, P. (1981) Vision Res. 21, 815-824. [DOI] [PubMed] [Google Scholar]
- 12.Schnapf, J. L., Kraft, T. W. & Baylor, D. A. (1987) Nature 325, 439-441. [DOI] [PubMed] [Google Scholar]
- 13.Bolz, J., Rosner, G. & Wassle, H. (1982) J. Physiol. (London) 328, 171-190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Rodieck, R. W. (1998) The First Steps in Seeing (Sinauer, Sunderland, MA).
- 15.Lewald, J. & Guski, R. (2004) Neurosci. Lett. 357, 119-122. [DOI] [PubMed] [Google Scholar]
- 16.Sugita, Y. & Suzuki, Y. (2003) Nature 421, 911. [DOI] [PubMed] [Google Scholar]
- 17.Kopinska, A. & Harris, L. R. (2004) Perception 33, 1049-1060. [DOI] [PubMed] [Google Scholar]
- 18.Bronkhorst, A. W. & Houtgast, T. (1999) Nature 397, 517-520. [DOI] [PubMed] [Google Scholar]
- 19.Mershon, D. H. & Bowers, J. N. (1979) Perception 8, 311-322. [DOI] [PubMed] [Google Scholar]
- 20.Zahorik, P. (2002) J. Acoust. Soc. Am. 112, 2110-2117. [DOI] [PubMed] [Google Scholar]
- 21.Watson, A. B. & Pelli, D. G. (1983) Percept. Psychophys. 33, 113-120. [DOI] [PubMed] [Google Scholar]
- 22.Moore, B. C. J. (2003) An Introduction to the Psychology of Hearing (Academic, New York), 5th Ed.
- 23.Zahorik, P. (2002) J. Acoust. Soc. Am. 111, 1832-1846. [DOI] [PubMed] [Google Scholar]
- 24.Ittelson, W. H. (1951) Am. J. Psychol. 64, 54-67. [PubMed] [Google Scholar]
- 25.Kaufman, L. (1974) Sight and Mind (Oxford Univ. Press, New York).
- 26.Scharf, B. (1978) in Handbook of Perception: Hearing, eds. Carterette, E. C. & Friedman, M. P. (Academic, New York), Vol. IV, pp. 187-242. [Google Scholar]
- 27.Buus, S., Florentine, M. & Poulsen, T. (1997) J. Acoust. Soc. Am. 101, 669-680. [DOI] [PubMed] [Google Scholar]
- 28.Spence, C., Nicholls, M. & Driver, J. (2000) Percept. Psychophys. 63, 330-336. [DOI] [PubMed] [Google Scholar]