Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2005 Jan 24;102(6):2244–2247. doi: 10.1073/pnas.0407034102

Synchronizing to real events: Subjective audiovisual alignment scales with perceived auditory depth and speed of sound

David Alais 1,*, Simon Carlile 1
PMCID: PMC548526  PMID: 15668388

Abstract

Because of the slow speed of sound relative to light, acoustic and visual signals from a distant event often will be received asynchronously. Here, using acoustic signals with a robust cue to sound source distance, we show that judgments of perceived temporal alignment with a visual marker depend on the depth simulated in the acoustic signal. For distant sounds, a large delay of sound relative to vision is required for the signals to be perceived as temporally aligned. For nearer sources, the time lag corresponding to audiovisual alignment is smaller and scales at rate approximating the speed of sound. Thus, when robust cues to auditory distance are present, the brain can synchronize disparate audiovisual signals to external events despite considerable differences in time of arrival at the perceiver. This ability is functionally important as it allows auditory and visual signals to be synchronized to the external event that caused them.

Keywords: audiovisual interactions, auditory distance perception, auditory psychophysics


Studies of audiovisual temporal alignment generally have found that an auditory stimulus needs to be delayed to be perceptually aligned with a visual stimulus (17). This temporal offset, on the order of several tens of milliseconds, is thought to reflect the slower processing times for visual stimuli. This offset arises because acoustic transduction between the outer and inner ears is a direct mechanical process and is extremely fast at just 1 ms or less (8, 9), whereas, by contrast, phototransduction in the retina is a relatively slow photochemical process followed by several cascading neurochemical stages and lasts ≈50 ms (1014). Thus, differential latencies between auditory and visual processing agree well with the common finding that auditory signals must lag visual signals by ≈40–50 ms if they are to be perceived as temporally aligned.

Most studies of audiovisual alignment, however, are based on experiments done in the near field, meaning that auditory travel time is a negligible factor. Studies of audiovisual alignment conducted over greater distances have examined whether the brain can compensate for the slow travel time of sound, but these have produced contradictory results (1517). Here, we test whether subjective audiovisual alignment reflects only the relatively stable internal latency differences that are well documented or whether knowledge of the external distance of an auditory source can be used to compensate for the slow travel time of sound relative to light. In the experiments presented here, we use a powerful cue to auditory source distance, the ratio of direct-to-reverberant energy (18), to vary the apparent distance of the acoustic signal's origin. By manipulating this cue, we find that increases in sound source distance cause sounds to be perceptually aligned with earlier visual events. When large auditory distances are simulated, the auditory lag required for vision to be aligned with sound becomes exaggerated far beyond what can be accounted for by differential neural latencies. These results suggest that the perceptual system is capable of compensating for the relatively slow speed of sound when reliable auditory depth information is available to guide it.

Methods

All measurements were made in a large (64 m3), darkened high-fidelity anechoic chamber (>99% absorption down to 200 Hz) with insertion loss >60 dB above 500 Hz. Subjects sat 57 cm from a laptop computer screen that displayed the visual stimulus. The screen was dark except for a 13-ms presentation of a circular Gaussian-profile luminance blob 4° in width. Two loudspeakers (Yamaha MSP 5) flanked the screen and played the auditory signal, an impulse response function (IRF) recorded by our laboratory in the concert hall of the Sydney Opera House. The IRF was convolved with white noise and began with a direct (i.e., anechoic) portion lasting 13 ms, followed by a long reverberant tail that dissipated over 1,350 ms. The initial and reverberant portions of the signal were windowed with raised cosines overlapping at half-height at 13 ms, so that the direct onset burst could be scaled smoothly in amplitude relative to the reverberant tail (using the “overlap and add” method) with no detectable discontinuity. The reverberant tail was identical for all simulated depths. The IRF was recorded 5 m from a source on the auditorium stage. Scaling down the direct portion (originally 84-dB peak sound pressure level) by 6 dB created a stimulus-simulating 10-m sound source distance, and further 6-dB attenuations produced stimuli-simulating 20- and 40-m source distances, respectively (see Fig. 1A). Thus auditory source distance was simulated by using the powerful depth cue of direct-to-reverberant energy ratio. In enclosed reverberant environments, the ratio of direct-to-reverberant energy is the strongest cue to auditory sound source distance because the incident level decreases by 6 dB with each doubling of distance, whereas the level of the reverberant tail is approximately invariant (1820). In the darkened anechoic chamber, each stimulus sounded like a brief white-noise burst heard in a large reverberant space at approximately the specified depth. The visual stimulus was also fixed in distance, 57 cm from the observer. It did not need to be colocated in depth with the auditory signal because its role was primarily to provide a reference point in time rather than a fusible audiovisual event.

Fig. 1.

Fig. 1.

Illustrations of the stimuli and procedures used in these experiments. (A) The impulse response function on the top row (5 m) was the original function recorded in the Sydney Opera House convolved with white noise. The direct sound is the initial portion of high amplitude. The long tail is the reverberant signal, which lasted 1,350 ms and was identical for all four stimuli. Because the ratio of direct-to-reverberant energy is a very strong cue to auditory source distance, attenuating the direct portion by 6 dB (a halving of amplitude) simulates a source distance of 10 m (see Methods). Further 6-dB attenuations simulated auditory distances of 20 and 40 m. (B) The visual stimulus was similar to that shown (Left), a circular luminance patch that was presented for 13 ms. The spatial profile of the stimulus (Right) was Gaussian with a full half-width of 4° of visual angle. (C) The onset of the auditory stimulus (Upper) was varied by an adaptive procedure to find the point of subjective alignment with the visual stimulus (Lower). A variable random period preceded the stimuli after the subject initiated each trial.

Subjects were instructed to listen to the entire auditory stimulus and then judge whether the auditory burst had occurred before or after the visual stimulus. Different auditory depths were paired in a given block of trials and were randomly interleaved. Initially, visual and auditory stimuli were played simultaneously and, depending on the subject's response, the onset time of the auditory stimulus was then advanced or retarded according to an adaptive staircase routine [quest (21)] designed to home in on the point of subjective audiovisual simultaneity. Six quests of 25 trials each were run, and the data were pooled into a single data set and fitted with a cumulative Gaussian by using a maximum-likelihood method to model the psychometric function. The auditory-onset asynchrony (relative to the visual flash) at the half-height of the psychometric function defined the point of subjective audiovisual alignment. Each psychometric function was resampled by using a bootstrapping procedure to generate a distribution of 1,000 functions, the standard deviation of which was used as an error bar for the point of subjective alignment. Slopes of linear fits to subjective alignments across auditory distance were tested for significance by using a Monte Carlo procedure simulating 1,000 replications of the experiment.

Results

The point of subjective alignment of auditory and visual stimuli depended on the source distance simulated in the auditory stimulus. As shown for four observers in Fig. 2C, sound-onset times had to be increasingly delayed to produce alignment with the visual stimulus as perceived acoustic distance increased. Best-fitting linear functions describe the data well, with slopes varying between observers from 2.51 to 4.17 ms per m. The intersubject variation is consistent with findings showing auditory distance estimates vary considerably between observers (15, 22), and the data also show the typical compressive function seen in studies of auditory depth (23); however, the average slope of 3.43 ms·m-1 is approximately consistent with the speed of sound. These results suggest two conclusions: first, that in making their audiovisual alignments observers were taking account of the distance of the sound source, as defined by the direct-to-reverberant energy ratio, and second, they were attempting to compensate for the travel time from that source with a subjective estimate of the speed of sound.

Fig. 2.

Fig. 2.

Data from the experimental conditions. (A) Psychometric functions for one observer at each of the four simulated auditory distances plotting the proportion of trials in which the visual stimulus was judged to have occurred before the auditory stimulus, as a function of the delay of the auditory stimulus. From left to right, the curves represent the 5-, 10-, 20-, and 40-m conditions. The abscissa shows time measured from the onset of the visual stimulus. (B) The same data as in a replotted on a logarithmic scale. It is clear from the linear plot (A) that the temporal precision of the audiovisual temporal order judgement decreases with auditory distance. However, the slopes are very similar when plotted on a logarithmic scale, indicating that the precision limit is a constant proportion of auditory distance (i.e., a Weber fraction). (C) The points of subjective audiovisual alignment (the half-height of the psychometric functions) for four observers at each of the four auditory distances. As auditory distance simulated by the direct-to-reverberant energy ratio increased, the auditory stimulus was perceptually aligned with earlier visual events, consistent with subjects using the energy ratio in their alignment judgements. The slopes of the best-fitting linear functions are shown for each observer. The average slope of 3.43 ms·m-1 is approximately consistent with the speed of sound.

To verify that the depth cue underlying the delaying of audiovisual alignment really was the ratio of direct-to-reverberant energies, the audiovisual alignment experiment was repeated by using just the initial 13-ms direct portion of the sound (with no reverberant tail) at the same four amplitude levels used in the first experiment. The results of this control experiment (Fig. 3, ▪) demonstrate first that the reverberant tail is necessary if the point of audiovisual alignment is to be shifted in time, and second that differences in the amplitude of the brief initial portion did not influence audiovisual alignment in our first experiment. Rather, it was the ratio of those initial amplitude changes with respect to the reverberant tail that was crucial to the delaying of audiovisual alignment as auditory depth increased.

Fig. 3.

Fig. 3.

The results of three control conditions shown for four observers. ▵ indicate audiovisual alignment for the first control in which only the onset burst of the four auditory stimuli was presented (the reverberant tail was removed). The slope of the best-fitting straight line to the averaged data was not significantly different from zero, showing that the reverberant tail is necessary to produce the shifts in subjective alignment seen in the first experiment. ▪ indicate the results of a second condition designed to control for loudness differences in the original experiment by scaling the amplitude of the original 5-m stimulus to match that of the other depths over the first 200 ms (see text). The best-fitting line to the averaged data does not differ significantly from zero, indicating that loudness differences between the four original stimuli did not determine the shifts in subjective alignment. ○ show results from a speeded attention condition. The best-fitting line to the averaged data was not significantly different from zero in slope, showing that focusing attention on the early part of the signal to make a speeded response will lead to the discounting of the direct-to-reverberant energy ratio. This finding suggests that use of this depth cue in audiovisual alignment is not mandatory and must be task-relevant.

A second control experiment tested whether changes in loudness, caused by the differing amplitudes of the early portion of the signals, may have determined the delaying of audiovisual alignment with increasing auditory distance. Loudness can be used by the auditory system to estimate depth, although without a context to indicate the likely amplitude at source it is a rather poor cue (22), just as sheer size in vision is a poor indicator of depth (24, 25). In generating loudness (perceived sound level), the auditory system integrates signals over a period of ≈200 ms (26, 27). We therefore calculated the average sound level of the first 200 ms of the four conditions used in the original experiment (see Fig. 1A) and expressed these time-averaged levels as ratios relative to that of the 5-m condition. These ratios estimate the perceptual changes in loudness in the early part of the signal across the four auditory depths. The stimuli for this second control experiment were then produced by scaling the entire original 5-m stimulus in amplitude (using the time-averaged level ratios), to create four stimuli whose time-averaged levels match those of the original 5-, 10-, 20-, and 40-m stimuli but whose direct-to-reverberant energy ratios are identical (because all are scaled versions of the 5-m stimulus). The audiovisual alignment task conducted with these control stimuli produced a flat function with a nonsignificant slope (Fig. 3, ▪), indicating that the intensity differences in our original stimuli (and the corresponding changes in perceived loudness) did not produce the shifts in temporal alignment we obtained. Rather, it was the systematic variation in perceived distance, as determined by the direct-to-reverberant energy ratio, which was the essential determinant of our results.

In a final manipulation, we repeated the audiovisual alignment task with the original stimuli in a “speeded attention” condition. Participants were told to direct their attention to the onset burst (ignoring the reverberant tail) and to respond as fast as possible while maintaining accuracy. For this reason, responses were rejected as being too slow if their reaction times exceeded 650 ms. Because this time window must include the alignment decision and the motor response to indicate it it ensures that only the early part of the auditory signal can be processed before an alignment response is required. By responding before the reverberant signal is fully encoded, the direct-to-reverberant ratio as a depth cue should be compromised or even eliminated. Audiovisual alignments for this speeded attention condition (Fig. 3, ○) show no systematic variation across auditory depth, although they do appear to be shifted vertically downward with respect to the other control conditions. These vertical shifts probably are caused by a phenomenon known as the modality shifting effect (28), which describes the considerable delay involved in switching an attentional focus from one sensory modality to another. In previous conditions, attention would have been divided between the visual and auditory modalities to optimally perform the alignment task. In the present condition, because attention was focused on audition, the time required to shift the focus of attention to vision would effectively mean that an apparently simultaneous visual stimulus would actually have a later onset than if vision and audition had been jointly attended. The consequence of this would be downward-shifted curves (as plotted in Fig. 3, ○).

Discussion

The essential finding in this article is that the auditory lag that results in perceived audiovisual alignment increases as the simulated auditory source distance increases. This finding implies that the brain is able to compensate for the fact that, with increasing source distance, the acoustic signal arising from a real bimodal event will arrive at the perceiver's head at progressively later times than the corresponding visual signal. Together, the results argue against an account of audiovisual alignment based solely on passive and low-level differential delays. On this model, subjective timing is determined simply by the times of physical arrival at the sensory epithelia summed with the respective processing latencies for each sensory modality. The low-level model therefore would predict a common auditory lag for all simulated source distances, one determined by the differential neural processing latencies for vision and audition. Our first experiment clearly demonstrates this is not so: the point of subjective alignment became systematically delayed as simulated auditory distance increased. Thus, the data suggest an active, interpretative process capable of exploiting auditory depth cues to temporally align auditory and visual signals at the moment they occur at their external sources.

The delaying of audiovisual alignment with increasing auditory distance only occurred when the depth cue was the ratio of direct-to-reverberant energy. In the first two control conditions, where loudness was the depth cue and the energy ratio was not available, no change in the point of temporal alignment was observed. Instead, the moment when the auditory and visual signals became available perceptually determined audiovisual alignment (“internal” alignment). By contrast, the shifting alignment point observed in our first experiment demonstrates “external” alignment. That is, as perceived auditory depth increased, observers delayed the point of subjective alignment (effectively, waiting for the acoustic signal to arrive) so that the auditory and visual signals would remain synchronous at the source. Because external alignment requires the brain to ignore a considerable temporal asynchrony between two neural signals (specifically, the late arrival of the auditory signal), it is unlikely to do so unless there is a robust depth cue to guide it, so that spurious alignments are avoided. The reason external alignment could be accomplished in the first experiment was likely because the ratio of direct-to-reverberant energy is a very powerful cue to auditory depth (perhaps bolstered by the naturally covarying level of the direct portion), a far more robust cue than loudness, and a suitable basis on which to attempt external alignment.

The perceptual system therefore appears to have several strategies available for determining audiovisual synchrony. Without a reliable depth cue, the brain will default to aligning signals internally. However, even given reliable depth information, external alignment is not mandatory. Rather, external alignment must be “task relevant” or it will not occur. As our final speeded attention condition demonstrates, focusing attention on the early part of the auditory signal eliminated the external alignment originally obtained with exactly the same stimuli. Thus, the perceptual system exhibits a flexibility in determining audiovisual alignment. The available depth information partly constrains which type of alignment can occur, but there is also a strategic, attentional component guided by the task needs of the perceiver.

The capacity to align external auditory and visual signals is important functionally as it ties them to the environmental event that caused them. Realizing an appropriate perceptual compensation requires knowledge of source distance and speed of sound, as well as of the effects of other more subtle factors that may affect the speed of neural signal processing (for discussion, see ref. 17). The direct-to-reverberant energy ratio provides a reliable auditory distance cue. As for speed of sound, listeners presumably derive an experience-based estimate of the speed of sound that is validated and refined through interaction with the environment. To a first approximation, then, estimates of both quantities required for external alignment are available. The fact that audiovisual alignment varies systematically with simulated auditory distance demonstrates that the brain can make use of these estimates when it is task-relevant to do so.

Acknowledgments

This work was supported by Australian Research Council Grant DP0345797 (to D.A.).

Author contributions: D.A. and S.C. designed research; and D.A. performed research, analyzed data, and wrote the paper.

This paper was submitted directly (Track II) to the PNAS office.

Abbreviation: IRF, impulse response function.

References


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES