Localizing the sources of two independent noises: Role of time varying amplitude differences

William A Yost; Christopher A Brown

doi:10.1121/1.4792155

. 2013 Apr;133(4):2301–2313. doi: 10.1121/1.4792155

Localizing the sources of two independent noises: Role of time varying amplitude differences

William A Yost ^1,^a), Christopher A Brown ²

PMCID: PMC3631260 PMID: 23556597

Abstract

Listeners localized the free-field sources of either one or two simultaneous and independently generated noise bursts. Listeners' localization performance was better when localizing one rather than two sound sources. With two sound sources, localization performance was better when the listener was provided prior information about the location of one of them. Listeners also localized two simultaneous noise bursts that had sinusoidal amplitude modulation (AM) applied, in which the modulation envelope was in-phase across the two source locations or was 180° out-of-phase. The AM was employed to investigate a hypothesis as to what process listeners might use to localize multiple sound sources. The results supported the hypothesis that localization of two sound sources might be based on temporal-spectral regions of the combined waveform in which the sound from one source was more intense than that from the other source. The interaural information extracted from such temporal-spectral regions might provide reliable estimates of the sound source location that produced the more intense sound in that temporal-spectral region.

INTRODUCTION

Although a great deal is known about the ability of listeners to locate a single sound source (see Blauert, 1997, for a thorough review of the spatial hearing literature), far less is known about listeners' abilities in localizing two simultaneous sound sources. This paper deals with human listener's localization performance when two sound sources in the azimuth plane in the free field produce simultaneous sound.

In order for the auditory system to determine the location of the sources of two simultaneous sounds, the sounds from the sources must differ in some way. Two identical sounds presented from two different azimuth locations interact acoustically to produce the perception of a single source (a phantom source) located midway between the two originating sound sources (see Bauer, 1961). Even when the sound at one source location occurs after that of an identical sound from a different source location, only a single sound source is perceived under many conditions (the effects of precedence; see Litovsky et al., 1999).

In this study we used independently generated wideband noise bursts (at least 200 ms in duration) that were presented at the same time from two loudspeakers located at different azimuth angles in the free-field. The noise bursts were identical in all other respects. Two independently generated noise bursts cannot be easily distinguished one from the other, in fact, it is difficult to make same-different discrimination judgments between two independently generated noise bursts that are 200 ms or longer (for instance, see Hanna, 1984). The stimuli in the present study are not ones that allow experience with the sounds to aid in identification (i.e., as might happen if something like speech were used). The goal was to make the task as difficult as possible, with the rationale that it could be made easier if needed by making the stimuli different along another dimensions.

Conditions in which independently generated noises are presented simultaneously from different source locations are often referred to as producing a “diffuse” sound field or the perception of a diffuse sound, i.e., the perception of a sound that has a broad spatial extent (see Gardner, 1969; Santala and Pulkki, 2011). Recently Santala and Pulkki (2011) showed that listeners can locate up to five different loudspeaker locations when independently generated noise is presented from each loudspeaker. While detailed measures of localization acuity were not obtained in that study, their data and that from other studies cited in their paper suggest that listeners can localize two sound sources producing independent samples of noise at the same time. Best et al. (2004) showed in a virtual anechoic environment that listeners could determine if there were one or two sources when the sounds were concurrently generated independent noise bursts located in a “virtual” azimuth plane. These authors did not have listeners determine the location of the sound sources. The major purpose of the current paper is to document listeners' performance in localizing two sound sources in the azimuth plane producing simultaneously and independently generated noise bursts and to test a hypothesis of how the auditory system might accomplish locating multiple sound sources.

Most sound source localization studies that have used multiple sound sources (see for instance Braasch and Hartung, 2002; Croghan and Grantham, 2010; Erno et al., 2001; Good and Gilkey, 1996; Good et al., 1997; Hawley et al., 1999; Kopčo et al., 2007; Lee et al., 2009) have investigated the localization of one sound source in the presence of one or more additional sound sources. These studies indicate that performance for locating a target sound source in the presence of competing sound sources is poorer than when the task is to locate just the target sound source. The amount of the sound source localization performance decrement varies from study to study, most likely due to the fairly large differences in stimuli and sound source location procedures used across studies. There are very few data indicating listener's performance in locating more than just one sound source when multiple sources produce simultaneous sounds, especially if the sounds are independently generated noise bursts. This is the aim of this study.

The stimulus conditions studied in this paper are similar to those used to study spatial release from masking. Spatial release from masking is the reduction in masking that occurs when a target sound is at a different location from a masking sound source as compared to conditions in which the target and masking sounds originate from the same source. In spatial release from masking studies there are two spatially separated sound sources (target and masker) producing simultaneous sounds as is the case for the conditions of the present study. It is usually assumed that some aspect of processing interaural cues (interaural time differences, ITDs, and/or interaural level differences, ILDs) is responsible for spatial release from masking. These are the same cues that one would use to locate azimuthal sound sources in multisource situations. Thus, better understanding sound source localization for multiple sources might provide useful information about spatial release from masking.

In spatial release from masking studies the listener detects, discriminates a difference in, or recognizes/identifies the target sound in the presence of distractor/masking sound sources. The target and masker(s)/distractor(s) stimuli usually differ significantly (e.g., target is a sentence and masker is a speech-shaped noise). The task in this study is sound source location identification and the stimuli are very similar (in some sense as similar as possible, but yet being acoustically different). Thus, caution is warranted in generalizing from the conditions of this study to those used in spatial release from masking studies. The focus of this paper is on sound source localization performance, no aspect of signal detection, discrimination, and/or sound identification was measured in this study.

In spatial release from masking studies, listeners are usually asked to make a response regarding the target sound source, and they do not generally make a response regarding the masker(s). In a sound source localization task the listener could be asked to indicate the source of a target sound in the presence of a distractor sound (similar to spatial release from masking studies and to most multisource localization experiments reported in the literature) or the listener could be asked to indicate the location of both sound sources when two sources produce sound. Both conditions were tested in the present study.

In experiment I listeners were asked to localize in the azimuth plane either a single sound source, a sound source in the presence of another source at a known location, or two sound sources. In the last condition, listeners had no prior knowledge about the location of either sound source, and they were to determine the location of both sound sources.

EXPERIMENT I

Listening environment

Experiments were conducted in an echo-reduced listening room, 11 ft × 12 ft., lined with 4 in. acoustic foam (Noise Reduction Coefficient-NRC = 0.9) on all six surfaces along with special sound treatment on the floor and ceiling. The room contained a 13-loudspeaker (Boston Acoustics 110 x) array arranged in an arc in the front hemifield 1.67 m away from the listening position, and at the height of the listeners' pinna while seated. Loudspeakers were positioned from −90° to +90° with 15° between each. Loudspeakers 1 and 13 did not produce any sound, but the listeners were not told this. Thus, the loudspeakers that presented sound span the range from −75° to +75°, in 15° spatial separations.

A small control room adjacent to this room contains the control computer, Echo Gina 12-channel DA/AD converters, amplifiers and attenuators for 11 of the 13 loudspeakers, and video monitoring of the subject. Speaker calibration was made at the location of the subject in the room. All loudspeakers were within +−8 dB across 100 to 15 000 Hz across all 11 loudspeakers. Additional digital equalization, done on-line for all experiments, reduced the variation to +− 2 dB across all frequencies and loudspeakers. Reverberation times (RT₆₀) were determined for each of 11 loudspeakers at the location of the subject. Broadband noise bursts (500 ms) and 1-ms transients were used to determine RT₆₀. Broadband RT₆₀ ranged from 90 to 122 ms across the 11 loudspeakers and the two measurement signals. On average RT₆₀ was 97 ms for the noise and 101 ms for the transient. In an octave band centered at 1000 Hz, RT₆₀ for the noise was on average 324 ms, while for an octave band centered at 4000 Hz average RT₆₀ was 56 ms.

Subjects

Eight listeners who reported having normal hearing, five females and three males all under the age of 30 years, served as listeners. All procedures used in this study were approved by the Arizona State University Institutional Review Board (IRB) for the protection of human subjects.

Stimuli

All stimuli were generated in MATLAB and presented to the 12-channel Echo Gina DA system at 44 100 Hz per DA channel. Noise bursts were 200 ms in duration, with 20-ms cosine-squared rise/decay times, bandpassed filtered between 125 and 6000 Hz with an 8-pole (∼48 dB/octave) Butterworth filter, and presented at 65 dBA (measured at the position of the listener with a Type 1 sound level meter using the slow setting) with a +−2-dB random level rove over loudspeakers and presentations (the 4-dB level rove was to deal with any cues that might be associated with the slight level differences across loudspeakers). A particular noise burst was never presented more than once. Noise bursts were independently generated across trials, across presentations within a trial, and across loudspeakers within a presentation.

Procedure

General

Listeners were instructed to face straight ahead and look at a red dot fixed to the center loudspeaker (7) at the start of each stimulus presentation, and were monitored via closed-circuit video to ensure compliance. Listeners pressed keys on a computer keyboard to initiate stimulus presentations and make responses. The listeners' task was to identify the loudspeaker or loudspeakers presenting sound, with possible responses in the range of 1 to 13. Each trial consisted of two stimulus presentations, even in conditions containing only one sound source for consistency. No trial-by-trial feedback was provided. One 10-trial practice block was run prior to testing. Listeners were told that when there were two sounds they would always be presented from different loudspeakers.

One sound source, one sound source and one source is localized (1S-1L)

After two presentations from the same loudspeaker, listeners indicated the loudspeaker number that corresponded to the perceived sound source location. Then 275 trials were run (25 trials for each of the 11 loudspeaker locations that presented sound) in five, 55-trial blocks.

Two sound sources, one source is localized (2S-1L)

Listeners were told that two loudspeakers would present sound at the same time and that one of the sound sources would always be the center (7) loudspeaker. After two presentations, they were to indicate the location of the loudspeaker that presented the other sound (loudspeakers 1 to 13). In addition to loudspeakers 1 and 13 (see above) loudspeakers 6 and 8 (i.e., those immediately adjacent to the center loudspeaker, 7) also did not produce sound, although listeners were not told this. There were 200 trials (25 trials for each of the eight loudspeakers that presented sound; 2, 3, 4, 5, 9, 10, 11, 12), divided into four, 50-trial blocks.

Two sound sources, two sources are localized (2S-2L)

For each trial, one of eight combinations of loudspeaker locations (see Fig. 1) was chosen at random, and stimuli were presented twice from this loudspeaker pair. Listeners were instructed to indicate the location of one of one sound after the first presentation, and the other location after the second presentation. There were 200 trials (25 trials for each of the eight loudspeaker pairings) presented in four, 50-trial blocks.

(Color online) The eight loudspeaker pairs used in experiment I (2S-2L condition) and experiments II and III.

Results

The data are plotted as histograms in Figs. 2 3 4 as the percent of the total trials (across conditions and listeners) in which a perceived loudspeaker location (X axis, 1–13) was indicated (Y axis). Figure 2 shows data for the 1S-1L conditions, Fig. 3 for the 2S-1L conditions, and Fig. 4 for the 2S-2L conditions. Bars with circles indicate correct responses (i.e., the position of the loudspeaker presenting sound). The histogram scale of percent (%) responses is shown on the lower left of the figures. For the 1S-1L condition (Fig. 2) and the 2S-1L condition (Fig. 3) the number of trials was the same as the maximum number of responses since only one loudspeaker location was reported on each trial. However, for the 2S-2L condition (Fig. 4) there were twice as many possible responses as there were trials, as there were two responses per trial. In all three conditions (Figs. 2 3 4) percent (%) responses was calculated and displayed by dividing the total number of responses for any particular perceived location by the number of total trials (not responses) for the particular actual loudspeaker pair. For the 2S-2L condition (Fig. 4) this results in the maximum percent responses for anyone actual loudspeaker pair totaling 200%.

(Color online) Histogram (percent of responses) of the localization responses across all conditions and listeners for the 13 loudspeaker locations in the 1S-1L condition for each of the actual loudspeaker location. Circles indicate the location of the loudspeaker that presented a sound.

(Color online) Same as Fig. 2, but for the 2S-1L condition.

(Color online) Same as Figs. 2 3, but for the 2S-2L condition. The percent (%) responses were calculated by dividing the total number of times a particular perceived location was reported by the total number of trials.

Figure 5 indicates the mean (across subjects and loudspeaker locations) root-mean-square (rms) error (using the “D” calculation of Rakerd and Hartmann, 1986) for the three conditions.1 A one-way repeated measure analysis of variance (ANOVA) with condition as the factor indicated a statistically significant Main Effect [F(2,5); p ≪ 0.01] and a repeated measures a priori t tests indicate that the rms error for 2S-2L was statistically greater than the rms error for 1S-1L (p ≪ 0.01) and the rms error for 2S-2L was statistically greater than the rms error for 2S-1L (p < 0.01). Figure 6 indicates the percent of trials in which listeners correctly located either both loudspeaker locations (both correct) or at least one of the two loudspeaker locations (one correct) in the 2L-2S condition.

Mean (across 8 listeners) rms error in degrees as a function of the three conditions of experiment I. Error bars are one standard error of the mean.

Mean (across 8 listeners) proportion of correct responses in getting both loudspeaker locations correct (both correct) or at least one loudspeaker location correct (one correct) in the 2S-2L condition. Error bars are one standard error of the mean.

Discussion

Sound source localization performance for locating a single noise source (1S-1L condition) was similar to that obtained by other investigators using listeners with normal hearing (e.g., Grantham et al., 2007; Wightman and Kistler, 1989). The rms errors of approximately 5°–8° and the fact that localization performance is best for locations directly in front of the listener are common findings. All eight listeners performed similarly in the 1S-1L condition.

When two sources presented simultaneous independent noise bursts and the listener knew that one of the sources was the center loudspeaker (2S-1L condition), performance was better than when the listener did not have any prior information about which loudspeaker would be presenting either one of the two sounds (2S-2L condition). In the 2S-1L condition sound source localization performance was worse than when listeners were asked to locate only one sound source (one condition). Thus, it appears that having prior information about the source of one sound when there are two sound sources aids localization of two sound sources. In the 2S-2L condition, on any trial, listeners were able to correctly locate both sources slightly less than half of the time and they located at least one correct sound source slightly more than 80% of the time. These data appear to agree qualitatively with those of Santala and Pulkki (2011) in indicating that listeners can localize the sources of two simultaneous and independent noises in the free field, but not as well as they can localize a single sound source.

In many experiments involving multiple sound sources the sounds from the various sources can be identified making it possible to assign a particular sound to a particular source. For instance, if two words from two different sources were used, the data could be tabulated as the percent perceived loudspeaker location for word one and for word two. This cannot be done in this experiment, since the two sounds (independently generated noise bursts) are barely discriminable (Hanna, 1984), i.e., any one noise burst is not identifiably different from any other noise burst.2 One of the motivations for using independent noise bursts in this study was to provide for the possibility to evaluate how being able to identify the sound from a source might influence multiple sound source localization. For instance, how does multiple sound source location performance compare for speech versus noise and to what extent are any differences in performance attributable to the ability to identify different speech stimuli but not different noise stimuli?

Given that two simultaneous noise bursts interact acoustically, it might seem surprising that listeners do as well as they do in localizing two independent noise sources. Our results showing that listeners can locate sound sources under these conditions is similar to the results obtained by Santala and Pulkki (2011) in the free-field and by Best et al. (2004) in a virtual-listening condition. Several investigators (e.g., Keller and Takahashi, 2005; Meffin and Grothe, 2009; Woodruff and Wang, 2010) have suggested that localization of multiple sound sources might occur because in the combined waveform some proportion of temporal-spectral regions might contain high relative levels of the sound from one of the sources. The interaural differences (ILDs and ITDs) in these temporal-spectral regions may provide reliable information about the interaural differences of the sound from that source. When the level of the sounds from the two sources are about the same within a temporal-spectral region, the interaural cues would not reflect those of either source (the interaural cues would be spurious) in that the interaction of the sound waveforms would obscure the interaural cues associated with the originating sound sources. Perhaps the ability to localize simultaneous sounds from two sources occurs when there are enough temporal-spectral regions in the combined waveform with reliable interaural cues relative to spurious interaural cues. Experiment II was designed to investigate sound source localization when there were differences in level over time from different sound sources.

In experiment II, the two independent noise bursts were sinusoidally amplitude modulated (SAM). In one case the noise bursts presented to both loudspeakers were modulated with the same envelope phase (in phase). In the other condition the modulation at one loudspeaker was 180° out of phase (out of phase) with that occurring at the other loudspeaker. In the out of phase condition, when the overall level at one loudspeaker was high, the level at the other loudspeaker was low. This is in contrast to the in phase condition in which the overall level at both loudspeakers was always the same. In experiment II listeners were asked to determine the location of the two sound sources as was done for the 2S-2L condition of experiment I. Both in phase and out of phase amplitude modulation between the two loudspeakers of independently and simultaneously generated noise were randomly mixed within a block of trials. The goal of experiment II was to determine if the out of phase condition leads to better localization performance than the in phase condition, and if so how performance changes with modulation rate. If level differences in different temporal regions of the combined waveform from two independently generated noise bursts (unmodulated) are a basis for sound source localization, the temporal regions would probably have to be fairly short, since it is unlikely that there would be long periods of time when two independently generated noises had significant level differences. If so, we hypothesized that fairly high SAM rates would produce better localization performance for the out of phase conditions as compared to the in phase conditions.