Discrimination and streaming of speech sounds based on differences in interaural and spectral cues

Marion David; Mathieu Lavandier; Nicolas Grimault; Andrew J Oxenham

doi:10.1121/1.5003809

. 2017 Sep 27;142(3):1674–1685. doi: 10.1121/1.5003809

Discrimination and streaming of speech sounds based on differences in interaural and spectral cues

Marion David ^1,^a), Mathieu Lavandier ², Nicolas Grimault ³, Andrew J Oxenham ⁴

PMCID: PMC5617732 PMID: 28964066

Abstract

Differences in spatial cues, including interaural time differences (ITDs), interaural level differences (ILDs) and spectral cues, can lead to stream segregation of alternating noise bursts. It is unknown how effective such cues are for streaming sounds with realistic spectro-temporal variations. In particular, it is not known whether the high-frequency spectral cues associated with elevation remain sufficiently robust under such conditions. To answer these questions, sequences of consonant-vowel tokens were generated and filtered by non-individualized head-related transfer functions to simulate the cues associated with different positions in the horizontal and median planes. A discrimination task showed that listeners could discriminate changes in interaural cues both when the stimulus remained constant and when it varied between presentations. However, discrimination of changes in spectral cues was much poorer in the presence of stimulus variability. A streaming task, based on the detection of repeated syllables in the presence of interfering syllables, revealed that listeners can use both interaural and spectral cues to segregate alternating syllable sequences, despite the large spectro-temporal differences between stimuli. However, only the full complement of spatial cues (ILDs, ITDs, and spectral cues) resulted in obligatory streaming in a task that encouraged listeners to integrate the tokens into a single stream.

I. INTRODUCTION

Understanding speech in complex auditory backgrounds relies on our ability to perceptually organize competing sound sources into streams. In the case of speech, the sounds emanating from a target speaker must be grouped together (integration) and separated from the competing background (segregation) to be intelligible (Bregman, 1990). In an early study, Cherry (1953) demonstrated that spatial separation between a target speaker and a masker can improve speech recognition. Cherry used dichotic presentation, with the target presented to one ear and the masker presented to the other. In real auditory environments, localization in both the median and horizontal planes is achieved via more subtle cues, such as interaural time and level differences (ITDs and ILDs, respectively) and monaural spectral differences (Blauert, 1997; Wightman and Kistler, 1992). These cues can be characterized via the head-related transfer function (HRTF; e.g., Gardner and Martin, 1995).

Many studies have investigated streaming using ITDs and ILDs (Hartmann and Johnson, 1991; Darwin and Hukin, 1999; Gockel et al., 1999; Oxenham, 2000; Roberts et al., 2002; Sach and Bailey, 2004; Kidd et al., 2005; Stainsby et al., 2011; Füllgrabe and Moore, 2012). Fewer studies have investigated the effect of spectral cues produced by simulating sounds from different locations. However, those that have studied the effects of spectral spatial cues, independent of binaural cues, have found that alternating sequences of broadband noise bursts can be perceptually segregated based on small spectral differences between the stimuli (Middlebrooks and Onsan, 2012). Stream segregation based on these spectral cues can also be obligatory (David et al., 2014; David et al., 2015), in that segregation occurs even in situations where listeners are instructed to integrate the sequences into a single stream; for a discussion of voluntary and obligatory streaming, see Micheyl and Oxenham (2010).

Although subtle spectral cues may be sufficient to segregate spectrally uniform noise bursts, it is not clear if this finding generalizes to more realistic stimuli, such as speech. First, the spectral variations in speech might make the spectral cues from spatial location less reliable. Second, the voiced portions of speech contain primarily low-frequency information, which will be less affected by the high-frequency spectral cues associated with spatial differences.

In the present study, speech sounds were used, which consisted of both unvoiced (fricative consonant) and voiced (vowel) parts. These consonant-vowel (CV) tokens were naturally uttered and randomly concatenated to form interleaved sequences. David et al. (2017) used the same stimuli to show that differences in fundamental frequency (F0), which affected primarily the lower-frequency voiced part of the stimulus, could induce streaming of the entire CV. In order to avoid a potentially confounding effect of F0 differences in the present study, all the stimuli had the same F0, while maintaining the natural variations in the spectral and temporal envelopes of speech. One question posed by the present study is whether the spectral cues that primarily affect the higher-frequency portions of the stimulus can also lead to streaming of the entire CV. Another question was the extent to which binaural cues in the horizontal plane contribute to stream segregation, over and above the monaural spectral cues that are also available in the horizontal plane (David et al., 2014). The experiments related to these questions were preceded by a discrimination task to ensure that listeners could perceive the differences induced by imposing different spatial or spectral cues on the stimuli. Depending on the cues available, these differences could be differences in spectrum (coloration) and/or perceived position.

II. EXPERIMENT 1: DISCRIMINATION TASK

A. Rationale

The aim of the discrimination task was to assess the extent to which listeners can perceive a difference in spatial or spectral cues between successive speech tokens, with and without between-token variability. In the horizontal plane, all the spatial cues (spectral differences, ILD and ITD) were available for the listener to discriminate the stimuli. Neither ILD nor ITD would be substantially affected by variability in the spectra of the tokens, so we predicted that listeners' discrimination performance should not be substantially affected. However, changes in source location within the median plane produce only spectral differences, which are more likely to be susceptible to interference by spectral variability between the tokens themselves. We used non-individualized HRTFs to produce changes in spectral cues that are representative of those elicited by stimuli presented at different elevations.

B. Method

1. Stimuli

The stimuli used were a subset of those used by David et al. (2017). The naturally uttered CV tokens (male voice) consisted of four different fricative consonants ([f], [s], [th] and [sh]) combined with nine different vowels ([æ], [e], [iː], [I], [ə], [ɛ], [ʌ], [ɑ] and [uː]). The stimuli were truncated to 160 ms by shortening both the fricative consonant and the vowel, so that each portion was approximately equal in length. The truncated segment was then gated on and off with 10-ms raised-cosine ramps. The F0 of the voiced portions was flattened to 110 Hz using the software Praat (Boersma and Weenink, 2017) and then the stimuli were resynthesized using a pitch synchronous overlap-add technique (PSOLA), widely used for F0 manipulations of speech sounds, which has minimal effect on the spectral shape of the CV tokens. This process equalized the F0, while preserving the natural spectral- and temporal-envelope variations of the speech stimuli.

The stimuli were filtered with non-individualized HRTFs (Gardner and Martin, 1995) to simulate different positions in the horizontal and median planes. It is worth noting that the spectral cues associated with elevation might not have been necessarily attributed to clear perceived positions by the listeners due to the use of non-individualized HRTFs. Nevertheless, the spectral differences introduced by these HRTFs should be representative of those experienced by normal-hearing listeners.

The excitation patterns (Glasberg and Moore, 1990) of three processed tokens with the same vowel but different consonants ([sha], [fa] and [tha]) simulated at 0° azimuth and 0° elevation are presented in the left panel of Fig. 1. The right panel of Fig. 1 illustrates the mean excitation patterns of the spectrum, averaged across all tokens used in the study, simulated at three different positions in the median plane (0°, 30°, and 70°). The spectra in the left panel illustrate the large high-frequency variability from token to token, even when they share the same vowel and are presented with the same fixed F0. Indeed, comparing the left and right panels of Fig. 1, the spectral differences from token to token are often larger than the spectral differences induced by a difference in simulated position in the median plane.

FIG. 1. — Excitation patterns produced by different combinations of tokens and simulated spatial positions. The left panel shows the excitation patterns of three different tokens used in this study, simulated at 0° azimuth, 0° elevation. The dotted line corresponds to [sha], and the solid black and grey lines correspond to [fa] and [tha], respectively. The right panel shows mean excitation patterns of all the tokens used in this study, simulated at three different positions in the median plane. The black and grey solid lines correspond to 0° and 70°, respectively, and the dotted line corresponds to 30° elevation.

2. Listeners

Sixteen listeners participated in the experiment (12 females, 4 males, aged from 18 to 28 years, median = 21). All of them were native speakers of American English, had normal hearing (i.e., pure-tone audiometric thresholds better than 20 dB hearing level (HL) at octave frequencies between 250 and 8000 Hz), and were paid for their participation. All listeners provided written informed consent and the protocol was approved by the Institutional Review Board of the University of Minnesota.

3. Procedure

A three-interval forced-choice procedure was used in which two stimuli were presented from a simulated location directly ahead (0° azimuth and elevation) and one stimulus was presented at a different simulated location in either the horizontal or median plane. The order of the three stimuli was selected at random on each trial and the stimuli were separated by 500-ms inter-stimulus intervals. The task involved indicating which of the three stimuli came from a different location. Six angles were tested in both planes: ±5°, ±10°, and ±30° in the horizontal plane, and ±10°, ±30°, +50°, and +70° in the median plane. In the constant-token condition, one speech token was selected at random on each trial and the same speech token was presented in all three intervals. In the different-token condition, three different speech tokens were selected at random (without replacement) on each trial and presented in the three intervals. Thus, in the constant-token condition, any change in the stimulus signified a change in simulated location, whereas in the different-token condition each interval involved spectral changes. Correct-answer feedback was provided after each trial.

The listeners completed two sessions of two hours each. Each session contained two separate blocks, one with constant tokens and one with different tokens. One session was used to test all conditions in the horizontal plane, and the other session was used to test all conditions in the median plane. The orders of the two sessions (horizontal/median) and two blocks within each session (same/different) were counterbalanced across the 16 subjects. For the constant-token conditions, four repetitions of each position and each token were presented, so that each listener completed 864 trials (4 repetitions with 6 angles and 36 tokens) in total. Listeners completed the same number of trials (864) for the different-token conditions, but the tokens were selected at random on each trial. Both sessions took place in a sound-attenuating booth. The stimulus presentation and response collection were controlled using the AFC software package (Ewert, 2013) under matlab (Mathworks, Natick, MA). The stimuli were converted to analog signals using a Lynx22 (Lynx Studio Technology, Costa Mesa, CA) 24-bit soundcard at a sampling rate of 44 100 Hz and were presented at 65 dB sound pressure level (SPL) via HD 650 headphones (Sennheiser, Old Lyme, CT).

C. Results

The proportion of correct responses was transformed into rationalized arcsine units (RAU) (Studebaker, 1985) to make them more suitable for parametric statistical analyses. The results, averaged across listeners, are shown in Fig. 2. The dashed line represents chance level and the black and grey circles represent the results from the constant- and different-token conditions, respectively.

The results in the horizontal plane are shown in the left panel of Fig. 2. A three-way repeated-measures analysis of variance (ANOVA) was performed with the RAU-transformed percent-correct values as the dependent variable and the condition (constant or different tokens), absolute angle (5°, 10°, and 30°), and hemisphere (negative/left or positive/right) as within-subjects factors. There were significant main effects of absolute angle [F(2,60) = 271.0, p < 0.001] and hemisphere [F(2,60) = 4.92, p = 0.03], but no effect of condition [F(1,30) = 1.20, p = 0.28]. The two-way interaction between absolute angle and hemisphere was significant [F(2,60) = 6.41, p = 0.003]. No other interactions were significant (p > 0.26 in all cases). These outcomes reflect the improvement in performance with increasing absolute angle and the slight asymmetry between the results from the left and right hemispheres, but no significant difference in performance between the constant- and different-token conditions.

The results from the median plane are shown in the right panel of Fig. 2. A two-way repeated-measures ANOVA was performed with the RAU-transformed percent-correct values as the dependent variable and the condition (constant or different tokens) and angle (–30°, −10°, +10°, +30°, +50°, and +70°) as within-subjects factors. Both main effects were highly significant [Condition: F(1,15) = 77.7, p < 0.001; Angle: F(1,15) = 118, p < 0.001], as was their interaction [F(1,15) = 59.8, p < 0.001]. Listeners performed significantly better when the stimuli did not vary from token to token. One-sample t-tests revealed that only performance for the stimuli simulated at +30° was not significantly above chance (33% or 34.21 RAU) in the different-token condition. For all other angles in this condition, mean performance was slightly but significantly above chance (p < 0.008 in all cases), even when accounting for multiple (6) comparisons using a Bonferroni correction (α = 0.05/6 = 0.0083). Even though performance was generally quite poor in the different-token conditions, with mean scores between 37 and 56 RAU, there was some evidence that discrimination was still possible in the median plane.

D. Discussion

In the horizontal plane, performance improved as the difference in simulated position increased between the reference (0° azimuth) and the target. Regardless of whether the tokens were constant or different within each trial, a separation of 5° was sufficient to enable their discrimination. This level of performance is expected, given that the minimum audible angle (MAA) for broadband sounds is typically around 2.5° (Perrott and Pacheco, 1989), and that the primary cues for localization in the horizontal plane are ITD and ILD, which are not affected by whether the tokens are different or the same.

In the median plane, overall performance was poorer and the difference between the constant- and different-token conditions was greater. The poorer overall performance is expected, given that minimum audible angles in the median plane are generally higher, at around 4° to 9° (Perrott and Saberi, 1990). In addition, non-individualized HRTFs give a good approximation of the binaural cues (ILD and ITD) but are less accurate for the spectral cues produced by the pinnae, which vary substantially between individuals. Thus, because non-individualized HRTFs were used, differences in source elevation were potentially only perceived as a change in spectral coloration rather than a shift in the perceived location of the source. The HRTFs may also explain why performance was generally better at the 50° separation than at the 70° separation. A comparison of the differences in excitation patterns (Glasberg and Moore, 1990) between 0° and 50° and between 0° and 70° shows that the overall differences were greater for the smaller angle, with a mean absolute level difference of 2.85 dB for the smaller angle difference compared with an absolute level difference of about 2.31 dB for the larger angle difference (see Fig. 3).

FIG. 3. — Differences in excitation patterns following filtering by the HRTF between sounds incident from 0° and 50° (black curve), and from 0° and 70° (grey curve). The larger absolute difference between 0° and 50° may explain why average listener performance was better when the B tokens were presented from 50° than when the tokens were presented from 70°.

The large detrimental effect of varying the tokens between intervals can be explained by the fact that the spectral differences between tokens interfered with the spectral differences imposed by the HRTFs, which were the only discrimination cue available for conditions in the median plane. Nevertheless, some discrimination from the reference remained possible at most tested elevations, leaving open the possibility that these cues could be used for auditory stream segregation, even in the presence of spectral variability of the tokens. This result is broadly consistent with the findings of Rakerd et al. (1999), who found that listeners were able to identify sounds with different spectral shapes when they all originated from the same location in space but were less able to perform the task when the location of the sounds in the median plane was randomly varied across presentations. Nevertheless, using sound sources in real space (rather than simulated HRTFs), they found that sound localization was possible even when listeners were not able to identify the sounds. The following experiment tested whether streaming was still possible with non-individualized HRTFs and with stimuli that varied in spectral shape between tokens.