Role of binaural hearing in speech intelligibility and spatial release from masking using vocoded speech

Soha N Garadat; Ruth Y Litovsky; Gongqiang Yu; Fan-Gang Zeng

doi:10.1121/1.3238242

. 2009 Nov;126(5):2522–2535. doi: 10.1121/1.3238242

Role of binaural hearing in speech intelligibility and spatial release from masking using vocoded speech

Soha N Garadat ^1,^a), Ruth Y Litovsky ^1,^b), Gongqiang Yu ^1,^c), Fan-Gang Zeng ²

PMCID: PMC2787072 PMID: 19894832

Abstract

A cochlear implant vocoder was used to evaluate relative contributions of spectral and binaural temporal fine-structure cues to speech intelligibility. In Study I, stimuli were vocoded, and then convolved through head related transfer functions (HRTFs) to remove speech temporal fine structure but preserve the binaural temporal fine-structure cues. In Study II, the order of processing was reversed to remove both speech and binaural temporal fine-structure cues. Speech reception thresholds (SRTs) were measured adaptively in quiet, and with interfering speech, for unprocessed and vocoded speech (16, 8, and 4 frequency bands), under binaural or monaural (right-ear) conditions. Under binaural conditions, as the number of bands decreased, SRTs increased. With decreasing number of frequency bands, greater benefit from spatial separation of target and interferer was observed, especially in the 8-band condition. The present results demonstrate a strong role of the binaural cues in spectrally degraded speech, when the target and interfering speech are more likely to be confused. The nearly normal binaural benefits under present simulation conditions and the lack of order of processing effect further suggest that preservation of binaural cues is likely to improve performance in bilaterally implanted recipients.

INTRODUCTION

Cochlear implants (CIs) have been highly successful at providing hearing to profoundly deaf individuals. As a result of continual progress made in this advanced technology, auditory perception in recipients has improved significantly in the past few decades. Today, most CI users are able to perform well in quiet listening situations. However, their performance deteriorates considerably in the presence of background noise and competing speech (Skinner et al., 1994; Muller-Deile et al., 1995; Battmer et al., 1997; Stickney et al., 2004). Numerous studies that focus on performance in unilateral CI users have attempted to identify some of the factors that can account for this deterioration, including the role of speech coding strategies and number of frequency bands (e.g., Gantz et al., 1988; Waltzman et al., 1992; Dorman and Loizou, 1997; Friesen et al., 2001; Stickney et al., 2004). In an alternative approach, bilateral CIs have been provided to a growing number of recipients, with the hope that stimulation of both ears will lead to improved performance in difficult listening situations. Results to date suggest that many bilateral CI users perform better at understanding speech in adverse listening conditions when using two CIs compared with a single CI (e.g., Schleich et al., 2004; Iwaki et al., 2004; Litovsky et al., 2004; 2006; 2009; Tyler et al., 2002). However, despite this improved performance, bilateral CI users are still considerably challenged in dynamic listening situations. In addition, there remains a gap in performance between bilateral CI users and normal hearing listeners (NHLs). The reasons for the gap remain to be understood.

When addressing the deficit in speech intelligibility that is experienced by CI users in presence of noise or competing speech, the complexity of the everyday auditory scene should be considered alongside the possibility that performance is limited by signal processing in the prosthetic devices. In real-world listening, the signal and unwanted “competing” sounds may overlap spectrally and temporally, as well as spatially. Often, there also exists acoustic variability in the auditory environments that may result in increased similarity between a target sound and a competing source, rendering extraction of the target signal rather difficult. The difficulty associated with source segregation under such conditions is often attributed to informational masking (Neff, 1995; Brungart, 2001; Kidd et al., 2002). Although the effects of informational masking can be decreased by introducing dissimilarity between the target and interferer (Durlach et al., 2003), this approach might not be realistically feasible in many listening situations due to the unpredictability of the auditory environments.

Overcoming informational masking can be achieved if listeners have access to a variety of other auditory cues. For example, NHLs exploit spectral (Assmann and Summerfield, 1990, 1994; Bird and Darwin, 1998; Vliegen and Oxhenham, 1999) as well as temporal (Tyler et al., 1982; Buss and Florentine, 1985; Bacon et al., 1998; Summers and Molis, 2004) cues to segregate overlapping and competing auditory streams. It is also well known that NHLs can take advantage of spatial cues to segregate speech from competing sounds. This is manifested in as much as a 12 dB improvement in speech reception thresholds (SRTs) when target speech and competitors are spatially separated compared with situations in which they are co-located. This benefit is known as spatial release from masking (SRM), and is an effect that has been studied extensively in NHLs (Freyman et al., 1999; Arbogast et al., 2005; Hawley et al., 1999; 2004; Drullman and Bronkhorst, 2000; Litovsky, 2005).

Spatial cues appear to become more prominent under conditions in which informational masking is relatively large (Kidd et al., 1998; Arbogast et al., 2002). This suggests that spatial hearing plays a crucial role in helping listeners to overcome informational masking. Given the growing number of bilateral CI users, the extent to which spatial cues can be made available to these listeners is a timely question with regard to addressing the gap in performance noted above. The contribution of spatial cues can be explored in these individuals by controlling the inputs to the two ears and comparing performance under bilateral vs unilateral listening modes. A recent study by Loizou et al. (2009) has shown that, compared with NHLs, bilateral users are less capable of taking advantage of binaural cues for source segregation, in particular, under conditions of informational masking. This may be due to the fact that CIs have limited spectral resolution (Freisen et al., 2001) and ineffective encoding of F0 information (Stickney et al., 2007). The novelty of the study of Loizou et al. (2009) lies in the tighter stimulus control utilized by presenting binaural stimuli directly to the CI users’ processors, with spatially appropriate stimuli that were convolved with head related transfer functions (HRTFs).

Limitations in performance of participants in the study of Loizou et al. (2009) are of interest here, as they may have arisen from two factors that are highly difficult to control in CI users. One factor is the lack of obligatory coordination between specific pairs of electrodes across the two ears, which would have reduced the extent to which binaural cues could be preserved with fidelity upon reaching the brainstem. A second issue arises when participants whose auditory system has undergone periods of auditory deprivation are tested. Disruptions in the neural processing mechanisms are likely to be present and to contribute to variability in performance within the population of CI users, leading to difficulty in identifying and understanding mechanisms involved in the processes of binaural cues under complex listening conditions.

CI vocoders can offer a powerful tool for investigating effects of CI signal processing independently of other confounds inherent in cochlear implantation. In the current study, a CI vocoder was utilized to investigate whether limitations in performance on spatial auditory tasks that are observed in bilateral CI users are due to the signal processing itself. One of the main issues addressed in the present study is whether CI users are susceptible to informational masking that is borne out of crude signal processing in their prosthetic devices. This issue was investigated by using testing conditions that represent simple but realistic everyday listening situations, yet at the same time in which informational masking in the non-CI conditions may be small. Spondaic target words were presented in the presence of sentences, a combination of target and interferer that deliberately creates relatively easy testing conditions. This approach enabled a systematic examination of a number of critical factors related to speech intelligibility in adverse listening conditions, akin to those that occur with CI processors when a limited number of frequency bands are available. Specifically, speech intelligibility and SRM were evaluated using spectrally degraded stimuli, under binaural and monaural listening conditions. Of a particular interest in this study was the extent to which CI signal processing might impact the role of binaural hearing in providing benefits on measures of speech intelligibility and SRM.

Listening conditions in this study utilized “virtual space” techniques (e.g., Hawley et al., 2004; Loizou et al., 2009) such that all acoustic stimuli were convolved with HRTFs1 to introduce more realistic, perceptually spatialized and separated target and competing stimuli. By controlling the stage at which stimuli were convolved with HRTFs, effects of signal processing and CI vocoding can be examined independent of the potential loss of binaural cues. Given that one of the future goals in bilateral CIs is to design and provide systems that capture and mimic the way that acoustic information is transmitted in NHLs, the present study could shed light on factors that could potentially enhance vs impair outcomes for effects due to binaural squelch, binaural summation, and the head-shadow effect.

STUDY I

In this study, conditions that are more idealized relative to true CI listening were examined by first processing the speech stimuli through the vocoders and subsequently convolving the output through the HRTFs. This approach is akin to a situation in which a NHL is presented with spectrally degraded stimuli through loudspeakers in a room, an approach that has previously been used to investigate effects of spectral degradation on speech perception but without considering effects of binaural hearing and∕or spatial cues (e.g., Shannon et al., 2002; Başkent and Shannon, 2007). The current study was designed to preserve as many cues as possible that would be naturally available to listeners for SRM. These include cues that are known to be available to bilateral CI users to some extent, such as head shadow and envelope interaural time differences (van Hoesel, 2004). In addition, we could preserve cues that contribute to spatialized percepts through temporal fine structure, an important binaural cue that is lost in CI processing. While the original speech fine structure in any band has been replaced with a tone, with the idealized order of processing applied here, the new fine structure is filtered through the HRTFs and thus contains the acoustic cues that are used for spatialization.

By preserving the fine-structure cues, it was assumed that there should be sufficient spatial information to acquire the classic release from masking for spectrally degraded stimuli; hence, informational masking that is created by signal processing can be evaluated with limited confounds.

Material and methods

Listeners

Nine NHLs (three male, six female; age range 19–25 years) participated. All subjects were native speakers of English and had pure tone thresholds better than 15 dB hearing loss at octave frequencies ranging from 250–8000 Hz. Participants signed a consent form approved by University of Wisconsin-Madison Institutional Review Board and were paid for their participation. Testing was conducted in five two-hour sessions.

Signal processing

Speech signals with a bandwidth between 300 and 10300 Hz2 were bandpass filtered into 4, or 8, or 16 contiguous frequency bands (see Table 1) by sixth-order Butterworth filters using a MATLAB software simulation of CI signal processing strategies (e.g., Shannon et al., 1995). Briefly, the envelope was extracted from each band by full-wave rectification and low-pass filtering at 50 Hz with a second order Butterworth filter. The extracted envelope was used to amplitude modulate a sinusoidal carrier at the band’s central frequency followed by the same bandpass filter as the analysis filter to remove spectral splatter. All bands were summed and then convolved with HRTFs (Gardner and Martin, 1994) to create perceptually spatialized and virtually separated target and interferers. For each stimulus (target or interferer), the carrier tones in the right and left ears were in phase. The phase relationship between the carrier tones for target and interferer waveforms was arbitrary. Target and interfering stimuli were then summed and presented to the listeners through headphones (Sennheiser HD 580) under binaural and monaural (right-ear) conditions. In the vocoded speech conditions, target and interfering sentences were processed in the same manner.

Table 1.

List of cutoff frequencies.

Band	16-band			8-band			4-band
Band	Lf	Cf	Hf	Lf	Cf	Hf	Lf	Cf	Hf
1	300	350	400	300	411	521	300	574	848
2	400	460.5	521	521	686	848	848	1445	2042
3	521	595	669	848	1089	1330	2042	3341	4640
4	669	758.5	848	1330	1686	2042	4640	7470	10300
5	848	957	1066	2042	2566	3091
6	1066	1198	1330	3091	3866	4640
7	1330	1490.5	1651	4640	5784	6927
8	1651	1845.5	2042	6927	8613	10300
9	2042	2279	2516
10	2516	2803.5	3091
11	3091	3441	3791
12	3791	4215.5	4640
13	4640	5156.5	5673
14	5673	6300	6927
15	6927	7688.5	8450
16	8450	9375	10300

Open in a new tab

Stimuli materials and virtual spatial configuration

Target stimuli consisted of a closed set of 25 spondees recorded in our laboratory with a male-talker and presented in quiet as well as in the presence of competing speech. The interferer stimuli were sentences from the Harvard IEEE corpus (Rothauser et al., 1969) recorded with a different male talker than the target. Thirty sentences were strung together, and segments were randomly chosen and played for 6 s per trial. The target words began approximately 1.5 s after the onset of the competing sentence. On each trial the 25 spondees were visually presented to the subjects on a computer monitor. Subjects were instructed to respond by using a mouse button to select the appropriate target word. Feedback was provided following each response by flanking the correct stimuli on the computer screen in front of the listener.

Given that all stimuli were convolved through HRTFs to enable virtual spatial separation of target and interfering speech, data were collected for each subject using the following location combinations: (1) quiet: target at 0° azimuth and no interferer, (2) front: target and interferer both at 0° azimuth, (3) right: target at 0° azimuth and interferer at 90° azimuth, and (4) left: target at 0° azimuth and interferer at −90° azimuth.

Stimulus levels and threshold estimation

All stimuli were calibrated using an artificial ear coupler (AEC101 IEC 318, Larson Davis). Calibration was conducted after stimuli were convolved through the HRTFs. Stimulus levels were set based on calibration for token sentences from the speech corpus presented from the simulated front condition. The level of the interferer was fixed at 60 dB sound pressure level (SPL); thus for the front condition, interferer levels were set to 60 dB SPL in each ear. For non-front conditions, interferer levels were 60 dB SPL for the ear ipsilateral to the interferers, and change in signal-to-noise ratio (SNR) represents the change in target level relative to interferer level at the ipsilateral ear. The level varied naturally with the HRTF at the contralateral side to create a head-shadow effect. The level of the target was varied adaptively using an algorithm that targets the 79.4 point on the psychometric function (Levitt, 1971). The target level was initially 65 dB SPL and was decremented by 8 dB following each correct response. After the first incorrect response, a modified adaptive 3-down∕1-up algorithm was used in which the step size was halved after each reversal, with the minimum step size set to 2 dB. If the same step size was used twice in a row in the same direction, the next step size was doubled in value. Testing was terminated following eight reversals.

SRTs were estimated from the adaptive tracks by using a constrained maximum-likelihood method of parameter estimation (MLE), which has been described by Wichmann and Hill (2001a, 2001b). Based on this method, data from each experimental run for each participant were fitted to a logistic function and thresholds were calculated by taking the level of the target at a specific probability level. This approach has been shown to yield comparable results to the well-known approach in which SRT is defined as the average of levels at which reversals occur. However, the MLE approach has the advantage, with this stimulus corpus and adaptive tracking method, of producing smaller group variance (Litovsky, 2005).

Procedure and training

Data collection was conducted in blocks with the number of frequency bands (unprocessed, 16, 8, and 4) fixed. To ensure familiarity with the task, participants completed the unprocessed conditions first. Subsequently, all other blocks (16, 8, and 4) were presented in a random order generated with a different seed for each subject. Within a block, all other conditions were randomized. Prior to each testing block with vocoded stimuli, subjects received additional listening exposures to familiarize them with the quality of the speech they were about to hear in the upcoming blocked condition. During these vocoded exposure periods, four SRTs (two in quiet, and two with front interferer) were collected; these SRTs were excluded from the main analyses. After data completion, all conditions were re-randomized and a second set of data was collected based on the assumption that with more exposure to vocoded speech, listeners’ performance would be more stable. This second set of data was used in the analyses and reported in this paper. However, statistical analyses comparing the two sets of data revealed that learning effects occurred only in the 4-band conditions.

Results

SRTs

SRTs were obtained using the MLE procedure described above and were normalized relative to interferer level. These data are displayed in Fig. 1 as a function of number of frequency bands under binaural and monaural conditions. Two-way repeated measure analyses of variance (ANOVAS) on SRTs were conducted for listening mode (binaural and monaural) and number of frequency bands (unprocessed, 16, 8, and 4); these analyses were conducted separately for each interferer condition (quiet, front, right, and left). A significant main effect of listening mode was found such that binaural SRTs were lower than monaural SRTs in all conditions; quiet [F(1,8)=9.874, p<0.05], front [F(1,8)=14.752, p<0.01], right [F(1,8)=77.763, p<0.0001], and left [F(1,8)=42.205, p<0.0001].

Significant main effects of number of bands were also observed for all conditions; quiet [F(3,8)=53.243, p<0.0001], front [F(3,8)=571.718, p<0.0001], right [F(3,8)=298.448, p<0.0001], and left [F(3,8)=2364.374, p<0.0001] SRTs. The lack of interactions with listening mode suggests that the effect of number of bands applies to binaural and monaural listening modes. Post-hoc Scheffe’s tests revealed that, in quiet, SRTs for the unprocessed condition were comparable to those in the 16-band condition but lower (better performance) than those in the 8- and 4-band conditions (p<0.001). However, in the presence of a speech interferer, the improvement in SRTs continued with further increases in number of frequency bands (p<0.0001). Specifically, SRTs for the unprocessed conditions were lower than those in the 16-, 8-, and 4-band conditions. In addition, SRTs for the 16-band conditions were lower than those in the 8- and 4-band conditions, with lower SRTs in the 8-band than those in the 4-band conditions.

Masking

In this study, masking was defined as the absolute change in SRTs when interferer stimuli were present compared with the quiet condition. Masking values for the front, right, and left, respectively, were computed as (SRT_front−SRT_quiet), (SRT_right−SRT_quiet), and (SRT_left−SRT_quiet). These masking values, shown in Fig. 2, were subjected to two-way repeated measures ANOVAs for listening mode (binaural, monaural) and number of frequency bands (unprocessed, 16, 8, and 4) as described above for SRTs.

A main effect of listening mode was not found for front masking, indicating comparable amount of masking for the binaural and monaural conditions. However, a main effect of listening mode was obtained for right [F(1,8)=88.280, p<0.0001] and left [F(1,8)=13.346, p<0.01] maskings, such that the amount of masking was greater in the monaural than in the binaural conditions. These results suggest that binaural listening provides mechanisms for reduction in masking that are not available in the single-ear listening mode. In addition, a main effect of number of frequency bands was obtained for front [F(3,8)=20.502, p<0.0001], right [F(3,8)=14.511, p<0.0001], and left [F(3,8)=6.944, p<0.005] masking. Scheffe’s post-hoc analyses revealed that the amount of masking was significantly smaller in the unprocessed condition than the 16-, 8-, and 4-band conditions (p<0.01), suggesting that spectrally degraded speech is more susceptible to masking than natural speech. Finally, differences in masking were not statistically significant across the three spectrally degraded conditions; this finding occurred in all three spatial masker configurations: front, right, and left.

Spatial release from masking

Figure 3 summarizes the findings for SRM which was computed for two spatial configurations: right (Masking_front−Masking_right) and left (Masking_front−Masking_left). These data were subjected to two-way repeated measures ANOVAs for listening mode (binaural and monaural), and number of frequency bands (unprocessed, 16, 8, and 4); separate analyses were conducted for the right and left SRM values. A main effect of listening mode suggested that SRM was larger in the binaural than in the monaural conditions (right-ear) for both right [F(1,8)=51.317, p<0.0001] and left [F(1,8)=24.700, p<0.005] interferer configurations.

Average amounts of SRM (+1 SD) are shown for the binaural (A) and monaural (B) listening modes. Within each panel, SRM is compared for the different frequency bands, as a function of interferer location.

A main effect of number of frequency bands was not found for the right spatial configuration but was obtained for the left configuration [F(1,8)=3.424, p<0.05]. Scheffe’s post-hoc analysis revealed that, in comparison with the unprocessed condition, the amount of SRM was greater for spectrally degraded conditions: 16-band (p<0.05), 8-band (p<0.005), and 4-band (p<0.001). Differences in SRM were not statistically significant across the different spectrally degraded conditions.

A significant interaction of listening mode × number of frequency bands [F(3,24)=5.116, p<0.01] was found for the right spatial configuration. Scheffe’s post-hoc analysis showed that, under monaural listening, SRM was statistically comparable for the different spectral conditions. In the binaural conditions, SRM for the unprocessed condition was smaller than that in the 8 (p<0.0001) and 4 (p<0.005) bands, and comparable to the 16-band condition. In addition, SRM was greater in the 8-band condition compared with 16-band (p<0.001) and 4-band (p<0.05) conditions. SRM for the 16- and 4-band conditions was comparable.

Bilateral effects

Further analyses were conducted in order to facilitate comparisons with studies in bilateral CI users. The variables of interest were head shadow, binaural squelch, and binaural summation (e.g., Muller et al., 2002; Tyler et al., 2003; Schleich et al., 2004; Litovsky et al., 2006). Head shadow in the monaural (right-ear) condition was defined as the advantage (reduction in SRT) obtained when the interferer was contralateral versus ipsilateral to the functional ear. It was thus computed as [SRT(monaural)_right−SRT(monaural)_left]. Binaural squelch describes the advantage obtained as a result of spatial separation between target stimuli and interfering stimuli. These values were obtained for each subject as [(monaural)_left−(binaural)_left]. Binaural summation, an advantage that can result from listening to identical stimuli with two ears, was calculated for each subject in two ways; first, by comparing SRTs in the conditions with no interferer [(monaural)_quiet−(binaural)_quiet], and second, by comparing SRTs in the conditions with interferer in the front [(monaural)_front−(binaural)_front].

For each of the four effects listed above, a one-way repeated measures ANOVA was conducted in which the variable of interest was number of frequency bands, including the unprocessed conditions. There were no statistically significant findings for any of the analyses, suggesting that the effects were not dependent on spectral resolution. Data for the vocoded speech were pooled across frequency band conditions and plotted as group means (+1 SD) in Fig. 4. Average values were 5.3 dB for head shadow, 5.9 dB for squelch, and 2.5 and 1.6 dB for binaural summation in quiet and in the presence of the front interferer, respectively. In Fig. 4, the bilateral effects are also plotted for the unprocessed conditions for comparison purposes. Given the lack of a significant main effect of number of spectral bands, the unprocessed and processed conditions were grouped for each condition and were subjected to one-sample t-tests (e.g., Schleich et al., 2004). Results revealed that head shadow, squelch, and summation in quiet and in the presence of front interferer were each significantly different than zero (p<0.0001, p<0.0001, p<0.01, p<0.01, respectively).

Group means (+1 SD) are shown for head shadow, binaural squelch, and binaural summation as estimated from the quiet condition and binaural summation as estimated from the condition with interferer in front. Data are plotted for the unprocessed conditions (dark bars) and processed conditions (light bars).

STUDY II

Given the increased robustness of SRM found in the first study when using vocoded speech, the next question addressed here was whether this effect can also be observed in a scenario that more realistically simulates true bilateral CI listening. Therefore, speech stimuli were first convolved through the HRTFs, as would occur in a real world to a person using CIs in the free field; the resulting stimuli were subsequently processed through the vocoder. This study, with the reversed order of signal processing, enabled us to examine whether the directionally dependent cues that are available in the HRTFs are immune to, or distorted by, the CI signal processing in ways that affect benefits from binaural hearing for the spatially separated conditions. Testing was conducted with a second group of listeners, and data from the two studies will be henceforth for conditions that are thought to involve the use of binaural directional cues for source segregation.