Benefits of triple acoustic beamforming during speech-on-speech masking and sound localization for bilateral cochlear-implant users

David Yun; Todd R Jennings; Gerald Kidd, Jr; Matthew J Goupell

doi:10.1121/10.0003933

. 2021 May 5;149(5):3052–3072. doi: 10.1121/10.0003933

Benefits of triple acoustic beamforming during speech-on-speech masking and sound localization for bilateral cochlear-implant users^a)

David Yun ¹, Todd R Jennings ², Gerald Kidd Jr ², Matthew J Goupell ^1,^b),^✉

PMCID: PMC8102069 PMID: 34241104

Abstract

Bilateral cochlear-implant (CI) users struggle to understand speech in noisy environments despite receiving some spatial-hearing benefits. One potential solution is to provide acoustic beamforming. A headphone-based experiment was conducted to compare speech understanding under natural CI listening conditions and for two non-adaptive beamformers, one single beam and one binaural, called “triple beam,” which provides an improved signal-to-noise ratio (beamforming benefit) and usable spatial cues by reintroducing interaural level differences. Speech reception thresholds (SRTs) for speech-on-speech masking were measured with target speech presented in front and two maskers in co-located or narrow/wide separations. Numerosity judgments and sound-localization performance also were measured. Natural spatial cues, single-beam, and triple-beam conditions were compared. For CI listeners, there was a negligible change in SRTs when comparing co-located to separated maskers for natural listening conditions. In contrast, there were 4.9- and 16.9-dB improvements in SRTs for the beamformer and 3.5- and 12.3-dB improvements for triple beam (narrow and wide separations). Similar results were found for normal-hearing listeners presented with vocoded stimuli. Single beam improved speech-on-speech masking performance but yielded poor sound localization. Triple beam improved speech-on-speech masking performance, albeit less than the single beam, and sound localization. Thus, triple beam was the most versatile across multiple spatial-hearing domains.

I. INTRODUCTION

One of the main challenges facing cochlear-implant (CI) listeners is the difficulty they experience when attempting to understand target speech in environments with multiple competing talkers or other types of background noise/interference (e.g., Loizou et al., 2009; Litovsky et al., 2012; Goupell et al., 2016; Litovsky et al., 2017). In contrast, normal-hearing (NH) listeners effectively exploit spatial cues [interaural time differences (ITDs) and interaural level differences (ILDs)] that foster the perceived separation of sound sources, which reduces the unwanted masking caused by non-target sound sources (Bronkhorst and Plomp, 1988). Consequently, NH listeners typically demonstrate a larger advantage than CI listeners for understanding speech when masking sound sources originate from different locations than that of the target talker. In other words, NH listeners demonstrate a greater spatial release from masking (SRM; the difference in threshold for co-located and spatially separated target and masker sources, discussed further below) than CI listeners (Schleich et al., 2004; Buss et al., 2008; van Hoesel et al., 2008; Loizou et al., 2009; Bernstein et al., 2016; Goupell et al., 2016).

Fitting CI users with bilateral devices has become increasingly common because it may restore some spatial-perception abilities. Despite the gains found with this approach, there remain fundamental issues that prevent CI users from achieving larger binaural benefits. Sound processing by CIs usually results in the loss of acoustic temporal-fine-structure information, in particular low-frequency ITDs, which provide critical cues for precise sound localization for NH listeners (Wightman and Kistler, 1992) and for most hearing-impaired listeners (but non-CI; e.g., Dubno et al., 2002). Furthermore, CI listeners have demonstrated limited ability to exploit ITDs even when acoustic temporal-fine-structure information is encoded by low-rate pulse trains in current devices (Zirn et al., 2016; Ausili et al., 2020). Therefore, bilateral CI listeners primarily use ILDs for sound localization (e.g., Seeber and Fastl, 2008; Aronoff et al., 2010). Other factors can diminish the salience of binaural cues in bilateral CI listeners, such as interaural place-of-stimulation mismatches (e.g., Hu and Dietz, 2015) and channel interactions from multi-electrode stimulation (Kan and Litovsky, 2015; Egger et al., 2016). Although there have been several previous studies and processor development efforts aimed at mitigating these problems and improving spatial hearing in bilateral CI listeners, a substantial deficit remains. To bring the sound-localization abilities and speech-understanding performance in multiple-source “cocktail party” listening situations [see reviews in Middlebrooks and Simon (2017)] of bilateral CI listeners to levels more comparable to NH listeners (e.g., Schleich et al., 2004; Bernstein et al., 2016; Goupell et al., 2016), advances in the effectiveness of bilateral CI approaches are necessary.

One approach to enhance speech understanding under masked conditions is to use front-end (i.e., before the input to the CI electrode arrays) signal-processing algorithms that implement noise reduction or other types of signal enhancement, rather than concentrating on improving the fidelity of binaural cues provided by the CIs. Such approaches aim to remove the masking sounds prior to reception by the listener. Acoustic beamforming is an example of such an approach that has found some success in enhancing speech understanding in noise for both hearing-aid (Valente et al., 1995; Valente et al., 2000) and CI users (Chung et al., 2006; Chung and Zeng, 2009; Baumgartel et al., 2015a; Baumgartel et al., 2015b; Dieudonné and Francart, 2018; Williges et al., 2018). Acoustic beamformers selectively amplify signals originating from a specified direction (typically from the azimuth the listener is facing) and attenuate signals originating from other azimuths. Thus, if the target source is in front of the listener at the focus of the beamformer and the maskers originate from non-frontal azimuths, then the signal-to-noise ratio (SNR) [or in the case of speech-on-speech masking, target-to-masker ratio (TMR)] is improved by an amount that depends on a variety of factors including the (typically) frequency-dependent directional response. A simple type of beamformer is a directional microphone with a fixed polar pattern of amplification. This type of directional microphone can consist of two or more microphones oriented front-to-back. The signals received then undergo a filter-and-sum algorithm that provides a directional response. The greater the number of microphones used, the greater the response directionality that can be achieved (e.g., Stadler and Rabinowitz, 1993; Desloge et al., 1997; Greenberg et al., 2003). Some beamforming methods have utilized adaptive patterns of amplification. Adaptive beamformers change their polar patterns based on the incoming signal to steer “nulls” (i.e., attenuation) in response to the locations of the interfering sound sources. Under appropriate conditions, adaptive beamformers have been found to provide better speech understanding in noise than omnidirectional amplification both for hearing-aid (Ricketts and Henry, 2002; Valente et al., 2006) and CI users (Spriet et al., 2007; Dorman et al., 2018), although the benefits provided by adaptive methods have limitations too, such as computational complexity (cf. Greenberg et al., 2003).

One limitation of many fixed and adaptive beamformers is that they usually assume the target source location is in front. In reality, there are many instances when the target source location is not in a frontal direction, and the target of interest can also change location rapidly, particularly in a multi-talker scenario. With these non-steering beamformers, the listener's only option to change the beamformer direction is to turn the head toward the target source, a strategy that may be socially awkward or too slow to follow the switching of speakers in a conversation (e.g., Roverud et al., 2018; see also Hendrikse et al., 2019). Steering beamformers, which allow the beamformer directionality to be steered away from the frontal azimuth and toward the target source location, have been explored with electroencephalogram attention decoding (e.g., Fuglsang et al., 2017; Aroudi and Doclo, 2020) and a visually guided hearing aid (VGHA), where an eye-tracker steers a beamformer based on the eye-gaze direction of the user (Kidd et al., 2013; Best et al., 2017b; Roverud et al., 2018). The VGHA thus allows a hearing-impaired (or, potentially, a CI) listener to select and focus on a talker of their choosing in a sound field comprising multiple talkers. The benefits of the beamformer component of the VGHA have been described for static (Kidd et al., 2015) and dynamic situations (Best et al., 2017b; Roverud et al., 2018; Hládek et al., 2019).

Although the beamformer used by a VGHA provides a high degree of spatial selectivity and typically yields large improvements in speech reception thresholds (SRTs) for both speech and noise maskers (e.g., Kidd et al., 2015), it also delivers only a single-beam output so that the stimulus is presented monotically or diotically to the listener. This means that the TMR advantage provided by the beamformer comes at the cost of greatly reduced sound-localization performance and diminished spatial awareness. For many hearing-impaired listeners, a hybrid natural-beamformer strategy that uses a combination of beamforming in the higher frequencies and natural head-induced head-related transfer functions (HRTFs) in the lower frequencies produced both large improvements in SRTs and preserved good localization performance of broadband and low-frequency sounds (Kidd et al., 2015; Best et al., 2017b). The benefit of this hybrid beamformer strategy, however, depends on the listener being able to exploit low-frequency ITDs, an ability that is severely compromised in bilateral CI users (Litovsky et al., 2012; Churchill et al., 2014).

An alternative to this hybrid beamformer strategy that does not depend on low-frequency ITDs has been proposed recently (Jennings and Kidd, 2018; Kidd et al., 2020). This new method, called “triple beam,” uses three spatially tuned beams of amplification designated left, center, and right. The center beam is presented diotically and is equivalent to what is used in the single beamformer (hereafter called “beam”),¹ which has been used in the earlier VGHA studies (e.g., Favrot et al., 2013; Kidd et al., 2013; Kidd et al., 2015). The center beam—absent any eye-gaze steering—is focused at 0° azimuth relative to the axis of the head (i.e., directly in front of the listener). The left and right beams are routed only to the left and right ears, respectively, and may be set at any angle from 0° to –90° (left beam) or from 0° to +90° (right). Thus, the input to the left ear is the superposition of the waveforms from the center beam and the left beam, while the input to the right ear is the superposition of the waveforms from the center beam and the right beam. The design goal of the triple-beam algorithm is to take advantage of the benefits of beamforming for speech understanding while also providing improved localization ability. Note that the triple-beam approach is somewhat counterintuitive in that the side beams may actually increase the input levels of the non-target/masker sources (i.e., decrease TMR), depending on the masker locations and orientation of the side beams.

Other novel binaural beamformers with binaural cue preservation and/or amplification have been suggested recently (e.g., Aroudi and Doclo, 2020). Using an approach somewhat similar to that used in the triple beam, Dieudonné and Francart (2018) reported using two beams directed to the sides (no diotic center beam as in triple beam) and introduced ILDs to frequencies <1500 Hz. Using this two-beam approach, they found that SRTs improved for NH listeners presented with bimodal CI simulations by 15.7 dB with noise presented to the CI ear and 7.6 dB with noise presented to the HA ear. Localization errors likewise were improved. Williges et al. (2018) used a steering beamformer and artificially applied ILDs based on estimated source direction in bilateral CI users and showed improvements to spatial-hearing abilities. Given these promising new approaches, we speculated that the triple-beam algorithm could provide a benefit for bilateral CI listeners by exploiting exaggerated ILDs, the primary sound-localization cue for this population, thus providing a greater sense of spatial hearing than would be possible otherwise.

The first goal of the current study was to determine the beamforming benefits for bilateral CI listeners in a spatial speech-on-speech unmasking task using a single beam and the new triple beam. The expectation was that both beamforming approaches would improve speech understanding in “cocktail party” listening conditions in bilateral CI listeners. We were also interested in whether exaggerated ILDs might maintain or improve speech understanding using triple beam by enhancing perceptual segregation of sound sources in conditions where TMR did not improve or even could decrease compared to the single beam (cf. Kidd et al., 2020). Thus, a highly informational masking situation was chosen. Some previous studies have indicated that exaggerated spatial cues could improve SRTs in bilateral CI listeners (Brown, 2014; Bernstein et al., 2016). It is unclear, however, how beneficial spatial information is for speech understanding with interferer sounds for bilateral CI listeners; a majority of the benefit is well described by improvements in TMR (Loizou et al., 2009; Dieudonné and Francart, 2019). A second goal was to determine whether the triple-beam approach specifically would improve sound-source localization. We hypothesized that the triple-beam approach could maintain speech understanding by compensating for decreased TMR with exaggerated spatial cues while also enhancing the ability of these listeners to localize sound sources. Preserving both abilities could provide an advancement in amplification/remediation strategies for bilateral CI users. We also obtained judgments from bilateral CI listeners about the number of sound sources that were present in the sound field in addition to where they were located. The rationale for making those measurements was that a complete understanding of the benefit provided by various algorithms for communication in CI listeners requires consideration of multiple functional aspects of hearing. NH listeners also were tested under channel-vocoded (i.e., a CI simulation; Shannon et al., 1995) stimulus conditions for comparison to the patterns of performance observed for the bilateral CI listeners and under natural listening conditions (which have implications for benefits realized by hearing-aid users).

II. EXPERIMENT I: SPEECH-ON-SPEECH MASKING IN CI LISTENERS

The purpose of this experiment was to compare SRTs in speech-on-speech masking situations using three listening algorithms: natural presentation, a single-beam algorithm (beam), or a hybrid beamforming algorithm with three beams (triple beam).

A. Methods

1. Listeners and equipment

Ten bilateral CI listeners (four females, 25–90 yr, average = 57.3 yr) were tested in this experiment. Listener information is presented in Table I; listeners S1–S10 participated in experiment I. All listeners had at least 1 yr of experience using bilateral CIs. Most of the bilateral CI listeners used their everyday programs on their personal sound processors, which were located behind the ears, and microphones were above the pinnae. S5's typical Cochlear sound processors were CP950s (where the microphone is not located above the pinna, but near the magnet that is behind the pinna). She was tested here, instead, with a pair of standard behind-the-ear CP910s dedicated for research purposes that also were adjusted to her typical settings. The listeners were instructed to select the program with a nearly omnidirectional microphone setting and without SCAN (an automatic scene classifying algorithm) if possible. Two of the ten CI listeners, S1 and S4, were tested with SCAN on because they had no alternative on their processors.

TABLE I.

CI listener demographic information and hearing history.

Code	Gender	Age (yr)	Implant use duration (yr)		Sound processor
Code	Gender	Age (yr)	Left	Right	Left	Right
S1	M	72	15	3	CP920	CP920
S2	F	60	9	9	CP910	CP910
S3	M	31	9	11	Freedom	Freedom
S4	M	60	8	3	CP920	CP920
S5	F	64	1	7	CP910	CP910
S6	F	65	5	7	CP910	CP910
S7	M	78	4	7	CP910	CP910
S8	F	28	26	8	CP920	CP920
S9	M	90	7	15	CP920	CP920
S10	M	25	7	7	CP920	CP920
S11	F	66	9	8	CP910	CP910
S12	F	74	11	14	CP910	CP920
S13	F	77	13	2	Naída CI Q90	Naída CI Q90
S14	F	45	10	8	CP920	CP920

Open in a new tab

A personal computer ran the experiment in matlab (Mathworks, Natick, MA). Stimuli were delivered by a sound card (UA-25 EX; Edirol/Roland Corp., Los Angeles, CA) and amplifier (D-75A; Crown Audio, Elkhart, IN) to circumaural headphones (HD650; Sennheiser, Wedemark, Germany). Testing was performed in a double-walled sound-attenuating booth (IAC, North Aurora, IL).

Testing was performed at the University of Maryland (College Park, MD). The Institutional Review Board at the University of Maryland approved this research protocol. Informed consent was obtained from listeners before testing.

2. Stimuli

Stimuli were five-word sentences consisting of a name, verb, number, adjective, and object from a laboratory-designed matrix-style corpus (Kidd et al., 2008). There were eight monosyllabic words per category, yielding a total of 40 words. The listener was presented with one target female talker and two masker female talkers. On each trial, three talkers were randomly chosen from a set of eight young-adult female talkers. All talkers spoke with neutral inflection. The duration of the individual words in this corpus was not the same; therefore, while words in each sentence stream occurred at approximately the same time, they were not precisely time aligned. The talkers and words chosen for both the target and maskers were mutually exclusive for each trial. Stimuli were processed in matlab to simulate talkers at different spatial locations, with or without beamforming. They were presented through a pair of circumaural headphones that were placed over the sound processors (Grantham et al., 2008; Goupell et al., 2018).

3. Beamformer characteristics

Measured HRTFs for the beamformers were used to virtually place the target at 0° (in front) and the masking talkers at either 0° (co-located) or symmetrically spaced around the target at locations of ±30° or ±90°. The HRTFs for the algorithms and spacings were created by recording impulse responses in a mildly reverberant sound field laboratory using 16 omnidirectional microphones mounted in four front-to-back rows of four microphones distributed laterally across the top of the head on a flexible headband/circuit board placed on the KEMAR manikin head [Fig. 6 in Kidd (2017); Fig. 2 in Roverud et al. (2018)]. This setup has been used exclusively for investigational purposes (cf. Kidd, 2017); such a beamformer would not be realizable in typical commercial devices and is distinct from other studies that have evaluated approaches that could be implemented in current devices (Adiloglu et al., 2015; Baumgartel et al., 2015b; Dieudonné and Francart, 2018). The signals from the 16 microphones were filtered/delayed-and-summed to create a single-beam beamformer aimed in any specified direction within a range of angles about the head. Beam consisted of a single beam that always pointed toward the front of the listener. Roverud et al. (2018) and Kidd et al. (2020) describe the recording and signal processing for beam and KEMAR in more detail. The triple beam was a combination of three beamformers, which was a simple summation with equal weight on each channel (i.e., the left channel was presented the input from the summed left and center beams) (Kidd et al., 2020). The triple beam had the ability to provide non-zero ILDs for sound sources, depending on the location, but no ITDs were provided.

Figures 1(A) and 1(B) provide an illustration of the broadband spatial response patterns of the triple-beam algorithm. The input stimulus was a broadband noise that initially had the same long-term average spectrum of a female speaker in the corpus that was used in the procedure. Then it was bandpass filtered between 200 and 8000 Hz using a fourth-order forward-backward Butterworth filter, the purpose being to approximately match the frequency range of the CI listeners participating in this experiment and the NH listeners presented vocoded speech participating in experiment II. Slight asymmetries in attenuation were due to irregularities in the room and placement of the microphone array during the recording of the impulse responses used to compute the filters. One beam was aimed toward the front of the talker at 0° (i.e., toward the target source) and created a single-beam signal that was routed to both ears (equivalent to the beam condition). The two other beams (called the left- and right-side beams) were oriented symmetrically at fixed azimuths of ±40°, and the output of each beam, after combination with the output of the center beam, was routed monaurally to the proximal ear. As can be seen in Fig. 1(C), this configuration of beams produces exaggerated ILDs relative to natural listening conditions from 0 to ±60° (Kidd et al., 2020).

FIG. 1. — (Color online) Illustrations of the patterns of attenuation [(A) and (B)] and ILDs (C) for the triple beam (TRIBEAM) algorithm based on measured impulse responses of the microphone array. In (A), each line represents a beam oriented in a different direction: solid (left ear only) is aimed at –40°, dashed (both ears) is aimed at 0° (directly in front of the listener), and dashed-dotted (right ear only) is aimed at +40°. The y axis represents attenuation relative to the signal with no beam processing (0-dB attenuation). (B) is similar except the attenuation is that for the summed center and side beams; in other words, it is the attenuation experienced by each ear when presented the triple-beam algorithm. (C) demonstrates the ILD (right – left attenuation in dB) from the KEMAR algorithm (solid) and the triple-beam algorithm (dashed). The beam algorithm is omitted because the diotic presentation produces 0-dB ILD for all source azimuths. (D) shows the TMR (solid lines) as well as the corresponding iSNR (dashed lines; see text for a description) as a function of masker angle for symmetrically placed maskers presented at the same level as a target placed at 0°. Symbols on the lines indicate TMR/iSNR levels corresponding to masker angles tested in experiments I and II.

As a result of the ILDs and directional microphone characteristics, improvements in the TMR can be achieved, depending on the spatial positions of target and maskers. We therefore calculated the intelligibility-weighted signal-to-noise ratio (iSNR) to estimate such improvements (Greenberg et al., 1993; Baumgartel et al., 2015a). The iSNR was calculated by filtering the signal (Gaussian noise convolved with the long-term average spectrum of the female BUG corpus used in this study) into six bands using fourth-order Butterworth filters and weighting and summing the SNRs for the resulting bands according to the octave-band speech intelligibility index standard (ANSI, 2017). The reported iSNR was the average of both ears. Because iSNR calculations were mostly independent of TMR, we provide the iSNR with target and maskers presented at the same level in Fig. 1(D). In general, the beam and triple-beam configurations show improvements in iSNR as the maskers are moved away from the target located at the midline to more lateral azimuths, consistent with the goal of beamforming and the opposite of what occurs for the natural listening (KEMAR) condition for these spatial configurations.

4. Procedure

Listeners were asked to identify the words spoken by a target talker in the presence of two competing masker talkers. The target talker was identified by the first word in the sentence (the name “Sue”; see below). Listeners were tested using three listening algorithms: natural presentation (KEMAR), a single-beamforming algorithm (beam), or a hybrid beamforming algorithm with three beams (triple beam). Impulse responses for KEMAR were recorded in the same room used to record the impulse responses for beam and triple beam, using loudspeakers and in-ear microphones from a KEMAR manikin 1.5 m away (G.R.A.S., Holte, Denmark), not behind-the-ear microphones common to CI sound processors [see detailed description in Kidd (2017)]. Use of the in-ear microphone provided easier comparison to previous studies and results in larger ILDs compared to BTE microphones (Mayo and Goupell, 2020); the resultant ILDs are presented in Fig. 1(C).

To set the stimulus levels, the target was placed at 0° azimuth and initially presented at a sound pressure level (SPL) of 65 dB. Following this initial presentation level, sample words were presented to the CI listener in order to achieve an interaurally loudness-balanced presentation. Monaurally presented speech stimulus tokens were presented to the listener, and an experimenter adjusted a single ear to a comfortable level based on listener report. After each ear was set at a monaural comfortable level, interaurally loudness-balanced values were determined by playing sample speech stimulus tokens sequentially across the ears. Again, an experimenter adjusted the sound levels while maintaining a comfortable overall loudness. Finally, the stimuli were presented dichotically to both ears, and the experimenter made any final adjustments. Only the loudness was balanced, not the spatial location (Fitzgerald et al., 2015; Baumgärtel et al., 2017). During testing, this pretest-determined loudness-balancing value for individual CI listeners was applied to stimuli in addition to any attenuation provided by spatial and microphone processing. Four listeners needed no loudness adjustment, while the other six listeners needed an interaural adjustment of 4 dB or less.

To minimize the effect of automatic gain control of the CIs, all stimuli were presented at or below the loudness-balanced levels. Thus, when the TMR was positive, the target level was held constant near 65 dB SPL while the masker levels were reduced, and when the TMR was negative the maskers where held constant near 65 dB SPL while the target level was reduced (Goupell et al., 2016). This procedure allowed for the appropriate TMR while minimizing the potential confound introduced by stimuli that became uncomfortably loud. The scaling for target and maskers was applied before spatial processing. TMR in this study refers to the ratio of sound level of the target to the individual masker levels, not to the overall level of the two maskers combined.

The listeners were tested using a one-up one-down adaptive procedure that estimated the 50% correct point on the psychometric function. The level corresponding to the 50% correct point was taken as an estimate of the SRT specified as the TMR in dB. On each trial, the listener was simultaneously presented with one target sentence and two masker sentences. The first target word was always “Sue.” The listener selected the four other target words in sequence from an array of all 32 words available from the speech matrix (i.e., excluding the name category, which was not scored). The responses were registered by using a mouse to select the printed words from a graphical user interface displaying the word matrix on a monitor. Correctly indicating at least three of the four target words was deemed a correct response. Listeners were told that they could expect the target sentence from the front.

The experimental runs were blocked by microphone condition (KEMAR, beam, and triple beam) and spatial condition (maskers co-located at 0°, spatially separated at ±30°, or spatially separated at ±90°). CI listeners were tested on nine (3 × 3) combinations in total. Each condition combination was tested in one block consisting of four adaptive tracks. For each trial within the block, a track was randomly chosen from any of the unfinished tracks until all four tracks were completed. For each adaptive track, the TMR started at 12 dB. The initial adaptive step size was 6 dB. The adaptive step size was reduced to 3 dB after three reversals. Each track ended after at least 20 trials and a minimum of nine reversals had occurred. The TMRs of the last six reversals were averaged to obtain the SRT for an individual track. The final SRT for a condition combination for each listener was the average of the four individual track SRTs. The order of condition combinations was randomized for each listener. The experiment took a total of approximately 4 h for each individual CI listener. Listeners were given breaks and tested for more than 1 day if needed.

B. Results and discussion

The SRTs for individual listeners are shown in Fig. 2. Generally, large differences were observed across listeners. For example, listeners S8 and S10 had relatively low SRTs (e.g., one threshold each less than –10-dB TMR), while listeners S1 and S9 had relatively high SRTs (e.g., multiple thresholds greater than 10-dB TMR). The patterns of SRTs, however, were relatively consistent across listeners, supporting a group-level analysis.

Figure 3 shows the group average results for the CI listeners. A two-way repeated-measures (RM) analysis of variance (ANOVA) was performed on the SRTs with factors of masker angle (three levels: 0, ±30°, ±90°) and microphone condition (three levels: KEMAR, beam, and triple beam). There was a significant main effect of masker angle [F(2,18) = 155.2, p < 0.0001, $η_{p}^{2}$ = 0.95] and of microphone condition [F(2,18) = 239.1, p < 0.0001, $η_{p}^{2}$ = 0.96]. The masker angle × microphone interaction also was significant [F(4,36) = 257.0, p < 0.0001, $η_{p}^{2}$ = 0.89].

FIG. 3. — Group average SRTs (A) for CI listeners as a function of masker separation in azimuth. Error bars represent ±1 standard deviation. (B) shows iSNRs vs SRTs. Each point corresponds to a certain microphone condition and masker angle. The dashed line shows a linear regression of the data.

To analyze this interaction, 36 two-tailed paired t-tests assuming equal variance were performed post hoc, and the p values are reported with Bonferroni correction. For the co-located condition (masker angle = 0°), there were no significant differences in SRTs between microphone conditions [KEMAR vs beam, KEMAR vs triple beam, and beam vs triple beam (p > 0.05 for all three comparisons)]. At masker angles = ±30°, SRTs were significantly lower for beam compared to KEMAR (p < 0.01). The SRT for triple beam was not significantly different from the SRT for KEMAR and beam (p > 0.05 for both comparisons). At masker angles = ±90°, SRTs for KEMAR were significantly higher than SRTs for beam and triple beam (p < 0.0001 for both), and SRTs for triple beam were significantly higher than SRTs for beam (p < 0.005).

For the KEMAR condition, there were no significant differences in SRTs between different angles (p > 0.05 for all three comparisons). For the beam condition, the SRT for the masker angle = 0° was significantly higher than the SRT at ±30° (p < 0.01) and ±90° (p < 0.0001), and the SRT for the masker angle = ±30° was significantly higher than the SRT at ±90° (p < 0.0001). For the triple-beam condition, the SRT for the masker angle = 0° was significantly higher than the SRT at ±30° (p < 0.05) and ±90° (p < 0.0005), and the SRT for the masker angle = ±30° was significantly higher than the SRT at ±90° (p < 0.0005). Therefore, the two-way interaction was a result of an increasingly larger decrease in SRTs as a function of masker angle across the three beamforming conditions: KEMAR (no change), triple beam (improvement), and beam (the most improvement).

The average SRTs obtained in the nine condition combinations were compared to the iSNR [Fig. 3(B)]. A linear regression was highly significant (p < 0.0001), where 90% of the variance was explained in the data points. Therefore, the performance on this task was well described by the changes in iSNR, including the better SRTs for the beam compared to the triple beam, and consistent with other studies in bilateral CI listeners (Baumgartel et al., 2015a). This also suggests that there seems to be minimal additional binaural benefit of possible perceived spatial separation for the triple-beam algorithm, contrary to our hypothesis.

In summary, the bilateral CI listeners benefited greatly from beamforming compared to natural spatial cues. Specifically, there was minimal improvement in SRTs when the target and maskers had spatial separation for the KEMAR conditions, while both beam and triple beam produced large improvement in SRTs. Furthermore, the improvement in SRTs for the beam condition was significantly greater than that for the triple-beam condition, suggesting the improvements in SRTs were driven primarily by improving the TMR rather than providing binaural unmasking from improved perceived spatial separation of the three sources. This result is consistent with the highly significant correlation in SRTs and iSNR.

III. EXPERIMENT II: SPEECH-ON-SPEECH MASKING IN NH LISTENERS

A second experiment was performed with NH listeners using stimuli that were processed through a vocoder under conditions that paralleled those tested with the CI listeners. Vocoding degrades the spectrotemporal representation of the stimulus relative to natural presentation in a manner that is similar to the sound processing performed by actual CIs (Loizou, 2006). An advantage of this type of CI simulation with NH listeners is that it allows certain binaural cues to be manipulated in ways that can control some of the stimulus properties that underlie the spatial-hearing abilities of CI listeners. For example, diminishing the interaural correlation of the vocoded signals may blur perceptual sound-source information and thus affect spatial-hearing abilities such as localization and source segregation (Jones et al., 2014; Swaminathan et al., 2016; Goupell and Stakhovskaya, 2018; Baltzell et al., 2020). In this experiment, two types of carriers were used, interaurally uncorrelated noise carriers that were intended to convey relatively poor spatial-location information (i.e., perceptually diffuse sound sources) and correlated noise carriers that were intended to convey better spatial-location information (i.e., perceptually compact sound sources) (Jones et al., 2014). We hypothesized that the NH listeners presented with the vocoded stimuli would produce poorer SRTs than with the non-vocoded stimuli because of diminished spatial-hearing abilities. We also hypothesized that uncorrelated noise carriers would diminish the usefulness of perceived spatial differences for performing the speech-on-speech masking task. The relevant question then was how different beamforming strategies would benefit listeners under these different interaural vocoding schemes and how that compared to the performance by these same listeners under matched natural (non-vocoded) listening.