Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2013 Feb 13;133(3):EL202–EL207. doi: 10.1121/1.4791710

Design and preliminary testing of a visually guided hearing aid

Gerald Kidd Jr 1, Sylvain Favrot 1, Joseph G Desloge 2, Timothy M Streeter 3, Christine R Mason 3
PMCID: PMC3585754  PMID: 23464129

Abstract

An approach to hearing aid design is described, and preliminary acoustical and perceptual measurements are reported, in which an acoustic beam-forming microphone array is coupled to an eye-glasses-mounted eye-tracker. This visually guided hearing aid (VGHA)—currently a laboratory-based prototype—senses direction of gaze using the eye tracker and an interface converts those values into control signals that steer the acoustic beam accordingly. Preliminary speech intelligibility measurements with noise and speech maskers revealed near- or better-than normal spatial release from masking with the VGHA. Although not yet a wearable prosthesis, the principle underlying the device is supported by these findings.

Introduction

Perhaps the most serious limitation to effective communication due to sensorineural hearing loss is the reduced ability to perceptually segregate and selectively attend to one specific talker among several competing talkers. The underlying reasons for this source selection problem are not fully understood but appear to be related to changes in the way that sounds are processed in the auditory system that extend beyond the reduced sensitivity and abnormal loudness growth that are characteristic of most sensorineural pathologies. For the majority of hearing losses that are not medically remediable, a hearing aid is the only viable treatment. However, even with the most sophisticated modern hearing aids, the fundamental problem remains: How to selectively amplify the sounds the listener wishes to hear while excluding unwanted, interfering sounds.

Current hearing aids provide two strategies for enhancing source selection that go beyond amplification and amplitude compression: Noise reduction and directionality. It is beyond the scope of this article to review the performance of these two approaches. Two general comments, though, will be made here. First, noise reduction depends on isolating and identifying the “noise” (from an unwanted “masker” source) when it is mixed with the “signal” (from the “target” source). Acoustically, however, such a distinction often is moot. Consider conversation in a multitalker “cocktail party” environment. The talker to which the listener wishes to attend may change from moment to moment, sometimes unpredictably. Thus the source that is the target depends on the perspective of the listener and not on any dependable acoustic characteristics. Noise reduction algorithms must make some assumptions about the properties of noise to attenuate it, but they cannot account for the changeable, internally determined designation of the target source by the listener. Second, directionality potentially does offer significant benefit in solving the source selection problem, especially for well-separated sources. However, the existing ways of implementing directionality have some limitations. Directional microphones typically are placed in a fixed location on the head—regardless of whether it is a single directional ear-level aid, a pair of such aids, or an array of microphones—meaning that the direction that is selectively emphasized is fixed relative to head position. Orienting the direction of amplification depends on turning the head toward the source. Head turns are relatively slow, limited in extent, and if required frequently to follow changes in source location (e.g., turn-taking in conversation) may be taxing both physically and socially. Shinn-Cunningham and Best (2008) summarize these limitations “…there is a key difference between how NH [normal hearing] listeners use selective attention and how a directional hearing aid works. Selective attention is steerable, focusing and refocusing on whatever sound source is of interest at a given moment…. In contrast, a directional aid focuses attention in the direction a listener is facing, with no consideration of the current goal or desired focus of attention…even if the hearing-aid user is almost instantaneously able to determine the direction to face, the physical act of turning the head is slower than the time required by a NH listener to switch the spatial focus of attention….Thus,…current hearing-aid technology…do(es) not restore the functional ability to fluidly focus attention on whatever source is immediately important, an ability that is critical if a listener is to participate in everyday social interactions” (p. 293).

The current article describes a new approach to hearing aid design that attempts to directly couple focused amplification to the selection of a sound source by the listener.1 This approach—referred to as the visually guided hearing aid (VGHA)—uses eye gaze as the means of steering directional amplification. Conceptually, the prototype VGHA is fairly simple: It consists of a head-worn beam-forming microphone array coupled to an eye-glasses mounted eye tracker by a laboratory-designed interface, combined with a few other standard components such as earphones. Although it is currently a laboratory prototype consisting of a number of commercially available components and custom-built instrumentation used for research, the ultimate goal is for this design to be implemented as a portable, wearable device. All of these components function in concert but remain available to be modified to accommodate the requirements of the specific research question being addressed. Our initial work on developing the prototype VGHA and some preliminary tests of its performance are described in the following text.

Methods

The laboratory prototype of the VGHA and the test facility are illustrated schematically in Fig. 1. The various components, which work together in real time as a single apparatus, are located in two adjacent sound-attenuating booths with connections to the computers and additional instrumentation extending outside of the booths. This arrangement was chosen to provide the experimental control and flexibility necessary for conducting the acoustical and perceptual measurements reported in the following text. The human listener is seated in the smaller double-walled “listening booth” wearing the eye tracker and headphones. The beam-forming microphone array is placed on a KEMAR mannequin in the larger single-walled “sound field booth.” The loudspeakers are arranged along a semicircle at a distance of 5 ft from KEMAR. The listening booth also has a monitor, keyboard, and mouse. The signals from KEMAR's ears were used for measurements and some experimental control conditions when the array was removed (referred to as the “KEMAR” condition). Other electroacoustic equipment included a MOTU eight-channel amplifier (“sound device” in Fig. 1) for the microphone array, and standard omnidirectional B&K microphone. During the experiments, the signals from the microphone array, KEMAR or the B&K microphone were routed directly to the listener's earphones (only mic array connections shown in Fig. 1). The microphone array was designed by Sensimetrics Corporation (Malden, MA). It consists of four pairs of cardiod microphones spaced 7.1 cm apart on a headband. Each microphone pair is oriented on the front-to-back axis aimed towards the front, separated by 3 cm. The eye tracker was the “Mobile Eye XG” (ASL, Bedford, MA) that employs a camera pointing inward to track eye position and an outward-pointing scene camera. The gaze angle sensed from the eye tracker steered the acoustic look direction (ALD) of the array. A calibration routine related gaze angle to the x-y coordinates of the video display. Gaze angle was converted to control signals that aimed the ALD of the microphone array by a custom interface.

Figure 1.

Figure 1

(Color online) A schematic illustration of the test facility and instrumentation configuration (conceptual, not to scale; see text) for the prototype VGHA.

Results

The spatial response of the microphone array was measured for a stationary noise source at 0° as the gaze angle was swept over a range of ±45°. Perceptually, the image of the noise is centered in the head due to the diotic input to the earphones from the array with loudness increasing to a peak and then decreasing as the ALD moves from +45° to 0° to −45°. An octave-band analysis of the array output is plotted in Fig. 2A. As expected, spatial selectivity increases with increasing frequency. Figure 2B shows the temporal response of the system. The display was a moving dot that the subject followed visually. The display shows a brief time slice of the movement of the dot on the screen (target angle), the subsequent movement of the eyes (gaze angle), and the orientation of the beam (ALD). This subject's eye movements followed the target fairly quickly with lags of about 200–300 ms due primarily to subject response time. The lag between detection of the eye-gaze angle and orienting the ALD was less than about 30 ms.

Figure 2.

Figure 2

(A) Octave-band analysis of the output of the beam-forming microphone array (referenced to the maximum level) for a broad-band noise source at 0° as the ALD is swept from −45° to +45°. (B) The measured temporal response of the VGHA. The temporally leading trace indicated the position of the visual marker (Target); the next trace (Gaze) was the subject's eye position while tracking the moving target; and the final trace (ALD) was the response of the array to changes in eye position.

Two masked speech identification experiments were conducted to provide an initial assessment of the performance of the VGHA under multisource conditions. The first experiment (Experiment 1) examined the benefit of the microphone array while the second experiment (Experiment 2) assessed the dynamic performance of the eye-gaze control. There were a total of six normal-hearing (NH) subjects and two subjects with unilateral deafness (HL; Experiment 1). The data represent the means of two estimates per subject per condition with standard deviations computed across subject means. In Experiment 1, the task was to identify the words spoken by a target talker in the presence of two simultaneous maskers that were either different talkers uttering similar sentences or independent speech-modulated noises (e.g., Marrone et al., 2008). The target and speech maskers were from a closed-set laboratory-designed corpus (“BU corpus”; Kidd et al., 2008a) that consists of five-word strings having the form <name> <verb> <number> <adjective> <object>. On every trial, the words were randomly selected from eight exemplars in each category; for example: “Sue found three red shoes.” The target talker was identified by the name “Sue” with scoring based on correctly identifying the next four words in the sentence. The masker talkers uttered sentences comprising mutually exclusive selections from each category. For the two noise maskers, the broadband envelopes of masker-word selections were used to amplitude-modulate independent noise carriers. The target was presented from 0°. The maskers were either colocated with the target or were spatially separated by ±90°. The maskers were presented at 55 dB SPL while the target adapted in level to estimate 50% correct (“threshold” target-to-masker ratio; T/M, specified at source). Subjects listened either through the VGHA or KEMAR's ears (KEMAR). Because the target location was constant, the ALD was fixed at 0° so listening through the VGHA is called the “MicArray” condition.

Figures 3A, 3B show group mean T/Ms for the NH and HL subjects, respectively. For the colocated speech maskers, T/Ms fell within a narrow range from 4 to 6 dB regardless of subject group or condition. The T/Ms in the colocated noise maskers were 10-15 dB lower than for speech. The main factor in this difference is the type of masking that is produced; i.e., informational for highly similar speech maskers and energetic for noise (cf. Kidd et al., 2008b). In the spatially separated conditions, there were systematic differences in T/Ms. For the NH listeners, T/Ms for the speech maskers were at −8.1 and −9.7 dB for the MicArray and KEMAR conditions, respectively, yielding spatial release from masking (SRMs) of 13.2 and 15.9 dB. Thresholds for noise maskers were lower at −15.1 dB for MicArray and −10.4 for KEMAR. The corresponding SRMs were 8 and 1.6 dB, with MicArray thus yielding a larger advantage of source separation for noise. For the HL listeners, there was a marked difference in T/Ms between the MicArray and KEMAR conditions. For speech, T/M in the separated condition was 7 dB for KEMAR and −8.6 dB for MicArray—a difference of more than 15 dB. For noise, threshold T/Ms were −4.4 dB for KEMAR and −14.6 dB for MicArray—a difference of 10.2 dB. The SRMs for KEMAR were both slightly negative, indicating no benefit of spatial separation of sources (cf. Marrone et al., 2008), while MicArray produced SRMs of 12.6 dB for speech maskers and 9.8 dB for noise. This large advantage from the MicArray condition for the HL listeners is not surprising considering the underlying mechanisms: Monaural listening for the HL listeners should be about the same for KEMAR in colocated and separated conditions for either type of masker. In contrast, lower thresholds due to improved spatial selectivity should be obtained from the MicArray case for either type of masker in spatially separated conditions.

Figure 3.

Figure 3

(A) Group mean T/Ms at threshold (and standard deviations) for the NH subjects in Experiment 1. (B) The same for two unilateral HL listeners. (C) Group means (and standard deviations) in percentage correct for Experiment 2. Refer to legends and text for all three panels.

In Experiment 2, the listeners adjusted the ALD of the microphone array via the eye tracker/interface in response to changes in the location of the target source [Fig. 2B]. The speech task was adapted from that reported by Best et al. (2007) in which listeners used visual markers to select among concurrent target digits. As in that study, the stimuli were five mutually exclusive spoken digits (numbers 1–10 excluding 7; BU Corpus). Here they were presented simultaneously one each from five spatially separated loudspeakers in two-word strings (10 words total). The digits were presented equal in level at 55 dB SPL. The only means for designating which digit was the target was a visual marker indicating its location. In the VGHA condition, when the subject moved their eyes to focus on a location on the monitor (head position oriented toward 0°), the ALD of the microphone array followed that motion aiming the beam toward the corresponding azimuth in the sound field. In the KEMAR condition, the subject used binaural information to orient the focus of attention in azimuth. The five loudspeakers spanned ±30° spaced 15° apart centered at 0° represented on the display by icons at the correct gaze angles. Each digit was spoken by a different randomly chosen talker. The visual cue for target location was presented 1 s before and during each digit. Thus selection of the target digit required positioning the ALD prior to the stimulus for both the first and second digits. Random guessing should yield 1/9 correct performance, while randomly choosing among the five stimuli presented on a given trial would yield 1/5 correct. A control condition was also tested using an omnidirectional microphone suspended at a position corresponding to the center of KEMAR's head (“Omni”). Group mean results for six NH listeners are shown in Fig. 3C. This is a difficult task with five concurrent sources spaced within a range of 60° and performance for VGHA and KEMAR conditions was only around 40% correct. However, the similarity of the performance between those two conditions—with both substantially better than Omni (20% correct)—suggests that the eye tracker control was successful in guiding the ALD within the 1 s intervals available.

Discussion

The following observations about the design and performance of the VGHA appear to be warranted: First, the beam-forming microphone array provided a high degree of spatial selectivity particularly at high frequencies. This was apparent from the acoustic measurements obtained in a mildly reverberant sound field and from perceptual measurements of masked speech identification. Although the performance of beam-formers depend on a variety of factors (e.g., Greenberg et al., 2003), the present configuration appeared to be sufficiently effective. The perceptual measurements revealed SRMs that were essentially the same as those found for normal binaural cues and much better than for monaural listening. This occurred for both energetic and informational masking. Although the subjects tested here had unilateral deafness, recent work has suggested that a similar advantage may occur for highly directional amplification (Bever et al., 2012) for bilateral sensorineural losses. Second, the dynamic aspect of VGHA control appeared to be satisfactory for the limited conditions tested here. This conclusion was supported by the results of the final experiment in which performance with the VGHA matched that found for normal selective attention using binaural cues. The temporal response of the VGHA deserves closer scrutiny because rapid control of source selection may be particularly important when the “target” source transitions quickly under highly uncertain conditions. It is possible that the close coordination of auditory and visual input (e.g., speech reading) may yield a special benefit in complex and dynamic conditions that the current experiments did not address. This general topic—the congruence of auditory and visual perception when the eyes move—has been studied for sound source localization (e.g., Vliegen et al., 2004; Razavi et al., 2007). This is an important perceptual problem that will require careful study in the future using the VGHA because the auditory image from the device resides at the interaural midline regardless of where the eyes are directed. Thus the translation of auditory and visual “maps” of the external environment that occurs naturally through binaural cues likely must be recalibrated when listening diotically/monotically (e.g., as for unilateral deafness) or through a beam-forming microphone array as with the VGHA.

Acknowledgments

This work supported by NIH/NIDCD Grant Nos. DC04545, DC00100, and DC04663. The authors are grateful for the contributions of Jayaganesh Swaminathan, Virginia Best, Lorraine Delhorne, Thomas von Wiegand, and Patrick Zurek to this project.

Footnotes

1

Similar approaches have been suggested by Hart et al. (2009) and Marzetta (2010).

References and links

  1. Best, V., Ozmeral, E. J., and Shinn-Cunningham, B. G. (2007). “ Visually-guided attention enhances target identification in a complex auditory scene,” J. Assoc. Res. Otolaryngol. 8, 294–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bever, J., Diedesch, A. C., Lewis, M. S., and Gallun, F. J. (2012). “ Assessing binaural hearing aid perception and performance in complex listening environments,” presented to the American Auditory Society, Scottsdale, AZ.
  3. Greenberg, J. E., Desloge, J. G., and Zurek, P. M. (2003). “ Evaluations of array processing algorithms for a headband hearing aid,” J. Acoust. Soc. Am. 113, 1646–1657. 10.1121/1.1536624 [DOI] [PubMed] [Google Scholar]
  4. Hart, J., Onceanu, D., Sohn, C., Wightman, D., and Vertegaal, R. (2009). “ The attentive hearing aid: Eye selection of auditory sources for hearing impaired users,” in INTERACT 2009, Part I, LNCS 5726, edited by Gross T., Gulliksen J., Kotze P., Oestreicher L., Palanque P., Oliveira Prates R., and Winckler M. (Springer, Berlin: ), pp. 19–35. [Google Scholar]
  5. Kidd, G., Jr., Best, V., and Mason, C. R. (2008a). “ Listening to every other word: Examining the strength of linkage variables in forming streams of speech,” J. Acoust. Soc. Am. 124, 3793–3802. 10.1121/1.2998980 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Kidd, G., Jr., Mason, C. R., Richards, V. M., Gallun, F. J. and Durlach, N. I. (2008b). “ Informational masking,” in Auditory Perception of Sound Sources, edited by Yost W. A., Popper A. N., and Fay R. R. (Springer Science+Business Media, New York: ), pp. 143–190. [Google Scholar]
  7. Marrone, N. L., Mason, C. R., and Kidd, G., Jr. (2008). “ Evaluating the benefit of hearing aids in solving the cocktail party problem,” Trends Amplif. 12, 300–315. 10.1177/1084713808325880 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Marzetta, T. L. (2010) “ Self-steering directional hearing aid and method of operation thereof,” U.S. patent US2010/0074460.
  9. Razavi, B., O'Neill, W. E., and Paige, G. D. (2007). “ Auditory spatial perception dynamically realigns with changing eye position,” J. Neurosci. 27, 10249–10258. 10.1523/JNEUROSCI.0938-07.2007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Shinn-Cunningham, B. G., and Best, V. (2008). “ Selective attention in normal and impaired hearing,” Trends Amplif. 12, 283–299. 10.1177/1084713808325306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Vliegen, J., Van Grootel, T. J., and Van Opstal, A. J. (2004). “ Dynamic sound localization during rapid eye-head gaze shifts,” J. Neurosci. 24, 9291–9302. 10.1523/JNEUROSCI.2671-04.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES