A Visual or Tactile Signal Makes Auditory Speech Detection More Efficient by Reducing Uncertainty

Bosco S Tjan; Ewen Chao; Lynne E Bernstein

doi:10.1111/ejn.12471

. Author manuscript; available in PMC: 2015 Apr 1.

Published in final edited form as: Eur J Neurosci. 2014 Jan 9;39(8):1323–1331. doi: 10.1111/ejn.12471

A Visual or Tactile Signal Makes Auditory Speech Detection More Efficient by Reducing Uncertainty

Bosco S Tjan ¹, Ewen Chao ², Lynne E Bernstein ^1,³

PMCID: PMC3997613 NIHMSID: NIHMS550437 PMID: 24400652

Abstract

Acoustic speech is easier to detect in noise when the talker can be seen. This finding could be explained by integration of multisensory inputs or refinement of auditory processing from visual guidance. In two experiments, we studied two-interval forced choice detection of an auditory “ba” in acoustic noise, paired with various visual and tactile stimuli that were identically presented in both observation intervals. Detection thresholds were reduced under the multisensory conditions versus the auditory-only condition, even though the visual and/or tactile stimuli alone could not inform the correct response. Results were analyzed relative to an ideal observer for which intrinsic (internal) noise and efficiency were independent contributors to detection sensitivity. Across experiments, intrinsic noise was unaffected by the multisensory stimuli, arguing against the merging (integrating) of multisensory inputs into a unitary speech signal; but sampling efficiency was increased to varying degrees, supporting refinement of knowledge about the auditory stimulus. The steepness of the psychometric functions decreased with increasing sampling efficiency, suggesting that the “task-irrelevant” visual and tactile stimuli reduced uncertainty about the acoustic signal. Visible speech was not superior for enhancing auditory speech detection. Our results reject multisensory neuronal integration and speech-specific neural processing as explanations for enhanced auditory speech detection under noisy conditions. Instead, our results support a more rudimentary form of multisensory interaction – the otherwise task-irrelevant sensory systems inform the auditory system about when to listen.

Keywords: speech detection, multisensory enhancement, ideal-observer analysis

Introduction

Stimulation to one sensory system can enhance perception of stimuli presented to a different sensory system. For example, auditory stimuli can enhance the perceived intensity of light (B. E. Stein, London, Wilkinson, & Price, 1996), and conversely, light can enhance the perceived intensity of acoustic white noise (Odgaard, Arieh, & Marks, 2004). Vibrotactile pulses aid in the detection of tones and increase their perceived loudness (Gillmeister & Eimer, 2007; Ro, Hsu, Yasar, Elmore, & Beauchamp, 2009; Schurmann, Caetano, Jousmaki, & Hari, 2004). Under noisy acoustic conditions, seeing a talker can lower the auditory speech detection threshold (Bernstein, Auer, & Takayanagi, 2004; Eskelund, Tuomainen, & Andersen, 2011; Grant, 2001; Grant & Seitz, 2000; Kim & Davis, 2004; Schwartz, Berthommier, & Savariaux, 2004). The speech detection enhancement could be attributable to audiovisual integration that leads to an amodal “integrated neural signal [that] is different (e.g., bigger, smaller, having a different temporal evolution)” (B. E. Stein et al., 2010). Alternatively, it could be due to visual guidance for listening to the speech in noise (Nahum, Nelken, & Ahissar, 2008).

We investigated speech detection enhancement with respect to an ideal observer, which is a theoretically optimal detector (Green & Swets, 1966; Pelli & Farell, 1999). We used the ideal observer as a standard yardstick to quantify system-level changes with and without multisensory inputs. The ideal observer has the full knowledge about the acoustic stimulus to be detected. Its performance is limited only by noise in the stimulus and uncertainty inherent in the task (e.g., uncertainty due to multiple stimuli for the same response).

Ideal-observer model

An ideal-observer model can be used to quantify multisensory facilitations using two orthogonal factors: (1) a non-acoustic stimulus could reduce the internal noise of the perceiver (e.g., a visual speech stimulus might recruit an auditory speech-specific process that may be less noisy than a generic sound-detection process); and (2) it could facilitate the extraction of the acoustic signal from the noisy input by appropriately focusing perceptual resources to the relevant information in the signal (e.g., by providing a temporal marker or by correlating with the structure of the auditory stimulus). Multisensory processing could, at least in principle, worsen one factor while improving the other (e.g., integrating a noisy but informative non-acoustic signal with the task-relevant acoustic signal could add noise but also increase efficiency).

An additive noise ideal observer model (Green & Swets, 1966; Pelli & Farell, 1999) explicitly represents and dissociates these two factors. Discriminability between a noisy signal and noise alone, as measured in d’, is expressed as:

d^{' 2} = η E ∕ (N + N_{eq}),

(1)

where E is signal energy, N is the spectral density of the external noise in the stimulus, N_eq is the additive noise in the perceptual system, expressed as an equivalent noise source at the input, and η is the sampling efficiency of the perceptual system. For an ideal observer, η = 1.0 and N_eq = 0. For humans, N_eq > 0, and η < 1.

Intuitively, the internal or “intrinsic” noise, expressed as the equivalent input noise, is the perceptual system’s precision for signal transduction and sensory measurement. For a given sensory-perceptual system, different neural pathways might operate with different amounts of intrinsic noise. The measurable intrinsic noise of a human observer depends on which of the subsystems are recruited, and how/whether the signals are combined. For example, if an auditory speech-specific subsystem had lower intrinsic noise than a general-purpose auditory system, and if a visual stimulus led to an increased utilization of the hypothetical lower-noise speech subsystem, intrinsic noise reduction should be observed. Alternatively, if the visual signal is combined with the auditory signal to form an amodal speech signal (multisensory integration), the noise in the visual system should contribute to the observed intrinsic noise.

Sampling efficiency (sometimes called statistical efficiency or calculation efficiency) is the fraction of the noise-limited stimulus information that a perceptual system utilizes to perform a task. For example, a system that uses the visual stimulus onset time to attend synchronously to the auditory input will exhibit a higher sampling efficiency for detecting the auditory stimulus than one that ignores that information. In general, the more a system uses the spatiotemporal properties specific to the stimulus, the higher should be its sampling efficiency. We expect efficiency to increase if the perceiver can use knowledge about visual and/or tactile stimuli to pick out the auditory signal from its noise background.

Under the assumption of an additive noise ideal observer model, changes in intrinsic noise (N_eq) and sampling efficiency (η) are theoretically independent. As Equation 1 suggests, these parameters can be empirically determined for both unisensory and multisensory conditions by adding external noise (N) to the signal. To minimize effects of nonlinearity in the perceptual system associated with performance level, measurements are typically made at a constant d’. The ideal observer model of Equation 1 can then be rewritten to make explicit that at a constant d’, the signal energy (E) required to achieve the specific d’ is linearly proportional to the total spectral density of the internal and external noise (N), and the proportional constant is inversely proportional to sampling efficiency:

E = (d^{' 2} ∕ η) (N + N_{eq}) .

(2)

Hence, an experiment that measures the threshold signal energy as a function of external noise at a constant d’ provides a straightforward means to estimate intrinsic noise (N_eq) and sampling efficiency (η).¹

The additive noise ideal observer model can account for a broad range of tasks, from simple signal detection to object identification (Green & Swets, 1966; Legge, Kersten, & Burgess, 1987; Pelli & Farell, 1999; Tjan, Braje, Legge, & Kersten, 1995). Furthermore, numerous studies have used similar observer models to study the effects of attention and perceptual learning on human performance, and in doing so, they demonstrated that efficiency and intrinsic noise are empirically dissociable (Gold, Bennett, & Sekuler, 1999; Lu & Dosher, 1998; Sun, Chung, & Tjan, 2010); see also (Lu & Dosher, 2008) for an extensive review and elaborated theoretical analysis).

The current study

We used the ideal-observer model of Equations 1 and 2 to investigate the data from two experiments. In Experiment 1, speech detection thresholds were measured at four external noise levels, including a no-noise condition, while holding d’ constant. Stimulus conditions were audio-only (AO), audio-tactile (AT), audiovisual with a stationary rectangle (AVR), and audiovisual speech (AVS). The tactile stimulus extended generalizability to an additional sensory system. Having demonstrated that intrinsic auditory noise does not change across different multisensory stimuli, in Experiment 2, a more sensitive paradigm was used to examine further whether visible speech stimuli confer significant additional advantage for detection. The four conditions from Experiment 1 and the combination of the visual rectangle with the tactile stimulus (AVRT) and the visual speech with the tactile stimulus (AVST) were presented. Multisensory integration and speech-specific processing are ruled out as explanations for the auditory speech detection enhancement with audiovisual speech. The results point to the ability to use knowledge about visual and/or tactile stimuli to pick out the auditory signal from its noise background.

Experiment 1: Efficiency and Intrinsic Noise

In Experiment 1, speech detection thresholds were measured at four external noise levels, including a no-noise condition, while holding d’ constant. We wanted to determine if the ideal-observer model could account for the data, and if so, how efficiency and intrinsic noise might vary across multisensory conditions. Stimulus conditions were audio-only (AO), audio-tactile (AT), audiovisual with a stationary rectangle (AVR), and audiovisual speech (AVS).

Materials and Methods

Participants

We tested four participants (ages 19-37 years, mean 25; 1 male) with American English as their first language, normal or corrected-to-normal vision, normal pure tone thresholds for ten standard frequencies from 250Hz to 8000Hz (ANSI, S3.6-2004), and normal composite scores on the Hearing in Noise Test (HINT) (Nilsson, Soli, & Sullivan, 1994). The participants had average or better lipreading ability (Auer & Bernstein, 2007). They gave informed consent and were paid $12/hr for their participation. Testing took place over 4-6 sessions (mean 5.5), distributed over 8-71 days (mean 33). Human subject testing was approved by the Institutional Review Board of the St. Vincent’s Hospital, Los Angeles, California, which oversees human subjects research at House Ear Institute, Los Angeles, California where the data were collected. The experiments were undertaken with the understanding and written consent of each subject, and the study conforms to the Code of Ethics of the World Medical Association (Declaration of Helsinki), printed in the British Medical Journal (18 July 1964).

Stimuli

Auditory

The speech stimulus was a video-recorded “ba” spoken by a female (Bernstein et al., 2004). The 543-ms acoustic syllable was adaptively adjusted in sound level during testing (see below). White noise was presented at 0, 40, 50, and 60 dB SPL. A large (90-sec) file of computer-generated acoustic white noise was sampled randomly for each trial, extending across both intervals and between them, at a constant level throughout a run. The acoustic stimulus and the white noise were mixed using a calibrated audio system, including a custom attenuator and were presented through calibrated ER-3A insert earphones (Etymotic Research Inc., external noise exclusion 30 dB SPL).

Visual

The visual stimuli included the corresponding video of the talker as she pronounced the “ba” syllable (in AVS and AVST conditions) and a static rectangular image (AVR, AVRT) (Figure 1). The visual speech stimulus movement onset coincided with the acoustic syllable onset in the signal-present interval (Figure 2). The visible syllable was longer than the acoustic signal, as is often true with isolated audiovisual speech syllables. To equate for the contrast energy in the visual stimuli, the non-speech visual stimulus was a static rectangle filled with pixels randomly selected from the rectangular region of the visual speech stimulus including the face (Figure 1). The viewing distance was 1m. The face and the rectangle stimuli subtended 6.0 degrees of visual angle horizontally and 8.2 degrees vertically. A fixation cross during AO trials was presented continuously against a grey background and subtended 0.72 degrees of visual angle.

Stimulus timing diagram. Each trial comprised of two temporal intervals, with the target acoustic “ba” presented in only one of the intervals (E). All other stimuli for a particular condition were repeated in both intervals. The horizontal extent of the stimuli in the figure corresponds to their temporal interval. The tactile, the visual rectangle, and the acoustic syllable had the same duration (534 ms). In the AVS condition, the talker’s face appeared at the beginning of the interval, but the mouth did not move until the acoustic signal onset (D). Dots on either end of the timeline (E) indicate the frames of temporal jitter – the total jitter around a particular interval was always 167 ms (5 frames). F indicates the noise duration. Up to six stimulus conditions were tested in this study; audio-only (AO) (E), audio-tactile (AT) (A and E), audiovisual with a stationary rectangle (AVR) (C and E), AVR with tactile (AVRT) (A, C, and E), audio with visual speech (AVS) (D and E), AVS with tactile (AVST) (A, D, and E). In the AO (E) and AT (A and E) conditions, a fixation cross was displayed for the entire interval (B).

Tactile

A Bruel & Kjaer 4810 minishaker mounted on a wooden stand that incorporated an armrest delivered a vibration stimulus to the right index fingertip. The stimulus was a 200-Hz haversine pulse train (i.e., pulse duration of 2.5ms) of total duration 534ms, with the same onset and offset as the acoustic “ba,” presented via a 0.25-in diameter circular probe. A custom stimulus delivery system incorporated compensation for finger loading. The minishaker was encased in a foam-lined box to attenuate acoustical emissions, and participants wore earmuffs (Bilsom Comfort model #2315, NRR 25dB) throughout testing to guard against detecting acoustic radiation from the vibrating device, although no evidence suggested that vibration was detectable in the presence of the acoustic masking noise. The tactile intensity level was set to the average level at which the stimulus was judged to be equal in intensity to the visual rectangle (7.2 micron peak displacement), following an informal cross-modal intensity matching experiment.

Timing

Synchronized onsets between auditory and visual stimuli, and between auditory and tactile stimuli were permanently established using a pre-recorded stimulus DVD. Figure 2 illustrates the timing within a trial, during which the auditory “ba” stimulus was randomly presented in only one of two observation intervals. The visual speech stimulus began with freeze frames but motion onset coincided with acoustic onset. The visual square and tactile stimuli onset coincided with acoustic onset timing. A total jitter of 167 ms was randomly inserted at the onset and offset of the two observation intervals such that all trials were the same duration. In the AO condition, a fixation cross was presented for the entire 2135 ms of each observation interval, the total duration of the video speech, including freeze frames. Uniform gray frames of 167 ms duration separated observation intervals in addition to the jitter.

Procedure

A two-interval forced-choice paradigm with adaptive three-down one-up staircase algorithm (Levitt, 1971) was used to obtain 79.4% (d’=1.16 for a 2IAFC design) detection thresholds. Within each testing block, stimulus condition and noise level were fixed, and the “ba” stimulus amplitude was varied. The adaptive step sizes were as follows: At the beginning of the block, 3-dB steps were used until the first reversal following an error; then 2-dB steps until the third reversal; 1-dB until the fifth reversal; 0.5 dB until the eighth reversal; and 0.1 dB for the final four reversals. Thresholds were the arithmetic mean in dB units of all 12 reversal points. In the noise conditions, the initial SNR was −6 dB. In a no-added-noise (quiet) condition, the initial speech level was 10 dB SPL. Two subsequent blocks in each type of condition were initiated with SNRs of 6 dB above the threshold from the previous corresponding stimulus block.

Participants received 15 practice trials per condition and then executed a variable number of testing blocks per session. The conditions were pseudo-randomly ordered and each condition was presented at every noise level once before any were repeated, resulting in 48 blocks (3 repetitions × 4 conditions × 4 noise levels). Because the paradigm used adaptive testing, the number of test trials per participant varied somewhat, averaging 65 trials per block.

Participants were told to attempt to detect the auditory stimuli and keep their gaze on the video monitor. They were not explicitly told to attend to the tactile stimuli. It was obvious to the participants that the visual and tactile stimuli were presented in both the signal-present and signal-absent intervals. Participants were instructed to respond as quickly and as accurately as possible when they detected the “ba” auditory stimulus. Responses were made using a two-button box with each button assigned to one of the stimulus intervals. Participants were free to respond during the first interval if they detected the stimulus there. Response times were recorded but not analyzed. LEDs affixed to the sides of the monitor and on the button box indicated the correct response after each response. Testing took place in a double-walled sound booth.

Results and discussion

Each participant contributed 16 thresholds (4 noise levels × 4 stimulus conditions) averaged over 3 blocks (about 200 trials per threshold). In a repeated measures ANOVA, stimulus type [F (3, 9) = 36.24, p < .0001] and noise level [F (3, 9) = 2015, p < 10⁻¹²] had strong effects on signal thresholds without any significant interaction. Post-hoc pairwise contrasts, corrected for multiple comparisons, revealed the order of signal threshold magnitudes to be AO (27.9 dB SPL)> (AT ≈ AVR) > AVS (25.6 dB SPL). That is, all multisensory conditions improved speech signal detection, with visual speech providing the largest gain.

The ideal-observer model (Equation 2) provided a good fit to the data of each participant, accounting for 99% of the variance (Figure 3a). Intrinsic noise (N_eq), efficiency (η), and the standard errors of the estimates were obtained by fitting Equation 2 to the data (Figures 3b-c). Intrinsic noise did not vary across stimulus condition [F (3, 9) = 1.003, p = .435]. The average level of intrinsic noise was equivalent to an input noise of 15.8 dB SPL, which is very low compared to the external noise.

Ideal-observer analysis of speech-detection in noise (Experiment 1). (a) Energy (E) of the speech signal is plotted against the power spectral density (N) of the external noise in log units for each participant. The ideal-observer model of Eq. 2 provides an excellent fit to the individual data (R²>0.99). Equivalent input noise (intrinsic noise) (b) and sampling efficiency (c) were estimated from the fits. Error bars represent +/− one standard error of the estimates. Multisensory conditions had no effect on intrinsic noise but significantly improved efficiency. Efficiencies were AO < (AT, AVR) < AVS, with mean intrinsic noise level estimated at 15.8 dB SPL.

In contrast, efficiency was reliably affected by the stimulus condition [F (3, 9) = 23.42, p < .001]. Post-hoc pairwise comparisons showed that efficiency was AO (2.2%) < (AT ≈ AVR) < AVS (3.6%). Efficiency averaged across conditions was 2.8%. Thus, no evidence was obtained for multisensory integration (i.e., either reduced or increased internal noise), but there was reliable evidence for increased auditory efficiency from visual and tactile stimuli.

Experiment 2: Efficiency and Linearity of Speech Detection

Experiment 1 showed that the equivalent input noise was very low relative to external noise (equivalent to an external noise at 15.8 dB SPL). Equation 1 implies that d’ ≈ √(ηE/N) whenever external noise is sufficiently high relative to intrinsic noise (N >> N_eq). That is, d’ measured at high external noise is unaffected by intrinsic noise and can therefore be used as a surrogate for efficiency. This fact was used to obtain a more precise assessment of multisensory facilitation, particularly, the relative effect of visual speech. It was also used to characterize any nonlinearity between d’ and SNR, which provides additional insight about the basis for multisensory enhancement.

Experiment 2 was carried out in two phases. In the preliminary phase, SNR thresholds were obtained adaptively at d’=1.16 with the signal fixed at 55 dB SPL and external noise varied. The relatively high signal intensity was used to ensure that performance would not be limited by the weak intrinsic noise. In the main experiment, a common range of SNRs, chosen based on the results from preliminary experiment and applicable to all participants, was used to measure d’s by using the method of constant stimuli (i.e., with both noise and signal fixed within blocks). Two additional conditions were tested in Experiment 2, AVRT and AVST, for which tactile stimuli were presented synchronously with the AVR and AVS stimuli, respectively.