Time-forward speech intelligibility in time-reversed rooms

Laricia Longworth-Reed; Eugene Brandewie; Pavel Zahorik

doi:10.1121/1.3040024

. 2008 Dec 22;125(1):13–19. doi: 10.1121/1.3040024

Time-forward speech intelligibility in time-reversed rooms

Laricia Longworth-Reed ¹, Eugene Brandewie ¹, Pavel Zahorik ¹

PMCID: PMC2677280 PMID: 19173377

Abstract

The effects of time-reversed room acoustics on word recognition abilities were examined using virtual auditory space techniques, which allowed for temporal manipulation of the room acoustics independent of the speech source signals. Two acoustical conditions were tested: one in which room acoustics were simulated in a realistic time-forward fashion and one in which the room acoustics were reversed in time, causing reverberation and acoustic reflections to precede the direct-path energy. Significant decreases in speech intelligibility—from 89% on average to less than 25%—were observed between the time-forward and time-reversed rooms. This result is not predictable using standard methods for estimating speech intelligibility based on the modulation transfer function of the room. It may instead be due to increased degradation of onset information in the speech signals when room acoustics are time-reversed.

Introduction

Classic demonstrations have revealed that backwards playback of speech recordings in reverberant environments results in greatly increased audibility of the acoustic reflections and reverberation (Houtsma et al., 1987). Unfortunately, this time-reversal manipulation also causes the speech source signal to become unintelligible, since it too is reversed in time. Objective measures of speech intelligibility would therefore not be valid measures of the time-reversal effects of room acoustics in such demonstrations. Here, an improved method using virtual auditory space (VAS) techniques to manipulate temporal aspects of (simulated) room acoustics independent of the speech source signals is described and tested. Speech intelligibility was measured for normal time-forward speech signals in two acoustical conditions: one in which room acoustics were simulated in a realistic time-forward fashion; and one in which the room acoustics were reversed in time, causing reverberation and acoustic reflections to precede the direct-path energy. This manipulation is interesting because it represents an extreme listening situation with which listeners have likely had very little experience. It also does not affect the modulation transfer function (MTF) of the room, which describes the way the room modifies the temporal energy distribution of a sound signal and has been shown to be highly predictive of speech intelligibility in rooms (Houtgast and Steeneken, 1973, 1985).

Of additional interest is the extent to which listeners might adapt to realistic “plausible” listening environments (Clifton et al., 1994; Hartmann, 1997) and not to “implausible” time-reversed environments, as well as the impact of realistic binaural input on these effects. Results by Watkins (2005b) indicate that the prior listening exposure in reverberant environments returns performance in a categorical speech perception task to levels consistent with those observed in much less reverberant listening situations, presumably through a process of perceptual adaptation to the listening environment. The adaptation is decreased when the acoustics of the prior listening context are made implausible through time-reversal (Watkins, 2005b, 2005a), and does not appear to depend on binaural input (Watkins, 2005b), although binaural listening is in general known to facilitate speech intelligibility in reverberant environments (Nabelek and Robinson, 1982).

Methods

Listeners

Ten listeners (four female) ages 19–51 years participated in the experiment. All had normal hearing, as verified by standard (ANSI-S3.9, 1989) audiometric screening.

Stimuli

VAS techniques were used to simulate a reverberant room with dimensions of 5.7×4.3×2.6 m and broadband reverberation time (T₆₀) of approximately 1.5 s. To implement the simulation, a simple model of a binaural room impulse response (BRIR) was constructed using an image-model (Allen and Berkley, 1979) to simulate early reflections and a statistical model of the late reverberant energy. The direct-path and 500 early reflections were all spatially rendered using head-related transfer functions measured from 613 spatial locations surrounding a single representative listener in an anechoic chamber, using methods fundamentally similar to those described in detail by Wightman and Kistler (1989). This listener did not participate in any subsequent behavioral testing. Each reflection was attenuated based on path-length, an average surface absorption coefficient, and the reflection order. Diffuse late reverberation was simulated in the BRIR using independent Gaussian noise samples for each ear shaped by decay functions derived from the Sabine equation (Sabine, 1922) in each of six octave bands ranging from 125 to 4000 Hz. In general, this simulation technique is similar to those implemented in other studies (Heinz, 1993; Naylor, 1993) and has been shown to produce simulations that are perceptually similar to those derived from measurements in a real room (Zahorik, 2004).

Time-forward and time-reversed BRIRs were convolved (using MATLAB® software) with time-forward sentences (adult talker) from the Hearing in Noise Test (HINT) sentence corpus (Nilsson et al., 1994). The simulated spatial location of the speech source was 1.4 m in front of the listener at ear level, in the approximate center of the virtual room. A graphical description of this stimulus generation procedure is shown in Fig. 1. Multimedia examples of the time-forward and time-reversed stimuli shown in Fig. 1 are available (Mm1 and Mm2), along with analogous stimuli generated using a different sentence (Mm3 and Mm4).

Stimulus generation scheme for time-forward and time-reversed binaural room impulse responses (left ear shown). Results displayed are from Mm1 and Mm2.

Download audio file^{(855.6KB, wav)}

Open in a new tab

Example of a sentence (“The tub faucet is leaking”) convolved with a time-forward BRIR. This is a file of type “wav” (856 KB).

Download audio file^{(855.6KB, wav)}

Open in a new tab

Example of a sentence (“The tub faucet is leaking”) convolved with a time-reversed BRIR. This is a file of type “wav” (856 KB).

Download audio file^{(792.7KB, wav)}

Open in a new tab

Example of a sentence (“The truck drove up the road”) convolved with a time-forward BRIR. This is a file of type “wav” (793 KB).

Download audio file^{(792.7KB, wav)}

Open in a new tab

Example of a sentence (“The truck drove up the road”) convolved with a time-reversed BRIR. This is a file of type “wav” (793 KB).

In addition to binaural simulations that would be representative of those received in a real-room listening situation, diotic presentations—in which the simulated results for each ear were summed and delivered to both ears—were tested to determine the extent to which natural binaural information is important for speech recognition abilities in the simulated listening environments.

All stimuli were presented over equalized Beyerdynamic DT-990-Pro headphones using a Digital Audio Labs CardDeluxe for D∕A conversion (24-bit, 44.1 kHz) at moderate level (approximately 65 dB SPL) within a double-walled sound isolation chamber (Acoustic Systems).

Design and procedure

Listeners were tested in both binaural and diotic presentation conditions of the two room acoustic conditions: time-forward and time-reversed. The four resulting conditions were run in blocks, with order counterbalanced. Subjects were presented with each sentence stimulus one time and asked to repeat as many words in the sentence as possible. Forty sentences were presented in two 20-sentence blocks for each condition. Sentence recognition scores (proportion of correct words within a given sentence) were computed for each sentence based on the subject’s responses. Subjects also completed an initial baseline condition in which ten sentences were presented diotically in the absence of any room acoustic processing.

Results

MTF analyses

MTFs were computed for both time-forward and time-reverse BRIRs in octave bands with center frequencies ranging from 125 Hz to 8 kHz, following methods described by Steeneken and Houtgast (1980). Figure 2 displays MTF results for time-forward (filled symbols) and time-reversed (open symbols) impulse responses in the 1-octave band centered at 1 kHz. The solid curve represents a theoretical MTF (Houtgast and Steeneken, 1985) for a room with 1.5 s reverberation time (T₆₀), which has a low-pass magnitude characteristic. Close agreement may be observed between most measured and predicted values, indicating that time reversal of the impulse response does not affect the MTF. Similar results were observed in other octave bands (not shown).

MTF for the 1-octave band centered at 1 kHz for both time-forward (filled symbols) and time-reversed (open symbols) binaural room impulse responses. Data for only the left ear are shown. The solid curve represents a theoretical MTF (Houtgast and Steeneken, 1985) for a room with 1.5 s reverberation time (T₆₀).

The Speech Transmission Index (STI) integrates results from MTFs measured in octave bands from 125 Hz to 8 kHz (Houtgast and Steeneken, 1985) and indicates the amount of temporal modulation reduction imposed by the room (ranging from 0 “complete reduction” to 1 “no reduction”). As a result, high STI values have been shown to correspond to high levels of speech intelligibility (Houtgast and Steeneken, 1973, 1985). Since STI is derived from the MTF, it too should be unaffected by the time-reversal manipulation. STI values were computed for both the time-forward and time-reversed room impulse responses using one of the HINT sentence stimuli. As predicted, the values were very similar: 0.4549 (time-forward) and 0.4325 (time-reversed). Previous results indicate that these STI values would produce approximately 85% correct identification in a CVC-type task (Houtgast and Steeneken, 1973).

It is important to note that past study of the MTF∕STI concept in relation to speech intelligibility and room acoustics has considered only the magnitude characteristics of MTFs (Houtgast and Steeneken, 1973, 1985), yet MTFs are technically complex functions with both magnitude and phase characteristics. Although it has been demonstrated that the time-reversal manipulation in this study does not affect the magnitude characteristic of the MTFs (or the STI), the phase characteristics of the MTFs were dramatically affected by this manipulation.

Word recognition

Figure 3a displays the mean word recognition scores for each of the experimental conditions. All subjects performed predictably well in the baseline condition (no room simulation): correctly identifying all the words in the sentences presented. The addition of normal time-forward acoustics of a reverberant room reduced word recognition scores to approximately 89% on average. This result is roughly consistent with the 85% speech intelligibility prediction based on the STI for the simulated room. Sentence recognition scores were dramatically reduced for the time-reversed acoustics (approximately 25% correct on average) relative to the time-forward acoustics. A repeated-measures ANOVA on the arcsine-transformed (Kirk, 1982) word recognition scores confirmed that this difference was highly significant, F(1,9)=1055, p<0.0001, η²=0.992. These effects of time-reversing the room impulse response are not explainable on the basis of MTF-type analyses, because the time-forward and time-reversed conditions yielded nearly identical STI values. The effect is also readily apparent under informal listening situations, such as those demonstrated in Mm1 versus Mm2 and Mm3 versus Mm4.

Slight, but statistically significant, increases in performance were also observed (Fig. 3a) for binaural listening conditions relative to diotic listening, F(1,9)=5.35, p<0.05, η²=0.373. This result is consistent with previous literature (Nabelek and Robinson, 1982).

To test for potential room adaptation effects, we compared average performance for the first ten sentences presented in each block of trials versus the last ten sentences on a listener-by-listener basis. If room adaptation takes place over the course of a trial block in which repeated listening from the same acoustic environment takes place, then word recognition performance should also improve during the trial block. This result was observed for the time-forward binaural condition only, where word recognition scores improved by nearly six percentage points (see Fig. 3b). This improvement was highly significant, F(1,9)=31.23, p<0.001 (Bonferonni correction), η²=0.776, and is consistent with the notion of an environmental adaptation effect in which familiarity with a natural acoustic environment can aid in speech intelligibility performance (Watkins, 2005b, 2005a). All other conditions show either no change, or else a small decrease, F(1,9)=5.67, p<0.05 (Bonferonni correction), η²=0.386, in performance between first and last ten sentences. These latter results are inconsistent with simple practice effects causing the improvement in performance for the time-forward binaural condition.

Onset analysis

One potential explanation for the dramatic decrease in intelligibility between time-forward and time-reversed room acoustics is that the latter disrupts onset information in the speech stimulus much more than the former. To examine this possibility, we quantified the onset information present in the various sentence stimuli (full 8 kHz bandwidth) used in this experiment by first computing the instantaneous slope (1st derivative) of the time waveform’s amplitude envelope (extracted using a second-order Butterworth low-pass filter with 10 Hz cutoff) and then taking the mean of all positive slope values in each sentence. Larger values of this measure generally indicate more rapid amplitude onsets in the time waveform. Figure 4 displays the distributions of this onset measure for each of 170 sentences with either no processing, time-forward room processing (left ear only), or time-reversed room processing (left ear only). Mean onset measures for the three conditions (matched-observations) were significantly different, F(2,338)=1676, p<0.0001, η²=0.908, and strongly correlated with mean word recognition performance in the three conditions (r=0.88).

Distributions of onset measures in the speech stimuli for three different room acoustic manipulations: base line—no room (white), time-forward room (grey), time-reversed room (black). Means (M) and standard deviations (SD) are shown for each distribution.

Discussions and conclusions

These results demonstrate that speech intelligibility is dramatically degraded when reverberation and echoes precede direct path energy in simulated room environments. Although the precise cause of the intelligibility degradation is not known, it cannot be explained by MTF-type analyses of the rooms and does not appear to depend critically on binaural input. It may instead be related to the resulting degradation of onset information in the speech signals, which has been argued to be particularly important for speech intelligibility, from both computational (Régnier and Allen, 2008) and neurophysiological perspectives (Heil, 2003). Further study will be needed to more fully test this onset hypothesis, using stimulus conditions more conducive to evaluating speech onset information, such as consonant-vowel pairs.

Additional results from this study are suggestive of adaptive processes that mediate the contributions of room acoustics under plausible time-forward conditions (Clifton et al., 1994; Hartmann, 1997), but not in implausible time-reversed conditions. Although consistent with results of Watkins (2005b, 2005a), further research will be required to more fully understand the conditions under which the adaptation occurs, its potential mechanisms, and its relationship to other adaptation phenomena in acoustically reflective environments that relate to apparent sound direction (Freyman et al., 1991).

Acknowledgments

Work supported by NIH-NIDCD (R01DC008168).

References and links

Allen, J. B., and Berkley, D. A. (1979). “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am. 10.1121/1.382599 65, 943–950. [DOI] [Google Scholar]
ANSI-S3.9 (1989). American National Standard Specification for Audiometers (American National Standards Institute, New York: ). [Google Scholar]
Clifton, R. K., Freyman, R. L., Litovsky, R. Y., and McCall, D. (1994). “Listeners’ expectations about echoes can raise or lower echo threshold,” J. Acoust. Soc. Am. 10.1121/1.408540 95, 1525–1533. [DOI] [PubMed] [Google Scholar]
Freyman, R. L., Clifton, R. K., and Litovsky, R. Y. (1991). “Dynamic processes in the precedence effect,” J. Acoust. Soc. Am. 10.1121/1.401955 90, 874–884. [DOI] [PubMed] [Google Scholar]
Hartmann, W. M. (1997). “Listening in a room and the precedence effect,” in Binaural and Spatial Hearing in Real and Virtual Environments, edited by Gilkey R. H. and Anderson T. R. (Erlbaum, Mahwah, NJ: ), pp. 191–210. [Google Scholar]
Heil, P. (2003). “Coding of temporal onset envelope in the auditory system,” Speech Commun. 10.1016/S0167-6393(02)00099-7 41, 123–134. [DOI] [Google Scholar]
Heinz, R. (1993). “Binaural room simulation based on an image source model with addition of statistical methods to include the diffuse sound scattering of walls and to predict the reverberant tail,” Appl. Acoust. 28, 145–159. [Google Scholar]
Houtgast, T., and Steeneken, H. J. M. (1973). “The modulation transfer function in room acoustics as a predictor of speech intelligibility,” Acustica 28, 66–73. [Google Scholar]
Houtgast, T., and Steeneken, H. J. M. (1985). “A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria,” J. Acoust. Soc. Am. 10.1121/1.392224 77, 1069–1077. [DOI] [Google Scholar]
Houtsma, A. J. M., Rossing, T. D., and Wagenaars, W. M. (1987). “Effects of echoes,” in Auditory Demonstrations (Institute for Perception Research, Eindhoven, The Netherlands: ). [Google Scholar]
Kirk, R. E. (1982). Experimental Design: Procedures for the Behavioral Sciences (Brooks∕Cole, Monterey, CA: ). [Google Scholar]
Nabelek, A. K., and Robinson, P. K. (1982). “Monaural and binaural speech perception in reverberation for listeners of various ages,” J. Acoust. Soc. Am. 10.1121/1.387773 71, 1242–1248. [DOI] [PubMed] [Google Scholar]
Naylor, G. M. (1993). “ODEON—Another hybrid room acoustic model,” Appl. Acoust. 10.1016/0003-682X(93)90047-A 38, 131–143. [DOI] [Google Scholar]
Nilsson, M., Soli, S. D., and Sullivan, J. A. (1994). “Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am. 10.1121/1.408469 95, 1085–1099. [DOI] [PubMed] [Google Scholar]
Régnier, M. S., and Allen, J. B. (2008). “A method to identify noise-robust perceptual features: application for consonant∕t,” J. Acoust. Soc. Am. 10.1121/1.2897915 123, 2801–2814. [DOI] [PubMed] [Google Scholar]
Sabine, W. C. (1922). “Reverberation,” in Collected Papers on Acoustics (Harvard University Press, Cambridge, MA: ). [Google Scholar]
Steeneken, H. J., and Houtgast, T. (1980). “A physical method for measuring speech-transmission quality,” J. Acoust. Soc. Am. 10.1121/1.384464 67, 318–326. [DOI] [PubMed] [Google Scholar]
Watkins, A. J. (2005a). “Perceptual compensation for effects of echo and of reverberation on speech identification,” Acust. Acta Acust. 91, 892–901. [Google Scholar]
Watkins, A. J. (2005b). “Perceptual compensation for effects of reverberation in speech identification,” J. Acoust. Soc. Am. 10.1121/1.1923369 118, 249–262. [DOI] [PubMed] [Google Scholar]
Wightman, F. L., and Kistler, D. J. (1989). “Headphone simulation of free-field listening: I. Stimulus synthesis,” J. Acoust. Soc. Am. 10.1121/1.397557 85, 858–867. [DOI] [PubMed] [Google Scholar]
Zahorik, P. (2004). “Perceptual scaling of room reverberation,” J. Acoust. Soc. Am. 10.1121/1.1369784 115, 2598. [DOI] [Google Scholar]

[c1] Allen, J. B., and Berkley, D. A. (1979). “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am. 10.1121/1.382599 65, 943–950. [DOI] [Google Scholar]

[c2] ANSI-S3.9 (1989). American National Standard Specification for Audiometers (American National Standards Institute, New York: ). [Google Scholar]

[c3] Clifton, R. K., Freyman, R. L., Litovsky, R. Y., and McCall, D. (1994). “Listeners’ expectations about echoes can raise or lower echo threshold,” J. Acoust. Soc. Am. 10.1121/1.408540 95, 1525–1533. [DOI] [PubMed] [Google Scholar]

[c4] Freyman, R. L., Clifton, R. K., and Litovsky, R. Y. (1991). “Dynamic processes in the precedence effect,” J. Acoust. Soc. Am. 10.1121/1.401955 90, 874–884. [DOI] [PubMed] [Google Scholar]

[c5] Hartmann, W. M. (1997). “Listening in a room and the precedence effect,” in Binaural and Spatial Hearing in Real and Virtual Environments, edited by Gilkey R. H. and Anderson T. R. (Erlbaum, Mahwah, NJ: ), pp. 191–210. [Google Scholar]

[c6] Heil, P. (2003). “Coding of temporal onset envelope in the auditory system,” Speech Commun. 10.1016/S0167-6393(02)00099-7 41, 123–134. [DOI] [Google Scholar]

[c7] Heinz, R. (1993). “Binaural room simulation based on an image source model with addition of statistical methods to include the diffuse sound scattering of walls and to predict the reverberant tail,” Appl. Acoust. 28, 145–159. [Google Scholar]

[c8] Houtgast, T., and Steeneken, H. J. M. (1973). “The modulation transfer function in room acoustics as a predictor of speech intelligibility,” Acustica 28, 66–73. [Google Scholar]

[c9] Houtgast, T., and Steeneken, H. J. M. (1985). “A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria,” J. Acoust. Soc. Am. 10.1121/1.392224 77, 1069–1077. [DOI] [Google Scholar]

[c10] Houtsma, A. J. M., Rossing, T. D., and Wagenaars, W. M. (1987). “Effects of echoes,” in Auditory Demonstrations (Institute for Perception Research, Eindhoven, The Netherlands: ). [Google Scholar]

[c11] Kirk, R. E. (1982). Experimental Design: Procedures for the Behavioral Sciences (Brooks∕Cole, Monterey, CA: ). [Google Scholar]

[c12] Nabelek, A. K., and Robinson, P. K. (1982). “Monaural and binaural speech perception in reverberation for listeners of various ages,” J. Acoust. Soc. Am. 10.1121/1.387773 71, 1242–1248. [DOI] [PubMed] [Google Scholar]

[c13] Naylor, G. M. (1993). “ODEON—Another hybrid room acoustic model,” Appl. Acoust. 10.1016/0003-682X(93)90047-A 38, 131–143. [DOI] [Google Scholar]

[c14] Nilsson, M., Soli, S. D., and Sullivan, J. A. (1994). “Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am. 10.1121/1.408469 95, 1085–1099. [DOI] [PubMed] [Google Scholar]

[c15] Régnier, M. S., and Allen, J. B. (2008). “A method to identify noise-robust perceptual features: application for consonant∕t,” J. Acoust. Soc. Am. 10.1121/1.2897915 123, 2801–2814. [DOI] [PubMed] [Google Scholar]

[c16] Sabine, W. C. (1922). “Reverberation,” in Collected Papers on Acoustics (Harvard University Press, Cambridge, MA: ). [Google Scholar]

[c17] Steeneken, H. J., and Houtgast, T. (1980). “A physical method for measuring speech-transmission quality,” J. Acoust. Soc. Am. 10.1121/1.384464 67, 318–326. [DOI] [PubMed] [Google Scholar]

[c18] Watkins, A. J. (2005a). “Perceptual compensation for effects of echo and of reverberation on speech identification,” Acust. Acta Acust. 91, 892–901. [Google Scholar]

[c19] Watkins, A. J. (2005b). “Perceptual compensation for effects of reverberation in speech identification,” J. Acoust. Soc. Am. 10.1121/1.1923369 118, 249–262. [DOI] [PubMed] [Google Scholar]

[c20] Wightman, F. L., and Kistler, D. J. (1989). “Headphone simulation of free-field listening: I. Stimulus synthesis,” J. Acoust. Soc. Am. 10.1121/1.397557 85, 858–867. [DOI] [PubMed] [Google Scholar]

[c21] Zahorik, P. (2004). “Perceptual scaling of room reverberation,” J. Acoust. Soc. Am. 10.1121/1.1369784 115, 2598. [DOI] [Google Scholar]

PERMALINK

Time-forward speech intelligibility in time-reversed rooms

Laricia Longworth-Reed

Eugene Brandewie

Pavel Zahorik

Abstract

Introduction