Spectral motion contrast as a speech context effect

Ningyuan Wang; Andrew J Oxenham

doi:10.1121/1.4892771

. 2014 Sep;136(3):1237–1245. doi: 10.1121/1.4892771

Spectral motion contrast as a speech context effect

Ningyuan Wang ^1,^a), Andrew J Oxenham ¹

PMCID: PMC4165225 PMID: 25190397

Abstract

Spectral contrast effects may help “normalize” the incoming sound and produce perceptual constancy in the face of the variable acoustics produced by different rooms, talkers, and backgrounds. Recent studies have concentrated on the after-effects produced by the long-term average power spectrum. The present study examined contrast effects based on spectral motion, analogous to visual-motion after-effects. In experiment 1, the existence of spectral-motion after-effects with word-length inducers was established by demonstrating that the identification of the direction of a target spectral glide was influenced by the spectral motion of a preceding inducer glide. In experiment 2, the target glide was replaced with a synthetic sine-wave speech sound, including a formant transition. The speech category boundary was shifted by the presence and direction of the inducer glide. Finally, in experiment 3, stimuli based on synthetic sine-wave speech sounds were used as both context and target stimuli to show that the spectral-motion after-effects could occur even with inducers with relatively short speech-like durations and small frequency excursions. The results suggest that spectral motion may play a complementary role to the long-term average power spectrum in inducing speech context effects.

I. INTRODUCTION

Perceptual systems encode stimuli in a way that is highly dependent on contextual information. Speech is no exception to this general rule, and our perception of individual speech sounds can depend strongly on the context in which they are presented. In a pioneering study, Ladefoged and Broadbent (1957) tested 60 subjects in a word identification task. They observed that altering the first two formants within a context sentence (“Please say what this word is”) dramatically changed subjects' identification of the following tests words. For example, a test word was perceived as “bit” by 53 subjects out of 60 when the unfiltered sentence was presented as the context, whereas the same word was perceived as “bet” by 54 of the subjects after the first formant (F1) of the preceding sentence was lowered somewhat. In a later example, Mann (1980) found that ambiguous syllables along a /ga/-/da/ continuum were generally perceived as /ga/ when preceded by the syllable /al/ and were perceived as /da/ when preceded by the syllable /ar/.

Since these early studies, it has been debated whether such context effects are specific to speech, or whether they reflect more general auditory processes. Soon after Mann's study, Fowler (1981) suggested that this “compensation for coarticulation” must reflect speech processes, since subjects' strategy for perceiving vowels was tightly coupled to their strategy for producing them. However, other researchers have since argued that such context effects may reflect more general auditory processes (Diehl et al., 2004). For instance, Lotto and Kluender (1998) observed a smaller but significant effect even when using sine-wave tones or glides corresponding to F3 of /al/ and /ar/ as the precursor, demonstrating that it was not necessary for the precursor to be perceived as speech for context effects to occur. In addition, Lotto et al. (1997) found similar context effects in a behavioral study of Japanese quails, suggesting that knowledge of speech was also not necessary. Both these and other studies (e.g., Holt, 2006), have suggested that the average power spectrum of the preceding sound plays a dominant role in determining context effects, and that the effects are contrastive. Summerfield et al. (1984) found that listeners were able to identify a flat-spectrum harmonic tone complex as a vowel, if it followed a sound with a similar spectrum, but with components at frequencies corresponding to the first three formants of the vowel omitted. Wang et al. (2012) also observed similar effects with cochlear-implant users. Such contrastive effects are common in other sensory modalities (Gibson, 1933), and may reflect the tendency of perceptual systems to normalize or “whiten” the incoming stimuli to improve coding efficiency (e.g., Barlow, 1961; Dean et al., 2008).

Aside from average power spectrum, other stimulus properties may also induce after-effects that may be relevant to speech perception. For instance, both speech and non-speech contexts affect the perception of the fundamental-frequency (F0) contour of lexical tones in a contrastive way: following a context with a higher mean F0, the target syllable is more likely to be identified as a lexical tone starting from a lower F0 and vice versa (Huang and Holt, 2012).

In addition to spectral contrast effects, temporal contrast effects also occur in speech perception (e.g., Diehl and Walsh, 1989; Wade and Holt, 2005). For instance, Wade and Holt (2005) measured the influence of the presentation rate of a preceding sequence of pure tones on the perception of stimuli generated from a continuum between /ba/ and /wa/, as defined by the duration of formant transitions. They observed that a rapid presentation rate of the preceding pure tones resulted in more /wa/ responses, corresponding to the perception of a longer formant transition, while a slower presentation rate resulted in more /ba/ responses, corresponding to the perception of a shorter formant transition. Thus, contrastive after-effects have been shown in speech in both spectral and temporal domains.

Dynamic spectral changes may also play a role in inducing context effects. In a demonstration with some similarities to the visual-motion after-effect (Gibson, 1933), often referred to as the “waterfall effect,” Shu et al. (1993) found that preceding glides in the center frequency of narrowband noise induced the perception of spectral motion in the opposite direction, such that a downward sweep, repeated over 2–3 min, caused listeners to hear a stationary noise band as increasing in frequency, and vice versa. Beyond that initial report on the spectral-motion after-effect, little is known concerning the underlying mechanisms, or its relevance to everyday auditory perception. One earlier study (Holt et al., 2000) reported that preceding contexts that included formant transitions had a larger effect on synthesized vowel identification than conditions with only a steady-state spectral context, suggesting that spectral motion may also play a role in speech context effects.

The present study investigates spectral-motion after-effects and their influence on the perception of non-speech and synthesized-speech sounds. The first experiment confirms the presence of spectral-motion after-effects with stimulus durations closer to those approximating speech sounds. The second experiment reports after-effects of spectral motion on perceptual judgments of speech sounds. Finally, the third experiment examines possible trade-offs between average spectrum and spectral motion, using precursors that were designed to more closely resemble speech sounds.

II. EXPERIMENT 1: AUDITORY SPECTRAL-MOTION AFTER-EFFECTS WITH WORD-LENGTH INDUCERS

A. Methods

1. Subjects

Eight (2 males, 6 females) native speakers of American English participated in this experiment and were compensated for their time. Their ages ranged from 18 to 28 years (mean age 23.6 years). They had normal hearing, as defined by audiometric thresholds below 20 dB hearing level (HL) at octave frequencies between 0.25 and 8 kHz.

2. Stimuli

Each trial consisted of a single 500-ms precursor tone, followed by a single 50-ms target tone. The precursor and target were separated by a 50-ms silent gap. All the stimuli were gated on and off with 20-ms raised-cosine ramps. As illustrated in Fig. 1, the precursor was centered in the high (2200 Hz), middle (2000 Hz), or low (1800 Hz) frequency region, and was a rising or falling linear frequency glide, or remained at the same frequency. The combination of three frequency regions and three temporal patterns resulted in a total of nine precursor conditions. The nominal beginning and end frequencies of the precursors are listed in Table I. The nominal beginning frequency of target stimulus was selected from the range between 1920 Hz and 2080 Hz in steps of 20 Hz, and the nominal end frequency was always 2000 Hz. The overall frequency content of both precursor and target was roved together by ±10% across trials, so that the frequency relationship between the precursor and the target remained constant. The rove was designed to discourage listeners from using potential cues based on absolute frequency.

FIG. 1. — Schematic diagram of the stimuli used in experiment 1. The precursor, or inducer, was a rising, falling, or steady 500-ms glide that was centered at one of three frequencies. The test stimulus, or target, was a 50-ms tone, selected from one of the rising, falling, or steady lines shown at the right of the figure.

TABLE I.

Onset and offset frequencies of each precursor condition.

Conditions	No precursor	High-rising	High-flat	High-falling	Middle-rising	Middle-flat	Middle-falling	Low-rising	Low-flat	Low-falling
Onset (Hz)	N/A	2150	2200	2250	1950	2000	2050	1750	1800	1850
Offset (Hz)	N/A	2250	2200	2150	2050	2000	1950	1850	1800	1750

Open in a new tab

The stimuli were generated digitally and played out diotically from a LynxStudio L22 24-bit soundcard at a sampling rate of 22.5 kHz via Sennheiser HD650 headphones to subjects seated in a double-walled sound-attenuating chamber. The equivalent diffuse-field presentation level for all the sounds was 65 dB sound pressure level (SPL).

3. Procedure

Subjects were asked to judge whether the target tone was “rising” or “falling” and to respond via virtual buttons on the computer display. Prior to the actual experiment, all subjects underwent a training session, during which they were presented with just the target and no precursor. Eight target conditions were tested, including all the target conditions tested in the actual experiment, with the exception of the “flat” target. Each of the conditions was presented 10 times within a block of trials. Feedback was provided during training. In order to progress to the actual experiment, subjects had to achieve at least 80% correct responses on average within 3 blocks in discriminating rising from falling glides. Two of the initial 10 subjects failed to reach this criterion, so only the remaining 8 were tested further. In the actual experiment, all 9 target conditions were tested 10 times each within each block in random order, for a total block length of 90 trials with a single precursor condition. The 10 precursor conditions (9 precursors and 1 no-precursor reference condition) were presented in separate blocks and were repeated 5 times, each in random order, for a total of 50 blocks. Thus, each of the 90 conditions (9 target by 10 precursor conditions) was repeated 50 times, and the proportion of “rising” and “falling” responses was calculated for each subject and condition from these 50 responses. No feedback was provided in the test sessions. All subjects provided informed written consent prior to participating, and the experimental protocols were approved by the Institutional Review Board of the University of Minnesota.

B. Results

The mean results are shown in Fig. 2. The left, middle, and right panels show the results using the precursor in the low, middle, and high spectral region, respectively. For comparison, the results from the condition with no precursor are shown as circles in all three panels. Considering first the condition with the precursor in the middle spectral region (Fig. 2, middle panel), it seems that on average the rising precursor led to more “falling” responses, and the falling precursor led to more “rising” responses, relative to the “flat” precursor condition. In other words, the results from the precursor in the middle region are consistent with predictions based on a contrastive spectral-motion after-effect. Similar differences between the falling and rising precursor can be observed in the lower and higher spectral regions (Fig. 2, left and right panels, respectively), although the relationship between those responses and the responses to the flat or no precursor are not so clear cut.

To quantify the effects of the precursor, we used probit analysis to fit each of the curves shown in Fig. 2 for each subject individually. Then we calculated the point at which each curve crossed the 50% point (i.e., the point at which a “falling” response was as likely as a “rising” response), which is termed the “category response boundary.” The mean category response boundaries, averaged across subjects, are shown in Fig. 3. A boundary value of 2000 Hz implies that a flat target was perceived veridically; higher boundary values imply that flat targets were more likely to be reported as rising, whereas lower boundary values imply that flat targets were more likely to be reported as falling. The category response boundaries were subjected to a two-way within-subjects analysis of variance (ANOVA), with precursor glide direction (up, down, or flat) and spectral region (low, medium, or high) as the two factors. Significant main effects were observed for both glide direction [F(2,14) = 5.6; p = 0.016] and frequency region [F(2,14) = 12.6; p = 0.001], and for their interaction [F(4,28) = 4.05; p = 0.01]. The main effect of glide direction reflects the trend visible in Fig. 3 that the rising precursor tended to lead to lower boundary values than the falling precursor. Post hoc contrast analysis showed that the response boundary in the rising condition was significantly different from that in the falling condition (p = 0.049). However, no significant difference was observed between the response boundary in the flat condition and that in either the rising or falling condition. The main effect of spectral region reflects the trend for decreasing boundary value from low to high precursor spectral region. The interaction presumably reflects the impression that the effect of spectral motion seems greater in the middle spectral region than in the low or high region.

FIG. 3. — Mean category response boundary frequencies for each condition. The different bar shadings represent the different precursor motion conditions, as shown in the legend. The results from the three spectral regions are shown in separate groups, as listed along the horizontal axis. Error bars represent 1 s.e. of the mean.

C. Discussion

The results from this experiment, showing a rising precursor leading to more “falling” responses, and vice versa, is consistent with the original report of a contrastive spectral-motion after-effect (Shu et al., 1993), and extends the original finding by showing that a relatively short, word-length, precursor of 500 ms is sufficient to produce a measurable effect. Relatively short spectral motion on this time scale could come from pitch glides in speech, particularly in tone languages, where it has already been shown that F0 contrast effects can be measured (Huang and Holt, 2009).

The effect of spectral region produced an interesting trend, which might be described as “continuity”: if the precursor was in the high spectral region, then the target was more likely to be reported as “falling,” i.e., moving from the region of the precursor to the center, whereas if the precursor was in the low spectral region, the target was more likely to be reported as “rising.” This is the opposite of what would be expected based on spectral contrast, where a high precursor would be expected to lower the perceived beginning of the precursor. One potential reason for why our results are not consistent with expectations based on spectral contrast was that the target consisted of just a short glide, whereas earlier studies have used speech-like sounds that began with a short glide, simulating a formant transition, and ended with a longer steady-state portion. The lack of a steady-state portion at the end of the glide may have reduced the extent to which spectral contrast differentially affected the beginning and end of the target sound.

We have assumed that the differences produced by the rising and falling precursors, particularly in the middle spectral region, are due to their spectral-motion properties. It is clear that the average spectrum of the precursor in the middle region cannot explain the effects, as the average frequency of the rising, falling and flat precursors are the same. Nevertheless, it is possible that the results reflect primarily the end frequency of the precursor, rather than spectral motion per se. This interpretation is rendered less likely by the fact that the end frequency does not provide a good predictor of all the results. Progressing from the low spectral region to the high, there is a 100-Hz difference between the end frequency of the falling and rising precursor within each spectral region, and between the rising precursor of one spectral region and the falling precursor of the next (going from left to right in Fig. 3, ignoring the flat precursor conditions). Therefore, if the end frequency of each precursor predicted the results, the category response boundary should monotonically (and perhaps linearly) decrease with increasing end frequency. Although this pattern holds within each of the three spectral regions, it does not hold across spectral regions; for instance, going from low-rising to middle-falling leads to an increase in category response boundary, rather than the expected decrease predicted by the end frequency of the precursor. However, the results are somewhat variable, leaving potential room for doubt. In the next experiment we used sine-wave speech targets where the perceived glide direction of a synthetic formant changed the identity of the speech sound. Based on earlier studies, we expected long-term spectral contrast effects to predict the opposite pattern of results from spectral-motion after-effects, thereby making it easier to distinguish between the two.