Streaming of vowel sequences based on fundamental frequency in a cochlear-implant simulation

Etienne Gaudrain; Nicolas Grimault; Eric W Healy; Jean-Christophe Béra

doi:10.1121/1.2988289

. 2008 Nov;124(5):3076–3087. doi: 10.1121/1.2988289

Streaming of vowel sequences based on fundamental frequency in a cochlear-implant simulation¹

Etienne Gaudrain ^1,^b), Nicolas Grimault ^1,^c), Eric W Healy ^2,^d), Jean-Christophe Béra ³

PMCID: PMC2677355 PMID: 19045793

Abstract

Cochlear-implant (CI) users often have difficulties perceiving speech in noisy environments. Although this problem likely involves auditory scene analysis, few studies have examined sequential segregation in CI listening situations. The present study aims to assess the possible role of fundamental frequency (F₀) cues for the segregation of vowel sequences, using a noise-excited envelope vocoder that simulates certain aspects of CI stimulation. Obligatory streaming was evaluated using an order-naming task in two experiments involving normal-hearing subjects. In the first experiment, it was found that streaming did not occur based on F₀ cues when natural-duration vowels were processed to reduce spectral cues using the vocoder. In the second experiment, shorter duration vowels were used to enhance streaming. Under these conditions, F₀-related streaming appeared even when vowels were processed to reduce spectral cues. However, the observed segregation could not be convincingly attributed to temporal periodicity cues. A subsequent analysis of the stimuli revealed that an F₀-related spectral cue could have elicited the observed segregation. Thus, streaming under conditions of severely reduced spectral cues, such as those associated with CIs, may potentially occur as a result of this particular cue.

INTRODUCTION

The mechanisms involved in auditory stream segregation have been thoroughly investigated in normal-hearing (NH) listeners (e.g., Bregman and Campbell, 1971; van Noorden, 1975; Bregman, 1990). These studies led to the peripheral channeling theory (Hartmann and Johnson, 1991), which states that two stimuli need to excite different peripheral neural populations to produce auditory streaming. This theory and its implementations (Beauvois and Meddis, 1996; McCabe and Denham, 1997) assume that the main cues for streaming are spectral, suggesting that frequency selectivity is critical. Moore and Gockel (2002), in a review of studies involving sequential stream segregation, further concluded that any sufficiently salient perceptual difference may lead to stream segregation, regardless of whether or not it involves peripheral channeling (see also Elhilali and Shamma, 2007). Frequency selectivity can also affect the perceptual salience of cues, and difference limen (DL) measurements can be used to evaluate the salience of stimuli along a given perceptual dimension. Rose and Moore (2005) tested this hypothesis and found that the fission boundary (cf. van Noorden, 1975) was indeed proportional to the frequency DL for pure tones between 250 and 2000 Hz. However, it can be difficult to clearly define the salience of complex sounds composed of many interacting features. Moreover, this difficulty can be compounded when the signal is degraded by the hearing system, such as in hearing-impaired (HI) listeners or in cochlear-implant (CI) users. The current study aims to clarify the role of fundamental frequency in the perceptual segregation of vowel sequences having spectral cues reduced through the use of an acoustic vocoder model of a CI (cf. Dudley, 1939; Shannon et al., 1995).

Experiments involving NH listeners have shed light on the mechanisms underlying pitch-based streaming and on the influence of reduced frequency resolution. Streaming based on fundamental frequency (F₀) is reduced when the resolvability of harmonic components of complex tones is reduced, but it is still possible to some extent even when harmonics are totally unresolved (Vliegen and Oxenham, 1999; Vliegen et al., 1999; Grimault et al., 2000). Gaudrain et al. (2007) found that F₀-based streaming of vowel sequences was reduced when frequency resolution was reduced by simulated broad auditory filters (Baer and Moore, 1993). Roberts et al. (2002) showed that differences solely in temporal cues (obtained by manipulating the phase relationship between components) can elicit streaming. Finally, Grimault et al. (2002) observed streaming based on the modulation rate of sinusoidally amplitude-modulated noises, i.e., without any spectral cues to pitch. Despite the fact that the pitch elicited by the modulated noise was relatively weak, these authors observed streaming similar to that obtained with unresolved complex tones. Thus, streaming is reduced when spectral cues are reduced, but it is apparently possible to some extent when spectral cues are removed and only temporal cues remain.

These results have substantial implications for individuals with sensorineural hearing impairment and those fitted with a CI. It is well known that these individuals have reduced access to spectral cues (cf. Moore, 1998). Fundamental frequency DLs are approximately 2.5 times greater than normal in HI listeners (Moore and Peters, 1992) and 7.7 times greater in CI users (Rogers et al., 2006). These results suggest that pitch differences are far less salient for these listeners, and that pitch-based streaming might be impaired. Indeed a few studies argue that reduced frequency selectivity is responsible for the relatively poor performance of CI users in the perception of concurrent voices (Qin and Oxenham, 2005; Stickney et al., 2004, 2007). However, psychoacoustic measures have indicated that temporal resolution is generally intact in the HI ear (for review, see Moore, 1998; Healy and Bacon, 2002). CI users are also sensitive to temporal rate pitch, up to a limit of approximately 300 Hz (Shannon, 1983; Tong and Clark, 1985; Townshend et al., 1987). Although their DLs for rate discrimination are larger than in NH (Zeng, 2002) CI listeners can use this cue to discriminate vowel F₀’s (Geurts and Wouters, 2001). Although these results indicate that the cues for streaming may be available, their use by these individuals is not well understood.

Auditory streaming has been examined to a limited extent in HI listeners, with mixed results. Grose and Hall (1996) found that listeners with cochlear hearing loss generally required a greater frequency separation for segregation of pure tones. However, Rose and Moore (1997) reported no systematic difference between ears of unilaterally impaired listeners in this task. The correlation between auditory filter width and pure-tone streaming was also found to be not significant (Mackersie et al., 2001). Grimault et al. (2001) found that streaming was hindered for HI listeners relative to NH, but only in conditions where components of complex tones were resolved for NH and unresolved for HI listeners. Finally, Stainsby et al. (2004) examined streaming based on phase relationship differences and found results for elderly HI listeners that were similar to those observed in NH listeners.

A few studies have also attempted to examine streaming in CI users (Hong and Turner, 2006; Chatterjee et al., 2006; Cooper and Roberts, 2007). For these users, different kinds of temporal cues can be related to pitch. Moore and Carlyon (2005) argued that the temporal fine structure of resolved harmonics was the most accurate pitch mechanism. However, when harmonics are unresolved, they interact in audi-tory filters and can encode pitch by amplitude modulation (AM) rate (i.e., by the temporal envelope periodicity). Because of the way the spectrum is partitioned in the CI processor, harmonics of lower pitched human voices (F₀∼100 Hz) almost always interact in the first channel of the CI processor. Thus, the availability of individual resolved harmonics is extremely limited. In contrast, the temporal envelope is roughly preserved in each band, so pitch may be coded by temporal periodicity cues. In this paper, the term “temporal-pitch cues” will then refer to temporal periodicity (in the range 100–400 Hz); in contrast to “spectral-pitch cues,” which will refer to the pitch that is evoked by resolved harmonics (i.e., issued from the tonotopic analysis of the cochlea, and if relevant, from some analysis of temporal fine structure). Because amplitude-modulated broadband noises can produce some impression of pitch (Burns and Viemeister, 1976, 1981) and can induce streaming (Grimault et al., 2002), it might be possible for these temporal cues to induce streaming in CI users.

Hong and Turner (2006) used the rhythm task described in Roberts et al. (2002) to obtain an objective measure of obligatory streaming in NH and CI users. They found that half of the 16–22 electrode CI users performed as poorly as the NH listeners (suggesting streaming), whereas the other half performed better than normal (suggesting less stream segregation). The authors showed that this variability correlated moderately but significantly with the ability to perceive speech in noise. Chatterjee et al. (2006) used pulse trains in ABA patterns and a subjective evaluation of whether subjects fitted with the 22-channel nucleus CI heard one or two streams. These authors observed response patterns that could be explained by streaming for both differences in spatial location (presentation electrode) and AM rate (in a single subject). However, they did not observe the characteristic buildup of streaming over time (Bregman, 1978) for simple pulsatile stimuli that differed in location. This observation raises the possibility that the task involved discrimination rather than streaming. On the contrary, they did observe some buildup for the signals that differed in AM rate, which suggests that AM rate based streaming was indeed observed. Cooper and Roberts (2007) also employed pulsatile stimuli that differed in electrode location. They obtained subjective reports involving the presence of two streams, but a second experiment revealed that the results may have been attributable to pitch (or brightness) discrimination. Other studies have targeted temporal cues more specifically, but have examined simultaneous segregation by CI users. Using a CI simulation based on filtered harmonic complexes, Deeks and Carlyon (2004) found only modest improvements in concurrent sentence recognition when the target and masker were presented at different pulse rates. Also, Carlyon et al. (2007) found that a difference in rate pitch did not enhance simultaneous segregation of pulse trains in CI users. Altogether, these studies provide only modest evidence that segregation or streaming can occur in CI recipients on the basis of either place pitch (i.e., electrode number) or temporal pitch.

These previous results together suggest (1) that F₀-based streaming is affected by frequency selectivity, but (2) that streaming can be also induced by temporal-pitch cues. It is also clear that (3) frequency selectivity is reduced in HI and CI listeners, but that (4) temporal-pitch cues are preserved to some extent in these listeners. The question then becomes to what extent these cues can be utilized to elicit streaming.

Although streaming is often assumed to be a primitive mechanism, some correlation between streaming and higher level processing, such as concurrent speech segregation, has been reported (Mackersie et al., 2001). However, the relation between streaming with pure or complex tones and speech segregation remains difficult to assess. In speech, pitch cues signaling that different talkers are present are mixed with other cues that may not be relevant for concurrent talker segregation. Listeners may then not benefit from these cues in ecological situations. Only a few studies have reported streaming with speech materials (Dorman et al., 1975; Nooteboom et al., 1978; Tsuzaki et al., 2007; Gaudrain et al., 2007), and only the last one examined the effect of impaired frequency selectivity.

The current study follows that of Gaudrain et al. (2007). Whereas streaming under conditions of broad tuning in accord with sensorineural hearing impairment was examined in that study, streaming under conditions similar to CI stimulation was assessed in the current study. Specifically, the role of reduced spectral-pitch cues in streaming of speech stimuli was assessed in a first experiment, and the possible role of temporal-pitch cues was investigated in a second experiment.

In this study, noise-band vocoder models of CIs were employed to control the spectral and temporal cues available to listeners. The use of NH listeners exposed to cues reduced in controlled manner allowed the elimination of complications and confounds associated with the clinical population. In addition, an objective paradigm—the order task—was used to assess streaming. In this task, the listener is presented a repeating sequence of vowels having alternating F₀ and asked to report the order of appearance of the constituent vowels (Dorman et al., 1975). If the sequence splits into streams corresponding to the two F₀’s, the loss of temporal coherence across streams hinders the ability to identify the order of items within the sequence. Although the order of items within individual steams is available to the listener, the order of items across streams is not. Thus, this task requires that the subject resists segregation to perform well. As a result, the task is used to assess “obligatory” streaming—that which cannot be suppressed by the listener (see Gaudrain et al., 2007).

This type of streaming does not produce a substantial cognitive load, and it is less dependent on attention and subject strategy than the subjective evaluation of one versus two streams. In addition, this approach is appropriate for examining segregation in the presence of reduced cues—because performance tends to improve (less streaming) with degraded stimuli, performance cannot be attributable to the degradation of the individual items. The reduction in spectral resolution associated with the vocoder is expected to reduce the amount of streaming. Consequently, performance in the order task should improve in the CI simulation, relative to the intact stimuli. However, if temporal-pitch cues encoded by the CI simulation are sufficient to elicit obligatory streaming, overall scores should remain low and an effect of F₀ separation should be observed.

EXPERIMENT 1

Materials and method

Subjects

Six subjects aged 22–30 years (mean 26.2) participated. All were native speakers of French and had pure-tone audiometric thresholds below 20 dB hearing level (HL) at octave frequencies between 250 and 4000 Hz (American National Standards Institute, 2004). All were paid an hourly wage for participation. These subjects participated in one of the experiments of Gaudrain et al. (2007) and were therefore familiar with the paradigm.

Stimuli

Individual vowels were first recorded and processed, then arranged into sequences. The six French vowels ∕a e i ɔ y u∕ were recorded (24 bits, 48 kHz) using a Røde NT1 microphone, a Behringer Ultragain preamplifier, a Digigram VxPocket 440 soundcard, and a PC. The speaker was instructed to pronounce all six vowels at the same pitch and to reduce prosodic variations. The F₀ and duration of each vowel were then manipulated using STRAIGHT (Kawahara et al., 1999). Duration was set to 167 ms to produce a speech rate of 6.0 vowel∕s. This value is close to that measured by Patel et al. (2006) for syllable rates in British English (5.8 syllables∕s) and in French (6.1 syllables∕s). Additional versions of each vowel were then prepared in which the average F₀’s were 100, 110, 132, 162, and 240 Hz. Fundamental frequency variations related to intonation were constrained to be within 0.7 semitones (4%) of the average. This value was chosen to allow F₀ variations within each vowel, but to avoid overlap across the F₀ conditions. Formant positions were held constant across F₀ conditions.

Each vowel was subjected to two conditions of reduced spectral resolution. In Q₂₀ the vowels were subjected to a 20-band noise vocoder, and in Q₁₂ they were subjected to a 12-band noise vocoder. The Q₁₂ condition was intended to be closer to actual CI characteristics, while the Q₂₀ condition was intended to be an intermediate condition with more spectral detail. Q_∞ refers to the intact vowels. The implementation of the noise-band vocoder followed Dorman et al. (1997). The stimulus was first divided into frequency bands using eighth order Butterworth filters. The cutoff frequencies of these bands were the approximately logarithmic values used by Dorman et al. (1998) and are listed in Table 1. The envelope of each band was extracted using half-wave rectification and eighth order Butterworth lowpass filtering with cutoff frequency of 400 Hz. This lowpass value ensured that temporal-pitch cues associated with voicing were preserved. The resulting envelopes were used to modulate white noises using sample point-by-point multiplication, which were then filtered to restrict them to the spectral band of origin. The 12 or 20 bands comprising a condition were then mixed to construct the vocoder. A 10 ms cosine rise∕fall gate was finally applied to each vowel in each condition.

Table 1.

Cutoff frequencies (Hz) of the 12- and 20-channel vocoders, from Dorman et al. (1998).

Open in a new tab

The vowels were then concatenated to form sequences. Figure 1 describes the arrangement of vowels into sequences and the construction of the various conditions. Each sequence contained one presentation of each vowel. Sequences containing all possible arrangements of the six repeating vowels ([n−1]!=120) were first generated, then the 60 arrangements having the smallest differences in formant structure were selected for inclusion [Fig. 1A]. The selection of arrangements having the smallest perceptual formant differences1 was performed to reduce the influence of streaming based on differences in formant structure between successive vowels in a sequence (Gaudrain et al., 2007). These 60 arrangements were then divided into five groups of 12 arrangements each, such that the average perceptual distance of each group was approximately equal [Fig. 1B].

The arrangement of conditions: (A) Individual vowels were recorded and modified using STRAIGHT. They were arranged into sequences (120 possible orders), and the 60 arrangements having the lowest perceptual distances d (in barks) were selected (see text for details). (B) These 60 arrangements were divided into five groups with similar average perceptual distance (12 in each group). (C) Each group was assigned to a fundamental frequency difference in both LowRef and HiRef conditions (yielding 120 sequences). (D) These 120 sequences appeared in both Slow and Fast conditions (yielding 240 sequences). (E) Finally, these 240 sequences appeared in each Q condition (yielding 720 sequences). These 720 sequences were presented across six presentation blocks, such that each condition was equally represented within each block.

The F₀ of the vowels in a sequence alternated between two values F₀₍₁₎ and F₀₍₂₎. In condition LowRef, the value of F₀₍₁₎ was 100 Hz and F₀₍₂₎ was one of the five F₀ values (100, 110, 132, 162, and 240). In condition HiRef, F₀₍₁₎ was 240 Hz and F₀₍₂₎ was one of the five F₀ values. Thus, there were five F₀ differences. Each group of 12 arrangements was then assigned to one of the five F₀ differences. The appearance of the same 60 arrangements in both the LowRef and HiRef conditions yielded 120 sequences [Fig. 1C]. These 120 sequences appeared in both a Slow and a Fast condition, yielding 240 sequences [Fig. 1D]. Finally, each of these 240 sequences appeared in each of the three Q conditions, yielding a total of 720 sequences [Fig. 1E].

In the Slow condition, the presentation rate was 1.2 vowel∕s, and in the Fast condition, it was 6 vowel∕s. Slow sequences were used to check vowel identification performance, and Fast sequences were used to examine streaming. To create the Slow sequences, silence was added between the vowels so that vowel duration remained constant across rate conditions. The Slow sequences were repeated four times and the Fast sequences were repeated 20 times, for overall stimulus durations of 20 s.

Stimuli were generated at 16 bits and 44.1 kHz using MATLAB. They were presented using the Digigram VxPocket 440 soundcard, and Sennheiser HD250 Linear II headphones diotically at 85 dB sound pressure level, as measured in an artificial ear (Larson Davis AEC101 and 824; American National Standards Institute, 1995).

Procedure

a. Training and selection. Two training tasks preceded testing. The first involved simple identification of single vowels. Subjects heard blocks containing each vowel at each F₀ twice. They responded using a mouse and a computer screen, and visual feedback was provided after each response. This test was repeated, separately for each Q condition, until a score of 98% (59∕60) was obtained. On average, proficiency was reached after one block for the Q_∞ vowels, 1.3 blocks for the Q₂₀ vowels, and 2.8 blocks for the Q₁₂ vowels.

The second training task involved vowel identification using the Slow sequences. In each block, 60 sequences were presented representing all 30 conditions (5 F₀’s×2 LowRef∕HiRef×3 Q conditions). The procedure was the same as the test procedure, except that visual feedback was provided. The subject was presented with a repeating sequence. After an initial 5 s period, during which streaming was allowed to stabilize, the subject was asked to report the order of appearance of the constituent vowels. They were allowed to start with any vowel. The response was entered using a computer graphic interface and a mouse. The next sequence was presented after the subject confirmed their response or after a maximum of 20 s. Visual feedback was then provided. To proceed to the test, subjects were required to obtain a score, averaged over two consecutive blocks, greater than 95% in each Q condition. On average, 6.3 blocks were necessary to reach the proficiency criterion. Although intended to be a selection criterion, no subject was eliminated at this step.

b. Streaming test. The procedure was the same as that in the second training task, except that no feedback was provided. The 720 sequences were distributed among six presentation blocks, such that each condition was represented equally in each block. The average duration of one block was approximately 28 min. Experiment 1 required subjects to participate in four 2 h sessions, during which frequent breaks were provided. The experimental procedure was formally approved by a local ethics committee (CCPPRB Léon Bérard).

Results

For each condition, the score is the percentage of responses in which the six vowels comprising a sequence were reported in the correct order. Mean scores across subjects are plotted as a function of F₀₍₂₎ in Fig. 2. Chance performance is 0.8%. As in Gaudrain et al., 2007, high scores can be interpreted as a tendency toward integration across F₀₍₁₎ and F₀₍₂₎ items and a resistance to obligatory streaming. Separate analyses were conducted on the LowRef and HiRef conditions because the F₀ differences were not the same in the two conditions. The results in the Slow condition (1.2 vowel∕s) showed that identification was near perfect in all conditions except one (HiRef, Q₁₂, F₀₍₂₎=162 Hz). An analysis of errors in this condition showed confusions between ∕y∕ and ∕e∕ in 8∕9 false responses. These two vowels therefore seem difficult to discriminate at this particular combination of F₀’s and vocoder channel divisions. All subsequent analyses were carried out on the data collected in the Fast conditions.

Shown are group means and standard deviations for a task in which subjects reported the order of six vowels appearing in sequence. Thus, low scores represent a tendency toward segregation. Alternate vowels were at alternate F₀ values (F₀₍₁₎ and F₀₍₂₎). In the LowRef conditions (upper panels), F₀₍₁₎ was fixed at 100 Hz and in the HiRef conditions (lower panels) F₀₍₁₎ was fixed at 240 Hz. Conditions Q₂₀ and Q₁₂ involved reduced frequency resolution via 20- and 12-channel noise vocoders. Filled squares represent a Fast condition (6.0 vowel∕s) in which streaming can occur, and open circles represent a Slow condition (1.2 vowel∕s) that ensures accurate item identification. The abscissa is logarithmic.

A two-way analysis of variance (ANOVA) on the Fast∕LowRef data using Q condition and F₀₍₂₎ as repeated parameters indicated that the effects of Q condition [F(2,10)=4.14, p<0.05] and F₀₍₂₎ [F(4,20)=12.21, p<0.001] were significant, and interacted significantly [F(8,40)=3.93, p<0.01]. Separate one-way ANOVAs on each Q condition using F₀₍₂₎ as a repeated factor showed a significant effect of F₀₍₂₎ in the Q_∞ condition [F(4,20)=9.28, p<0.001], but not in the Q₁₂ [F(4,20)=0.65, p=0.63] or Q₂₀ conditions [F(4,20)=0.21, p=0.93].

A two-way ANOVA on the Fast∕HiRef data using Q condition and F₀₍₂₎ as repeated parameters indicated that Q condition [F(2,10)=9.02, p<0.01] and F₀₍₂₎ [F(4,20)=14.30, p<0.001] were significant, and interacted significantly [F(8,40)=6.74, p<0.001]. Separate one-way ANOVAs on each Q condition using F₀₍₂₎ as a repeated factor showed significant effects of F₀₍₂₎ in the Q_∞ condition [F(4,20)=10.15, p<0.001] and in the Q₁₂ condition [F(4,20)=14.96, p<0.001], but not in the Q₂₀ condition [F(4,20)=1.50, p=0.24]. A post hoc analysis using pairwise t-tests showed that the effect of F₀₍₂₎ in the Q₁₂ condition was due solely to the point F₀₍₂₎=162 Hz. As previously stated, the confusions in this particular condition suggest difficulty with this particular set of parameters. When the HiRef condition was analyzed with this condition excluded, the pattern of significance was identical to that observed in the LowRef conditions: a significant effect of F₀₍₂₎ in the Q_∞ condition [F(3,15)=12.24, p<0.001], but not in the Q₁₂ [F(3,15)=0.74, p=0.54] or Q₂₀ conditions [F(3,15)=1.13, p=0.37].

Discussion

The results in the natural speech condition (Q_∞) are consistent with those observed by Gaudrain et al. (2007) in their first experiment. The greater the F₀ difference, the lower the scores, signifying greater streaming. Streaming based on F₀ difference is considered to be obligatory here because the task employed required that streaming be suppressed in order to perform accurately. Although the pattern of results in the current experiment is similar to that obtained by Gaudrain et al. (2007), the baseline level of performance differs. Scores in the Q_∞ condition at matched F₀ were over 80% here and approximately 50% in experiment 1 of Gaudrain et al. (2007). One possible reason is that participants in the current experiment were well trained. In addition, there were subtle differences in the stimuli used in the two studies. Gaudrain et al. (2007) attributed low scores in the matched F₀ condition to formant-based streaming. Such a phenomenon has been reported by Dorman et al. (1975) with synthesized vowels. Formant-based streaming might be reduced with the recorded vowels used in the current experiment, where small F₀ fluctuations were preserved. Fundamental frequency fluctuations might serve to strengthen the grouping of components comprising individual vowels and limit the grouping of formants across successive vowels, as suggested by Gestalt theory (Bregman, 1990).

In the conditions having spectral degradation (Q₂₀ and Q₁₂), the scores are high and do not depend on F₀. Thus, when spectral cues to pitch were reduced in accord with a CI model, F₀-based streaming was reduced or eliminated. Further, these results indicate that the temporal cues to pitch that remained in the vocoded stimuli were not strong enough, in this case, to elicit obligatory streaming. It is potentially interesting to note that these results cannot be explained by a loss of intelligibility since degradation of the stimuli yielded an increase in performance. In addition, vowel identification was confirmed in the Slow condition.

The main finding of this experiment was that no F₀-based streaming appeared when spectral-pitch cues were degraded using a model of a 12- or a 20-channel CI. This result suggests that obligatory streaming is reduced when spectral cues to pitch are reduced in this manner and may not be possible for vowel stimuli based on remaining temporal-pitch cues. This observation is in apparent contrast with studies that observed some streaming in CI recipients (Chatterjee et al., 2006; Hong and Turner, 2006; Cooper and Roberts, 2007). However, these previous observations were generally based on conditions in which some place pitch existed. The current result is also in apparent contrast with the observation of streaming based on temporal pitch in NH (Grimault et al., 2002). One explanation for this discrepancy is that temporal-pitch cues were not sufficiently salient in the noise-band vocoder. This point is addressed in Sec. 4. It is also potentially important that obligatory streaming is strongly influenced by presentation rate (van Noorden, 1975), and that Hong and Turner (2006), Chatterjee et al. (2006), and Grimault et al. (2002) all used sequences with higher presentation rates (10 stimuli∕s) to observe streaming. It then seems plausible that, for an F₀ difference of about one octave, the natural presentation rate used in experiment 1 was not sufficiently high to elicit obligatory streaming under degraded conditions, but that streaming may still be possible. The next experiment assesses this hypothesis.

EXPERIMENT 2

As shown by van Noorden (1975), the temporal coherence boundary, the threshold corresponding to obligatory streaming, depends on presentation rate. As shown in experiment 1 of Gaudrain et al. (2007), higher presentation rates in the current paradigm do indeed lead to stronger measures of streaming. Thus, increasing the repetition rate should strengthen the streaming effect and reveal if segregation is possible under the current conditions of severely reduced spectral cues, but preserved temporal cues to pitch. In addition, two envelope cutoff values were employed to more closely examine the role of temporal cues.