Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2015 Dec 9;138(6):EL504–EL508. doi: 10.1121/1.4936981

Auditory streaming of tones of uncertain frequency, level, and duration

An-Chieh Chang 1, Robert A Lutfi 1, Jungmee Lee 1
PMCID: PMC4676779  PMID: 26723358

Abstract

Stimulus uncertainty is known to critically affect auditory masking, but its influence on auditory streaming has been largely ignored. Standard ABA-ABA tone sequences were made increasingly uncertain by increasing the sigma of normal distributions from which the frequency, level, or duration of tones were randomly drawn. Consistent with predictions based on a model of masking by Lutfi, Gilbertson, Chang, and Stamas [J. Acoust. Soc. Am. 134, 2160–2170 (2013)], the frequency difference for which A and B tones formed separate streams increased as a linear function of sigma in tone frequency but was much less affected by sigma in tone level or duration.

1. Introduction

As the frequency separation of A and B tones in an ABA-ABA tone sequence increases the tones are heard to split into separate auditory streams. The phenomenon of auditory streaming first drew attention of researchers with the publication of the seminal work of van Noorden (1975) and has since become a pervasive influence in research on hearing, being implicated in the release of auditory masking (Kidd et al., 1994), informing work on computational auditory scene analysis (CASA; Wang and Brown, 2006), and being linked to our ability to “hear out” individual sound sources in natural, multisource acoustic environments (Bregman, 1990).

Given such widespread significance attached to the phenomena, it is noteworthy that what is known about auditory streaming comes mostly from studies using a comparatively restricted set of stimuli. For the vast majority of streaming studies, the stimuli are sequences of tones of fixed frequency, amplitude and duration or tones forming simple repetitive patterns or familiar melodies (Moore and Gockel, 2012). In the broader context of studies wherein streaming is judged to play a role, the sounds are considerably more complex—random tonal patterns (Kidd et al., 2008), speech (Wang, 2006), and ambient environmental noise (Bregman, 1990) to cite a few examples. These later sounds differ in significant ways from those of streaming studies, but perhaps the most important difference is that they vary unpredictably from one moment to the next. This is important because stimulus uncertainty is identified as one of most significant factors affecting performance in the studies using these sounds (Kidd et al., 2008; Lutfi et al., 2013; Ellis, 1996). The effect of stimulus uncertainty on streaming thus has implications for how streaming is to be considered in relation to this previous work.

A review of the literature reveals that the vast majority of studies purporting to measure the effect of stimulus uncertainty on streaming (otherwise referred to as stimulus irregularity) are, in fact, masking studies (e.g., Kidd et al., 1994; Dau et al., 2009; Denham et al., 2010; Elhilali et al., 2010; Andreou et al., 2011). That is, the perception of separate streams in these studies is inferred from the listener's ability to detect some change in one stimulus (target) given interference provided by another (masker). The distinction is meaningful because streaming is neither a necessary nor sufficient condition for detection (cf. Lutfi and Liu, 2011). Only two studies, to the authors' knowledge, have directly investigated the effect of stimulus uncertainty on the perception of separate streams. Bendixen et al. (2010) presented listeners with ABA-ABA tone sequences for which the frequency and level of A and/or B tones formed a regular repeating pattern or varied at random. They found that randomizing either or both tone sequences decreased the proportion of time listeners reported the A and B sequences to be perceived as separate streams (also see Bendixen et al., 2013). Szalardy et al. (2014) constructed a pair of interleaved tone sequences using the notes of two different familiar melodies for each sequence. They similarly found that the proportion of time listeners reported a percept of streaming was less when the tones and/or their timing were scrambled than when the tones faithfully followed the melodies.

The results of these two studies are clearly in line with expectations based on the link made in the literature between streaming and masking. Nonetheless, the results only weakly support this link as neither study was designed, nor was it the intent, to permit a direct comparison of the effects of stimulus uncertainty on streaming to the effects of stimulus uncertainty as they are typically measured in masking studies. The goal of the present study, then, was to provide such a comparison. The approach proceeded in two stages. First, uncertainty was introduced in different tone parameters in a manner consistent with past studies of masking. Second, a model that has proven successful in accounting for the effects of stimulus uncertainty in past studies of masking was used to make parallel predictions for the effects on auditory streaming. These predictions were based on the premise, as implied in the literature, of a strong link between streaming and masking.

2. Method

2.1. Theoretical model

The model applied in this case was the component-relative-entropy (CoRE) model, which has been widely applied to studies of informational masking [see Lutfi (1993) and Lutfi et al. (2013) for reviews]. The model predicts that the masking of stimulus A by stimulus B, and so streaming by our premise, is related to the degree to which average statistical properties of A and B differ (their information divergence or relative entropy). Consider the predictions for a standard ABA-ABA tone streaming experiment. All tones have the same level and duration and, within each sequence, the same frequencies. The frequency separation between the A and B tones (Δf) is gradually increased until the listener reports a percept of streaming (the so-called fission threshold). Here the fission threshold is reached early because the average statistical separation between A and B tones is determined exclusively by Δf; the level and duration of the tones are the same, and there is no random variation in f that would serve to reduce the statistical separation provided by Δf. The same result is to be expected when random variation is introduced in the level or duration of the tones. The statistical separation between the tones continues to be determined exclusively by Δf, and so the fission threshold is expected to be largely unaffected. But now consider the effect of introducing random variation in the frequencies of the tones. Here the statistical separation of tones is determined not only by Δf but also by the magnitude of the random variation in frequency. As the magnitude of this variation is increased, Δf must also increase to maintain the statistical separation in frequency corresponding to the same fission threshold. The strong prediction then is that only random variation in frequency will affect fission thresholds.

2.2. Stimuli

For all experimental conditions, the stimuli were ABA-ABA tone sequences identical to those of past streaming studies with one exception: The sequences were made uncertain by drawing the frequency, level, or duration of the tones independently and at random from normal distributions of these parameters, similar to what has been done in past masking studies (e.g., Lutfi, 1992, 1993; Lutfi et al., 2013). The standard deviation σ of the distributions was fixed within a block of trials but varied across blocks over a maximum realistic range consistent with past masking studies. Larger sigmas would have resulted in many of the tones exceeding permissible levels or being outside the audible frequency range. For frequency, σ was 0, 200, 400, 600, or 800 cents, for level, 0, 2, 4, 6, or 8 dB and for duration, 0, 10, 20, 30, or 40 ms. Frequency, level, or duration was sampled independently for each tone with the exception that the A tones within an ABA triplet were required to have the same value. The mean level and duration of tones was fixed, respectively, at 65 dB sound pressure level (SPL) and 100 ms. The mean frequency of A tones was fixed at 1000 Hz, while the mean frequency of B tones began at 1000 Hz and then was systematically incremented by 50 cents with each successive ABA triplet. The increase continued until the listener reported a percept of streaming (see Sec. 2.3). All sounds were presented diotically over Beyerdynamic DT990 headphones to listeners seated in a double-walled, Industrial Acoustics Company (IAC) sound-attenuated chamber. Tones were gated on and off with 5-ms, cosine-squared ramps and played at a 44 100 Hz sampling rate with 16-bit resolution using a MOTU 896 audio interface.

2.3. Procedure

The procedure for collecting fission thresholds was similar to that used in other streaming studies (cf. Moore and Gockel, 2012). Before data collection, listeners received a verbal description accompanied by a visual representation of the stimuli. They were told that as they listened, the B tones would gradually increase in pitch and that they were to press their mouse button when they perceived the A and B tones to form separate perceptual streams. The sum of the 50-cent increments in B tones to that point was taken as a single estimate of the fission threshold. A new trial was then initiated 2 s after the previous trial had ended. Five such trials constituted a block obtained for each value of σ. Listeners completed four to six blocks in a day, per each 1 hr visit and were permitted 1 min breaks between blocks. Before data collection, listeners received two blocks to familiarize them with the task. The data from practice trials were not analyzed. The fission threshold for each value of σ was calculated as the mean of the five estimates within each block after the Modified Thompson-Tau method was used to remove likely outliers (Thompson, 1985). There were only eight instances in which an estimate was rejected as an outlier. Values of σ for each tone parameter were presented to listeners in both ascending and descending order.

2.4. Subjects

A total of six male and eight female listeners (aged 19–29 yr) were recruited from the online Job Center of University of Wisconsin–Madison and were paid at an hourly rate for their participation. Each listener had pure-tone thresholds equal to or less than 20 dB hearing level (HL) for octave frequencies from 250 to 8000 Hz. The data from two of these listeners were ultimately excluded from the group analyses because they rarely reported a perception of streaming, causing their fission thresholds to exceed the maximum allowed Δf to maintain frequencies within the audible range. All procedures involving human subject recruitment and participation were executed in compliance with the University of Wisconsin–Madison Institutional Review Board guidelines.

3. Results and conclusions

Figure 1 shows the grand mean fission thresholds (Δf in cents relative to 1000 Hz) as a function of σ in frequency (filled squares), duration (filled circles), and level (filled triangles). Only the grand mean thresholds are shown as the pattern of results was quite similar across listeners and presentation order of σ. Error bars represent 1 standard error from the mean. Note that the axes for the σ associated with each parameter are aligned to represent the practical range over which each parameter has been allowed to vary in past masking studies. Otherwise, the scaling is arbitrary. A two-way analysis of variance (ANOVA) (presentation order by σ) performed separately for each tone parameter indicated a significant effect of σ (uncertainty effect) for all three of the tone parameters: frequency (F(1,4) = 54.53, p < 10−15) duration (F(1,4) = 11.08, p < 10−8), and level (F(1,4) = 8.79, p < 10−6). No significant effect of presentation order was found for either frequency or duration. A small significant effect of presentation order was found for level (F(1,4) = 9.01, p < 0.003) but not large enough to fundamentally change the interpretation of results as reflected in the size of the error bars.

Fig. 1.

Fig. 1.

Grand mean fission thresholds (Δf in cents relative to 1000 Hz) are plotted as a function of σ in frequency (filled squares), duration (filled circles), and level (filled triangles). Error bars give the standard errors of the means. Solid lines are the predictions of the CoRE model based on the premise that streaming represents a release from masking.

The predictions of the CoRE model for thresholds are given by solid lines shown in Fig. 1 (Lutfi, 1993; Lutfi et al., 2013). These predictions are based on the premise that masking is a linear function of the statistical separation (relative entropy) of target and masker and that streaming is linked to a release from masking. In the strict application of the model assumed here, σ in level and duration are predicted to have no effect on fission thresholds inasmuch as they have no effect on the statistical separation of A and B tones. That a small effect is, in fact, observed is likely due to variation in the effective internal representation of frequency resulting from changes in the excitation patterns produced by tones of different level and duration. Notwithstanding, the general pattern of results is consistent with the predictions of the model. As expected, the variation in frequency has, by far, the greatest effect on fission thresholds and the function relating fission thresholds to σ is linear.

The present results suggest that the effects of stimulus uncertainty on auditory streaming are similar to those for auditory masking. This conclusion derives from the accuracy of predictions for streaming of a model of masking modified to assume a strong link between streaming and masking. The modified model tends to support the link often made in the literature. However, the model may also be used in the same way to identify important differences between masking and streaming. The two phenomena are, indeed, different. From a statistical standpoint, streaming may be thought to represent a judgment that two sequences have different statistical properties (derive from different sound sources). Masking, on the other hand, represents a failure to recognize a change in the statistical properties of one of the sequences, the other being quite irrelevant to the task. Simply put, streaming requires focusing on both tone sequences, masking on only one. The distinction is drawn out in the case where the variances of the two sequences are unequal. Here both the predicted and obtained amounts of masking have been shown to depend critically on whether the target or the masker has the greater variance (Lutfi and Doherty, 1994). For streaming, however, it should make little difference whether the A or B tones have greater variance because each should be treated equally in the decision as to whether they belong to different sources. This, indeed, appears to be the result in contrast to masking (cf. Lutfi and Doherty, 1994; Chang et al., 2014). In future studies it may prove useful to identify other differences between auditory streaming and masking using the modeling approach adopted here.

Acknowledgment

This research was supported by NIDCD Grant No. 5R01DC001262-21.

References and links

  • 1. Andreou, L. , Kashino, M. , and Chait, M. (2011). “ The role of temporal regularity in auditory segregation,” Hear. Res. 280, 228–235. 10.1016/j.heares.2011.06.001 [DOI] [PubMed] [Google Scholar]
  • 2. Bendixen, A. , Bohm, T. M. , Szalardy, O. , Mill, R. , Denham, S. L. , and Winkler, I. (2013). “ Different roles of similarity and predictability in auditory stream segregation,” Learn. Percept. Suppl. 2 5, 37–54. 10.1556/LP.5.2013.Suppl2.4 [DOI] [Google Scholar]
  • 3. Bendixen, A. , Denham, S. L. , Gyimesi, K. , and Winkler, I. (2010). “ Regular patterns stabilize auditory streams,” J. Acoust. Soc. Am. 128(6), 3658–3666. 10.1121/1.3500695 [DOI] [PubMed] [Google Scholar]
  • 4. Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound ( MIT Press, Cambridge, MA: ), pp. 213–393. [Google Scholar]
  • 5. Chang, A.-C. , Heo, I. , Lee, J. , Stoelinga, C. N. J. , and Lutfi, R. A. (2014). “ Factors affecting auditory streaming of random tone sequences,” J. Acoust. Soc. Am. 135, 2414. 10.1121/1.4878008 [DOI] [Google Scholar]
  • 6. Dau, T. , Ewert, S. , and Oxenham, A. J. (2009). “ Auditory stream formation affects comodulation masking release retroactively,” J. Acoust. Soc. Am. 125, 2182–2188. 10.1121/1.3082121 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Denham, S. L. , Gyimesi, K. , Stefanics, G. , and Winkler, I. (2010). “Stability of perceptual organization in auditory streaming,” in The Neurophysiological Bases of Auditory Perception, edited by Lopez-Poveda E. A., Palmer A. R., and Meddis R. ( Springer; Science and Business Media, New York), pp. 477–531. [Google Scholar]
  • 8. Elhilali, M. , Xiang, J. , Shamma, S. A. , and Simon, J. Z. (2010). “Auditory streaming at the cocktail party,” in The Neurophysiological Bases of Auditory Perception, edited by Lopez-Poveda E. A., Palmer A. R., and Meddis R. ( Springer Science and Business Media, New York: ), pp. 545–554. [Google Scholar]
  • 9. Ellis, D. P. W. (1996). “ Prediction driven auditory scene analysis,” Ph.D. thesis, Massachusetts Institute of Technology. [Google Scholar]
  • 10. Kidd, G., Jr. , Mason, C. R. , Deliwala, P. S. , Woods, W. S. , and Colburn, H. S. (1994). “ Reducing informational masking by sound segregation,” J. Acoust. Soc. Am. 95, 3475–3480. 10.1121/1.410023 [DOI] [PubMed] [Google Scholar]
  • 11. Kidd, G., Jr. , Mason, C. R. , Richards, V. M. , Gallun, F. J. , and Durlach, N. I. (2008). “Informational masking,” in Springer Handbook of Auditory Research: Auditory Perception of Sound Sources, edited by Yost W. A. and Popper A. N. ( Springer-Verlag, New York: ), pp. 143–190. [Google Scholar]
  • 12. Lutfi, R. A. (1992). “ Informational processing of complex sound. III. Interference,” J. Acoust. Soc. Am. 91, 3391–3401. 10.1121/1.402829 [DOI] [PubMed] [Google Scholar]
  • 13. Lutfi, R. A. (1993). “ A model of auditory pattern analysis based on component-relative-entropy,” J. Acoust. Soc. Am. 94, 748–758. 10.1121/1.408204 [DOI] [PubMed] [Google Scholar]
  • 14. Lutfi, R. A. , and Doherty, K. A. (1994). “ Effect of component-relative entropy on the discrimination of simultaneous tone complexes,” J. Acoust. Soc. Am. 96, 3443–3450. 10.1121/1.410607 [DOI] [PubMed] [Google Scholar]
  • 15. Lutfi, R. A. , Gilbertson, L. , Chang, A.-C. , and Stamas, J. (2013). “ The information divergence hypothesis of informational masking,” J. Acoust. Soc. Am. 134, 2160–2170. 10.1121/1.4817875 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Lutfi, R. A. , and Liu, C. J. (2011). “ A method for evaluating the relation between sound source segregation and masking,” J. Acoust. Soc. Am. 129, EL34–EL38. 10.1121/1.3519871 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Moore, B. C. J. , and Gockel, H. E. (2012). “ Properties of auditory stream formation,” Philos. Trans. R. Soc. Biol. Sci. 367, 919–931. 10.1098/rstb.2011.0355 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Szalardy, O. , Bendixen, A. , Bohm, T. M. , Davies, L. , and Denham, S. L. (2014). “ The effects of rhythm and melody on auditory stream segregation,” J. Acoust. Soc. Am. 135, 1392–1405. 10.1121/1.4865196 [DOI] [PubMed] [Google Scholar]
  • 19. Thompson, R. (1985). “ A note on restricted maximum likelihood estimation with an alternative outlier model,” J. R. Statist. Soc. B. 47(1), 53–55. [Google Scholar]
  • 20. van Noorden, L. P. A. S. (1975). “ Temporal coherence in the perception of tone sequences,” Ph.D. dissertation, Technical University Eindhoven, Eindhoven. [Google Scholar]
  • 21. Wang, D. (2006). “Feature-based speech segregation,” in Computational Auditory Scene Analysis, Principles, Algorithms, and Applications ( Wiley and Sons, Hoboken, NJ: ), pp. 1–37. [Google Scholar]
  • 22. Wang, D. , and Brown, G. J. (2006). “Fundamentals of computational auditory scene analysis,” in Computational Auditory Scene Analysis, Principles, Algorithms, and Applications ( Wiley and Sons, Hoboken, NJ: ), pp. 81–111. [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES