Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2012 Jul 12;132(2):EL109–EL113. doi: 10.1121/1.4734575

A detection-theoretic framework for modeling informational masking

Robert A Lutfi 1,a), An-Chieh Chang 1, Jacob Stamas 1, Lynn Gilbertson 1
PMCID: PMC3407140  PMID: 22894307

Abstract

There has been growing interest in recent years in masking that appears to have its origin at a central level of the auditory nervous system—so-called informational masking (IM). Masker uncertainty and target-masker similarity have been identified as the two major factors affecting IM; however, no theoretical framework currently exists that would give precise meaning to these terms necessary to evaluate their relative importance or model their effects. The present paper offers a first attempt at such a framework constructed within the doctrines of the theory of signal detection.

Introduction

In a 2006 Letter to the Editor of this journal, Durlach gave the following indictment of the state of research on auditory masking:

“Recently, both activity level and range of topics in this area has increased to such an extent that the area appears to be in total disarray. Not only is there no overarching conceptual structure available to organize the area and provide it with scientific elegance, but there are few definitions that evidence even a modest degree of scientific stability.”

Six years later the situation has not changed much. Indeed, the need for a unifying conceptual framework has only increased with the growing number of studies that have little else in common except a generally agreed upon operational definition of masking. The present paper does not claim a solution to this problem, only a small step in that direction. It suggests a theoretical approach that might be applied with some benefit in an area where there is currently much research activity—the topic of informational masking (IM). The goal is to provide a working framework wherein the relative importance of factors influencing IM can be evaluated and modeled.

Loosely defined, IM is masking that has its origin at some central level of the auditory nervous system beyond the cochlea (Durlach et al., 2003a; Kidd et al., 2008). For a more extensive review of the literature on IM the reader is referred to Kidd et al. (2008). Currently, there are two major factors that have been identified with IM. The first is masker uncertainty (Neff and Green, 1987; Watson et al., 1976; Lutfi, 1993). This factor has the longest history of study and is the only one so far to be considered within a theoretical framework that has allowed the effects to be modeled computationally (Lutfi, 1993; Oh and Lutfi, 1998). Operationally, masker uncertainty is created by varying one or more properties of the masker at random from trial to trial. It has typically been treated as an ordinal variable, labeled “`high” and “low” or “minimal” and “maximal”; the one exception being the theoretical development alluded to above. The effects reported are large; elevations in detection thresholds for the target can amount to 40–50 dB for well-practiced listeners (Neff and Green, 1987). The second major factor is target-masker similarity. This factor is most often identified with the degree to which target and masker are separated in frequency, space, or time; greater separations generally resulting in less IM (Brungart, 2001; Freyman et al., 1999; Kidd et al., 1994, 2008). In some cases, target-masker similarity is also identified with the degree to which properties of the target covary with those of the masker over time (Durlach et al., 2003b; Dau et al., 2009; Hall et al., 2008; Kidd et al., 1994). Interest in the latter case stems, in part, from the subjective observation that when two or more tonal sequences are comodulated in frequency and amplitude they tend to fuse perceptually into a single auditory stream (Bregman, 1990, pp. 642–646). Perceptual fusion has been identified as a possible factor underlying IM (Kidd et al., 1994; Dau et al., 2009).

Ideally, to evaluate the relative influence of masker uncertainty and target-masker similarity one would like a common standard by which to compare their effects; a scale that would relate the different stimulus manipulations associated with each factor. This may seem a tall order given the real qualitative differences that exist among these factors in the way they are measured and operationally defined. Yet there is a conceptual framework that is specifically designed for this purpose—it is the component of the theory of signal detection that pertains to the scaling of distances between stochastic stimuli (Macmillan and Creelman, 1990). Fundamentally, what the analysis provides is a unitless measure of distance that is independent of the physical dimension(s) along which the stimuli vary. For the general case of two stimuli given as normal random variables, S1~N(μ1,σ1) and S2~N(μ2,σ2), the measure is

da=μ1μ2AVG(σ122+σ212), (1)

where σ122 is the variance in S1 not shared with S2. In practice, da serves as the standard that allows the performance of human observers, discriminating S1 from S2, to be meaningfully compared across a wide variety of different stimulus configurations and psychophysical tasks. In Sec. 2 we show, by way of example, how da might be used in a similar manner to evaluate the relative importance of masker uncertainty and target-masker similarity across different stimuli and tasks.

Detection-theoretic approach

For our example, we chose a task that is similar to a real-world listening task and that has been used to identify the factors influencing IM—the every-other-word identification task of Kidd et al. (2008). The stimulus on each trial is a sequence of randomly selected words that alternates between two different human speakers and that forms a grammatically correct sentence for each speaker. The words from one speaker are designated as targets (T), the words from the other speaker as maskers (M). The listener’s task is to identify as many of the target words as possible. Masking in this task is taken to reflect predominantly IM inasmuch as the temporal separation of the words minimizes the opportunity for interactions in the cochlea.

Now, we could consider a number of differences between speakers shown to impact IM, but let us start with the fundamental frequency (F0) of the speaker’s voice. To apply our approach the F0s of target and masker words on each presentation are sampled independently and at random from two normal distributions differing in mean, μT ≠ μM. The standard deviations, σT and σM, are chosen to be large enough so that the sampled F0s are clearly discriminable; for example, reflecting normal variation in prosody. Panel 1 of Fig. 1 shows a representative case where σT = σM. The distributions of F0 are given by the continuous curves drawn to the left of the panel (black for target, gray for masker), the individual values of F0 for a representative trial are given by the small rectangles to the right. We will refer to this as the standard condition. It is similar to the fixed-voice condition of Kidd et al. (2008). The remaining panels show manipulations associated with each factor affecting IM, consistent with how they have been operationally defined in the literature. In panel 2 masker uncertainty is increased by increasing the variability of masker F0s, given by σM. This is analogous to having each masker word spoken by a randomly chosen speaker as in the random-voice condition of Kidd et al. (2008). It is also representative, more generally, of the many IM studies for which the focus has been on the frequency uncertainty associated with masker tones (e.g., Neff and Green, 1987; Watson et al., 1976; Lutfi, 1993; Kidd et al., 2008). In panel 3 target-masker similarity is increased by reducing the mean difference between the target and masker F0s, given by μTμM. This would correspond to the case where both speakers are of the same gender as in the study by Brungart (2001). It is also representative of conditions for which the reduction in frequency separation causes perceptual fusion (Bregman, 1990, p. 18), sometimes implicated in IM. Finally, in panel 4 the overall separation between target and masker F0s continues to vary at random from trial to trial, but target-masker similarity is increased by increasing the covariation of target and masker F0s, given by r. Although we know of no direct comparison to studies using speech, this condition is otherwise analogous to studies wherein target and masker tones share the same pitch contour (e.g., Durlach et al., 2003b). In each panel the manipulation is intended to make the task more difficult by increasing the likelihood elements of the masker will be “confused” for those of the target. The question is: How should the specific values of σM, μTμM, and r be selected so that their effects can be meaningfully compared?

Figure 1.

Figure 1

Values of F0 (black for target, gray for masker) for a single representative trial are given as small rectangles for the standard condition (Panel 1) and each of the manipulations associated with the major factors affecting IM (Panel 2: Masker uncertainty σM, Panel 3: Target-masker frequency separation μTμM, and Panel 4: Target-masker covariance r).

To answer this question, we return to the notion of “confusion” between target and masker. The term is used loosely in the literature to capture the widely expressed view that IM results from the listener’s failure in some way to distinguish target from masker. Given this view, it makes sense to evaluate the confusion between target and masker in the same way it has been done in so many other studies for the confusion between two targets, relative to the scaled distance between the two signals, da. This, in fact, is the logic behind the current approach.1 Returning to our example, we consider the case where σT = σM for the standard. We have then from Eq. 1

da=(μTμM){σT2+σM22[1+(1r2)(N1)]}1/2, (2)

where N = 5 is the number of words in each sentence and r2 is the proportion of variance in the masker F0s that is common to the target F0s. In practice, the values of σM, μTμM, and r in Eq. 2 are selected so that in going from the standard to any other condition the change in the distance between target and masker, as given by da, is the same in each case. Performance across conditions can then be meaningfully compared to identify real differences in the effectiveness of each factor and to test predictions of models. For example, in the manipulation involving the comodulation of target and masker in Fig. 1, r is changed from a value of 0 for the standard to a value of 1 for the comparison. From Eq. 2 this yields a factor of 1/5 change in da. To produce the same change for the manipulation involving the separation of target and masker, μM would be reduced by a factor of 1/5. And, to produce the same change for the manipulation of masker uncertainty, σM would be increased by a factor of 3.

Potential applications and summary

One of the first applications of the approach would be to test basic hypotheses regarding the nature of the interaction between target and masker that leads to IM. Some success so far has been achieved by modeling this interaction as a weighted sum of target and masker elements (Lutfi, 1993; Oh and Lutfi, 1998). However, an alternative notion is that, on some proportion of trials, the masker simply distracts attention away from the target, resulting in chance performance on those trials (Werner and Bargones, 1991). Neither idea has been widely tested, yet both make different predictions for the specific relation of da to performance as da varied in the conditions of Fig. 1 (cf. Lutfi et al., 2003). These predictions can also be tested against those of qualitative interpretations invoking perceptual fusion and streaming (cf. Dau et al., 2009). Note, for example, that the need for perceptual fusion may be obviated in the case of target- masker comodulation, where an increase in target-masker confusion is expected simply because the listener has fewer independent “looks” at the difference between target and masker within each trial. Finally, within the proposed framework it becomes possible to make meaningful comparisons of the relative effectiveness of differences between target and masker that have been shown to produce a release in IM (e.g., timbre, F0, spatial separation, rhythm). One intriguing prediction in this case, for which there is indirect support, is that IM will be independent of the physical differences between target and masker and will depend, instead, only on their statistical properties, as given by da (Lutfi, 1990, 1993). These and other possibilities are currently being tested in our lab, and it is hoped that such examples will stimulate applications in other labs as well toward the development of testable computational models of IM.

Acknowledgment

This research was supported by a NIDCD Grant No. R01 DC001262-20. We wish to thank the three anonymous reviewers for their comments on an earlier version of the manuscript.

Footnotes

1

A related though conceptually different approach was used by Lee and Richards (2011) specifically to evaluate the effects of target-masker similarity. These authors equate target- masker similarity with the listener’s ability to discriminate target from masker imbedded in broadband masking noise—a response-based measure rather than a stimulus-based measure. The approach does not apply to the effects of masker uncertainty and so is not discussed further here.

References and links

  1. Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound (MIT Press, Cambridge, MA). [Google Scholar]
  2. Brungart, D. S. (2001). “Informational and energetic masking effects in the perception of two simultaneous talkers,” J. Acoust. Soc. Am. 109, 1101–1109. [DOI] [PubMed] [Google Scholar]
  3. Dau, T., Ewert, S., and Oxenham, A. J. (2009). “Auditory stream formation affects comodulation masking release retroactively,” J. Acoust. Soc. Am. 125, 2182–2188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Durlach, N. I. (2006). “Auditory masking: Need for an improved conceptual structure,” J. Acoust. Soc. Am 120, 1787–1790. [DOI] [PubMed] [Google Scholar]
  5. Durlach, N. I., Mason, C. R., Kidd, G., Jr., Arbogast, T. L., Colburn, H. S., and Shin-Cunningham, B. (2003a). “Note on informational masking,” J. Acoust. Soc. Am. 113, 2984–2987. [DOI] [PubMed] [Google Scholar]
  6. Durlach, N. I., Mason, C. R., Shinn-Cunningham, B. G., Arbogast, T. L., Colburn, H. S., and Kidd, G., Jr. (2003b). “Informational masking: Counteracting the effects of stimulus uncertainty by decreasing target-masker similarity,” J. Acoust. Soc. Am. 114, 368–379. [DOI] [PubMed] [Google Scholar]
  7. Freyman, R. L., Helfer, K. S., Mcall, D. D., and Clifton, R. K. (1999). “The role of perceived spatial separation in the unmasking of speech,” J. Acoust. Soc. Am. 106, 3578–3588. [DOI] [PubMed] [Google Scholar]
  8. Hall, J. W., III, Buss, E., and Grose, J. H. (2008). “Comodulation detection differences in children and adults,” J. Acoust. Soc. Am. 123, 2213–2219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kidd, G., Jr., Best, V., and Mason, C. R. (2008). “Listening to every other word: Examining the strength of linkage variables in forming streams of speech,” J. Acoust. Soc. Am. 124, 3795–3802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kidd, G., Jr., Mason, C. R., and Deliwala, P. S. (1994). “Reducing informational masking by sound segregation,” J. Acoust. Soc. Am. 95, 3475–3480. [DOI] [PubMed] [Google Scholar]
  11. Kidd, G., Jr., Mason, C. R., Richards, V. M., Gallun, F. J., and Durlach N. I. (2008). “Informational masking,” in Springer Handbook of Auditory Research: Auditory Perception of Sound Sources, edited by Yost W. A. and Popper A. N. (Springer-Verlag, New York: ), pp. 143–190. [Google Scholar]
  12. Lee, T. Y., and Richards, V. M. (2011). “Evaluation of similarity effects in informational masking,” J. Acoust. Soc. Am. 129, EL280–EL285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lutfi, R. A. (1990). “Informational processing of complex sound. I. Cross-dimensional analysis,” J. Acoust. Soc. Am. 87, 2141–2148. [DOI] [PubMed] [Google Scholar]
  14. Lutfi, R. A. (1993). “A model of auditory pattern analysis based on component-relative-entropy,” J. Acoust. Soc. Am. 94, 748–758. [DOI] [PubMed] [Google Scholar]
  15. Lutfi, R. A., Kistler, D. J., Callahan, M. R., and Wightman, F. L. (2003). “Psychometric functions for informational masking,” J. Acoust. Soc. Am. 114(6), 3273–3282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Macmillan, N. A., and Creelman, C. D. (1990). Detection Theory: A User’s Guide (Cambridge University Press, Cambridge, England: ), pp. 65–71. [Google Scholar]
  17. Neff, D. L., and Green, D. M. (1987). “Masking produced by spectral uncertainty with multicomponent maskers,” Percept. Psychophys. 41, 409–415. [DOI] [PubMed] [Google Scholar]
  18. Oh, E., and Lutfi, R. A. (1998). “Nonmonotonicity of informational masking,” J. Acoust. Soc. Am. 104, 3489–3499. [DOI] [PubMed] [Google Scholar]
  19. Watson, C. S., Kelly, W. J., and Wroton, H. W. (1976). “Factors in the discrimination of tonal patterns II: Selective attention and learning under various levels of stimulus uncertainty,” J. Acoust. Soc. Am. 60, 1176–1186. [DOI] [PubMed] [Google Scholar]
  20. Werner, L. A., and Bargones, J. Y. (1991). “Sources of auditory masking in infants: Distraction effects,” Percept. Psychophys. 50, 405–412. [DOI] [PubMed] [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES