Cues for masked amplitude-modulation detection

Paul C Nelson; Laurel H Carney

doi:10.1121/1.2213573

. Author manuscript; available in PMC: 2008 Oct 27.

Published in final edited form as: J Acoust Soc Am. 2006 Aug;120(2):978–990. doi: 10.1121/1.2213573

Cues for masked amplitude-modulation detection

Paul C Nelson ¹, Laurel H Carney ^2,^a)

PMCID: PMC2572864 NIHMSID: NIHMS74678 PMID: 16938985

Abstract

The ability of psychoacoustic models to predict listeners’ performance depends on two key stages: preprocessing and the generation of a decision variable. The goal of the current study was to determine the perceptually relevant decision variables in masked amplitude-modulation detection tasks in which the modulation depth of the masker was systematically varied. Potential cues were made unreliable by roving the overall modulation depth from trial to trial or were reduced in salience by equalizing the envelope energy of the standard and target after the signal was added. Listeners’ performance was significantly degraded in both paradigms compared to the baseline (fixed-level modulation masker) condition, which was similar to those used in previous studies of masking in the envelope-frequency domain. Although this observation was broadly consistent with a simple long-term envelope power-spectrum model, there were several aspects of the data that were not. For example, the steep rate of change in threshold with masker depth and the fact that an optimal amount of envelope noise could enhance performance were not predicted by decision variables calculated directly from the stimulus envelope. A physiologically based processing model suggested a realistic nonlinear mechanism that could give rise to these second-order features of the data.

I. INTRODUCTION

Behaviorally relevant acoustic stimuli such as speech cannot be defined solely by their long-term audio-frequency composition. Temporal variations in a signal’s spectrum and interactions between individual spectral components result in amplitude-modulated (AM) sounds. Viemeister (1979) used concepts from linear systems analysis as a framework to determine the effective temporal modulation transfer function (MTF) of the auditory system by measuring the just-noticeable modulation depth of a sinusoidally amplitude-modulated (SAM) noise for a range of modulation frequencies (f_m). Viemeister’s approach has proven highly valuable as a first-order approximation of the system’s (low-pass) properties and as a starting point for many other studies. For example, the modulation filter-bank model structure (e.g., Dau et al., 1997a), which assumes that the envelope of the output of each audio-frequency channel passes through a bank of bandpass filters (broadly) tuned to f_m, is able to account for several perceptual findings that a low-pass pre-processor cannot explain (i.e., Dau et al., 1997a, b, 1999; Ewert and Dau, 2000).

To predict psychophysical thresholds, the output of any model must be concisely quantified with some decision variable (DV). And while the preprocessing model structures are fundamentally different for the two models mentioned earlier, both the Viemeister (1979) model and the most recent implementations of the Dau model [the envelope power-spectrum model, Ewert and Dau, 2000; Ewert et al., 2002; Ewert and Dau, 2004] assume an average rms DV at the output of their envelope-filtering process. This assumption has been shown to be reasonable for many types of AM-detection tasks, but it is not clear whether decision statistics that rely on local temporal envelope features (instead of average or long-term features) would be equally successful as quantifications of the model outputs.

The broad goal of the current set of experiments was to further elucidate which features of AM stimuli are perceptually salient and used by listeners in modulation detection tasks. To accomplish this, empirical data are presented that provide critical tests for various DVs. Paradigms from the audio-frequency tone-in-noise (TIN) detection literature that highlight shortcomings of long-term decision statistics in the spectral-frequency domain (roving-level and energy-equalized TIN detection) were translated into the modulation-frequency domain. Because the stimuli had envelope-frequency bandwidths smaller than the presumed modulation filter widths, the internal representation of the stimulus envelope was similar for the low-pass (Viemeister) model and the bandpass (Dau) model. An alternative model, developed to predict responses of inferior colliculus (IC) neurons to AM signals (Nelson and Carney, 2004), was tested alongside the previously proposed psychophysical (signal-processing) models. The working hypothesis was that a physiologically motivated model structure would shape the internal representation of the stimulus more like the real system than “effective” signal-processing models.

There are two reasons to consider a masked AM-detection task (instead of pure, or unmasked, AM detection) to test our hypotheses. First, several reasonable techniques can be used to adjust a given model’s unmasked detection abilities, which makes it difficult to dismiss one competing decision statistic over another. A more interesting reason is that real-world sounds have complex modulation spectra, so it is useful to consider envelope detection abilities and limitations for stimuli other than pure sinusoidal AM. Previous studies of masked-AM detection have focused on the effects of varying the frequencies of the signal and/or masker modulation (Houtgast, 1989; Bacon and Grantham, 1989; Strick-land and Viemeister, 1996; Dau et al., 1997a; Ewert and Dau, 2000; Ewert et al., 2002). Here, masker level (or masker modulation depth) was the only systematically manipulated stimulus dimension. Predicted signal-detection thresholds based on a battery of potentially relevant DVs were compared to the masked thresholds measured psychophysically. Because several decision devices predicted statistically similar thresholds, a more detailed analysis of the relationships between DVs and listener responses on a trial-by-trial basis was also carried out.

A subset of the potential perceptually relevant decision devices investigated in the present study can be introduced in the context of previous work. Perhaps the most influential and straightforward DV assumed in previous AM-coding work is the long-term rms energy measured at the output of some envelope-filtering process. Such a statistic can explain the shape of the temporal modulation transfer function (with low-pass preprocessing: Viemeister, 1979; Strickland and Viemeister, 1996) and the envelope-frequency selectivity observed in experiments measuring sinusoidal AM-detection thresholds in the presence of a narrowband-noise masker modulation applied to the same carrier (with bandpass preprocessing: Ewert and Dau, 2000; Ewert et al., 2002). Moore and Sek (2000) measured detection thresholds for stimuli with three AM-frequency components for three different phase configurations, and found no dependence of thresholds on the components’ relative phases. This finding is also consistent with predictions of an average (rms) envelope statistic. Note that any local temporal structure present in the stimulus (or its internal representation) is discarded with an average (rms) metric.

Strickland and Viemeister (1996) concluded that the ratio of the maximum value to the minimum value of the envelope (max/min) was the best predictor of listeners’ thresholds in a tone-on-tone modulation masking experiment. In contrast to the rms statistic, which averages over the entire temporal waveform, max/min makes decisions based on only two points in the envelope representation. Crest factor (ratio of maximum envelope value to the envelope rms) represents a compromise in some sense: a single value of the waveform is normalized by an averaged value. Lorenzi et al. (1999) accounted for performance in a (supra-threshold) modulation component phase discrimination task by basing decisions on the crest factor of a low-pass filtered version of the envelope of their stimuli. DVs based on the higher-order moments of envelope amplitude distributions have also been tested in various envelope-processing tasks (i.e., skewness: Lorenzi et al., 1999; kurtosis: Strickland and Viemeister, 1996).

Another aspect of a signal with a complex modulation spectrum is its venelope, or second-order envelope (Shofner et al., 1996; Ewert et al., 2002; Lorenzi et al., 2001a, b). Venelope cues could potentially be used in modulation masking experiments, especially in conditions with tonal maskers and noise signals (Ewert et al., 2002). This line of reasoning parallels results from audio-frequency tone and noise masking experiments in which envelope cues have been shown to have various effects on detection performance, depending on the masker-signal configuration (i.e., the asymmetry of masking; see Derleth and Dau, 2000). It is reasonable to hypothesize that venelope fluctuations may also provide a detection cue for conditions with sinusoidal signals and random maskers (as measured in the present study), especially when first-order envelope cues are made unreliable or completely removed.

As an alternative to signal-processing-based DVs, threshold predictions were also made based on a physiologically motivated model for neural responses to AM tones (Nelson and Carney, 2004). The average firing rate of model inferior colliculus cells was tested as a physiologically realistic DV, alongside several of the signal statistics described earlier. In the model cells, firing rate increases monotonically with signal modulation depth. Interactions between strong inhibitory and weaker excitatory inputs result in a “hard” threshold modulation depth that limits the model’s detection performance even in the absence of internal or external (stimulus-induced) noise sources. Model-cell rate MTFs are bandpass, with Q values (measured at the half-maximal-rate points) of approximately 1. This broad tuning is realized in the physiological model by assuming different time courses in the effective low-pass filtering properties of inhibition and excitation. The Q values are consistent with the signal-processing modulation filters derived recently by Dau and co-workers to predict several aspects of psychophysical envelope coding (Dau et al., 1997a; Ewert and Dau, 2000; Ewert et al., 2002). For the band-limited stimuli used in the present study, the filtering properties of the IC model cells have little effect on shaping the internal representation of the envelope. Again, the focus is on understanding the perceptually salient quantifications of the internally represented envelope (as opposed to testing the validity of a bandpass modulation filter versus a “smoothing” or low-pass modulation filter).

Independent of the chosen DV, simulations of psychophysical experiments must include some mechanism to limit model performance in the detection and discrimination of deterministic stimuli (without external noise). The most common way to do this is to add some amount of internal noise, either to the internal representation, or to the final value of the decision statistic in each interval. Ewert and Dau (2004) have provided some insight into the appropriate statistical description of the internal noise relevant to envelope-processing tasks. They measured AM depth-discrimination thresholds for a wide range of standard depths, and found the Weber fraction for sinusoidal carriers to be independent of standard depth, as long as the standard was well above threshold. This can be accounted for in a model by assuming a constant ratio between the DVs in the target and standard interval at threshold, or by including an internal noise whose variance is proportional to the value of the assumed decision statistic. For low standard depths (i.e., -28 and -23 dB in 20 log m, where m is linear modulation depth), the situation was different. In this range, a constant increase in modulation depth was required to reach discrimination threshold (independent of the standard depth). This can be thought of as arising from a second type of internal noise process—one with a fixed variance, which dominates threshold measurements at low modulation depths. We will address Ewert and Dau’s (2004) findings, but we will also consider model predictions with a fixed-variance noise only, as a “best-case scenario” for the various decision statistics (i.e., if a decision statistic predicts higher thresholds than the listeners’ performance with the fixed-variance noise alone, it would certainly not be able to account for thresholds if the constant-ratio noise, or Weber-fraction noise, were also included).

Two specific paradigms that have been used in the audio-frequency domain to test the power spectrum model of masking were translated into the envelope-frequency domain in the current study: roving-level and equal-energy TIN detection. A within-trial rove in overall energy renders long-term rms cues unreliable, and models based on energy cues predict higher thresholds in a roving-level situation. The absolute amount of increase over fixed-level conditions depends on the rove range (Green, 1983). Kidd et al. (1989) found that roving the overall level by 32 dB in an audio-frequency TIN detection task did not have a significant effect on thresholds (for noise bandwidths greater than one-third of the psychophysically measured auditory-filter bandwidth). In another paradigm that challenges energy-based audio-frequency models of masking, Richards and Nekrich (1993) measured the detectability of tones in narrow bands of masking noise after the energy in the two observation intervals was equalized. Pure long-term energy models predict that such a task would be impossible (for subcritical bandwidths), but listeners performed the task reliably. Richards and Nekrich (1993) attributed their results to differences in the envelopes of the noise-alone and tone-plus-noise stimuli.

With this body of previous work in mind, we present here psychophysical masked-AM detection data and predicted thresholds based on a diverse set of decision statistics. Measured and simulated thresholds in roving-level and equal-energy conditions are compared to those from a baseline fixed-level masker condition, over a wide range of masker modulation depths.

II. PSYCHOPHYSICAL EXPERIMENT

A. Methods

1. Subjects and procedure

Four listeners with normal hearing participated in the experiment. Pure-tone thresholds for all of the subjects were less than 15 dB HL at octave frequencies between 500 Hz and 8 kHz. The authors served as two of the subjects (S2 and S3) and had experience in psychoacoustic measurements. The remaining two listeners had no previous experience. A training period, typically lasting three or more 1.5-h sessions, was provided in which masked and absolute modulation thresholds were estimated using procedures similar to those described in the following. Further training was provided for the roving-level and equal-envelope-energy (EEE) conditions (see the following). Data collection began when thresholds for a subject stabilized; there were typically no learning effects observed after four to five tracks on a given condition. The listeners became familiar with the different stimulus conditions, and were aware of the particular condition prior to the start of a track.

Masked SAM detection thresholds were obtained using an adaptive two-interval, two-alternative forced-choice (2I, 2AFC) procedure with a two-down, one-up stepping rule that estimated the modulation depth necessary for 70.7% correct detection (Levitt, 1971). This combination of parameters resulted in a threshold estimate that corresponded to a d’ of about 0.8. In the randomly chosen target interval, the signal modulation was imposed along with a masker modulation on the tone carrier. The standard interval contained only the masker modulation. The signal modulation depth m at the beginning of a track was set well above threshold, and was varied initially by 3-dB steps (in 20 log m), and in steps of 1.5 dB after the first two reversals. The tracking procedure was run until 16 reversals were obtained; threshold for a given track was taken as the mean modulation depth of the last ten reversals. For each stimulus condition, thresholds presented here are the mean of four such estimates. Only tracks in which the standard deviation of the last ten reversals was less than 3 dB were included in further analysis. Across-subject average data are presented as the mean and standard deviation of the 16 threshold estimates (4 listeners×4 tracks per condition).

2. Apparatus and stimuli

Subjects listened diotically through calibrated Sennheiser HD 580 headphones while seated in a sound-treated booth. Stimuli were digitally generated at a sampling rate of 48.828 kHz and converted to analog signals via the TDT System III two-channel real-time processor (RP2.1) digital-to-analog converter and the TDT System III headphone buffer (HB7), with its gain set to -27 dB (to eliminate back-ground noise). Signals were generated and presented with visual feedback using matlab. Noise waveforms were saved for both intervals on every trial (by recording random-number-generator seeds) so that the exact stimuli could be reconstructed for post hoc analysis (see Sec. III A 3).

The two intervals were each 600 ms in duration including 50-ms cos² ramps, and were presented with a 500-ms interstimulus interval. Both the sinusoidal signal (always in sine phase) and the narrow-band Gaussian-noise masker modulation were applied to the envelope of a 2800-Hz tone carrier for the entire duration of the stimulus. The signal frequency was 64 Hz; the masker was centered on the signal frequency and had a bandwidth of 32 Hz. These parameters were chosen to satisfy several specific constraints. First, the modulation frequencies were low enough to avoid the introduction of audio-frequency spectral resolution cues that arise when the sidebands generated by modulation are remote from the carrier frequency component. In addition, the band-width of the masker was wide enough to allow for the slower second-order (venelope) fluctuations to fall within a range that could potentially be detected in a 600-ms duration signal (the venelope energy was concentrated around 10 Hz). The AM signal and masker parameters were also influenced by modeling considerations, as described below.

Two statistically independent realizations of the masker were generated for the standard and target intervals. An additive approach, as opposed to the multiplicative one used in several related studies (Ewert and Dau, 2000; Ewert et al., 2002; Houtgast, 1989), was used to combine the signal and masker. This allowed for more careful control of the envelope-frequency domain magnitude spectrum (i.e., addition of time-domain waveforms results in the addition of their frequency-domain spectra, whereas multiplication of time waveforms is equivalent to a convolution of their frequency spectra). The equation for the stimuli in both intervals is

s (t) = c {\sin (2 π f_{c} t) [1 + m \sin (2 π f_{m} t) + M (t)]},

where f_c is the carrier frequency, m is the stimulus modulation depth (zero in the standard interval), f_m is the signal modulation frequency, and M(t) is the masker waveform (zero when measuring absolute thresholds). Masker level was defined in terms of the rms of M(t). The compensation factor c was included so the overall power in both intervals was equivalent to that of a 65-dB SPL pure tone. Every stimulus was checked for over modulation caused by the stochastic nature of the narrow-band maskers; no envelope with a modulation index greater than one was presented to the listeners.

3. Conditions

The acoustic stimuli used in this experiment were similar to those described in Ewert et al. (2002). Different parameter variations, as well as minor procedural modifications, distinguish the two studies. Ewert et al. (2002) focused on frequency effects (of both signal and masker). Here, we explicitly considered the effect of masker level (i.e., the masker rms modulation depth) and the consequences of systematically controlling the availability of envelope-detection cues. Thresholds for three conditions were measured: (1) SAM detection with a fixed-level modulation masker, (2) SAM detection with a random 10-dB within-trial rove in masker level, and (3) SAM detection with EEE in the standard and target intervals (after the signal was added). The roving-level condition effectively made envelope energy an unreliable cue; the EEE condition strongly attenuated first-order envelope energy differences as a cue for detection. Thresholds from the fixed-level condition provided a baseline for evaluating the consequences of these two manipulations. Note that the fixed-level condition was comparable to those of previous studies (i.e., Ewert et al., 2002).

B. Results and discussion

1. Fixed-level modulation masker

General trends in the results were similar across the four listeners, but individual sensitivity varied considerably in the masked-AM detection task. Both individual (upper panel) and mean thresholds (lower panel) are shown in Fig. 1 for the detection of a 64-Hz sinusoidal modulation in the presence of an additional masker modulation. The masker had a bandwidth of 32 Hz, and was always centered on the signal frequency. Signal thresholds are shown for a 10-dB range of masker modulation depths.

FIG. 1 — Individual (top panel) and mean (bottom panel) masked-SAM detection sensitivity. Thresholds at these supra-threshold masker depths increased at a rate of about 1 dB (20 log m) per 1 dB (masker rms); the dashed lines in the two panels serve as a reference with a 1 dB/dB slope. Signal *f_m*=64 Hz; masker bandwidth=32 Hz, centered on signal frequency; SPL=65 dB; carrier *f_c*=2800 Hz; duration=600 ms. Standard deviations of individual listener threshold estimates were between 2 and 4 dB (error bars omitted for clarity).

Thresholds increased monotonically as the masker level increased over this range of masker depths. Listener S4 was less sensitive than the other three subjects, while the thresholds of Subject S3 increased at a rate less than 1 dB/dB. Mean thresholds were 1-2 dB (20 log m) lower than the masker modulation depth (dB rms), and increased with a slope of 1 dB/dB. These results are consistent with those of Houtgast (1989), who measured detection thresholds for an 8-Hz sinusoidal signal modulation in the presence of a 2.8 -Hz bandwidth masker modulation. In contrast with the present study, Houtgast (1989) combined the signal and masker multiplicatively and imposed them on a noise carrier.

Somewhat less intuitive are the patterns of thresholds measured for lower-level maskers. In efforts to map out the entire range of masker modulation depths that produced masking while still avoiding overmodulation, for the purpose of the roving-level experiment (to follow), it became clear that some of the listeners’ masked thresholds were lower than their pure AM-detection thresholds. This “facilitation” is illustrated in Fig. 2 in the form of nonmonotonic threshold versus masker level functions for two of the four listeners (S2 and S3). The thresholds for the three right-most points in each function are replotted from Fig. 1. Unmasked detection thresholds ranged from -25 to -30 dB (masker level=-99 dB rms; left-most point on each plot), and were consistent with previously reported pure-tone SAM detection thresholds for comparable f_c, f_m, and SPL (i.e., Kohlrausch et al., 2000). The external variability of the noise maskers began to influence thresholds between -40 and -30 dB rms. The presence of the region of facilitation was not related to absolute sensitivity to AM; the two subjects that exhibited the clearest facilitation had the lowest (S2) and highest (S3) thresholds in unmasked AM detection. In addition, the masker level that resulted in the most facilitation was the same for both listeners (-28 dB rms).

FIG. 2 — Masked-detection thresholds for a wide range of masker modulation depths. Two of the listeners (S2 and S3) exhibited a nonmonotonic dependence of sensitivity on masker level; their thresholds were lower for a masker level of -28 dB than in the unmasked condition. The three right-most points in each panel are replotted from Fig. 1; these masker levels consistently caused “positive” masking without causing overmodulation.

Strickland and Viemeister (1996) and Bacon and Grantham (1989) reported facilitation in some of their tone-on-tone modulation masking conditions, when the frequency of the masker was well below that of the signal. They accounted for this type of negative masking by assuming that their listeners were able to attend to the valleys of the masker when its fluctuations were slow enough, resulting in a temporally localized larger effective modulation depth. The facilitation illustrated in Fig. 2 is fundamentally different: the masker and signal occupy the same frequency region, and inherent fluctuations in the narrow-band masker made the timing of its valleys unpredictable to our listeners. Also, the negative masking effects in previous studies increased as the masker modulation depth increased; the effect observed in the current study is only measurable at very low masker depths (near or even below detection thresholds). Potential mechanisms underlying on-frequency, low-level noise-masker facilitation will be evaluated in Sec. III.

2. Roving-level modulation masker

The effect of introducing a random 10-dB within-trial rove in masker level on listeners’ thresholds is shown in Fig. 3. Because the masker modulation depth was different in every interval, it was necessary to track on the level of the signal with respect to the level of the noise (i.e., the difference between the two in dB). Detection thresholds are plotted for a fixed-level (-18 dB rms) noise masker (filled bars) and for the roving level (uniformly distributed from -23 to -13 dB rms) noise masker (open bars). Individual and across-subject average thresholds are included in the figure.

In general, thresholds in the roving-level condition were 3-5 dB higher than those in the fixed-level case. The effect was significant (t-test, p<0.02) for three of the four individual listeners, and highly significant (p<0.0001) when the across-subject mean and variance was considered. The 10-dB rove in masker level increased the mean thresholds by 4 dB. Unfortunately, the small dynamic range of AM maskers precluded the use of larger rove ranges in the present study (i.e., the masker must be intense enough to cause masking, but not so strong as to result in overmodulation, especially in the signal interval). Despite the limitations, the significant effect of this relatively small rove range contrasts with results from audio-frequency TIN detection experiments, where even a 32-dB rove in masker level did not significantly affect listeners’ thresholds (except at the narrowest bandwidth tested, Kidd et al., 1989). The convincing results of Kidd et al. (1989) provide a critical test that challenges the power spectrum model of masking in the audio-frequency domain. Qualitatively, models which assume the long-term energy of the (ac-coupled) envelope as the perceptually relevant quantity (e.g., Viemeister, 1979; Ewert and Dau, 2002) are not seriously challenged by the current results obtained with the roving-level modulation masker. A more careful analysis of this general statement is provided in Sec. III.

3. Equalized-envelope-energy modulation masker

As an alternative approach to test energy-based models, (long-term) first-order envelope cues were removed by forcing the rms modulation depth of the standard and target intervals to be the same, regardless of the level of the signal (in 20 log m). The task was the same as in the fixed-level and roving-level condition: listeners chose the interval containing the sinusoidal signal modulation. Pure long-term energy decision statistics did not provide any cues for detection in this paradigm (as long as the masker bandwidth was within the passband of the envelope-filtering process). Overmodulation was not an issue in the EEE condition: the average depth in both standard and target intervals was determined by the depth of the masker-alone modulation. Qualitatively, the signal-interval envelope fluctuations became more sinusoidal as m increased (but the overall rms modulation depth was the same in both the standard and target envelopes).

Example waveforms for a -13-dB rms standard depth are illustrated in Fig. 4(a) along with individual listener and mean thresholds for a 10-dB range of masker-alone modulation depths [Fig. 4(b)]. Note that absolute thresholds are not plotted in Fig. 4; instead, increases in threshold over the corresponding fixed-level masker condition are shown. The key result illustrated in Fig. 4 is that the listeners were able to perform the task, although measured thresholds were about 10 dB worse on average than in the fixed-level condition (in which the overall rms modulation depth was allowed to naturally vary across intervals). Perhaps the most striking aspect of the individual thresholds is the high variability both within and across listeners (note the expanded scale of the y axis). Anecdotally, the task became considerably more difficult in the EEE condition, and listeners reported the use of a very different strategy compared to that employed in the fixed-level case. The following sections quantitatively explore potential cues that could explain thresholds in all three masker configurations (fixed-level, roving-level, and EEE).