Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2020 Sep 17;148(3):1552–1566. doi: 10.1121/10.0001971

Spectro-temporal glimpsing of speech in noise: Regularity and coherence of masking patterns reduces uncertainty and increases intelligibility

Daniel Fogerty 1,a),, Victoria A Sevich 2, Eric W Healy 2
PMCID: PMC7500957  PMID: 33003879

Abstract

Adverse listening conditions involve glimpses of spectro-temporal speech information. This study investigated if the acoustic organization of the spectro-temporal masking pattern affects speech glimpsing in “checkerboard” noise. The regularity and coherence of the masking pattern was varied. Regularity was reduced by randomizing the spectral or temporal gating of the masking noise. Coherence involved the spectral alignment of frequency bands across time or the temporal alignment of gated onsets/offsets across frequency bands. Experiment 1 investigated the effect of spectral or temporal coherence. Experiment 2 investigated independent and combined factors of regularity and coherence. Performance was best in spectro-temporally modulated noise having larger glimpses. Generally, performance also improved as the regularity and coherence of masker fluctuations increased, with regularity having a stronger effect than coherence. An acoustic glimpsing model suggested that the effect of regularity (but not coherence) could be partially attributed to the availability of glimpses retained after energetic masking. Performance tended to be better with maskers that were spectrally coherent as compared to temporally coherent. Overall, performance was best when the spectro-temporal masking pattern imposed even spectral sampling and minimal temporal uncertainty, indicating that listeners use reliable masking patterns to aid in spectro-temporal speech glimpsing.

I. INTRODUCTION

Models of speech intelligibility can vary widely in how they predict perceptual performance. For example, the speech intelligibility index (SII) (ANSI, 1997) predicts intelligibility on the basis of long-term level differences between the target and masker, weighted by the perceptual importance of individual frequency bands. This has been extended to consider temporal fluctuations of the speech and noise by calculating the SII for short time windows (Rhebergen and Versfield, 2005). In addition, glimpsing models have become widely used to define the energetic masking effects of noise within small spectro-temporal units (e.g., Cooke, 2006; Brungart et al., 2009; Tang et al., 2016). These glimpsing methods predict intelligibility based on the proportion of speech information “available” to the listener. However, a diverse set of studies suggest that speech recognition in noise is not only a function of the amount of speech information available, but also the organization of that information. Multiple mechanisms might underlie the perceptual processing of that organization, such as context and predictive effects (see review by Stilp, 2020), attentional mechanisms (e.g., Cusack et al., 2004; Shamma et al., 2011), and perceptual grouping cues (see review by Cooke and Ellis, 2001), among others. The current study marks a demonstration of how masker spectro-temporal organization, defined along acoustic dimensions of regularity and coherence, might influence speech recognition beyond that expected by the proportion of the signal masked, as defined by energetic masking.

A. Spectro-temporal glimpsing

Understanding speech in adverse listening conditions often requires listeners to glimpse speech information distributed in time and frequency. For 70 years, investigators have sought to define the parameters that explain how listeners glimpse speech information in fluctuating backgrounds. The seminal study by Miller and Licklider (1950) demonstrated that speech recognition is largely determined by the proportion of speech preserved from the noise and the rate of noise fluctuation (i.e., gating rate). This study of temporal gating was later extended to spectro-temporal gating by Howard-Jones and Rosen (1993), who created a “checkerboard noise,” referring to its visual representation on the spectrogram. This noise was gated on and off within a frequency band, with adjacent bands gated out-of-phase. The creation of the checkerboard-modulated masker allowed for the study of the spectro-temporal masking parameters that influence glimpsing. Subsequently, speech recognition with checkerboard filtering of speech (rather than a checkerboard-noise masker) has also been investigated (Buss et al., 2004; Hall et al., 2008b; Ozmeral et al., 2012). However, these previous studies have been fairly limited in the stimulus parameters that have been used. For example, as acknowledged by Ozmeral et al. (2012), studies have primarily used periodic gating, which is not representative of the non-regular glimpsing that occurs in natural listening environments.

In order to investigate the perceptual effect of non-regular glimpsing, the acoustic parameters detailed below are used to define the acoustic organization of the masking pattern.

B. Defining regularity and coherence for checkerboard maskers

It is first important to define two sets of terms. These terms apply to both the spectral and the temporal gating of the spectro-temporal masker. The spectrogram examples in Fig. 1 may also help clarify.

FIG. 1.

FIG. 1.

Spectrograms displaying the parameters of regularity and coherence for the spectral and temporal dimensions.

First, regularity will refer to the manner in which the divisions occur on the frequency or time axis. Random or irregular spectral filtering results in frequency bands that are different in size across the spectrum at a given time point, whereas non-random or periodic filtering results in bands that are all the same size (in ERBNs). Random or irregular temporal gating results in time divisions that are different in size across the masker, whereas non-random or periodic gating results in time divisions that are equal in duration (isochronous). This manipulation of regularity is seen in the left half of Fig. 1, where periodic spectral or temporal gating results in individual masker “blocks” of equal size, whereas random gating results in unequal sized “blocks.” Thus, reduced regularity increases glimpse variability.

Second, coherence will refer to the existence of commonality across the spectral or temporal dimension. Spectral coherence will refer to the fact that the frequency bands are locked across the utterance, and changes from speech to noise occur within those bands for the duration of an utterance. This locked spectral coherence (constant spectral filtering) is highlighted in the third panel of Fig. 1 by the black box drawn on the spectrogram. Temporal coherence will refer to the fact that temporal onsets are aligned and that changes from speech to noise (or noise to speech) occur simultaneously across all frequencies. The back box in the right most panel of Fig. 1 depicts this common onset/offset of noise bands across frequency. Note that given the alternating nature of the checkerboard gating, it is the gating period (or band grouping) that is coherent.

As a final illustration of coherence, it can be observed that the spectral coherence example in Fig. 1 lacks temporal coherence. That is, there is no time point at which all frequency bands turn on/off synchronously. Likewise, the temporal coherence example lacks spectral coherence, as frequency-band cutoffs vary across time. A spectro-temporal masker that has both spectral and temporal coherence would resemble a true checkerboard.

C. Organization of the masking pattern

As defined above, masking patterns can vary based on regularity and coherence in the spectral and temporal domains. However, while there is some symmetry across these domains, this should not imply that there are symmetric perceptual processes involved across the two domains. This section details evidence that several different perceptual mechanisms are likely associated with these different parameters. Overall, the studies reviewed here suggest that masking patterns affect intelligibility beyond that accounted for by the proportion of speech information retained after accounting for energetic masking. However, while potential mechanisms are reviewed here as evidence that these acoustic parameters of spectro-temporal masking may be important to consider, detailed examination of the various perceptual mechanisms associated with different aspects of the masking pattern is beyond the scope of this preliminary investigation.

1. Regularity

Non-regular masking patterns were investigated as early as Miller and Licklider (1950), who acknowledged that random temporal gating is more representative of natural listening environments. They included a few conditions to investigate the potential effect of this decreased regularity. Their preliminary conclusion was that randomization of temporal gating intervals had the small effect of creating a shallower, inverted u-shaped intelligibility function across different average gating rates of the noise.

Later studies showed a significant effect of masker temporal regularity. For example, detectability of a target tone changes over different temporal delays dependent on the modulation rate of a preceding amplitude-modulated noise masker, such that tone detection is best at delays corresponding to the phase of prior masker dips (Hickok et al., 2015). In other words, preceding masker regularity facilitates detection of a future target. Similarly, enhancing temporal regularity can facilitate glimpsing of a target in a masker leading to increased speed and accuracy for detecting the expected signal (Barnes and Jones, 2000). For speech, decreased temporal regularity of amplitude-modulated noise maskers can interfere with speech glimpsing (Shen and Pearson, 2019). Furthermore, altering temporal rhythms of target and/or competing talkers also affects speech intelligibility (McAuley et al., 2020). Results such as these suggest that listeners are adept at tracking regularities in the acoustic environment (e.g., Jones et al., 2002; Arnal et al., 2014; ten Oever et al., 2014) and form the basis of a rhythmic theory of attention (Jones, 1976). Additional findings of neural synchrony (or entrainment) to auditory patterns support these behavioral observations (e.g., Ward, 2003; Neuling et al., 2012; Lakatos et al., 2013; Weisz and Obleser, 2014) and have led to oscillatory-based theories of perception (e.g., Engel et al., 2001; Ghitza, 2011; Giraud and Poeppel, 2012).

Overall, regularity in the stimulus environment (i.e., signal and masker) appears to facilitate perceptual processing. Thus, masker manipulations that interfere with the regularity in frequency and time of spectro-temporal speech glimpses may hinder speech recognition. However, because the above studies largely focused on temporal regularity, less is known about regularity in the spectral or spectro-temporal domains.

Recently, Fogerty et al. (2018) investigated speech recognition in spectro-temporally modulated noise having periodic or random gating across a range of spectral bandwidths and gating rates. Overall, speech recognition improved as the spectro-temporal “holes” in the masker become larger (i.e., lower gating rates with larger frequency bandwidths). They also demonstrated that random spectro-temporal gating decreased performance relative to periodic gating. However, as noted in that study, the random spectro-temporal gating employed confounded regularity and coherence across the spectral and temporal domains. Therefore, the current study is needed to detail how different dimensions of the masking pattern affect intelligibility.

2. Coherence

In addition to the work involving regularity, work also exists to suggest that coherence can improve speech perception in noise. Spectral coherence in the masking pattern constrains speech glimpses within reliable (i.e., highly probable) frequency bands over time. Thus, speech recognition is better when speech glimpses originate from a fixed frequency band (i.e., spectrally coherent), rather than variable frequency bands (Li and Loizou, 2007). One possibility for this effect is due to an attentional tuning to highly probable frequency bands, as shown for tone detection studies (Sorkin, et al., 1968; Dai et al., 1991) and detailed in auditory cortex (Da Costa et al., 2013; Riecke et al., 2017). Providing consistent glimpses from a specific frequency band over time may effectively sharpen attentional bands to facilitate speech glimpsing. Of course, there are important methodological differences between speech glimpsing in checkerboard noise and these studies. However, they suggest that a perceptual mechanism might exist to capitalize on the reliability of spectral divisions over time.

With regard to temporal coherence, work in perceptual and cortical processing suggests that common onsets are important for auditory scene analysis and object formation (e.g., Elhilali et al., 2009; see also Shamma et al., 2011). In natural environments, a common onset of noise across frequency bands can improve perception by helping to bind the noise components across frequency bands (e.g., Gordon, 1997). This binding can assist in the formation of the noise into a single auditory object and its segregation from the speech. However, Buss et al. (2003) demonstrated that temporal coherence of masker amplitude modulations in narrow frequency bands can reduce open-set word recognition. Additionally, with checkerboard noise, this temporal coherence results in common onsets for speech and noise across alternate frequency bands. These common onsets could potentially result in the perceptual binding together of speech and noise bands, thereby increasing masking properties of the noise.

While these theories are not directly tested here, they offer potential mechanisms that might possibly underlie any differences observed due to variations in the acoustic organization of checkerboard masking patterns. Performance differences resulting from manipulations of the masking pattern would suggest that attentional, grouping, or other perceptual mechanisms are involved to influence how speech glimpses are used by the listener beyond what can be explained by a simple description of energetic masking.

D. The current study

As highlighted in the foregoing discussion, the manipulation of masker regularity and coherence is related to a number of theories from the existing literature. There are significant differences between previous studies that have most often investigated one property independently, e.g., temporal regularity, and the current implementation in which both spectral and temporal domains are considered. However, based on the literature surrounding the reviewed theories, we make the following predictions regarding performance under the spectro-temporal masking conditions defined in the current study.

  • (1)

    Regarding spectral regularity, it is not clear if there is an advantage when each speech band has equal bandwidth. However, it is known that an even, sparse sampling of the frequency spectrum is advantageous (e.g., Warren et al., 1995; Warren et al., 1997; Warren and Bashford, 1999; Healy and Warren, 2003; Warren et al., 2005; Bashford et al., 2000; Humes and Kidd, 2016), which suggests some advantage for spectral regularity.

  • (2)

    Regarding temporal regularity, rhythmic and oscillatory theories of perception suggest a facilitation of speech recognition based on attentional entrainment to the regular timing of glimpses.

  • (3)

    Spectral coherence imposes a consistency of frequency band cutoffs over time (see the black box drawn in Fig. 1) that may help to sharpen attention to frequency bands in which speech glimpses occur with high probability.

  • (4)

    Temporal coherence provides a mechanism for the perceptual binding of speech glimpses due to common onsets. However, the asynchronous nature of checkerboard processing may offset this advantage by facilitating the grouping of bands containing speech and noise based on shared common onsets.

Therefore, the combination of these masker parameters is related to several different theories of auditory attention, grouping, and masking that have the potential to influence speech glimpsing when translated to the spectro-temporal domain. As a beginning investigation, the purpose of this study was to define the perceptual effects of acoustic manipulations of regularity and coherence such that they can be further developed for consideration in speech intelligibility models. Thus, the overarching hypothesis of this study is that the organization of the masking pattern (defined by regularity and coherence) influences speech recognition in noise. The current study represents a critical test as to whether masking patterns play a role in speech glimpsing beyond the proportion of speech information available for perception.

II. EXPERIMENT 1: SPECTRO-TEMPORAL COHERENCE

Experiment 1 examined the effect of spectral and temporal coherence for stimuli with reduced spectro-temporal regularity. In this experiment, stimuli employed random gating in both spectral and temporal domains, similar to the random spectro-temporal conditions from Fogerty et al. (2018). One condition contained only spectral coherence, one condition contained only temporal coherence (as in Fogerty et al., 2018), and one condition contained neither coherence. A fully coherent spectro-temporal masker condition was also included (i.e., the original checkerboard noise, “periodic”). Thus, Experiment 1 was designed to investigate the effects of masker spectro-temporal coherence across different gating rates.

Consistent with the earlier findings from Fogerty et al. (2018), our hypothesis was that performance would be better at lower gating rates that result in larger glimpses. We also predicted that the strong onset cues induced by temporal coherence of the masker and speech would interfere the most with speech recognition, due to the perceptual binding of noise and speech. Following this potential mechanism, the temporal coherence present during periodic, fully coherent (checkerboard) maskers may result in more effective masking than those conditions in which strong onset cues are absent (i.e., only spectral coherence).

A. Participants

Nine normal-hearing young adult listeners participated in the experiment (mean age = 19.4 years, 18–20 years, all female). They were recruited from courses at The Ohio State University and received course credit for participating. All listeners were native speakers of American English and had audiometric thresholds of 20 dB hearing level (HL) or better at octave frequencies between 250 and 8000 Hz (ANSI, 2004, 2010). No subjects had prior exposure to the sentence materials.

B. Stimuli and design

Speech stimuli consisted of sentences spoken by the standard male talker from the Hearing in Noise Test (HINT) (Nilsson et al., 1994). This test consists of 25 lists of ten sentences each, plus practice lists of ten sentences each. The experimental design consisted of a 3 (temporal gating rate) × 4 (coherence condition) factorial design. The temporal gating rates used to determine the on/off rate of the checkerboard noise were 2, 4, and 8 Hz. The rates corresponded to a full on/off cycle of the noise. These rates were selected from Fogerty et al. (2018) because they yielded the largest behavioral difference between periodic and random gating conditions. The noise bandwidth for all conditions was set at 3 ERBN. The four coherence conditions were (1) full coherence (or periodic, P), that implemented coherence in both spectral and temporal domains, (2) spectral coherence (SC), such that the same frequency bands were grouped together across time, (3) temporal coherence (TC), such that all frequency bands turned on or off at the same time, and (4) no coherence (NC). The on/off duty cycle was always 50%. Stimulus examples of these four coherence conditions for a 2-Hz gating rate are displayed in Fig. 2.1 Noise masker audio files for all experiments are provided via the Open Science Framework.2

FIG. 2.

FIG. 2.

Spectrograms of an example section of the 2-Hz/3-ERBN spectro-temporal modulated checkerboard noise for the four different coherence conditions.

Generation of the checkerboard noise followed the signal processing implemented by Fogerty et al. (2018). A continuous speech-shaped noise (SSN) was generated that matched the long-term average amplitude spectrum of the HINT sentences. This SSN was then filtered into 60 contiguous ½-ERBN bands ranging from 80 to 7562 Hz using a bank of a linear-phase finite impulse-response bandpass filters.

To create the fully coherent periodic checkboard noise, six adjacent ½-ERBN noise bands were grouped together into a single spectral region of 3 ERBN. A second, adjacent spectral region was also created by grouping another six ½-ERBN bands. The set of ½-ERBN bands within a spectral region was temporally modulated in-phase, with the two different spectral regions temporally modulated 180° out-of-phase from each other. For example, at a given time, the state of the 12 ½-ERBN noise bands was [111111000000], such that six bands within a spectral region, with a cumulative bandwidth of 3 ERBN, were either on (“1”) or off (“0”). This pattern was repeated across the five spectral regions (12 bands within a region × 5 spectral regions = 60 ½-ERBN bands). The remaining three coherence conditions reduced regularity by implementing random gating in frequency and time.

Regarding randomization in the spectral domain, two adjacent spectral regions (each with a 3-ERBN bandwidth) were grouped together with half the region on and half the region off. Spectral randomization occurred by circularly shifting which of the 12 ½-ERBN bands were turned on or off. Thus, the band state prior to randomization, [11111000000], could be [111100000011] following randomization. Each grouped spectral region of 12 ½-ERBN noise bands within a temporal-gating period used a different spectral randomization. Whereas the same spectral randomizations were used across time for the SC condition, different spectral randomizations were used for each temporal-gating period for TC and NC conditions.

Regarding randomization in the temporal domain, the starting phase of the temporal gating within each full gating cycle was circularly shifted by a random phase. This same random phase was used for all bands in the TC condition; whereas, a different random phase was determined for each grouped set of 12 ½-ERBN noise bands in the SC and NC conditions.

These stimulus processing procedures resulted in defining coherence for the wideband signal across time. However, some local coherence remained due to the on/off alternation of noise characteristic of checkerboard processing. For SC, temporal coherence was limited to be within a pair of 3 ERBN bands. For TC, spectral coherence was limited to be within a full temporal-gating period. For NC, spectro-temporal coherence was limited to be within a pair of 3 ERBN bands and a full temporal-gating period. Thus, these conditions limited coherence based on the full spectrogram, but did not remove all locally coherent noise modulations.

The noise was turned on and off using a 5-ms raised-cosine ramp, such that the “on” period of each noise interval was measured from the 6-dB down point of the modulated rise or fall. Non-gated SSN was also tested for comparison.

C. Procedure

Participants were tested while seated with the experimenter in a sound-attenuating booth. Stimuli were presented diotically using Sennheiser HD 280 Pro headphones (Wedemark, Germany). The headphones were calibrated using a Larson Davis sound-level meter and headphone coupler (Depew, NY). The level of the non-gated SSN was set to 70 dBA. The “on” portions of the gated noise were equivalent to this level. As 50% of the noise was removed through spectro-temporal filtering, the overall level of the gated noise was −3 dB from the non-gated SSN (67 dBA). The speech stimuli were scaled to a long-term average level of 61 dBA, which resulted in an SNR of −9 dB relative to the level of the non-gated noise and −6 dB relative to the long-term average level of the gated noise.

Two comparison conditions were tested in non-gated SSN. In one condition, the 70 dBA noise was mixed with the 61 dBA speech, to produce −9 dB SNR. In the other condition, the noise was reduced in level by 3 dB to produce −6 dB SNR. Thus, the first non-gated SSN condition maintained energetic masking at the same SNR that existed during the “on” portions of noise. The second non-gated SSN condition mimicked the long-term average SNR in gated noise.

To allow for any perceptual tracking or adaptation to the masker regularity, the presentation of the noise began several seconds before presentation of the speech and was continuously on during a given condition. Each subject was presented with a random alignment between the speech and the spectro-temporal noise along with a different sentence list-to-condition correspondence. Participants were instructed to repeat each sentence as best they could and to guess if necessary. The experimenter controlled the stimulus presentation and recorded correct responses. Performance was based on percentage of all words correctly reported. Sentences contained from six to ten words.

A practice list was provided, first consisting of five HINT sentences in a quiet background, followed by ten HINT sentences in the same background noise as the first-heard condition at a rate of 4 Hz. This was followed by the 14 condition blocks (3 temporal gating rates × 4 coherence conditions + 2 SSN conditions = 14), with 15 sentences in each block. Conditions were presented in pseudorandom order for each subject, and no sentences were ever repeated to a given subject. Proportion correct for this and all subsequent experiments was transformed to rationalized arcsine units (RAU) to stabilize the error variance for statistical analysis (Studebaker, 1985).

D. Results

Results for Experiment 1 are displayed in Fig. 3. A three (temporal gating rate) by four (coherence) repeated-measures analysis of variance was completed on the RAU scores. Consistent with previous findings (Fogerty et al., 2018), a significant main effect of gating rate confirmed declining performance with increasing rate and correspondingly smaller glimpse sizes [F(2, 16) = 70.5, p < 0.001, ηp2 = 0.90]. The significant main effect of condition [F(3, 24) = 12.4, p < 0.001, ηp2=0.61] confirmed better performance with increasing coherence of the masker. That is, the best performance across all gating rates was obtained for the fully coherent periodic masker, whereas the NC masker resulted in the poorest performance collapsed across rates. Finally, the significant interaction [F(6, 48) = 2.7, p < 0.05, ηp2 = 0.25] reflected the changing performance across coherence conditions with increasing gating rate. Specifically, whereas SC and TC maskers produced similar performance at the lowest 2-Hz gating rate, performance diverged at the higher rates. SC and fully coherent maskers approximated each other at the higher gating rates, plateauing to a performance level consistent with a steady-state masker matching the long-term average level of the gated noise (i.e., −6 dB SNR). On the other hand, TC and NC maskers produced similar performance at the higher rates. While these maskers still demonstrated significant masking release relative to the “on” portions of the masker (i.e., −9 dB SNR), performance at 8 Hz was poor relative to SC and P, and to the −6 dB SNR noise.

FIG. 3.

FIG. 3.

Group-mean word recognition scores (and standard errors) for the four spectro-temporal gating conditions across three gating rates. The dashed lines indicate performance in steady-state noise presented at the long-term average level of the checkerboard noise, −6 dB SNR (top), and at the level of the “on” portions of the gated noise, −9 dB SNR (bottom). P, periodic (or full coherence); SC, spectral coherence; TC, temporal coherence; NC, no coherence.

E. Discussion

Overall, for all conditions tested, the results of this experiment confirm the previously reported finding by Fogerty et al. (2018) that performance in spectro-temporally gated noise improves as the gating rate decreases below 4 Hz. In accord with the conclusions of the previous study, this effect appears to be due to larger preserved glimpses that may be less susceptible to masking—both simultaneous and non-simultaneous.

Whereas the current results of speech in filtered noise demonstrate better performance with larger glimpses, studies examining filtered speech in quiet have often found better recognition for smaller glimpses provided by narrow bands of speech that sufficiently sample the frequency spectrum (e.g., Warren et al., 1995; Warren et al., 1997; Warren and Bashford, 1999; Healy and Warren, 2003; Warren et al., 2005; Bashford et al., 2000; Humes and Kidd, 2016). This effect also holds true for checkerboard filtered speech using smaller frequency bands or higher temporal gating rates (Buss et al., 2004). Thus, listening to filtered speech in quiet appears to be fundamentally different from listening to continuous speech masked by filtered noise. Whereas both situations involve the ability to assemble partial aspects of speech information distributed in time and frequency, the latter also involves masking properties of the noise in which larger glimpses of speech are advantageous. Small, low-energy glimpses, while available with filtered speech, may be overcome by the presence of a temporally or spectrally adjacent intense noise.

This study demonstrates that speech recognition is best when the masker has the greatest coherence and regularity, i.e., the fully coherent spectro-temporally periodic masker. Spectro-temporal randomization decreased performance. Spectral coherence resulted in better performance than temporal coherence, but not to the level of the fully coherent periodic masker, especially at the lowest gating rate. Temporal coherence of the masker resulted in poor performance, particularly at the higher gating rates. Interpretation of this effect is complicated when considering the high performance observed with the periodic masker, which exerts both spectral and temporal coherence. The poor performance in TC was either not due to common onsets across frequency bands, which the periodic masker also has, or the hypothesized detrimental effect of common onsets was overcome in the periodic masker by additional masker regularity. The combination of the conditions tested here cannot explain this effect, although the former explanation is more parsimonious. Finally, the condition that lacked any global spectro-temporal regularity and coherence (NC, however, retained some local coherence within a spectro-temporal period), resulted in the poorest overall performance.

Another factor that may be involved here is the possibility of modulation masking (e.g., Stone et al., 2012; Fogerty et al., 2016), in which the modulation rate of the masker interferes with the perceptual processing of important speech modulation around 4–8 Hz (e.g., Drullman et al., 1994a,b; Greenberg et al., 2003). Given that performance declines for 4 and 8-Hz gating for all coherence conditions, modulation masking would need to occur within frequency bands (i.e., does not require temporal coherence). That is, the temporal modulation is preserved within an ½ ERB band, with minimal effect of the uncomodulation across frequency bands. This would be consistent with interpretations that modulation masking occurs even for notionally “steady-state” maskers in which spectral or temporal coherence would not be expected (Stone et al., 2012). However, the possible role of modulation masking cannot be separated from the effect of smaller glimpse sizes in the current study and does not appear to fully explain differences across the coherence conditions tested here.

This experiment clearly demonstrates that performance changes as coherence in the structure of the masker (or preserved speech glimpses) changes, with the best performance during spectro-temporally regular and coherent maskers (i.e., periodic). The organization of the masking pattern does appear to affect performance. However, all conditions in this experiment involved randomization in both the spectral and temporal domains. Experiment 2 was designed to investigate the independent contribution of these two domains in terms of the gating regularity and coherence.

III. EXPERIMENT 2: REGULARITY AND COHERENCE

Experiment 1 demonstrated a significant effect of masker gating coherence, with graded performance across SC, TC, and NC conditions. These conditions all used randomization in both domains to reduce spectro-temporal regularity, for comparison to the fully coherent periodic spectro-temporal masker. Therefore, from this experiment, it is difficult to isolate the effects of gating regularity in either the spectral or temporal domain. Furthermore, these independent effects of regularity may interact with the effects of coherence. Experiment 2 was designed to systematically separate these factors for independent investigation.

Experiment 1 provided the extremes of no randomization with complete spectro-temporal coherence in the periodic condition [Fig. 2(a)] to spectro-temporal randomization with no global coherence [Fig. 2(d)]. The detailed manipulations of Experiment 2 were designed to span the performance gap between these two conditions by independently varying the temporal and/or spectral domain used to define regularity and coherence. To best observe these effects, the gating rate with the greatest performance difference in Experiment 1 was selected for study (2 Hz).

Experiment 2a started with the periodic condition [Fig. 2(a)] as a baseline and reduced regularity (i.e., periodicity) or coherence (either spectral or temporal). Experiment 2b started with the spectro-temporally random conditions of Experiment 1 [Figs. 2(b)–2(c)] as a baseline and introduced regularity in either the spectral or temporal domain. The combination of these conditions provided a fairly comprehensive investigation of the effects of randomization and coherence in the spectral and temporal domains.

A. General procedures

Twenty new normal-hearing young adult listeners (mean age = 19.9 years, 19–21 years) from The Ohio State University participated in Experiments 2a and 2b during the same test session. All listener characteristics were identical to those in Experiment 1.

Calibration, apparatus, and testing procedures were identical to those of Experiment 1. Each listener heard 11 conditions (2 regularity types × 2 coherence types + 1 control in Exp. 2a, and 3 periodic regularity types × 2 coherence types in Exp. 2b). Fifteen sentences for each condition were selected from the HINT corpus. The speech stimuli were scaled to be presented at an SNR of −6 dB relative to the average level of the gated noise, and −9 dB relative to the level of the non-gated calibration noise, which was presented at 70 dBA. Fifteen practice sentences were again provided prior to experimental testing. Participants responded by repeating the sentences aloud and were scored online by the experimenter.

B. Experiment 2a

1. Stimuli and design

Noise maskers were gated according to the procedures described in Experiment 1, again using a 50% duty cycle and a 3-ERBN bandwidth. The temporal gating rate was 2 Hz. Experiment 2a consisted of five experimental conditions displayed in Fig. 4. The first condition [Fig. 4(a)] was a replication of the periodic spectro-temporal condition from Experiment 1. The second two conditions maintained coherence in both the spectral and temporal domains but preserved periodicity in only one dimension: spectral periodicity [Fig. 4(b)] or temporal periodicity [Fig. 4(c)]. The dimension that was not periodic was random. Following the stimulus processing procedures of Experiment 1, this randomization was introduced by selecting a random phase to circularly shift the gating interval for each grouping of ½-ERBN spectral bands or each period, for spectral or temporal randomization, respectively. (Recall that the 3-ERBN gating bandwidth was created by grouping twelve ½-ERBN bands together and assigning six consecutive bands to turn on or off together, prior to the circular randomization). The final two conditions maintained spectro-temporal periodicity but preserved coherence in only a single dimension: spectral coherence [Fig. 4(d)] or temporal coherence [Fig. 4(e)]. The dimension that was not coherent was made incoherent by randomly selecting a different starting phase for the temporal gating of each spectral group [Fig. 4(d)] or a different starting phase for the spectral bands in each gating cycle [Fig 4(e)].

FIG. 4.

FIG. 4.

Spectrograms of an example section of the 2-Hz/3-ERBN spectro-temporal modulated checkerboard noise. (a) displays the periodic replication from Experiment 1 that had spectro-temporal (ST) coherence and periodicity; (b–e) display conditions that either removed coherence or periodicity by way of randomization in either the spectral or temporal domain.

Finally, whereas temporal manipulations resulted in a different pattern over time, spectral manipulations, even though randomly selected, resulted in the same spectral pattern over time. In order to minimize any potential effect that a particular spectral randomization resulted in more or less favorable glimpsing conditions for this particular talker, four different randomizations for conditions in Figs. 4(c) and 4(d) were created for testing. Individual participants were randomly assigned to one of these four versions to be used throughout the duration of their testing.

2. Results and discussion

Results for the four new experimental conditions are displayed in Fig. 5. The top dashed reference line displays the new data for replicating the periodic condition from Experiment 1. The bottom dashed line replots the Experiment 1 data for the no coherence condition. Again, these conditions represent extremes between which we explore. Planned, paired t-tests between each new experimental condition and the periodic condition confirmed that the effect of removing either spectral (Fig. 5(e), TC-STP) or temporal (Fig. 5(d), SC-STP) coherence was minimal (p > 0.05), relative to coherence in both dimensions (i.e., periodic). However, there was a significant reduction in performance due to disrupting regularity, either in the temporal (Fig. 5(b), STC-SP: t(19) = 3.4, p < 0.01, d = 0.75) or spectral (Fig. 5(c), STC-TP: t(19) = 2.2, p <0.05, d = 0.49) domain, when the noise masker was both spectrally and temporally coherent. Therefore, the regularity of interruption in either dimension appears to be important, and spectro-temporal regularity is potentially more important than the effect of checkerboard coherence. That is, regardless of the dimension of coherence, STP resulted in the best performance, approximating that of the baseline periodic masker with coherence in both dimensions. These effects were explored in additional detail in Experiment 2b.

FIG. 5.

FIG. 5.

Group-mean word recognition scores (and standard errors) for Experiment 2a at a 2-Hz/3-ERBN spectro-temporal (ST) gating rate. The top dashed line represents data that replicate the periodic condition from Experiment 1 [Fig. 4(a)] for this new group of participants. The bottom dashed line is a reference, replotting the data from Exp. 1 in the no coherence (NC) condition. Conditions are ordered left to right in the order corresponding to (b–e) in Fig. 4. S, spectral; T, temporal; subscript C, coherence; subscript P, periodic.

C. Experiment 2b

1. Stimuli and design

Experiment 2b was designed to take the spectro-temporally random stimuli from Experiment 1 that had either spectral or temporal coherence and introduce periodic regularity into one dimension: either spectral or temporal. This resulted in four new noise masking conditions. Combined with a replication of Experiment 1 with the original spectrally coherent and temporally coherent random stimuli, this resulted in a total of six experimental conditions tested in Experiment 2b.

For reference, masker spectrograms for these six conditions are displayed in Fig. 6. Figures 6(a)–6(c) display spectrally coherent stimuli and Figs. 6(d)–6(f) display temporally coherent stimuli. Figures 6(a) and 6(d) are the replication stimuli from Experiment 1 with random gating in both spectral and temporal domains. Figures 6(c) and 6(e) restore temporal periodicity, while Figs. 6(b) and 6(f) restore spectral periodicity.

FIG. 6.

FIG. 6.

Spectrograms of an example section of the 2-Hz/3-ERBN spectro-temporal modulated checkerboard noise. (a, d) display replications from Experiment 1 that had spectral (S) or temporal (T) coherence with spectro-temporal randomization. (b, c) maintain spectral coherence and introduce periodicity in either the temporal or spectral domain. Likewise, (e, f) maintain temporal coherence and introduce periodicity in either the temporal or spectral domain.

Stimulus processing of the noise maskers followed the same processing steps as in the previous two experiments. Identical noise maskers from Experiment 1 were used for the spectro-temporally random stimuli [Figs. 6(a) and 6(d)]. For the new spectrally periodic stimuli [Fig. 6(b) and 6(f)], no random shift of the ½-ERBN spectral bands was applied. For the new temporally periodic stimuli [Fig. 6(c) and 6(e)], no temporal phase shift was applied to the temporal gating interval. As in Experiment 2a, four different randomizations of the spectrally coherent stimuli were created for random assignment across participants to minimize the potential influence of this factor on the experimental results.

2. Results and discussion

Results for the six new experimental conditions are displayed in Fig. 7, along with reference lines for performance in the periodic condition (for these same subjects in Experiment 2a) and no coherence conditions (from Experiment 1). Conditions are labeled with example spectrograms referencing the stimulus panels in Fig. 6.

FIG. 7.

FIG. 7.

Group-mean word recognition scores (and standard errors) for Experiment 2b at a 2-Hz/3-ERBN spectro-temporal gating rate. The top dashed line is the periodic condition data from Experiment 2a from this group of participants. The bottom dashed line is a reference, replotting the data from Experiment 1 in the NC condition. Conditions are ordered left to right in the order corresponding to Figs. 5(a)–5(f). S, spectral; T, temporal; subscript C, coherence; subscript P, periodic.

A 2 (coherence) × 3 (periodicity) repeated-measures analysis of variance was conducted on the RAU scores. For this analysis, periodicity was defined according to how it aligned with the dimension of coherence. That is, SC-SP and TC-TP involved periodicity aligned with the dimension to which coherence was applied, unlike SC-TP and TC-SP in which periodicity was applied to the alternate dimension from that of coherence. Results demonstrated a significant main effect of coherence, with better overall performance for spectrally coherent maskers [F(1,19) = 8.1, p = 0.01, ηp2=0.30]. Performance also improved with greater periodicity in the dimension of the applied coherence [F(2,38) = 30.7, p < 0.001, ηp2 = 0.62]. No significant interaction was observed (p = 0.40). The best performance was obtained in the SC-SP condition where listeners, on average, matched performance in the periodic spectro-temporal masker.

Overall, these results suggest that performance improvements are greatest for increasing periodicity in the masker. Furthermore, this enhanced masker regularity is most useful when it is aligned with the dimension of the coherent modulations. That is, spectral regularity is most beneficial when present with spectrally coherent modulations. Likewise, temporal regularity is most beneficial with temporally coherent modulations. Finally, listeners benefited from spectral coherence to a greater degree than temporal coherence [Figs. 7(a)–7(c) greater than Figs. 7(d)–7(f)]. Of course, even though different randomizations of spectrally coherent conditions were provided across participants, individual participants could also have benefited from the same randomization across trials, which was not available in the temporal coherence conditions.

Similar to that found in Experiment 2a, regularity in the temporal or spectral domain is the main determinant of performance, with coherence contributing to a smaller degree. However, the current experiment adds an additional nuance in that regularity reduces masking particularly when it occurs within the dimension of coherent modulations.

The current effect of coherence is also largely consistent with the results from Experiment 1. SC and TC conditions demonstrated similar performance when there was no periodicity (compare SC versus TC at 2 Hz in Figs. 3 and 7). However, when periodicity was present, spectral coherence resulted in higher scores (Experiment 2b, where SC the conditions in Figs. 7(a)–7(c) were better than TC conditions in Figs. 7(d)–7(f)]. These comparisons were all at a gating rate of 2 Hz. Increases in gating rate also resulted in spectral coherence becoming advantageous over temporal coherence (Experiment 1, see Fig. 3). Therefore, we might expect the advantage of SC over TC observed in Experiment 2 (at 2 Hz) to become even greater at higher rates.

IV. TIME-FREQUENCY GLIMPSE ANALYSIS

The above set of experiments demonstrate a significant effect of masker regularity. That is, performance was always best with the baseline periodic masker. As regularity decreased by introducing either spectral or temporal randomization, performance decreased. Furthermore, when periodicity was maintained in both spectral and temporal domains but coherence in one of the domains was removed [see Figs. 5(d) and 5(e)], performance was similar to that with complete spectro-temporal coherence and regularity. These results seem to conform with literature suggesting that listeners are sensitive to the regularity of the masker, or speech glimpses provided by the masker.

However, the manipulation of masker regularity and coherence also affected properties of speech glimpses. Whereas the glimpses were all the same size in a given periodic condition, they were variable in size in a given random condition. The spectrograms show this greater number of smaller and larger glimpses in the random conditions.

A. Analysis procedure

In order to investigate the contribution of speech glimpses to intelligibility, a glimpsing model was applied to the current stimuli. This model is designed as an acoustic analysis to define the portions of speech that might be retained after accounting for the energetic masking of the noise. It is not designed to model the biological processes of the ear or perceptual processing of the human listener.

Toward this end, an ideal binary mask (IBM) (Wang, 2005) was created at the tested SNR using a concatenation of all experimental sentences and a random selection of each masker matching the duration of the speech concatenation. Following the procedures of Li and Loizou (2008), the IBM was created using a local SNR criterion of 0 dB and an analysis window of 20 ms with 50% overlap and 160 FFT points (note that this is a much finer scale than created by the checkerboard noise processing). These parameters determine the acoustic glimpses in terms of noise threshold and spectral and temporal resolution.

Using this model, the glimpse proportion for each noise condition was calculated. Whereas acoustic masker modulation was created using a fixed 50% duty cycle for all conditions, the current glimpse proportion analysis takes into account the relative difference between natural speech fluctuations and masker level for each time-frequency unit, the size of which is defined by the IBM parameters. Intuitively, it might seem that the glimpse proportion should be 0.5 for all conditions as determined by the noise duty cycle. However, in practice, this does not occur, due in large part to the fluctuating nature of the speech and noise. Also, the time-frequency resolution of the IBM (determined by the window and FFT size) determines the detection of speech glimpses. Thus, speech glimpses that were either too brief in duration or narrow in frequency for this analysis were not included in this acoustic measure. Importantly, this was more likely to occur for conditions with greater randomness in which very small glimpses could have been created. Across all conditions, the calculated glimpse proportions ranged from 0.38 to 0.42, which were associated with speech recognition scores between 44% and 73%.

B. General analysis properties

In order to provide some intuition into this analysis, Fig. 8 displays spectrograms for an example 5-s sample of speech. Figure 8(a) displays the target speech in quiet and Fig. 8(b) displays the periodic 2 Hz checkerboard masker. Figure 8(c) displays a spectrogram of potential glimpses from the IBM, calculated using a SSN target to model the average speech spectrum. The inverse checkerboard pattern is notable, compared to the masking pattern in Fig. 8(b). Furthermore, retained energy during time-frequency units containing the masker is also notable. While the contribution of these highly sparse cues present during masking intervals is limited when alone (e.g., 12% speech recognition in the −9 dB SNR SSN condition), the literature is full of examples where sparse speech cues, when combined with other speech information, can yield large benefits to speech recognition (e.g., Warren et al., 1995; Warren and Bashford, 1999; Healy and Warren, 2003; Warren et al., 2005; Bashford et al., 2000). Finally, Fig. 8(d) displays the spectrogram for speech [in Fig. 8(a)] retained in the IBM following checkerboard masking [in Fig. 8(b)]. While the speech spectrogram is notably sparser, the lack of a checkerboard pattern analogous to Fig. 8(c) is notable. This is due to (1) the lack of speech energy in many potential time-frequency units and (2) the presence of preserved speech energy retained during noise intervals due to higher relative energy in the speech (also see the supplemental materials for how detected glimpses may be reduced when the glimpse analysis window spans the edge of a noise unit3). In this example, 40% of the target speech time-frequency units were preserved.

FIG. 8.

FIG. 8.

Spectrograms for (a) clean speech prior to noise masking, (b) periodic checkerboard noise masker, (c) potential time-frequency glimpses based on a steady-state noise matching the long-term average speech spectrum of the target speech, and (d) observed time-frequency speech glimpses for the target speech in (a) during the presence of the masking noise in (b).

C. Explaining participant performance

Of interest is how this acoustic description of speech glimpsing, which accounts for the energetic masking of the noise, can explain participant performance. Participant accuracy scores are plotted as a function of the calculated effective glimpse proportion in Fig. 9 for Experiments 1 and 2. The same data are plotted twice, either coded by regularity (left panel) or by coherence (right panel).

FIG. 9.

FIG. 9.

Participant accuracy as a function of glimpse proportion across all experiments. The left panel codes the conditions according to regularity and the right codes conditions according to coherence.

Several observations can be made from these results. First, as predicted, maskers with greater regularity are generally associated with greater effective glimpse proportions (although some exceptions are observable). Second, a glimpse proportion does not account for the systematic variation in performance due to masker coherence (STC, SC > TC, NC). This is particularly evident at the lower glimpse proportions in the right panel. In summary, this analysis suggests that the strong effect of regularity may be largely related to the availability of a greater number of effective glimpses, but this glimpsing model does not adequately explain the effect of coherence or how the two may potentially interact.

These results are significant because they may explain in part why regularity enhances performance in this and other studies: regularity leads to more useful glimpses. Furthermore, these results also highlight a limitation of the glimpsing model: it does not explain variation in speech recognition due to masker coherence, perhaps because the glimpsing model analyzes each time-frequency unit independently. This latter observation suggests that the organization of masking patterns, defined in part by coherence, is another important parameter for consideration in models of glimpsing.

V. GENERAL DISCUSSION

Overall, results from this set of experiments demonstrate that sentence recognition in spectro-temporally gated noise improves with increasing regularity of masker fluctuations in both spectral and temporal domains. In Experiment 1 (see Fig. 3), it was found that, in the absence of periodicity, speech recognition is better with masker spectral coherence across temporal-gating rates and with temporal coherence at slower rates (i.e., 2 Hz) compared to the NC condition that limited coherence in both dimensions. Although the differences between spectral and temporal coherence were small at the lowest gating rate of 2 Hz, spectral coherence was found to be superior to temporal coherence at the higher rates. This benefit provided by spectral coherence is in accord with hypotheses related to attention to specific frequency bands, although this study did not directly test such a mechanism. Whereas spectrally coherent conditions approached performance for SSN calibrated to 50% power of the “on” portions of the noise (due to the 50% duty cycle), temporal coherence demonstrated no advantage over no-coherence conditions. However, all conditions demonstrated significant masking release relative to SSN at the “on” level of the noise. While the exact mechanism for these results remains undefined, they do demonstrate that masker patterns influence perceptual glimpsing beyond the traditional intelligibility predictions that are limited to considerations of the proportion of speech information available (e.g., Cooke, 2006; ANSI, 1997).

Experiment 2 confirmed the primary role of regularity, and a more limited contribution of checkerboard coherence at 2 Hz (although note that a larger difference in spectral and temporal coherence was observed at higher rates in Experiment 1). All periodic/coherent conditions produced performance that was superior to that of the no-periodicity/no-coherence condition. These results together suggest that the spectro-temporal pattern of speech glimpses, defined by masker regularity, may facilitate speech-in-noise recognition. Regularity induces a uniform size of glimpses across the spectrum or across time that appears beneficial for speech glimpsing. The benefit of regularity in the spectral or temporal domains may be understood, in part, through the consequent increase in the number of glimpses large enough to be retained in the glimpsing model accounting for energetic masking, or a decrease in variability. This stands in contrast to an explanation based on the enhancement of future glimpse predictability, as spectral regularity (unlike temporal regularity) provides no information about future events.

In addition, the results of Experiment 2b suggest that listeners are particularly able to benefit from enhanced masker regularity when it occurs within the dimension of coherent masker modulations. That is, performance is maximized when spectral periodicity is combined with spectral coherence, or temporal periodicity is combined with temporal coherence. The reason for this is not clear from the present set of experiments. Further work will be required to detail the perceptual mechanisms behind how regularity and coherence within the same dimension might enhance glimpsing.

A. Potential mechanisms underlying masking pattern effects

Across experiments, maskers with spectral coherence generally resulted in performance levels at or better than those with temporal coherence (see Figs. 3 and 7). As hypothesized, the better performance with spectral coherence may be related to the facilitation of attention to selective auditory filters over time that provide consistent glimpses. In contrast, the greater masking effects of temporal coherence observed in Experiments 1 and 2b may be due to the presence of common onsets across speech and noise bands that result in perceptual grouping of the speech and noise and therefore interfere with stream segregation (as hypothesized).

However, these two independent perceptual mechanisms related to auditory attention and perceptual grouping appear inadequate for several reasons. Among these, it is not clear how selective auditory filters would occur in checkerboard maskers where availability of the full spectrum is available over time. Data from the current experiments also points to limitations of the perceptual grouping hypothesis. First, participants performed particularly well in fully spectro-temporally periodic maskers, which also shared common onsets across speech and noise (in contrast to the hypothesis). Second, the glimpse analysis (see Fig. 8) suggests that such an explanation, while intuitive, might not always match the acoustic realization of glimpses when speech is continuous. This analysis demonstrated that the onset of speech glimpses did not always match the onsets of noise bands. Therefore, an explanation based on common onsets may be insufficient to explain the temporal coherence results.

Instead, a more general and parsimonious explanation would involve variation in a single mechanism: one involving the degree of spectro-temporal masker uncertainty (or reliability). Periodic maskers provide reliable masker intervals in frequency and time. Spectrally coherent maskers provide more limited, but reliable frequency bands over time. Reliability of masker fluctuations is most reduced for temporally coherent maskers, which provide variable, uncertain masker intervals cycle-to-cycle. It may be that one reason performance improves with temporally coherent maskers at slower gating rates is that this reduces masker uncertainty over the longer temporal-gating cycle. As such, uncertainty (or reliability) may provide the most parsimonious account for the results of spectral, temporal, and spectro-temporal coherence. While notable differences exist with the current study, it has often been observed that uncertainty in noise masking patterns reduces detection of target signals (e.g., Spiegel et al., 1981; Spiegel and Green, 1982; Langhans and Kohlrausch, 1992) and greater reliability improves the intelligibility of target speech (e.g., Felty et al., 2009; Zaar and Dau, 2015).

B. Future directions

Further studies will need to explore the generalizability of these results to other types of maskers. The checkerboard maskers used here provide a method for controlling how masking patterns are spectro-temporally organized. However, they are also necessarily artificial. In addition, the coherence manipulations described here are defined in terms of a checkerboard masker that involved alterations of the noise. Typical coherent maskers may involve coherence across all components of the masker, rather than alternating components, which may introduce important perceptual differences. However, even with these limitations, the results of this study suggest that the acoustic organization of masking patterns, particularly related to the regularity and uncertainty (reliability) of that pattern over time, are important characteristics that determine the effectiveness of the masker beyond simple energetic masking defining the proportion of speech information retained.

The currently observed effects may be different for listeners with hearing loss. For example, the poor frequency resolution of listeners with sensorineural hearing loss may limit their ability to locate glimpses of clean speech in natural noises because all of the broad glimpses available to these listeners contain noise (e.g., Apoux and Healy, 2009; Healy et al., 2019). However, the results of the current study suggest that under the right conditions (i.e, spectrally regular and coherent modulations of sufficient glimpse sizes), hearing loss should impose minimal limitations on glimpsing beyond reductions in sensation level. This is because the 3 ERB glimpse bandwidth used here should still exceed the broad auditory filters associated with some degrees of hearing loss. The limits to this ability would be subject to the possibility of reduced integration of speech information across glimpses (e.g., Healy and Bacon, 2002; Hall et al., 2008a,b; Başkent et al., 2010). Future work will be required to assess the interaction between masker regularity and hearing loss.

C. Summary

Overall, the results of this study confirm the previous finding from Fogerty et al. (2018) that performance in modulated maskers is better for lower gating rates that provide larger spectro-temporal glimpses of the target speech. Furthermore, this study uniquely demonstrates that speech recognition in spectro-temporally modulated maskers improves as the regularity of the stimulus improves. This effect was explained to a limited extent by a reduction in the availability of speech glimpses when the phase of spectral or temporal gating was randomized. Spectro-temporal coherence demonstrated a smaller effect but was most effective when regularity and coherence occurred in the same domain. These results support the overarching hypothesis of this study: the acoustic organization of spectro-temporal masker patterns influences speech intelligibility beyond what is accounted for by energetic masking. Thus, increasing the regularity and reliability of masker patterns, enhancing the coherence of speech glimpses, and expanding the size of spectro-temporal glimpses is required to maximize speech recognition for normal-hearing listeners in spectro-temporally modulated backgrounds.

VI. SUMMARY AND CONCLUSIONS

  • (1)

    The impact of spectro-temporally modulated noise maskers on speech recognition was examined. The two characteristics of the maskers included regularity (or periodicity versus randomness), referring to the manner in which the divisions occur at equal intervals on the frequency or time axis, and coherence, referring to the existence of commonality across the spectral or temporal dimension (see Fig. 1).

  • (2)

    Glimpsing of speech in spectro-temporally fluctuating maskers is improved when periodicity and/or coherence are introduced to the spectro-temporal patterns. Whereas several studies have shown effects of temporal regularity, the current results add to the rather limited data showing an effect of spectral regularity/coherence.

  • (3)

    In the absence of periodicity, coherence can benefit speech glimpsing. Coherence in the spectral domain was generally more beneficial than coherence in the temporal domain (see Experiment 1, Fig. 3, and Experiment 2b, Fig. 7).

  • (4)

    The facilitative effect of masker periodicity is generally stronger than the effect of coherence. This was shown in Experiment 2a, where removing coherence in one dimension did not hinder performance when full spectro-temporal periodicity was present, whereas removing periodicity in one dimension did hinder performance (Fig. 5).

  • (5)

    Combining periodicity with coherence was particularly beneficial and resulted in performance that was consistently superior to a condition of no periodicity or coherence. The benefit of combined periodicity and coherence was largest when these characteristics of the modulated noise masker existed within the same domain—either spectral or temporal (Experiment 2b, Fig. 7).

  • (6)

    Maskers with temporal coherence but limited spectral coherence hindered intelligibility. While the possibility of enhanced perceptual grouping of speech and noise was explored, this effect may be more related to an effect of masker uncertainty. That is, the masking of frequency bands was variable cycle-to-cycle.

  • (7)

    The role of randomness can be explained in part by poorer glimpsing opportunities relative to the periodic conditions. However, this effect cannot account for the effects of coherence or, particularly, the synergistic combination of periodicity and coherence within the same domain.

  • (8)

    Speech recognition for normal-hearing listeners in spectro-temporally modulated backgrounds is maximized by increasing the regularity of masker patterns, expanding the size of spectro-temporal glimpses, and enhancing the coherence (or reducing the uncertainty) of speech glimpses.

ACKNOWLEDGMENTS

Data collection assistance of Devan Lander is gratefully acknowledged. This work was supported in part by the National Institutes of Health, National Institute on Deafness and Other Communication Disorders, Grant Nos. R01-DC015465 (D.F.) and R01-DC015521 (E.W.H.).

Footnotes

1

Given the alternating nature of the checkerboard gating, all stimuli contained local temporal and spectral coherence within a time-frequency unit defined by the gating period and band grouping. The coherence conditions here specifically controlled for global coherence of the entire masker across gating periods and band groupings.

2

Audio files of the spectro-temporal noise maskers used in this study are available at https://osf.io/7f89d/.

3

See supplementary material at https://doi.org/10.1121/10.0001971 for additional detail regarding the ideal binary mask. This further analysis suggests that potential speech glimpses may not be detected at the edge of noise intervals.

References

  • 1.ANSI (1997). ANSI S3.5-1995, American National Standard Methods for the Calculation of the Speech Intelligibility Index ( Acoustical Society of America, New York: ). [Google Scholar]
  • 2.ANSI (2004). ANSI S3.21 (R2009), American National Standard Methods for Manual Pure-Tone Threshold Audiometry ( Acoustical Society of America, New York: ). [Google Scholar]
  • 3.ANSI (2010). ANSI S3.6, American National Standard Specification for Audiometers ( Acoustical Society of America, New York: ). [Google Scholar]
  • 4. Apoux, F. , and Healy, E. W. (2009). “ On the number of auditory filter outputs needed to understand speech: Further evidence for auditory channel independence,” Hear. Res. 255, 99–108. 10.1016/j.heares.2009.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Arnal, L. H. , Doelling, K. B. , and Poeppel, D. (2014). “ Delta–beta coupled oscillations underlie temporal prediction accuracy,” Cereb. Cortex 25, 3077–3085. 10.1093/cercor/bhu103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Barnes, R. , and Jones, M. R. (2000). “ Expectancy, attention, and time,” Cogn. Psychol. 41, 254–311. 10.1006/cogp.2000.0738 [DOI] [PubMed] [Google Scholar]
  • 7. Bashford, J. A., Jr. , Warren, R. M. , and Lenz, P. W. (2000). “ Relative contributions of passband and filter skirts to the intelligibility of bandpass speech: Some effects of context and amplitude,” Acoust. Res. Lett. Online 1, 31–36. 10.1121/1.1329836 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Başkent, D. , Eiler, C. L. , and Edwards, B. (2010). “ Phonemic restoration by hearing-impaired listeners with mild to moderate sensorineural hearing loss,” Hear. Res. 260, 54–62. 10.1016/j.heares.2009.11.007 [DOI] [PubMed] [Google Scholar]
  • 9. Brungart, D. S. , Chang, P. S. , Simpson, B. D. , and Wang, D. (2009). “ Multitalker speech perception with ideal time-frequency segregation: Effects of voice characteristics and number of talkers,” J. Acoust. Soc. Am. 125, 4006–4022. 10.1121/1.3117686 [DOI] [PubMed] [Google Scholar]
  • 10. Buss, E. , Hall, J. W. III , and Grose, J. H. (2003). “ Effect of amplitude modulation coherence for masked speech signals filtered into narrow bands,” J. Acoust. Soc. Am. 113, 462–467. 10.1121/1.1528927 [DOI] [PubMed] [Google Scholar]
  • 11. Buss, E. , Hall, J. W. III , and Grose, J. H. (2004). “ Spectral integration of synchronous and asynchronous cues to consonant identification,” J. Acoust. Soc. Am. 115, 2278–2285. 10.1121/1.1691035 [DOI] [PubMed] [Google Scholar]
  • 12. Cooke, M. (2006). “ A glimpsing model of speech perception in noise,” J. Acoust. Soc. Am. 119, 1562–1573. 10.1121/1.2166600 [DOI] [PubMed] [Google Scholar]
  • 13. Cooke, M. , and Ellis, D. P. (2001). “ The auditory organization of speech and other sources in listeners and computational models,” Speech Commun. 35, 141–177. 10.1016/S0167-6393(00)00078-9 [DOI] [Google Scholar]
  • 14. Cusack, R. , Decks, J. , Aikman, G. , and Carlyon, R. P. (2004). “ Effects of location, frequency region, and time course of selective attention on auditory scene analysis,” J. Exp. Psychol. Hum. Percept. Perform. 30, 643–656. 10.1037/0096-1523.30.4.643 [DOI] [PubMed] [Google Scholar]
  • 15. Da Costa, S. , van der Zwaag, W. , Miller, L. M. , Clarke, S. , and Saenz, M. (2013). “ Tuning in to sound: Frequency-selective attentional filter in human primary auditory cortex,” J. Neurosci. 33, 1858–1863. 10.1523/JNEUROSCI.4405-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Dai, H. , Scharf, B. , and Buus, S. R. (1991). “ Effective attenuation of signals in noise under focused attention,” J. Acoust. Soc. Am. 89, 2837–2842. 10.1121/1.400721 [DOI] [PubMed] [Google Scholar]
  • 17. Drullman, R. , Festen, J. M. , and Plomp, R. (1994a). “ Effect of temporal envelope smearing on speech reception,” J. Acoust. Soc. Am. 95, 1053–1064. 10.1121/1.408467 [DOI] [PubMed] [Google Scholar]
  • 18. Drullman, R. , Festen, J. M. , and Plomp, R. (1994b). “ Effect of reducing slow temporal modulations on speech reception,” J. Acoust. Soc. Am. 95, 2670–2680. 10.1121/1.409836 [DOI] [PubMed] [Google Scholar]
  • 19. Elhilali, M. , Ma, L. , Micheyl, C. , Oxenham, A. J. , and Shamma, S. A. (2009). “ Temporal coherence in the perceptual organization and cortical representation of auditory scenes,” Neuron 61, 317–329. 10.1016/j.neuron.2008.12.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Engel, A. K. , Fries, P. , and Singer, W. (2001). “ Dynamic predictions: Oscillations and synchrony in top-down processing,” Nat. Rev. Neurosci. 2, 704–716. 10.1038/35094565 [DOI] [PubMed] [Google Scholar]
  • 21. Felty, R. A. , Buchwald, A. , and Pisoni, D. B. (2009). “ Adaptation to frozen babble in spoken word recognition,” J. Acoust. Soc. Am. 125, EL93–EL97. 10.1121/1.3073733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Fogerty, D. , Carter, B. L. , and Healy, E. W. (2018). “ Glimpsing speech in temporally and spectro-temporally modulated noise,” J. Acoust. Soc. Am. 143, 3047–3057. 10.1121/1.5038266 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Fogerty, D. , Xu, J. , and Gibbs, B. E. (2016). “ Modulation masking and glimpsing of natural and vocoded speech during single-talker modulated noise: Effect of the modulation spectrum,” J. Acoust. Soc. Am. 140, 1800–1816. 10.1121/1.4962494 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Ghitza, O. (2011). “ Linking speech perception and neurophysiology: Speech decoding guided by cascaded oscillators locked to the input rhythm,” Front. Psychol. 2, 130, 1–13. 10.3389/fpsyg.2011.00130 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Giraud, A. L. , and Poeppel, D. (2012). “ Cortical oscillations and speech processing: Emerging computational principles and operations,” Nat. Neurosci. 15, 511–517. 10.1038/nn.3063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Gordon, P. C. (1997). “ Coherence masking protection in brief noise complexes: Effects of temporal patterns,” J. Acoust. Soc. Am. 102, 2276–2283. 10.1121/1.419600 [DOI] [PubMed] [Google Scholar]
  • 27. Greenberg, S. , Carvey, H. , Hitchcock, L. , and Chang, S. (2003). “ Temporal properties of spontaneous speech—A syllable-centric perspective,” J. Phon. 31, 465–485. 10.1016/j.wocn.2003.09.005 [DOI] [Google Scholar]
  • 28. Hall, J. W. III , Buss, E. , and Grose, J. H. (2008a). “ Spectral integration of speech bands in normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am. 124, 1105–1115. 10.1121/1.2940582 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Hall, J. W. III , Buss, E. , and Grose, J. H. (2008b). “ The effect of hearing impairment on the identification of speech that is modulated synchronously or asynchronously across frequency,” J. Acoust. Soc. Am. 123, 955–962. 10.1121/1.2821967 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Healy, E. W. , and Bacon, S. P. (2002). “ Across-frequency comparison of temporal speech information by listeners with normal and impaired hearing,” J. Speech Lang. Hear. Res. 45, 1262–1275. 10.1044/1092-4388(2002/101) [DOI] [PubMed] [Google Scholar]
  • 31. Healy, E. W. , Vasko, J. L. , and Wang, D. L. (2019). “ The optimal threshold for removing noise from speech is similar across normal and impaired hearing – a time-frequency masking study,” J. Acoust. Soc. Am. 145, EL581–EL586. 10.1121/1.5112828 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Healy, E. W. , and Warren, R. M. (2003). “ The role of contrasting temporal amplitude patterns in the perception of speech,” J. Acoust. Soc. Am. 113, 1676–1688. 10.1121/1.1553464 [DOI] [PubMed] [Google Scholar]
  • 33. Hickok, G. , Farahbod, H. , and Saberi, K. (2015). “ The rhythm of perception: Entrainment to acoustic rhythms induces subsequent perceptual oscillation,” Psychol. Sci. 26, 1006–1013. 10.1177/0956797615576533 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Howard-Jones, P. A. , and Rosen, S. (1993). “ Uncomodulated glimpsing in ‘checkerboard’ noise,” J. Acoust. Soc. Am. 93, 2915–2922. 10.1121/1.405811 [DOI] [PubMed] [Google Scholar]
  • 35. Humes, L. E. , and Kidd, G. R. (2016). “ Speech recognition for multiple bands: Implications for the speech intelligibility index,” J. Acoust. Soc. Am. 140, 2019–2026. 10.1121/1.4962539 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Jones, M. R. (1976). “ Time, our lost dimension: Toward a new theory of perception, attention, and memory,” Psychol. Rev. 83, 323–355. 10.1037/0033-295X.83.5.323 [DOI] [PubMed] [Google Scholar]
  • 37. Jones, M. R. , Moynihan, H. , MacKenzie, N. , and Puente, J. (2002). “ Temporal aspects of stimulus-driven attending in dynamic arrays,” Psychol. Sci. 13, 313–319. 10.1111/1467-9280.00458 [DOI] [PubMed] [Google Scholar]
  • 38. Lakatos, P. , Musacchia, G. , O'Connel, M. N. , Falchier, A. Y. , Javitt, D. C. , and Schroeder, C. E. (2013). “ The spectrotemporal filter mechanism of auditory selective attention,” Neuron 77, 750–761. 10.1016/j.neuron.2012.11.034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Langhans, A. , and Kohlrausch, A. (1992). “ Differences in auditory performance between monaural and diotic conditions. I: Masked thresholds in frozen noise,” J. Acoust. Soc. Am. 91, 3456–3470. 10.1121/1.402834 [DOI] [PubMed] [Google Scholar]
  • 40. Li, N. , and Loizou, P. C. (2007). “ Factors influencing glimpsing of speech in noise,” J. Acoust. Soc. Am. 122, 1165–1172. 10.1121/1.2749454 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Li, N. , and Loizou, P. (2008). “ Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction,” J. Acoust. Soc. Am. 123, 1673–1682. 10.1121/1.2832617 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. McAuley, J. D. , Shen, Y. , Dec, S. , and Kidd, G. R. (2020). “ Altering the rhythm of target and background talkers differentially affects speech understanding,” Atten. Percep. Psychophys. 82, 3222–3233. 10.3758/s13414-020-02064-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Miller, G. A. , and Licklider, J. C. R. (1950). “ The intelligibility of interrupted speech,” J. Acoust. Soc. Am. 22, 167–173. 10.1121/1.1906584 [DOI] [Google Scholar]
  • 44. Neuling, T. , Rach, S. , Wagner, S. , Wolters, C. H. , and Herrmann, C. S. (2012). “ Good vibrations: Oscillatory phase shapes perception,” Neuroimage 63, 771–778. 10.1016/j.neuroimage.2012.07.024 [DOI] [PubMed] [Google Scholar]
  • 45. Nilsson, M. , Soli, S. D. , and Sullivan, J. A. (1994). “ Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am. 95, 1085–1099. 10.1121/1.408469 [DOI] [PubMed] [Google Scholar]
  • 46. Ozmeral, E. J. , Buss, E. , and Hall, J. W. III (2012). “ Asynchronous glimpsing of speech: Spread of masking and task set-size,” J. Acoust. Soc. Am. 132, 1152–1164. 10.1121/1.4730976 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Rhebergen, K. S. , and Versfeld, N. J. (2005). “ A speech intelligibility index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners,” J. Acoust. Soc. Am. 117, 2181–2192. 10.1121/1.1861713 [DOI] [PubMed] [Google Scholar]
  • 48. Riecke, L. , Peters, J. C. , Valente, G. , Kemper, V. G. , Formisano, E. , and Sorger, B. (2017). “ Frequency-selective attention in auditory scenes recruits frequency representations throughout human superior temporal cortex,” Cerebral Cortex 27, 3002–3014. 10.1093/cercor/bhw160 [DOI] [PubMed] [Google Scholar]
  • 49. Shamma, S. A. , Elhilali, M. , and Micheyl, C. (2011). “ Temporal coherence and attention in auditory scene analysis,” Trends Neurosci. 34, 114–123. 10.1016/j.tins.2010.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Shen, Y. , and Pearson, D. V. (2019). “ Efficiency in glimpsing vowel sequences in fluctuating makers, [sic]: Effects of temporal fine structure and temporal regularity,” J. Acoust. Soc. Am. 145, 2518–2529. 10.1121/1.5098949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Sorkin, R. D. , Pastore, R. E. , and Gilliom, J. D. (1968). “ Signal probability and the listening band,” Percept. Psychophys. 4, 10–12. 10.3758/BF03210439 [DOI] [Google Scholar]
  • 52. Spiegel, M. F. , and Green, D. M. (1982). “ Signal and masker uncertainty with noise maskers of varying duration, bandwidth, and center frequency,” J. Acoust. Soc. Am. 71, 1204–1210. 10.1121/1.387769 [DOI] [PubMed] [Google Scholar]
  • 53. Spiegel, M. F. , Picardi, M. C. , and Green, D. M. (1981). “ Signal and masker uncertainty in intensity discrimination,” J. Acoust. Soc. Am. 70, 1015–1019. 10.1121/1.386951 [DOI] [PubMed] [Google Scholar]
  • 54. Stilp, C. (2020). “ Acoustic context effects in speech perception,” WIREs Cogn Sci. 11, e1517, 1–18. 10.1002/wcs.1517 [DOI] [PubMed] [Google Scholar]
  • 55. Stone, M. A. , Füllgrabe, C. , and Moore, B. C. (2012). “ Notionally steady background noise acts primarily as a modulation masker of speech,” J. Acoust. Soc. Am. 132, 317–326. 10.1121/1.4725766 [DOI] [PubMed] [Google Scholar]
  • 56. Studebaker, G. A. (1985). “ A ‘rationalized’ arcsine transform,” J. Speech Hear. Res. 28, 455–462. 10.1044/jshr.2803.455 [DOI] [PubMed] [Google Scholar]
  • 57. Tang, Y. , Cooke, M. , Fazenda, B. M. , and Cox, T. J. (2016). “ A metric for predicting binaural speech intelligibility in stationary noise and competing speech maskers,” J. Acoust. Soc. Am. 140, 1858–1870. 10.1121/1.4962484 [DOI] [PubMed] [Google Scholar]
  • 58. ten Oever, S. , Schroeder, C. E. , Poeppel, D. , van Atteveldt, N. , and Zion-Golumbic, E. (2014). “ Rhythmicity and cross-modal temporal cues facilitate detection,” Neuropsychologia 63, 43–50. 10.1016/j.neuropsychologia.2014.08.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Wang, D. L. (2005). “ On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, edited by Divenyi P. ( Kluwer, Norwell, MA: ), pp. 181–197. [Google Scholar]
  • 60. Ward, L. M. (2003). “ Synchronous neural oscillations and cognitive processes,” Trends Cogn. Sci. 7, 553–559. 10.1016/j.tics.2003.10.012 [DOI] [PubMed] [Google Scholar]
  • 61. Warren, R. M. , and Bashford, J. A., Jr. (1999). “ Intelligibility of 1/3-octave speech: Greater contribution of frequencies outside than inside the nominal passband,” J. Acoust. Soc. Am. 106, L47–L52. 10.1121/1.427606 [DOI] [PubMed] [Google Scholar]
  • 62. Warren, R. M. , Bashford, J. A., Jr. , and Lenz, P. W. (2005). “ Intelligibilities of 1-octave rectangular bands spanning the speech spectrum when heard separately and paired,” J. Acoust. Soc. Am. 118, 3261–3266. 10.1121/1.2047228 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Warren, R. M. , Hainsworth, K. R. , Brubaker, B. S. , Bashford, J. A., Jr. , and Healy, E. W. (1997). “ Spectral restoration of speech: Intelligibility is increased by inserting noise in spectral gaps,” Percept. Psychophys. 59, 275–283. 10.3758/BF03211895 [DOI] [PubMed] [Google Scholar]
  • 64. Warren, R. M. , Riener, K. R. , Bashford, J. A., Jr. , and Brubaker, B. S. (1995). “ Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits,” Percept. Psychophys. 57, 175–182. 10.3758/BF03206503 [DOI] [PubMed] [Google Scholar]
  • 65. Weisz, N. , and Obleser, J. (2014). “ Synchronisation signatures in the listening brain: A perspective from non-invasive neuroelectrophysiology,” Hear. Res. 307, 16–28. 10.1016/j.heares.2013.07.009 [DOI] [PubMed] [Google Scholar]
  • 66. Zaar, J. , and Dau, T. (2015). “ Sources of variability in consonant perception of normal-hearing listeners,” J. Acoust. Soc. Am. 138, 1253–1267. 10.1121/1.4928142 [DOI] [PubMed] [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES