Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 1.
Published in final edited form as: Hear Res. 2022 May 24;426:108535. doi: 10.1016/j.heares.2022.108535

A dynamic binaural harmonic-cancellation model to predict speech intelligibility against a harmonic masker varying in intonation, temporal envelope, and location

Luna Prud’homme a, Mathieu Lavandier a,*, Virginia Best b
PMCID: PMC9684346  NIHMSID: NIHMS1834112  PMID: 35654633

Abstract

The aim of this study was to extend the harmonic-cancellation model proposed by Prud’homme et al. [J. Acoust. Soc. Am. 148, 3246–3254 (2020)] to predict speech intelligibility against a harmonic masker, so that it takes into account binaural hearing, amplitude modulations in the masker and variations in masker fundamental frequency (F0) over time. This was done by segmenting the masker signal into time frames and combining the previous long-term harmonic-cancellation model with the binaural model proposed by Vicente and Lavandier [Hear. Res. 390, 107937 (2020)]. The new model was tested on the data from two experiments involving harmonic complex maskers that varied in spatial location, temporal envelope and F0 contour. The interactions between the associated effects were accounted for in the model by varying the time frame duration and excluding the binaural unmasking computation when harmonic cancellation is active. Across both experiments, the correlation between data and model predictions was over 0.96, and the mean and largest absolute prediction errors were lower than 0.6 and 1.5 dB, respectively.

Keywords: Speech Intelligibility, Auditory Modeling, Harmonic Cancellation, Binaural Hearing

1. Introduction

Listeners are often faced with the challenge of trying to attend to a target talker in a noisy environment. In many social environments, the masking of speech can be separated into two broad categories: energetic masking and informational masking (Brungart et al., 2001). Energetic masking (EM) refers to masking that occurs when the target and masker signals overlap and compete at the periphery of the auditory system (Durlach et al., 2003), ie., when target and masker signals overlap in the time and frequency domains. Informational masking (IM) refers to a reduction in speech intelligibility that cannot be explained by EM. This broad definition of IM includes a wide range of factors that prevent the listener from segregating the target speech from competing voices, or from focusing on the right talker. In the present modeling effort we are concerned with quantifying EM in complex listening situations. Several mechanisms, relying on different acoustic cues, have been proposed in the literature to explain how the auditory system could reduce masking in such situations. Spatial separation between the target and masking sources improves speech intelligibility (Plomp, 1976; Hawley et al., 2004) and two main mechanisms are thought to support this spatial release from masking (SRM): better-ear listening and binaural unmasking. If the target is located on one side of the listener and the masker on the other side, the ear that is on the same side as the target will receive a better signal-to-noise ratio (SNR) than the other ear. Better-ear listening is the ability of the listeners to take advantage of the better SNR at one ear under conditions of spatial separation. Binaural unmasking relies on interaural time differences (ITDs). If the two sources are separated, the listeners can take advantage of the ITD differences to improve detection and intelligibility. According to the equalization-cancellation theory (Durlach, 1972), the auditory system effectively “cancels” part of the masker to improve the internal SNR. It is also well known that listeners are able to take advantage of amplitude modulations in the masker to improve masked speech intelligibility (Festen & Plomp, 1990; Bronkhorst & Plomp, 1992; Peters et al., 1998; Cooke, 2006; Beutelmann et al., 2010; Collin & Lavandier, 2013). In the temporal dips of the masker, the SNR is momentarily increased, providing the opportunity for listeners to glimpse information from the target. This mechanism is often called “dip listening”. On the other hand, it has been suggested that modulations in the masker can also reduce speech recognition by obstructing target speech modulations, a phenomenon called modulation masking (Fogerty et al., 2016).

A number of speech intelligibility models have been proposed in the literature (see Lavandier & Best (2020) for a recent review of binaural models). Several of these models can predict with various degrees of accuracy speech intelligibility for spatially separated, amplitude modulated noise maskers (Beutelmann et al., 2010; Chabot-Leclerc et al., 2016; Tang et al., 2016; Andersen et al., 2018; Vicente & Lavandier, 2020). However, there is currently no speech intelligibility model able to predict intelligibility in the presence of competing talkers, where effects linked to fundamental frequency (F0, Culling & Stone, 2017) may come into play.

The mechanisms underlying F0-based effects on speech intelligibility are not as well understood as SRM and dip listening. Several studies showed improvements in speech intelligibility when the target and masker differed in F0 for different kinds of harmonic maskers: speech maskers (Brokx & Nooteboom, 1982; Bird & Darwin, 1998; Darwin et al., 2003; Deroche & Culling, 2013), harmonic complexes (Deroche & Culling, 2011, 2013; Deroche et al., 2014a; Leclère et al., 2017), and synthetic vowels (de Cheveigné et al., 1995, 1997; Culling & Darwin, 1993; Summerfield & Culling, 1992). Deroche et al. (2014b) suggested that listeners could glimpse target energy in the spectral dips of a harmonic masker, which occur between the resolved partials, and that this mechanism of spectral glimpsing could improve speech intelligibility. Another potential unmasking mechanism, harmonic cancellation (de Cheveigné, 1993; for a recent review see de Cheveigné, 2021), relies on the idea that listeners are able to detect the harmonic structure of the masker and suppress it when its F0 is different from that of the target to improve the internal SNR. It has also been suggested that improvements in speech intelligibility in the presence of a harmonic masker compared 80 to an aperiodic masker could be due to a decrease of modulation masking (Stone et al., 2011, 2012).

Steinmetzger et al. (2019) proposed a modified version of the monaural speech-based envelope power spectrum model (sEPSMcorr; Relaño-Iborra et al., 2016) to explain speech intelligibility in the presence of noise and intonated harmonic maskers. While this model accounted for effects related to modulation masking, it only partially explained the so-called “masker periodicity benefit”. Prud’homme et al. (2020) proposed a monaural harmonic-cancellation model that accurately predicted speech intelligibility in the presence of a variety of stationary, monotonous harmonic maskers.

Speech maskers present more complex characteristics than the maskers used to validate previous intelligibility models. Unlike noise, speech signals are harmonic and can be characterized by their F0. Unlike simple harmonic complexes, speech signals contain intonation, i.e. F0 variations over time, amplitude modulations and unvoiced parts. Most real-world situations involving speech maskers also involve binaural differences. All of these characteristics of speech maskers would need to be taken into account by a comprehensive speech intelligibility model. In addition, there is some evidence that these different characteristics and associated effects could interact to affect the amount of masking that is ultimately observed. For example, results from Leclère et al. (2017) suggest that F0-based unmasking effects associated with harmonic complexes could be impaired by intonation. They also found that SRM was smaller for harmonic complexes than for noise maskers. Two studies found that dip listening was larger for noise maskers than for harmonic complexes (Leclère et al., 2017; Steinmetzger & Rosen, 2015).

Here, as a step towards predicting speech-on-speech intelligibility, we aimed to predict intelligibility in the presence of harmonic non-speech maskers. We chose to combine the models of Vicente & Lavandier (2020) and Prud’homme et al. (2020) because they were designed to handle time-varying maskers and harmonic maskers, respectively, and because they share a common structure (inherited from the model proposed by Collin & Lavandier, 2013). Variations of combinations of these models were tested on the stimuli from Leclère et al. (2017): harmonic complexes, also called “buzzes”, varying in their F0 contour, spatial location or amplitude modulation. Those maskers are more complex than those tested by Prud’homme et al. (2020), which were stationary, monotonous harmonic complexes, while still being simpler than natural speech maskers. They represent a step between stationary harmonic complexes and speech, allowing an investigation of how the models handle intonation, amplitude modulation, and binaural differences, while still focusing on EM and avoiding for now the complexities of IM.

2. Behavioral data

Leclère et al. (2017) investigated the potential interaction between F0-based effects and SRM (experiment 1) or temporal dip listening (experiment 2). They measured speech reception thresholds (SRTs, the SNRs for 50% target intelligibility) for target sentences spoken by a male voice in French. The mean F0 of the target was always fixed at 117 Hz. They tested both intonated target sentences (no modification of the F0 contour) and monotonized target sentences (F0 fixed at 117 Hz); but, for each of the two experiments, only the conditions with the naturally intonated target were considered here. The maskers were harmonic complexes with partials in random phase. They had the same average long-term excitation pattern as the target sentences. Those speech-shaped buzzes were either monotonized (fixed F0 over time) or intonated (using the F0 contour of two concatenated sentences from the target speech material randomly selected on each trial, always different from the target sentence). Their mean F0 was either equal to the target mean F0, 117 Hz, or 139 Hz, leading to a difference in mean F0 (ΔF0) of 22 Hz or 3 semitones.

In experiment 1, the target was always presented 30° to the right of the listener using anechoic head-related transfer functions (HRTFs; Gardner & Martin, 1995). The masker always had a stationary envelope and was tested in two spatial conditions: co-located or separated from the target (masker 30° to the left of the listener). Experiment 1 had eight conditions: 2 spatial conditions x× 2 ΔF0 × 2 masker F0 contours.

In experiment 2, all stimuli were presented diotically but the masker had either a stationary envelope or the modulated broadband envelope of a single voice (extracted from two concatenated sentences from the target speech material, randomly selected on each trial, always different from the target sentence, also used to extract the F0 contour of the intonated masker, when appropriate). Experiment 2 had eight conditions: 2 amplitude modulation × 2 ΔF0 × 2 masker F0 contours.

The results of experiments 1 and 2 are presented in Figures 2 and 3, respectively. The main results of experiment 1 were: (1) monotonized buzzes produced lower SRTs than intonated buzzes, (2) there was a ΔF0 benefit for intonated buzzes but it was not significant for monotonized buzzes, (3) the SRTs were always lower in the separated condition. The main results of experiment 2 were: (1) monotonized buzzes produced lower SRTs than intonated buzzes, (2) there was a significant ΔF0 benefit for both intonated and monotonized buzzes, (3) amplitude modulated buzzes produced lower SRTs only in the intonated condition.

Figure 2:

Figure 2:

Mean SRTs measured by Leclère et al. (2017) in their experiment 1 for stationary collocated and separated buzzes, with the corresponding model predictions using: (A) model 1: a binaural model without harmonic cancellation (Vicente & Lavandier, 2020), (B) model 2: a monaural model with harmonic cancellation (Prud’homme et al., 2020), for which only the collocated conditions were considered, (C) model 3: a binaural model with harmonic cancellation, (D) model 4: a binaural model with harmonic cancellation and binaural unmasking mutually exclusive. A time frame of 300 ms was used for models 3 and 4.

Figure 3:

Figure 3:

Mean SRTs measured by Leclère et al. (2017) in their experiment 2 for diotic buzzes with a stationary or modulated envelope, with the corresponding model predictions using: (A) model 1: a binaural model without harmonic cancellation (Vicente & Lavandier, 2020), (B) model 2: a monaural long-term model with harmonic cancellation (Prud’homme et al., 2020), (C) model 4: a binaural model with harmonic cancellation and binaural unmasking mutually exclusive using a 300-ms time frame. The predictions of model 3 are similar to those of model 4 in these diotic conditions. (D) model 4 using time frame durations of 100 and 500 ms.

3. Models

The models tested in the present study are different variations/combinations of the models from Vicente & Lavandier (2020) and Prud’homme et al. (2020). Both models were originally based on the model proposed by Collin & Lavandier (2013), so they have the same structure: (1) target and masker signals are passed through a gammatone filterbank with two filters per equivalent rectangular bandwidth (ERB), distributed linearly on an ERB-rate scale (Moore & Glasberg, 1983), (2) the SNR is computed in each frequency band, (3) weightings are applied according to the speech intelligibility index (SII; ANSI S3.5, 1997) (4) the weighted SNRs are summed across frequency bands, (5) the resulting effective SNR is then inverted and offset1 so that the mean predicted SRT across conditions in the experiment is aligned with the mean measured SRT.

Four model versions were tested here:

  • Model 1: the original dynamic binaural model without harmonic cancellation (Vicente & Lavandier, 2020).

  • Model 2: the original long-term monaural model with harmonic cancellation (Prud’homme et al., 2020).

  • Model 3: a dynamic, binaural model with harmonic cancellation.

  • Model 4: a dynamic, binaural model with harmonic cancellation in which binaural unmasking and harmonic cancellation are mutually exclusive. The binaural unmasking advantage is set to 0 dB for the frequency bands and time-frames in which harmonic cancellation is applied.

3.1. Model 1

The model from Vicente & Lavandier (2020) is able to predict accurately binaural speech intelligibility for noise maskers with amplitude modulations. Amplitude modulations in the masker are taken into account by segmenting the masker signal into short time frames using half-overlapping Hann windows. In each frequency band the binaural unmasking advantage is computed using an equation proposed by Culling et al. (2005) to estimate the binaural masking level difference (BMLD). In parallel, the better-ear SNR is obtained by selecting the best SNR across ears, band by band. The binaural unmasking advantage and the better-ear SNR are then integrated across frequencies using the SII-weightings and averaged across time frames. The two values, computed independently, are added to obtain the effective SNR. A ceiling, which corresponds to the maximum better-ear SNR that is allowed in each frequency band and time frame, was introduced in the model to prevent the SNR to tend to infinity in the temporal gaps of the masker. Vicente & Lavandier (2020) optimized the model by using a time frame of 300 ms to compute the binaural unmasking advantage, a time frame of 24 ms to compute the better-ear SNR, and a ceiling of 20 dB.

3.2. Model 2

The model proposed by Prud’homme et al. (2020) is able to predict speech intelligibility against a monotonous, stationary, diotic buzz masker. The harmonic cancellation component is implemented by filtering the target and masker signals with a comb filter that removes energy at the masker F0 and its harmonics. In each frequency band, harmonic cancellation is applied only if it improves the SNR in that band. Four parameters were fixed by Prud’homme et al. (2020): a jitter in the estimation of the F0 (0.25F0), the width of the notches of the comb filter (0.6F0), a SNR ceiling (40 dB), and a frequency limit up to which harmonic cancellation is applied (5000 Hz). The jitter parameter corresponds to the width of a normal distribution from which the jitter value is drawn. To obtain the model predictions the model is run several times for each condition using a different realization of the stimuli and a different value of the jitter. The jitter parameter takes a different random value so that the model produces a different effective SNR on each of these “trials”. The effective SNRs are then averaged across trials to obtain the final prediction.

3.3. Model 3

Model 3 is a hybrid of the two original models (1 and 2). The structure of the model is presented in Figure 1. The masker is segmented into time frames using half-overlapping Hann windows like in model 1. The mean F0 in each time frame is computed using PRAAT PSOLA (Boersma & Weenink, 2018). If an F0 is found at least 50% of the time across the time frame duration, the harmonic cancellation component is used (as in model 2, the SNRs are computed after the signals have been comb-filtered and harmonic cancellation is applied if it improves the SNR in each frequency band). In the case of buzzes, there was always an F0 so the harmonic cancellation component was always applied, but this option was added to the model so that it could be applied to stimuli with unvoiced parts in the future. The mean F0 across the time frame is used to design the comb filter. For simplicity, the same time frame duration is used for the computation of all mechanisms: harmonic cancellation, better-ear listening and binaural unmasking (unlike in model 1). The SNR is computed at both ears for both unfiltered and comb-filtered signals. The best SNR across ears is chosen (better-ear SNR as in model 1) and then the best better-ear SNR between the comb-filtered and unfiltered signals is chosen in each band (as in model 2). The BMLDs are computed only on the unfiltered2 signals — thus assuming binaural unmasking and harmonic cancellation to be independent mechanisms — then integrated across frequency, and added to the broadband best better-ear SNR.

Figure 1:

Figure 1:

Structure of the dynamic, binaural speech intelligibility model with harmonic cancellation (model 3). The condition If there is an F0 is met when the masker is voiced for at least 50% of the time frame (i.e. an F0 is found by PRAAT PSOLA for at least 50% of samples). The grey color of the lower “Binaural unmasking advantage” indicates that this advantage is computed on the unfiltered (not the comb-filtered) signal.

The parameters set by Prud’homme et al. (2020) for the long-term harmonic cancellation mechanism of model 2 were used for model 3. In addition, compared to model 2, a new parameter, the time frame duration, was introduced in model 3. The rationale behind segmenting the masker into time frames is to account for the amplitude modulations in the masker (as in model 1) and to account for the F0 variations over time in the case of intonated maskers. In Figure 1, the bottom box (“If there is an F0”) is equivalent to model 2 operating in time frames on the two ears and taking the best of the two SNRs, and then adding the binaural unmasking component. The upper box (“If there is no F0”) is equivalent to model 1 using the same time frame for both binaural unmasking and better-ear listening with a ceiling at 40 dB instead of 20 dB.

The new model parameter (time frame duration) had most influence for the predictions of experiment 2 that involved fluctuating maskers, thus different time frame durations were tested for this experiment. A range of acceptable values for the time frame duration that gave good predictions for experiment 2 was defined. Some of these values were then tested for experiment 1 to find the final value that provided the best fit to the experimental data for both experiments.

3.4. Model 4

Model 4 is a modified version of model 3. It has the same structure, except that when the harmonic cancellation component is invoked (ie., when there is an F0 and when applying harmonic cancellation improves the computed SNR in the considered frequency band and time frame), the binaural unmasking advantage (gray part in Fig. 1) is set to 0 dB. This model thus assumes that the mechanisms of harmonic cancellation and binaural unmasking are mutually exclusive.

3.5. Implementation and evaluation of the predictions

For all model versions, the model input for the target was an averaged target signal created by adding 120 target sentences, truncated to the duration of the shortest sentence. For the harmonic cancellation models (models 2, 3 and 4), the predictions were computed using 800 trials, as done by Prud’homme et al. (2020).

The performance of the model was evaluated using the mean absolute prediction error, which corresponds to the mean across conditions of the absolute difference between the measured and predicted SRTs, the largest absolute prediction error, and the Pearson correlation between data and predictions.

4. Results

4.1. Predictions with the original model 1

Figure 2(A) presents the predictions from model 1 for experiment 1. The model predicts a difference in SRT between the co-located and separated conditions (SRM) of about 10 dB, which is about 4 dB higher than the SRM observed in the data. As expected, this model does not predict any of the F0 effects: it does not predict the difference between monotonized and intonated buzzes, nor the small ΔF0 benefit observed for the intonated buzzes.

Figure 3(A) presents the predictions from model 1 for experiment 2. As for experiment 1, model 1 does not predict the difference between intonated and monotonized buzzes, the small ΔF0 effect, nor the interaction of F0 effects with dip listening. Most strikingly, this model predicts a dip listening advantage of about 4 dB for all masker types, whereas the data showed no advantage for the monotonized buzzes and an advantage of only 1 dB for intonated buzzes.

4.2. Predictions with the original model 2

Figure 2(B) presents the predictions from model 2 for experiment 1. This model is monaural and cannot predict SRM, thus only the co-located conditions are considered. The predictions were computed using the right ear signals only. Compared to model 1, this model predicts the difference between intonated and monotonized buzzes with good accuracy. The model can also predict an SRT difference associated with ΔF0, but it is larger than that observed in the data for monotonized maskers (1 dB compared to 0.8 dB in the data) and not predicted at all for intonated maskers (−0.1 dB compared to 0.8 dB in the data). Across the four co-located conditions, this model produced a mean error of 0.8 dB and the largest error was 1.3 dB.

Figure 3(B) presents the predictions from model 2 for experiment 2. Contrary to model 1, model 2 predicts an opposite effect to dip listening: the predicted SRTs are lower for stationary than for modulated maskers, and this difference is larger for monotonized buzzes. Because model 2 does not operate on short time frames, it is not surprising that it cannot predict any benefit associated with dip listening for modulated maskers. The fact that it predicts the opposite indicates that that harmonic cancellation operates less effectively on modulated maskers.

4.3. Predictions with the new models 3 and 4

The predictions of models 3 and 4 are presented by considering experiment 2 first (Fig. 3, C and D), because it involved non-stationary buzzes more critical to test the influence of the time-frame duration, the only new parameter proposed in these dynamic models. Several time frame durations were tested, from 100 to 900 ms. As the target and maskers were diotic in this experiment, models 3 and 4 gave the same predictions (no binaural unmasking involved).

For clarity purposes, Figure 3 presents the predictions for experiment 2 only for a limited number of frame durations: 300 ms (Fig. 3 C), 100 ms and 500 ms (Fig. 3 D). Figure 4 (circle symbols) presents the mean and maximum prediction errors across conditions in the experiment as a function of frame duration. The predictions are worse for shorter time frames (below 300 ms) and Figure 3 (D, light gray lines for 100 ms) indicates that the SRT difference between monotonized (left) and intonated (right) buzzes is greatly underestimated then, while the dip listening advantage is greatly overestimated (difference between full and dashed light gray lines). The errors are lowest for time frame durations between 300 and 800 ms (largest error < 1.5 dB, mean error < 0.7 dB; Fig. 4). Figure 3 shows that increasing the time frame duration reduces the predicted dip listening advantage (panel C for 300 ms), to the point that the SRT predicted for modulated monotonized buzzes is higher than those for stationary monotonized buzzes (panel D dark grey lines for 500 ms, left for monotonized buzz). This trend is also found in the data although the difference between those two conditions was not significant. The model predicts the difference between intonated (right) and monotonized (left) buzzes reasonably well for all time frames longer than 300 ms, although it is best for 300 ms (Fig. 3 C). The ΔF0 benefit is well predicted for the monotonized buzzes, but the model failed to predict this benefit for the intonated buzzes.

Figure 4:

Figure 4:

Mean and largest prediction errors obtained with model 4, as a function of the time frame duration used for the predictions of experiment 1 (triangles) and experiment 2 (circles) from Leclère et al. (2017).

Figure 4 (triangle symbols) also presents the mean and largest prediction errors of model 4 for experiment 1 for some time frame durations. All the trends concerning the effect of the frame duration were the same for models 3 and 4. Contrary to experiment 2, increasing the frame duration increases the errors in predictions. Increasing the time frame duration increases and over-estimates the predicted SRT difference between intonated and monotonized buzzes (not shown on Fig. 2). Overall, the time frame duration of 300 ms resulted in good predictions for both experiments (mean error < 0.6 dB and largest error < 1.5 dB). Figure 2(C and D) presents the corresponding predictions from models 3 and 4 for experiment 1. The two models accurately predict the F0-based effects, as the monaural model 2 with harmonic cancellation does (panel B). The SRM predicted by model 3 (panel C, difference between full and dashed lines) is almost as big as that predicted by model 1 (panel A), around 10 dB, which is 4-dB larger than the SRM observed in the data. The SRM predicted by model 4 (panel D) is close to the SRM observed in the data.

4.4. Summary of the predictions

As expected the two original models cannot predict the data from Leclère et al. (2017). Model 1 failed to predict the F0-based effects, while model 2 failed to predict spatial unmasking and dip listening.

As further highlighted when comparing the scatter plots of Figure 5, the model that provides the best fit to the data of both experiments is model 4 using a time frame duration of 300 ms (mean error = 0.60 dB, largest error = 1.50 dB, correlation = 0.99 for experiment 1 and mean error = 0.55 dB, largest error = 1.21 dB, correlation = 0.96 for experiment 2).

Figure 5:

Figure 5:

Predicted vs. measured SRTs for experiments 1 and 2 (replotted from Fig. 2 and 3) using: (A) model 1 (no harmonic cancellation; Vicente & Lavandier, 2020), (B) model 2 (monaural long-term; Prud’homme et al., 2020), (C) model 3 (harmonic cancellation and binaural unmasking independent), (D) model 4 (harmonic cancellation and binaural unmasking mutually exclusive). A time frame of 300 ms was used for models 3 and 4. The dashed line of unity slope passing though the origin represents a 1:1 relationship between predicted and measured SRTs. For all models, predictions were fitted to the data using a different reference for the two experiments (the mean measured SRT across conditions in each experiment).

5. Discussion

5.1. Intonation

As shown by the predictions for both experiments, the implementation of harmonic cancellation in the model is necessary to predict the differences between intonated and monotonized buzzes. Model 1 without harmonic cancellation completely fails (Fig. 2 A and Fig. 3 A). These results support the idea that harmonic cancellation (or a related mechanism) plays a role in the unmasking of speech in the presence of harmonic maskers, and that this mechanism operates most effectively when the F0 does not vary over time. Specifically, it appears that a monotonous F0 is easier to “cancel” than a variable F0. Leclère et al. (2017) hypothesized that the sluggishness of the mechanism could explain the difference between intonated and monotonized buzzes. In the model, this is represented by the fact that the comb filter uses the mean F0 across the whole time frame. In the intonated case, the comb filter will thus be less efficient at cancelling the masker energy, because it is based on an approximated masker F0, compared to the monotonized case where the F0 is constant across the time frame. Varying the time frame duration in the model provided a test of this hypothesis: the shorter the time frame is, the better the approximation of the masker F0 and the more effective harmonic cancellation should be. Somewhat surprisingly, the long-term model (model 2, which has just one long time frame) gave a reasonable prediction of the effect of intonation (Fig. 2 B). Nevertheless, varying the time frame duration in the dynamic models (models 3 and 4) influenced the predictions. A time frame of 300 ms gave slightly better predictions of the effect of intonation than the long-term model, presumably because the F0 of the intonated masker is estimated more accurately. If the time frame is too short, however, the predictions of the model are worse: the predicted SRTs are too low for intonated maskers compared to monotonized maskers (see Fig. 3 D, light gray lines for the model with 100-ms time frame). It appears that the time frame needs to be short enough so that it follows to some extent the F0 of the intonated masker, but long enough to account for some sluggishness in the mechanism.

5.2. ΔF0 benefit

Model 1 without harmonic cancellation does not predict the small ΔF0 benefit observed in the data (Fig. 2 A and Fig. 3 A). Leclère et al. (2017) suggested that F0-based effects observed in their data could be due to spectral glimpsing and/or harmonic cancellation. Given that model 1 is not able to give accurate predictions of the data, despite computing the SNR in frequency bands and thus at least partly accounting for spectral glimpsing, the present results point towards a role for harmonic cancellation. This confirms the same observation made by Prud’homme et al. (2020) while considering monotonized buzzes and different data sets.

For the harmonic-cancellation models (models 2 to 4), the predicted ΔF0 benefit is small: around 1 dB for monotonized maskers and 0.3 dB for intonated maskers. In comparison, this benefit was between 0 and 1 dB for monotonized maskers and between 0.8 and 1.2 dB for intonated maskers in the data. Thus, although all of the benefits are small, it appears that the predicted benefits are slightly larger than observed for monotonized maskers, and slightly smaller than observed for intonated maskers. Regarding monotonized maskers, it is possible that the non-significant effect in experiment 1 of Leclère et al. (2017) was due to a floor effect as the SRTs were very low. Regarding the lack of a predicted F0 benefit for the intonated maskers, this may reflect the fact that the model uses only an average target spectrum and thus is less effective at capturing momentary differences in F0 that arise when both target and masker are intonated.

Overall, even though models 3 and 4 fail to predict the ΔF0 benefit for intonated buzzes, their prediction of this benefit is rather accurate for monotonized buzzes. This effect is quite small, in particular in comparison to the deleterious effect of masker intonation that is well captured by the models, so the resulting errors are also relatively small.

5.3. Spatial separation

The SRM observed in experiment 1 for buzz maskers was smaller than has been reported in previous studies for noise maskers. We currently have no clear explanation for this fact. It is possible that a buzz produces less masking than noise (due to its spectral dips) and as such there is less potential for SRM. The binaural model 1 without harmonic cancellation predicts an SRM of about 10 dB for both intonated and monotonized maskers, which is about the same as it would predict for noise maskers3. This would suggest that the difference in SRM between noise and buzz maskers is not explained primarily by the spectral dips (which are at least partly taken into account in model 1). Another explanation is that a mechanism linked to harmonicity (possibly harmonic cancellation) already lowered the SRTs, giving less opportunities for SRM.

The binaural model 3 with harmonic cancellation also predicts an SRM larger than in the data, whereas making binaural unmasking and harmonic cancellation mutually exclusive in model 4 gives very good predictions of SRM for the buzzes. While further investigation is clearly needed, this result raises the interesting possibility that the auditory system cannot perform harmonic cancellation and binaural unmasking at the same time within the same frequency channel. While somewhat surprising, it has been pointed out previously that from a computational perspective these two processes are highly parallel (de Cheveigné, 2021; Lavandier et al., 2022): while harmonic cancellation relies on the F0 difference between target and masker and is impaired by inharmonicity in the masker (de Cheveigné et al., 1995; Deroche et al., 2014b), equalization-cancellation relies on the ITD difference between target and masker and is impaired by decorrelation of the masker at the ears (Lavandier & Culling, 2010; Durlach, 1972).

5.4. Amplitude modulation

In experiment 2, the maskers were either stationary or modulated in amplitude. Previous studies comparing a stationary noise to a single-voice modulated noise found dip listening advantages between 4 and 12 dB (Collin & Lavandier, 2013; Festen & Plomp, 1990; Peters et al., 1998; Hawley et al., 2004; Beutelmann et al., 2010). With the buzzes presented here, Leclère et al. (2017) only found a dip listening advantage for intonated buzzes (not for monotonized buzzes) and it was less than 1.5 dB. They proposed two potential explanations for this interaction. First, spectral glimpsing and temporal glimpsing might not be able to take place simultaneously (in other words, they could be mutually exclusive). Because intonation could impair spectral glimpsing by blurring the spectral dips in the intonated buzz, then temporal dip listening could take place for this masker. Second, the F0 contour of the intonated buzz could allow the listeners to anticipate the envelope fluctuations and thus make better use of the information in the masker (temporal) dips. If this second explanation were true, then none of the models presented here would be able to predict this difference.

Model 1 without harmonic cancellation predicts a 4-dB dip listening advantage for modulated buzzes, which is similar to the predictions it gives for a one-voice modulated noise masker (see results from Collin & Lavandier, 2013). If spectral glimpsing and temporal glimpsing were mutually exclusive, model 1 could not account for such interaction as the structure of the model does not allow to separate the two mechanisms. In the models 3 and 4, the longer time frame needed for harmonic cancellation to predict F0-based effects limits the model’s ability to take full advantage of the temporal dips of the masker, thus reducing the predicted dip listening advantage. Moreover, increasing the time frame duration reduces the predicted dip listening advantage more for the monotonized buzzes than for the intonated buzzes, so that these models partly account for the interaction between intonation and amplitude modulation also observed in the data.

5.5. Model limitations

Despite the overall success of our dynamic harmonic-cancellation model 4 in predicting the behavioral data of Leclère et al. (2017), we have identified several model limitations that should be kept in mind and perhaps addressed in future implementations.

First, the model failed to predict the small ΔF0 benefit for intonated buzzes. One possible explanation for this is that the model only considers the long-term spectrum of the target and not its detailed F0 profile. It might be that when the harmonic structure/F0 to be canceled is constant (monotonized buzzes), then the models predict a ΔF0 benefit because the comb-filtered target long-term spectrum is sensitive to the fact that the F0 used for the comb-filter is identical or not to the target mean F0. When the F0 to be canceled is changing over time (intonated buzzes), then the models predict almost no ΔF0 benefit maybe because in each time frame the F0 to be canceled is most of the time different from the target mean F0, whether or not the mean of the to-be-canceled F0 is equal or not to the mean target F0. Because our model does not consider the target F0 profile, we do not expect that it could predict the influence of modifying this profile, such as the differences highlighted by Leclère et al. (2017) when using a monotonized rather than intonated target to measure the SRTs in their experiments 1 and 2. Note that in the study of Prud’homme et al. (2020), the input of model 2 was the concatenation of the target sentences rather than their average. The calculation is equivalent and give very similar results, but it is much faster with the averaged target that does not involve long signals. Compared to the averaged target, the concatenated target preserves the F0 profile of the sentences, but this profile is not considered by the model that relies only on the target long-term spectrum, and the averaged and concatenated targets have the same long-term spectrum.

Another limitation of the new model is that the time frame duration simultaneously influences the predictions of the dip listening advantage associated with temporal variations in the masker envelope, and of the effect of intonation associated with temporal variations in the masker F0. The optimal 300-ms duration probably results from a compromise in the modelling of these two effects. Specifically, better-ear listening and dip listening have been shown to be better predicted using shorter time frames, such as the 24-ms Hann window used by Beutelmann et al. (2010) and Vicente & Lavandier (2020). Vicente & Lavandier (2020) used a longer 300-ms window to compute the binaural unmasking advantage and account for binaural sluggishness, as done before by Hauth & Brand (2018). Here, a single (thus compromise) duration is used to account for the effects of better-ear listening/dip listening, binaural unmasking and intonation in the masker. It is possible that better predictions would be obtained in future implementations of the model by assuming independent processes, each using a different time-frame duration.

6. Conclusion

The harmonic-cancellation intelligibility model proposed by Prud’homme et al. (2020) has been extended to account for binaural hearing and the effects of intonation and amplitude modulation. The model proposed here gives accurate predictions of the data from two experiments that measured SRTs for speech masked by harmonic complexes having different F0s, F0 contours, amplitude modulation and spatial separation from the target. A harmonic cancellation mechanism is needed to predict the F0-based effects and predictions were best when operating the model on 300-ms time frames. The most successful version of the model assumes that the mechanisms of harmonic cancellation and binaural unmasking are mutually exclusive, raising questions that deserve further investigation. This model version is being made publicly available — as model prudhomme2022 — within the Auditory Modeling Toolbox (AMT 1.1; Majdak et al., 2022; Lavandier et al., 2022), along with the example code, data and signals used to run the model for the experiment 1 of Leclère et al. (2017) — as exp_prudhomme2022 (see section Experiments in the AMT Documentation). The proposed model has also been used to investigate the relevance of harmonic cancellation in the context of speech masked by competing speech in a companion paper (Prud’homme et al., 2022).

Highlights.

  • A binaural non-stationary harmonic-cancellation speech intelligibility model

  • Predictions of SRTs for speech masked by a harmonic complex

  • Accounts for spatial separation, intonation and amplitude modulation of the masker

  • Harmonic cancellation and binaural unmasking might be mutually exclusive mechanisms

Acknowledgments

This work was performed within the LabEx CeLyA (Grant No. ANR-10-LABX-0060) and funded by the “Fondation Pour l’Audition” (Speech2Ears grant). V.B. was supported, in part, by National Institutes of Health-National Institute on Deafness and Other Communication Disorders (NIH-NIDCD) Award No. DC015760.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1

The model can only predict relative differences between conditions but no absolute prediction of intelligibility. Therefore, a reference needs to be chosen, which is typically the average SRT across conditions.

2

Note that we would not expect a strong influence of the comb-filtering on the BMLDs if they were calculated also on the comb-filtered signals, because these calculations depend only on the interaural coherence and ITD of the signals (Culling et al., 2005), which should be minimally affected by the comb-filter which is identical for the two ears.

3

The predicted SRM for a noise masker having the same spectrum as the speech target and tested in the two spatial configurations considered here is 10.3 dB, when estimated using the model from Jelfs et al. (2011), which is equivalent to model 1 for stationary noise maskers and normal-hearing listeners (Vicente et al., 2020).

References

  1. Andersen AH, de Haan JM, Tan Z-H, & Jensen J (2018). Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions. Speech Communication, 102, 1–13. [Google Scholar]
  2. ANSI S3.5 (1997). Methods for Calculation of the Speech Intelligibility Index. American National Standards Institute, New York,. [Google Scholar]
  3. Beutelmann R, Brand T, & Kollmeier B (2010). Revision, extension, and evaluation of a binaural speech intelligibility model. The Journal of the Acoustical Society of America, 127, 2479–2497. doi: 10.1121/1.3295575. [DOI] [PubMed] [Google Scholar]
  4. Bird J, & Darwin CJ (1998). Effects of a difference in fundamental frequency in separating two sentences. In Palmer A, Rees A, Summersfield Q, & Meddis R (Eds.), Psychophysical and Physiological Advances in Hearing (pp. 263–269). Wiley. [Google Scholar]
  5. Boersma P, & Weenink D (2018). Praat: Doing phonetics by computer [Computer program]. Version 6.0.42, retrieved 15 August 2018 from http://www.praat.org/.
  6. Brokx JP, & Nooteboom SG (1982). Intonation and the perceptual separation of simultaneous voices. Journal of Phonetics, 10, 23–36. [Google Scholar]
  7. Bronkhorst AW, & Plomp R (1992). Effect of multiple speechlike maskers on binaural speech recognition in normal and impaired hearing. The Journal of the Acoustical Society of America, 92, 3132–3139. doi: 10.1121/1.404209. [DOI] [PubMed] [Google Scholar]
  8. Brungart DS, Simpson BD, Ericson MA, & Scott KR (2001). Informational and energetic masking effects in the perception of multiple simultaneous talkers. The Journal of the Acoustical Society of America, 110, 2527–2538. doi: 10.1121/1.1408946. [DOI] [PubMed] [Google Scholar]
  9. Chabot-Leclerc A, MacDonald EN, & Dau T (2016). Predicting binaural speech intelligibility using the signal-to-noise ratio in the envelope power spectrum domain. The Journal of the Acoustical Society of America, 140, 192–205. [DOI] [PubMed] [Google Scholar]
  10. Collin B, & Lavandier M (2013). Binaural speech intelligibility in rooms with variations in spatial location of sources and modulation depth of noise interferers. The Journal of the Acoustical Society of America, 134, 1146–1159. doi: 10.1121/1.4812248. [DOI] [PubMed] [Google Scholar]
  11. Cooke M (2006). A glimpsing model of speech perception in noise. The Journal of the Acoustical Society of America, 119, 1562–1573. doi: 10.1121/1.2166600. [DOI] [PubMed] [Google Scholar]
  12. Culling JF, & Darwin CJ (1993). Perceptual separation of simultaneous vowels: Within and across-formant grouping by F0. The Journal of the Acoustical Society of America, 93, 3454–3467. doi: 10.1121/1.405675. [DOI] [PubMed] [Google Scholar]
  13. Culling JF, Hawley ML, & Litovsky RY (2005). Erratum: The role head-induced interaural time and level differences in the speech reception threshold for multiple interfering sound sources [J. Acoust. Soc. Am. 116, 1057 (2004)]. The Journal of the Acoustical Society of America, 118, 552–552. doi: 10.1121/1.1925967. [DOI] [PubMed] [Google Scholar]
  14. Culling JF, & Stone MA (2017). Energetic Masking and Masking Release. In Middlebrooks JC, Simon JZ, Popper AN, & Fay RR (Eds.), The Auditory System at the Cocktail Party (pp. 41–73). Cham: Springer International Publishing; volume 60 of Springer Handbook of Auditory Research. doi: 10.1007/978-3-319-51662-2. [DOI] [Google Scholar]
  15. Darwin CJ, Brungart DS, & Simpson BD (2003). Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers. The Journal of the Acoustical Society of America, 114, 2913. doi: 10.1121/1.1616924. [DOI] [PubMed] [Google Scholar]
  16. de Cheveigné A (1993). Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancellation model of auditory processing. The Journal of the Acoustical Society of America, 93, 3271–3290. doi: 10.1121/1.405712. [DOI] [Google Scholar]
  17. de Cheveigné A (2021). Harmonic Cancellation—A Fundamental of Auditory Scene Analysis. Trends in Hearing, 25, 2331–2165. doi: 10.1177/23312165211041422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. de Cheveigné A, Kawahara H, Tsuzaki M, & Aikawa K (1997). Concurrent vowel identification. I. Effects of relative amplitude and F0 difference. The Journal of the Acoustical Society of America, 101, 2839–2847. doi: 10.1121/1.418517. [DOI] [Google Scholar]
  19. de Cheveigné A, McAdams S, Laroche J, & Rosenberg M (1995). Identification of concurrent harmonic and inharmonic vowels: A test of the theory of harmonic cancellation and enhancement. The Journal of the Acoustical Society of America, 97, 3736–3748. doi: 10.1121/1.412389. [DOI] [PubMed] [Google Scholar]
  20. Deroche MLD, & Culling JF (2011). Voice segregation by difference in fundamental frequency: Evidence for harmonic cancellation. The Journal of the Acoustical Society of America, 130, 2855–2865. doi: 10.1121/1.3643812. [DOI] [PubMed] [Google Scholar]
  21. Deroche MLD, & Culling JF (2013). Voice segregation by difference in fundamental frequency: Effect of masker type. The Journal of the Acoustical Society of America, 134, EL465–EL470. doi: 10.1121/1.4826152. [DOI] [PubMed] [Google Scholar]
  22. Deroche MLD, Culling JF, Chatterjee M, & Limb CJ (2014a). Roles of the target and masker fundamental frequencies in voice segregation. The Journal of the Acoustical Society of America, 136, 1225–1236. doi: 10.1121/1.4890649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Deroche MLD, Culling JF, Chatterjee M, & Limb CJ (2014b). Speech recognition against harmonic and inharmonic complexes: Spectral dips and periodicity. The Journal of the Acoustical Society of America, 135, 2873–2884. doi: 10.1121/1.4870056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Durlach NI (1972). Binaural signal detection: Equalization and cancellation theory. In Tobias J (Ed.), Foundations of Modern Auditory Theory (pp. 371–462). Academic, New York: volume II. [Google Scholar]
  25. Durlach NI, Mason CR, Kidd G, Arbogast TL, Colburn HS, & Shinn-Cunningham BG (2003). Note on informational masking (L). The Journal of the Acoustical Society of America, 113, 2984–2987. doi: 10.1121/1.1570435. [DOI] [PubMed] [Google Scholar]
  26. Festen JM, & Plomp R (1990). Effects of fluctuating noise and interfering speech on the speech-reception threshold for impaired and normal hearing. The Journal of the Acoustical Society of America, 88, 1725–1736. doi: 10.1121/1.400247. [DOI] [PubMed] [Google Scholar]
  27. Fogerty D, Xu J, & Gibbs BE (2016). Modulation masking and glimpsing of natural and vocoded speech during single-talker modulated noise: Effect of the modulation spectrum. The Journal of the Acoustical Society of America, 140, 1800–1816. doi: 10.1121/1.4962494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Gardner WG, & Martin KD (1995). HRTF measurements of a KEMAR. The Journal of the Acoustical Society of America, 97, 3907–3908. doi: 10.1121/1.412407. [DOI] [Google Scholar]
  29. Hauth CF, & Brand T (2018). Modeling sluggishness in binaural unmasking of speech for maskers with time-varying interaural phase differences. Trends in Hearing, 22, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Hawley ML, Litovsky RY, & Culling JF (2004). The benefit of binaural hearing in a cocktail party: Effect of location and type of interferer. The Journal of the Acoustical Society of America, 115, 833–843. doi: 10.1121/1.1639908. [DOI] [PubMed] [Google Scholar]
  31. Jelfs S, Culling JF, & Lavandier M (2011). Revision and validation of a binaural model for speech intelligibility in noise. Hearing Research, 275, 96–104. doi: 10.1016/j.heares.2010.12.005. [DOI] [PubMed] [Google Scholar]
  32. Lavandier M, & Best V (2020). Modeling Binaural Speech Understanding in Complex Situations. In Blauert J, & Braasch J (Eds.), The Technology of Binaural Understanding Modern Acoustics and Signal Processing (pp. 547–578). Cham: Springer International Publishing. doi: 10.1007/978-3-030-00386-9_19. [DOI] [Google Scholar]
  33. Lavandier M, & Culling JF (2010). Prediction of binaural speech intelligibility against noise in rooms. The Journal of the Acoustical Society of America, 127, 387–399. doi: 10.1121/1.3268612. [DOI] [PubMed] [Google Scholar]
  34. Lavandier M, Vicente T, & Prud’homme L (2022). A series of SNR-based speech intelligibility models in the Auditory Modeling Toolbox. Acta Acustica, Accepted for publication. [Google Scholar]
  35. Leclère T, Lavandier M, & Deroche ML (2017). The intelligibility of speech in a harmonic masker varying in fundamental frequency contour, broadband temporal envelope, and spatial location. Hearing Research, 350, 1–10. doi: 10.1016/j.heares.2017.03.012. [DOI] [PubMed] [Google Scholar]
  36. Majdak P, Hollomey C, & Baumgartner R (2022). AMT 1.x: A toolbox for reproducible research in auditory modeling. Acta Acustica, Accepted for publication. [Google Scholar]
  37. Moore BC, & Glasberg BR (1983). Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. The Journal of the Acoustical Society of America, 74, 750–753. doi: 10.1121/1.389861. [DOI] [PubMed] [Google Scholar]
  38. Peters RW, Moore BCJ, & Baer T (1998). Speech reception thresholds in noise with and without spectral and temporal dips for hearing-impaired and normally hearing people. The Journal of the Acoustical Society of America, 103, 577–587. doi: 10.1121/1.421128. [DOI] [PubMed] [Google Scholar]
  39. Plomp R (1976). Binaural and Monaural Speech Intelligibility of Connected Discourse in Reverberation as a Function of Azimuth of a Single Competing Sound Source (Speech 672 or Noise). Acta Acustica united with Acustica, 34, 200–211. [Google Scholar]
  40. Prud’homme L, Lavandier M, & Best V (2020). A harmonic-cancellationbased model to predict speech intelligibility against a harmonic masker. The Journal of the Acoustical Society of America, 148, 3246–3254. doi: 10.1121/10.0002492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Prud’homme L, Lavandier M, & Best V (2022). Investigating the role of harmonic cancellation in speech-on-speech masking. Hearing Research, Submitted to this special issue, under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Relaño-Iborra H, May T, Zaar J, Scheidiger C, & Dau T (2016). Predicting speech intelligibility based on a correlation metric in the envelope power spectrum domain. The Journal of the Acoustical Society of America, 140, 2670–2679. doi: 10.1121/1.4964505. [DOI] [PubMed] [Google Scholar]
  43. Steinmetzger K, & Rosen S (2015). The role of periodicity in perceiving speech in quiet and in background noise. The Journal of the Acoustical Society of America, 138, 3586–3599. doi: 10.1121/1.4936945. [DOI] [PubMed] [Google Scholar]
  44. Steinmetzger K, Zaar J, Relaño-Iborra H, Rosen S, & Dau T (2019). Predicting the effects of periodicity on the intelligibility of masked speech: An evaluation of different modelling approaches and their limitations. The Journal of the Acoustical Society of America, 146, 2562–2576. doi: 10.1121/1.5129050. [DOI] [PubMed] [Google Scholar]
  45. Stone MA, Füllgrabe C, Mackinnon RC, & Moore BCJ (2011). The importance for speech intelligibility of random fluctuations in “steady” background noise. The Journal of the Acoustical Society of America, 130, 2874–2881. doi: 10.1121/1.3641371. [DOI] [PubMed] [Google Scholar]
  46. Stone MA, Füllgrabe C, & Moore BCJ (2012). Notionally steady background noise acts primarily as a modulation masker of speech. The Journal of the Acoustical Society of America, 132, 317–326. doi: 10.1121/1.4725766. [DOI] [PubMed] [Google Scholar]
  47. Summerfield Q, & Culling JF (1992). Periodicity of maskers not targets determines ease of perceptual segregation using differences in fundamental frequency. The Journal of the Acoustical Society of America, 92, 2317–2317. doi: 10.1121/1.405031. [DOI] [Google Scholar]
  48. Tang Y, Cooke M, Fazenda BM, & Cox TJ (2016). A metric for predicting binaural speech intelligibility in stationary noise and competing speech maskers. The Journal of the Acoustical Society of America, 140, 1858–1870. [DOI] [PubMed] [Google Scholar]
  49. Vicente T, & Lavandier M (2020). Further validation of a binaural model predicting speech intelligibility against envelope-modulated noises. Hearing Research, 390, 107937. doi: 10.1016/j.heares.2020.107937. [DOI] [PubMed] [Google Scholar]
  50. Vicente T, Lavandier M, & Buchholz JM (2020). A binaural model implementing an internal noise to predict the effect of hearing impairment on speech intelligibility in non-stationary noises. The Journal of the Acoustical Society of America, 148, 3305–3317. doi: 10.1121/10.0002660. [DOI] [PubMed] [Google Scholar]

RESOURCES