An ideal quantized mask to increase intelligibility and quality of speech in noise

Eric W Healy; Jordan L Vasko

doi:10.1121/1.5053115

. 2018 Sep 13;144(3):1392–1405. doi: 10.1121/1.5053115

An ideal quantized mask to increase intelligibility and quality of speech in noise

Eric W Healy ^1,^a),b),^✉, Jordan L Vasko ^1,^b)

PMCID: PMC6136922 PMID: 30424638

Abstract

Time-frequency (T-F) masks represent powerful tools to increase the intelligibility of speech in background noise. Translational relevance is provided by their accurate estimation based only on the signal-plus-noise mixture, using deep learning or other machine-learning techniques. In the current study, a technique is designed to capture the benefits of existing techniques. In the ideal quantized mask (IQM), speech and noise are partitioned into T-F units, and each unit receives one of N attenuations according to its signal-to-noise ratio. It was found that as few as four to eight attenuation steps (IQM₄, IQM₈) improved intelligibility over the ideal binary mask (IBM, having two attenuation steps), and equaled the intelligibility resulting from the ideal ratio mask (IRM, having a theoretically infinite number of steps). Sound-quality ratings and rankings of noisy speech processed by the IQM₄ and IQM₈ were also superior to that processed by the IBM and equaled or exceeded that processed by the IRM. It is concluded that the intelligibility and sound-quality advantages of infinite attenuation resolution can be captured by an IQM having only a very small number of steps. Further, the classification-based nature of the IQM might provide algorithmic advantages over the regression-based IRM during machine estimation.

I. INTRODUCTION

The perception of speech in background noise represents a challenge for a variety of listeners in a variety of settings. Normal-hearing (NH) listeners with proficiency of the language can tolerate considerable amounts of noise if conditions are otherwise ideal. But even these best listeners can struggle if the signal is acoustically deficient, as can be the case during transmission over cellular phones or other communication systems. The situation is compounded if the listener does not have complete proficiency with the language, as can be the case for non-native-language listeners, children, and other individuals. But the challenge is particularly striking for listeners with hearing loss. In fact, poor speech recognition when background noise is present is a primary auditory complaint of hearing-impaired (HI) individuals (see Moore, 2007; Dillon, 2012), and the speech-in-noise problem for these listeners represents one of our greatest challenges.

Fortunately, techniques exist to help alleviate this challenge. Time-frequency (T-F) masking represents a powerful tool for improving the intelligibility of speech in noise. In T-F masking, the speech-plus-noise mixture is divided in both time and frequency into small units, and each unit is scaled in level according to the relationship between the speech and the noise within the unit. Units with less favorable signal-to-noise ratios (SNRs) are attenuated, resulting in a signal containing T-F units largely dominated by the target speech signal.

There are two main classes of T-F masks, known as “hard” and “soft” masks. These correspond to two main T-F masking schemes, which include binary masking and ratio masking. In the ideal binary mask (IBM; Hu and Wang, 2001; Wang, 2005), each T-F unit is assigned a value of 1 if it is dominated by the target speech or 0 if it is dominated by noise. The IBM is then multiplied with the speech-plus-noise mixture, causing units dominated by the target speech to remain intact and units dominated by the noise to be discarded. In the ideal ratio mask (IRM; Srinivasan et al., 2006; Narayanan and Wang, 2013; Hummerstone et al., 2014; Wang et al., 2014), each T-F unit is again assigned an attenuation scaling according to the speech versus noise relationship. But instead of being binary, this scaling can take any value along a continuum from 0 to 1. Units having a more favorable SNR are attenuated less and those having a less favorable SNR are attenuated more. Accordingly, the IRM is similar to the classic Wiener filter (see Loizou, 2007). As with the IBM, the speech-plus-noise mixture is multiplied with this mask to obtain an array of T-F units, each scaled according to its speech versus noise dominance.

Both masks are capable of producing vast improvements in the intelligibility of noisy speech. Brungart et al. (2006), Li and Loizou (2008a,b), Kim et al. (2009), Kjems et al. (2009), and Sinex (2013) all found that the IBM could produce near-perfect sentence intelligibility for NH listeners in various noises (speech-shaped noise, speech-modulated noise, 2- to 20-talker babble, and various recorded environmental sounds). Anzalone et al. (2006) and Wang et al. (2009) tested both NH and HI subjects, and found that the IBM could produce substantial speech-reception threshold improvements for sentences in different noises (speech-shaped noise and cafeteria noise). With regard to the IRM, Madhu et al. (2013) and Koning et al. (2015) found that it can also produce near-perfect sentence intelligibility for NH listeners in different noises (multi-talker babble and single-talker interference).

The comparison between intelligibility produced by the IBM versus that produced by the IRM is made difficult by the fact that all of the studies just cited employed sentence materials and those employing percent-correct intelligibility often observed ceiling scores at or near 100%. But Madhu et al. (2013) and Koning et al. (2015) both observed that the IRM produced ceiling intelligibility for NH subjects over a wider range of parameter values than did the IBM. In contrast, Brons et al. (2012) observed that the IBM led to better intelligibility than did an IRM having an attenuation of 10 dB for all units with SNR below 0 dB. Thus, the relative intelligibilities produced by the IBM versus the IRM are not clear.

What is more clear is that soft masking typically provides better subjective sound quality of speech than does hard masking. Madhu et al. (2013) conducted pairwise comparisons of preferred sound quality for NH subjects, and found that the ideal Weiner filter was preferred over the IBM in 88%–100% of trials.

The term “ideal” refers to the fact that the masks are created using knowledge of the pre-mixed speech and noise signals—they are oracle masks. The term also refers to the fact that the IBM produces the optimal SNR gain of all binary T-F masks under certain conditions (Li and Wang, 2009). Obviously, knowledge of the pre-mixed signals is not present in real-world settings. But translational significance for T-F masks comes from efforts to estimate them directly from the speech-plus-noise mixture, and the IBM has for many years been considered a goal of computational auditory scene analysis (Wang, 2005). Recent advances in machine learning, and particularly deep learning, have allowed both the IBM and IRM to be estimated with accuracy sufficient to produce considerable intelligibility improvements. This work has involved both NH listeners (Kim et al., 2009; Healy et al., 2013; Healy et al., 2014; Healy et al., 2015; Chen et al., 2016; Healy et al., 2017; Monaghan et al., 2017; Bentsen et al., 2018) and HI listeners (Healy et al., 2013; Healy et al., 2014; Healy et al., 2015; Chen et al., 2016; Healy et al., 2017; Monaghan et al., 2017) and has included a variety of background noises (speech-shaped noise, multi-talker babble, recorded environmental sounds, and single-talker interference). The intelligibility improvements have often allowed HI subjects having access to speech processed by the estimated T-F mask to equal the performance of young NH subjects without processing.

In addition to their different perceptual ramifications, the two main T-F masking schemes possess different characteristics that may be relevant for their estimation by machine-learning algorithms. Estimation of the IBM involves classification of T-F units into two categories using a single decision boundary, whereas estimation of the IRM typically involves regression and approximation of the continuous function underlying attenuation versus SNR. These represent very different learning regimes, with classification into a small number of categories potentially representing a more basic form of machine learning, underlying more elementary tasks such as object recognition (e.g., character recognition) and word/phoneme recognition. Accordingly, it has been argued that computation of a binary mask may be considerably simpler than computation of a soft mask (Wang, 2008).

But it has also been argued (e.g., Wang et al., 2014) and observed (Madhu et al., 2013; Koning et al., 2015; Bentsen et al., 2018) that binary masks are less robust to estimation errors relative to soft masks, meaning that the errors that occur during machine estimation are likely larger in magnitude in binary than in soft masks. It is easy to see that every estimation error in a binary mask is of maximum magnitude (e.g., assigning 1 to a T-F unit that should have been 0, or vice versa), and that in a soft mask, estimation errors can take any value, with an upper bound corresponding to the magnitude of the binary-mask error.

In the current study, a mask is proposed that is different from the two main T-F masking schemes. In the ideal quantized mask (IQM), the speech-noise mixture is divided into T-F units and each is assigned an attenuation based on SNR. However, this attenuation takes one of N values, where N represents a small integer value. The T-F masking conditions employed in the current study form a continuum in terms of attenuation steps, from two (IBM) to infinity (IRM). The three intermediate steps involve an IQM having 3, 4, and 8 steps (IQM₃, IQM₄, and IQM₈). The goals of the current study are to first clarify the relative intelligibilities produced by the IBM versus the IRM. Then, the intelligibility and sound quality of the IQM are established in NH subjects and compared to those resulting from the IBM and IRM. The goal is to capture the (potential) intelligibility and well-established sound-quality advantages of the IRM, and the classification nature and potential computational advantages of the IBM, in an IQM having only a very small number of attenuation steps.

II. EXPERIMENT 1. INTELLIGIBILITY RESULTING FROM VARIOUS T-F MASKS

In experiment 1, intelligibility was assessed in each of the five conditions of T-F masking. The speech materials selected were standard word lists because sentences tend to produce ceiling intelligibility values when subjected to both IBM and IRM processing. Experiment 1a involved broadband (unfiltered) word stimuli, and experiment 1b involved the same stimuli subjected to bandpass filtering in order to further avoid ceiling effects and better reveal differences across conditions. The background noise employed involved recordings from a busy cafeteria. It was selected for ecological validity and to possess variety of sound sources and types, including the babble of multiple talkers, the transient impact sound of dishes, and other environmental sounds.

A. Method

1. Subjects

A total of 20 subjects participated: 10 in experiment 1a and 10 in experiment 1b. The subjects were students at The Ohio State University and received course credit for participating. All were native speakers of American English and had NH as defined by audiometric thresholds of 20 dB hearing level (HL) or better at octave frequencies from 250 to 8000 Hz on the day of test (ANSI, 2004, 2010). The exception was one subject with a threshold of 25 dB HL at 8000 Hz in one ear. Ages ranged from 19 to 29 yr (mean = 20.9 yr) and all were female. Care was taken to ensure that no subject had prior exposure to the speech materials employed.

2. Stimuli

The speech materials for both experiments 1a and 1b were from the Central Institute for the Deaf (CID) W-22 test (Hirsh et al., 1952), drawn from an Auditec CD (St. Louis, MO). The test includes 200 phonetically balanced words in the carrier phrase, “Say the word ___.” Five words were excluded (mew, two, dull, book, there), based on low frequency of occurrence or poor articulation/recording quality, to yield 195 words. The background cafeteria noise was also from an Auditec CD. It was approximately 10 min in duration and consisted of three overdubbed recordings made in a busy hospital-employee cafeteria. Noise segments having random start points and durations equal to each word in its carrier phrase were mixed with each speech utterance at an overall SNR of −10 dB. This relatively low SNR was selected to reduce ceiling intelligibility effects.

The files were down-sampled to 16 kHz for processing in matlab (MathWorks, Natick, MA). Preparation of the T-F masks began by dividing each speech-plus-noise mixture into a T-F representation. The cochleagram representation (Wang and Brown, 2006) was employed, which is essentially a spectrogram having attributes similar to the human cochlea. This involved first filtering into 64 gammatone bands having center frequencies ranging from 50 to 8000 Hz evenly spaced on the equivalent rectangular bandwidth scale (Glasberg and Moore, 1990). Each band was then divided into 20-ms time segments having 10 ms overlap using a Hanning window. This same T-F representation was used for all of the masks.

a. Preparation of the IBM.

The IBM consists of a two-dimensional array of 1's and 0's, one value for each T-F unit. Its processing followed that employed by us previously (Healy et al., 2013; Healy et al., 2014). The SNR within each T-F unit was calculated based on the pre-mixed signals. If the SNR was greater than a fixed local criterion (LC) value, the unit was concluded to be target-speech dominated and it was assigned a value of 1. Inversely, if that SNR was less than or equal to LC, the unit was concluded to be noise dominated and it was assigned a value of 0. That is,

IBM (t, f) = {\begin{array}{l} 1, if SNR (t, f) > L C \\ 0, otherwise, \end{array}

(1)

where $S N R (t, f)$ denotes the SNR within the T-F unit centered at time $t$ and frequency $f$ . LC was set to −15 dB in order to be 5 dB below the overall SNR. This relationship between LC and SNR has been considered near optimal (e.g., Brungart et al., 2006; Kjems et al., 2009; Vasko et al., 2018), and has also been employed by us previously (Healy et al., 2013; Healy et al., 2014). To create the IBM-processed signals, the mask was applied to the speech-plus-noise mixture by multiplying each mixture T-F unit by the value of the IBM for that unit.

b. Preparation of the IRM.

The IRM also consists of a two-dimensional array of values, one for each T-F unit, but these values are continuous between 0 and 1. It was also based on the relative energies of speech versus noise within each T-F unit, as defined by

IRM (t, f) = \sqrt{\frac{S (t, f)}{S (t, f) + N (t, f)}} = \sqrt{\frac{SNR (t, f)}{SNR (t, f) + 1}},

(2)

where $S (t, f)$ is the speech energy contained within the T-F unit centered at time $t$ and frequency $f$ , and $N (t, f)$ is the noise energy contained within the same unit. Whereas Eq. (1) is a conditional statement and so units are irrelevant so long as they match for SNR and LC, SNR in Eq. (2) is an untransformed ratio of energies. This square-root form of the IRM has been found to be optimal (e.g., Wang et al., 2014) and has been employed by us previously (Healy et al., 2015; Healy et al., 2017; Chen et al., 2016). The mask was applied to the speech-plus-noise, again by weighting each mixture T-F unit by the value of the IRM for that unit.

c. Preparation of the IQM.

IQMs were created having three, four, and eight attenuation steps (IQM₃, IQM₄, IQM₈). The SNR boundaries defining each step of the IQM and the attenuation assigned to each step of the IQM were based on the IBM and IRM functions. The SNR boundaries were centered such that the IQM₂ would equal the IBM (having an LC value 5 dB below the overall mixture SNR) once scaled for overall level. The center SNR boundaries of the IQM₄ and IQM₈ (between steps 2 and 3 in the IQM₄ and between steps 4 and 5 in the IQM₈) were also set to equal the single IBM division. The attenuation assigned to each step (the IQM value) was equal to the attenuation assigned by the IRM (the IRM value) at the lower SNR boundary for the step. The exception was that the lowest step was always assigned a value of 0, like in the IBM.

The process began with the selection of a series of points on the IRM function, according to Eqs. (3) and (4),

p = - {log}_{2} \sqrt{\frac{10^{(LC / 10)}}{10^{(LC / 10)} + 1}},

(3)

x_{n} (t, f) = {(\frac{n - 1}{N})}^{p},

(4)

where the exponent p was selected based on the LC for the IBM (−15 dB) to provide the desired relationship between the IQM center SNR boundary and the IBM boundary. $x_{n} (t, f)$ represents the mask gain within the T-F unit centered at time $t$ and frequency $f$ , N represents the total number of steps in the IQM, and n = 1,…,N and represents the ordinal position of each step. These points became the SNR boundaries and attenuation values for the IQM, as in Eq. (5),

{IQM}_{N} (t, f) = {\begin{matrix} x_{1} (t, f), & if & 0 \leq IRM (t, f) \leq x_{2} (t, f) \\ x_{2} (t, f), & if & x_{2} (t, f) < IRM (t, f) \leq x_{3} (t, f) \\ ⋮ \\ x_{N} (t, f), & if & x_{N} (t, f) < IRM (t, f) \leq 1 . \end{matrix}

(5)

As with the other two masks, the IQM was applied by multiplying the stepped mask with the speech-plus-noise mixture units. Figure 1 displays the SNR boundaries for each step of each IQM employed (top panel) and the attenuations produced by each step of each IQM employed (bottom panel). Every stimulus was scaled after processing to the same overall root-mean-square level, eliminating differences in overall level.

FIG. 1. — The top panel displays the SNR boundaries for each step of each IQM employed, and the bottom panel displays the attenuations assigned to each step of each IQM employed.

Whereas the IBM takes values of either 0 or 1, the IRM takes on values bounded by 0 and 1. Although the IRM is capable in theory of zeroing T-F units, it is potentially notable that this will generally not occur because the likelihood of zero signal energy within a T-F unit is nil. But like the IBM, the current IQM was designed to zero all T-F units at the lowest step [ $x_{1} (t, f) always = 0$ ]. This decision was made to reduce the perception of low-level noise arising from the T-F units having the least-favorable SNRs. It is also noteworthy that the current implementation of the IQM is based on existing T-F masks in order to facilitate direct comparison, and it is not yet known to what extent this particular implementation produces optimal human intelligibility and sound quality.

The broadband stimuli processed as just described were used for experiment 1a. For experiment 1b, the same stimuli were subjected to bandpass filtering from 750 to 3000 Hz. The final processed stimuli were passed through a 2000-order finite-duration impulse response filter, resulting in steep filter slopes that exceeded 1000 dB/octave.

B. Procedure

The procedures for experiments 1a and 1b were identical. The experiment was divided into three blocks, each involving 13 words in each mask condition for a total of 39 words/condition. The order of mask conditions (IBM, IQM₃, IQM₄, IQM₈, IRM) was randomized for each block and subject, as was the word list-to-condition correspondence. The stimuli were converted to analog form using Echo Digital Audio (Santa Barbara, CA) Gina 3G digital-to-analog converters and presented diotically over Sennheiser HD 280 headphones (Wedemark, Germany). The presentation level was set to 65 dBA at each earphone at the start of each session using a flat-plate coupler and sound level meter (Larson Davis AEC 101 and 824, Depew, NY). Subjects were tested individually in a double-walled audiometric booth seated with the experimenter. The subjects were instructed to repeat each word back as best they could after hearing each and were encouraged to guess if unsure. No word was repeated for any listener. The experimenter controlled the presentation of words and recorded responses. Testing began with a brief practice in which subjects heard words from the consonant-nucleus-consonant (CNC) corpus (Lehiste and Peterson, 1959). These were also standard recordings produced by a male talker and in a carrier phrase (“Ready, ___.”). Subjects heard five CNC words in each mask condition in order of decreasing number of attenuation steps (IRM, IQM₈, IQM₄, IQM₃, IBM). Feedback was provided during practice but not during formal testing.

C. Results and discussion

1. Human subjects results

The top panel of Fig. 2 displays group mean word-recognition scores for each broadband T-F mask in experiment 1a. Apparent in this panel is that all masks produced high recognition scores (above 70% correct), but that scores for the IBM were lower than those for the IQMs and the IRM, where all values exceed 90% correct. The bottom panel of Fig. 2 displays scores for the group hearing the bandpass stimuli in experiment 1b. Apparent is that scores were reduced below the ceiling and larger differences between scores emerged across the different masks, both as desired. A first notable finding is that speech recognition produced by the IBM is not equal to that produced by the IRM, despite that both produce similar ceiling scores for sentence intelligibility. Instead, mean recognition scores were better for the IRM by 36 percentage points when ceiling values were eliminated. A second primary finding is that scores were highest in the IQM₈ condition and scores for the IQM₄ approximated that for the IRM. Thus, it appears that the intelligibility benefit of the IRM can be captured with as few as four attenuation steps. Finally, it is noted that the addition of any number of attenuation steps above two produced increased speech recognition.

The scores were transformed into rationalized-arcsine units (RAUs; Studebaker, 1985) and subjected to a two-way mixed analysis of variance (ANOVA; 2 filtering groups × 5 mask conditions). The interaction between filtering and mask conditions was not significant [F(4,72) = 0.9, p = 0.45], suggesting that the pattern of performance across different mask conditions was generally consistent across experiments. As anticipated, the main effect of filtering was significant [F(1,18) = 359.6, p < 0.001], simply reflecting the desired reduction in scores associated with filtering. Most critically, the main effect of mask condition was significant [F(4,72) = 86.3, p < 0.001]. Performance across the five pooled mask conditions was examined using Holm-Sidak pairwise post hoc comparisons. Performance did not differ significantly among the IQM₄, IQM₈, and IRM (p ≥ 0.15), where scores were within 4 percentage points. All other comparisons were significant, suggesting that the IBM and the IQM₃ produced lower recognition scores (p < 0.001). The patterns of significant main effects and pairwise comparisons were identical when the RAU data from experiments 1a and 1b were subjected to separate one-way repeated-measures ANOVAs, despite that the latter set of scores were all free of ceiling effects and therefore differed more widely across mask conditions.

2. Acoustic intelligibility estimates

Predicted intelligibility based on the acoustic stimuli was also assessed using the standard metric, short-time objective intelligibility (STOI; Taal et al., 2011). This metric reflects the correlation between the temporal amplitude envelopes of speech-plus-noise following processing and that of clean unprocessed speech. The index therefore typically ranges from 0.0 to 1.0 (although negative correlations are possible) and reflects the extent to which the envelope of the processed noisy speech reflects that of the original noise-free speech. It has been shown to be highly correlated with human speech intelligibility and is often used as an objective estimate of intelligibility.

For each mask condition, the STOI value was calculated for each of the 195 W-22 words in its carrier separately, then averaged to obtain means and variability estimates. Accordingly, standard deviations were calculated rather than standard errors because each entry in the population estimate represents a single utterance, rather than a single human subject. Figure 3 displays these STOI values for the broadband stimuli employed in experiment 1a (top panel) and the filtered stimuli employed in experiment 1b (bottom panel). Apparent is that the trend observed across conditions in Fig. 2 can also be seen in Fig. 3, except that the STOI values are somewhat similar across conditions, suggesting that they may underpredict the human speech-recognition differences observed across the five mask conditions (see Taal et al., 2011, for functions mapping STOI to intelligibility). Most notable is the similarity across STOI scores observed for the experiment 1b stimuli, where ceiling effects were absent and large differences in human speech recognition were observed.

FIG. 3. — STOI predictions based on the broadband acoustic stimuli employed in experiment 1a (top panel) and the filtered stimuli employed in experiment 1b (bottom panel). For each condition, STOI values were calculated for each W-22 utterance separately, and then averaged. Errors represent standard deviations.

III. EXPERIMENT 2a. SOUND-QUALITY RATINGS FOR VARIOUS T-F MASKS

In this experiment, the focus was on subjective sound quality. Subjects compared utterances processed by two different T-F masks and rated which sound quality was preferred and by how much. Everyday sentences were employed in order to provide a longer duration sample to judge and a more common communication unit. Further, sentences are highly intelligible when processed by both the IBM and IRM (and so presumably by the IQM as well), removing the influence of differential intelligibility and allowing subjects to focus on sound quality. Finally, the sentence was the same across the two masks compared in each trial in order to further focus the judgment on sound quality.