The effects of selective consonant amplification on sentence recognition in noise by hearing-impaired listeners

Rithika Saripella; Philipos C Loizou; Linda Thibodeau; Jennifer A Alford

doi:10.1121/1.3641407

. 2011 Nov;130(5):3028–3037. doi: 10.1121/1.3641407

The effects of selective consonant amplification on sentence recognition in noise by hearing-impaired listeners

Rithika Saripella ¹, Philipos C Loizou ^1,^a), Linda Thibodeau ², Jennifer A Alford ²

PMCID: PMC3248061 PMID: 22087930

Abstract

Weak consonants (e.g., stops) are more susceptible to noise than vowels, owing partially to their lower intensity. This raises the question whether hearing-impaired (HI) listeners are able to perceive (and utilize effectively) the high-frequency cues present in consonants. To answer this question, HI listeners were presented with clean (noise absent) weak consonants in otherwise noise-corrupted sentences. Results indicated that HI listeners received significant benefit in intelligibility (4 dB decrease in speech reception threshold) when they had access to clean consonant information. At extremely low signal-to-noise ratio (SNR) levels, however, HI listeners received only 64% of the benefit obtained by normal-hearing listeners. This lack of equitable benefit was investigated in Experiment 2 by testing the hypothesis that the high-frequency cues present in consonants were not audible to HI listeners. This was tested by selectively amplifying the noisy consonants while leaving the noisy sonorant sounds (e.g., vowels) unaltered. Listening tests indicated small (∼10%), but statistically significant, improvements in intelligibility at low SNR conditions when the consonants were amplified in the high-frequency region. Selective consonant amplification provided reliable low-frequency acoustic landmarks that in turn facilitated a better lexical segmentation of the speech stream and contributed to the small improvement in intelligibility.

INTRODUCTION

Vowel perception generally poses little difficulty for hearing-impaired listeners due partially to the fact that the level of the vowels is much greater than the level of the consonants (Owens et al., 1968; Edwards, 2004). In contrast, consonant perception is considerably much more challenging for hearing-impaired listeners (e.g., Owens et al., 1972). Among other factors, consonant perception by hearing-impaired (HI) listeners seems to be affected by multiband compression (Yund and Buckles, 1995; Edwards, 2004), which reduces spectral contrast, and the degree of hearing loss, particularly in the high-frequency regions where some consonants (e.g., /s/, /t/) have prominent energy. This reduced spectral contrast combined with the hearing loss clearly influences the audibility of consonants (Owens et al., 1972; Turner and Robb, 1987). Simple amplification of consonants, however, to restore audibility may not always lead to intelligibility benefit for HI listeners. As hypothesized by many (e.g., Skinner, 1980; Hogan and Turner, 1998; Turner and Cummings, 1999; Ching et al., 1998; Moore, 2001), once the hearing loss in a particular region of the cochlea becomes too severe (beyond 55 dB HL, according to some studies), speech information is affected by distortion, even when presented at suprathreshold levels.

The intensity of some consonants can be as low as 20 dB lower than that of vowels (Gordon-Salant, 1986; Freyman et al., 1991). For that reason, a number of studies has considered selective amplification of consonants, while leaving the vowel level constant, and has examined the role of the consonant-vowel intensity ratio (CVR) on consonant identification. Increasing the CVR has been found to improve consonant recognition performance in normal-hearing (NH) listeners (Gordon-Salant, 1986; Freyman and Nerbonne, 1989) and in patients with sensorineural hearing loss (Gordon-Salant, 1987; Montgomery and Edge, 1988). In conditions wherein listeners were forced to rely more on temporal-envelope cues rather than on spectral cues, Freyman et al. (1991) noted an improvement in performance in consonant recognition, especially the recognition of voiced stops, when the consonants were amplified by 10 dB. In summary, improving the CVR can potentially improve the intelligibility of some consonants by NH and HI listeners.

In most of the preceding CVR studies, isolated syllables in consonant-vowel (CV) or consonant-vowel-consonant (VCV) format were used as test material, and the amplification was applied to all consonants including the semivowels and nasals. Such an approach raises some questions in terms of practical implications in commercial hearing aids and in terms of generalization of the studies’ outcomes and conclusions in continuous speech. For one, it presumes the existence of a consonant detection algorithm that would reliably discriminate between semivowels and vowels or between nasals and vowels, a formidable challenge not only in background noise but also in quiet conditions. Second, in real communicative situations, HI listeners make use of high-level linguistic information (e.g., context) to identify words in continuous speech. As such, if the information contained in some consonants is masked and not perceptible, listeners might be able to use supplemental cues present in the relatively less corrupted (masked) segments (e.g., vowels). Third, background noise does not typically mask all phonetic segments to the same extent, owing to the spectral differences between noise and speech (Parikh and Loizou, 2005). Low-frequency noise (e.g., car noise), for instance, will not mask the higher frequencies to the same degree as it masks the low-frequency regions of the spectrum. Consequently, a different amplification factor might be required for different consonants. Last, modifying (increasing) the CVR in continuous speech (rather than in isolated syllables) might not always be desirable as it can affect the transmission of voicing information. A study by Port and Dalby (1982), for instance, demonstrated that when other cues to voicing are ambiguous, the CVR can provide reliable cues to perception of voicing of word-final stops. Luce and Charles-Luce (1985) also showed that while vowel duration is very reliable in signaling voicing in syllable-final stops, the CVR remains a significant correlate of voicing. The contribution, and importance, of voicing information to perception of continuous speech in steady-background noise has been demonstrated in our prior study (Li and Loizou, 2008) with NH listeners. Not much work has been done, however, to understand the contribution of voicing information in continuous speech by HI listeners, particularly when speech is corrupted by background noise.

Taking the preceding issues and questions into account, we investigate in the present study a number of hypotheses. The weak consonants, such as the obstruent consonants (e.g., stops, fricatives) are easily masked by noise compared to the more intense vowels and more generally the sonorant sounds (Parikh and Loizou, 2005). This suggests that in a noisy situation, listeners will have access to relatively reliable information contained in the vowels (as they are masked less by noise) and little, if any, information contained in the obstruent consonants. Listeners are thus faced with the task, and challenge, of integrating the information “glimpsed” from the vocalic phonetic segments (owing to their relatively higher SNR) to identify words in continuous speech. The first experiment tests the hypothesis that providing access to information contained in the obstruent consonants ought to improve speech recognition by HI listeners as that would assist them in integrating the information “glimpsed” from the vocalic segments to hear out the target speech. To test this hypothesis, listeners are presented with noise-corrupted sentences containing clean (i.e., not masked by noise) obstruent consonants but otherwise noise-corrupted sonorant sounds (e.g., vowels, semivowels, nasals). This experiment can provide important insights as to how efficiently HI listeners integrate information that is “glimpsed” across multiple segments of the utterance to hear out the target speech. This ability is important particularly in the context of perceiving target speech in competing-talker listening situations (see review by Assmann and Summerfield, 2004; Li and Loizou, 2007). The outcomes of the preceding experiment are important as they would provide motivation for the development of hearing-aid signal processing algorithms that would either provide differential amplification to the heavily masked obstruent consonants (investigated in Experiment 2) or specific noise reduction techniques capable of suppressing the noise present in those segments. The second experiment investigates the possibility that the high-frequency cues present in most obstruent consonants (e.g., /t/, /s/) might not be audible to HI listeners. To test this hypothesis, noisy consonants contained in sentences were selectively amplified in the high frequencies while leaving the noisy sonorant sounds unaltered. If the high-frequency cues contained in most consonants were not accessible to HI listeners because they were not audible, then amplifying them ought to improve speech intelligibility. The high-frequency consonant amplification was done in a way that did not significantly alter the CVR as the intent was to provide more reliable voicing information.

EXPERIMENT 1: PERCEPTION OF CONSONANT INFORMATION BY HI LISTENERS IN CONTINUOUS SPEECH EMBEDDED IN BACKGROUND NOISE

Methods

Subjects and stimuli

A total of eight HI subjects and eight elderly normal-hearing subjects (ENH) participated in this experiment. HI subjects were 37–78 yr, with average age of 56 yr and the ENH subjects were 48–67 yr with average age of 54 yr. All subjects were native English speakers and were paid for their participation. Figure 1 shows the audiometric thresholds of the HI subjects. All the HI subjects were experienced hearing-aid users and had bilateral sensorineural hearing loss with one exception. One subject had unilateral hearing loss. The inclusion criteria for the ENH subjects was to have maximum hearing thresholds for air conducted pure tones not exceeding 30 dB HL in either of the two ears. Hence, all ENH subjects had thresholds lower than 30 dB HL at octave frequencies from 250 Hz to 4 kHz. Thresholds at 8 kHz were also lower than 30 dB HL with the exception of one subject.

Audiometric thresholds (in dB HL) of the HI subjects. For the subjects with bilateral hearing loss, only the poorer threshold is shown.

The speech material consisted of sentences taken from the IEEE database (IEEE, 1969). All sentences were produced by a male speaker. The sentences were recorded in a sound-attenuated booth (Acoustic Systems, Inc.) at a 25 kHz sampling rate. Details about the recording setup and copies of the recordings are available in Loizou (2007). The sentences were corrupted by a 20-talker babble (Auditec CD, St. Louis) at four SNR levels (−5 to +10 dB, in steps of 5 dB). The long-term average spectrum of the babble can be found in Parikh and Loizou (2005).

Signal processing

The IEEE sentences were manually segmented into two broad phonetic classes: (1) obstruent sounds, which included the stops, fricatives, and affricates, and (2) the sonorant sounds, which included the vowels, semivowels, and nasals. The segmentation was carried out in a two-step process. In the first step, initial classification of voiced and unvoiced speech segments was provided by a highly accurate F0 detector, taken from the STRAIGHT algorithm (Kawahara et al., 1999), which was applied to the stimuli at 1-ms intervals using a high-resolution fast Fourier transform (FFT). Segments with nonzero F0 values were initially classified as voiced and segments with zero F0 value as unvoiced. In the second step, the voiced and unvoiced decisions were inspected for errors and the detected errors were manually corrected. This process is described in more detail in Li and Loizou (2008). The two-class segmentation of all the IEEE sentences was saved in text files in the same format as TIMIT’s.phn files and is available from the CD-ROM provided in Loizou (2007). The preceding two phonetic classes were chosen for two reasons. First, the obstruent sounds are particularly vulnerable to steady background noise, given their low intensity relative to that of the sonorant sounds (Parikh and Loizou, 2005). Second, the majority of the obstruent sounds have a prominent energy in the high frequencies. The fricatives /s/and /sh/, for instance, have most of their energy concentrated above 4 kHz.

The IEEE speech stimuli were processed in two different conditions. In the first condition, which served as the control condition, the listeners were presented with noise-corrupted speech stimuli. We refer to this condition as the unprocessed (UN) condition. The second condition included sentences containing clean (uncorrupted) obstruent segments but noise-masked sonorant segments (e.g., vowels). The clean obstruent segments were extracted from the sentences prior to their mixing with the masker. We refer to this condition as the clean consonant (CC) condition.

Procedure

Sentences from the IEEE database were processed as described in Sec. 2A2 and stored for testing purposes. Two IEEE lists (20 sentences) were used for each condition, and none of the lists were repeated. Multi-talker babble was added to the IEEE sentences at four different SNR levels: −5, 0, 5, and 5 dB. HI subjects were tested at all SNR levels, but ENH subjects were only tested at −5 and 0 dB SNR levels as we were constrained by ceiling effects at higher SNR levels. Prior to the test, subjects listened to four lists of sentences to become familiar with the processed stimuli and the task (i.e., subjects listened to the sentences while reading the contents of the sentence, and this was done only during the practice session). HI subjects participated in a total of eight randomized conditions (= 4 SNR levels × 2 algorithms), and ENH subjects participated in a total of four randomized conditions (= 2 SNR levels × 2 algorithms). Two lists of IEEE sentences (i.e., 20 sentences) were used per condition, and none of the lists were repeated across conditions. Different lists of sentences were assigned to different listeners. Sentences were presented to the listeners in blocks with 20 sentences/block for each condition. The sentences were presented at an average level of 72 dB SPL as measured by the Quest Precision Impulse Integrating Sound Level Meter (Model No: 1800). During the test, the subjects were asked to type the words they heard on a keyboard.

Because the personal hearing aids might have noise reduction algorithms programmed, the subjects were fit with Phonak eXtra behind-the-ear (BTE) hearing aids that were calibrated according to the subjects’ hearing loss and had no noise-reduction capabilities. These BTE hearing aids matched the gain and output of the subjects’ personal hearing aids within 5 dB but had no noise-reduction capabilities. Two subjects were fit with eXtra 211 hearing aids, three with eXtra 311 hearing aids, and three with eXtra 411 hearing aids. The hearing aids were programmed based on the NAL-NL1 algorithm using the manufacturer’s fitting software (Byrne et al., 2001). This fitting algorithm is designed to maximize speech intelligibility while keeping the overall loudness at a level that does not exceed that of normal-hearing persons listening to the same sound. The gain required to achieve this varies with input level making it most appropriate for non-linear hearing aids. The subject’s own ear-molds were used or a temporary coupling was made using tubing and a foam tip. Real-ear verification according to NAL-NL1 targets was performed using the Audioscan Verifit program and standard probe insertion depth of 5-6 mm from the tympanic membrane. The targets for soft, standard, and loud inputs (50, 70, and 85 dB SPL, respectively) were met within + 5 dB at octave frequencies from 500 to 4000 Hz. Furthermore, electro-acoustical analysis, according to ANSI S3.22 (2003), showed agreement between high-frequency average maximum output and reference test gain for the experimental and personal aids within + 5 dB when run at the users’ settings. Attack and release times were 1 and 10 ms, respectively. The maximum output and compression settings remained at the default settings as determined by the fitting software. The hearing aids were set to experience level 4, which represents a listener who is accustomed to amplification.

The testing was held in a double-walled IAC sound booth (Model No: 102400). The processed speech files were sent first to a Crown amplifier (Model No: D75) that was used to adjust the presentation level of the signals being played. The Crown amplifier was connected to the speakers in the sound booth. The speech files were played to the subjects using TOA Electric Company speakers (Model No: 22-ME-AV), which have a frequency response up to 20 kHz. For the single HI subject with unilateral hearing loss, the ear with normal hearing was occluded during the test. A computer monitor and a keyboard were placed in the sound booth so that the subjects could operate a graphical user interface to type in the words they heard using the keyboard. The entire testing was monitored outside the sound booth with an additional computer monitor and keyboard.

Results and discussion

Figure 2 shows the mean performance obtained by HI and ENH listeners in the various conditions. Performance was measured in terms of percent of keywords identified correctly. A substantial improvement in intelligibility was observed by ENH listeners at −5 dB SNR when they had access to clean consonant information. Large improvements in intelligibility were also observed by HI listeners at low SNR levels (≤ 0 dB).

Mean percent correct scores of HI listeners (dark-filled symbols) and ENH listeners (open symbols) in the various SNR conditions. Performance obtained using unprocessed (noise-masked) sentences is indicated as UN, and performance obtained using sentences containing clean consonants but otherwise noise-masked sonorant sounds is indicated as CC. Error bars indicate standard errors of the mean.

Two-way analysis of variance (ANOVA) with repeated measures was run to assess the effects of consonant processing (UN and CC) and SNR level on sentence recognition by the HI listeners. Prior to the ANOVA, the percent correct scores were arcsine transformed, as per Studebaker (1985), to RAU scores. The ANOVA, when applied to the RAU scores, showed significant effect of processing (F[1,7] = 66.1, p < 0.0005), significant effect of SNR level (F[3,21] = 197.1, p < 0.0005) and non-significant interaction (F[3,21] = 2.83, p = 0.063).

The preceding analysis suggests that HI listeners received significant benefit in intelligibility when they had access to clean consonant information in otherwise noise-masked sentences (sonorant segments were left corrupted by babble). The improvement ranged from 20% to 30% for most SNR levels (−5, 0, and 5 dB) and was found to be statistically significant (p < 0.005) based on post hoc tests. Based on interpolation of the psychometric functions (Fig. 2) of the UN and CC conditions, this improvement amounted to a 4-dB decrease in speech reception threshold (SRT). A comparatively lower benefit was obtained at the high-SNR level (10 dB). Post hoc tests confirmed that the benefit observed at 10 dB SNR was significant (p = 0.019), albeit small (5% improvement). Bonferonni correction was applied to all post hoc tests.

The same statistical analysis was used to analyze the scores obtained by ENH listeners. The ANOVA showed significant effect of processing (F[1,7] = 110.1, p < 0.0005), significant effect of SNR level (F[1,7] = 364.3, p < 0.0005) and nonsignificant interaction (F[1,7] = 1.3, p = 0.288). The preceding analysis suggests that the improvement in intelligibility brought by CC processing was significant at both SNR levels.

It is clear from Fig. 2 that HI listeners benefited when they had access to clean consonant information in otherwise noise-masked sentences. It is worth noting that the improvement in intelligibility with the utilization of CC processing was comparable to that obtained by cochlear implant (CI) listeners (Li and Loizou, 2010). In our previous study, CI listeners were presented with similar speech stimuli (CC) in steady-continuous noise at 5 and 10 dB SNR levels. Results indicated a 20% improvement in intelligibility at 5 dB and a 10% improvement at 10 dB (Li and Loizou, 2010). These improvements are nearly identical to those observed in the present experiment (see Fig. 2) with HI subjects. Hence, from this we conclude that CI listeners are able to extract, and integrate, the consonant information contained in otherwise noise-masked sentences at the same level as HI listeners.

To compare the performance obtained by ENH and HI listeners in the various conditions, a three-way ANOVA was run using the two listener groups, two SNR levels (−5 and 0 dB) and two processing conditions (UN and CC) as factors and the number of words correct as dependent variables. The processing (F[1,7] = 167.5, p < 0.001) and group effects (F[1,7] = 15.7, p < 0.001) were found to be significant, and their interaction was found to be non-significant (F[1,7] = 0.35, p = 0.568). SNR also was found to have a significant effect (F[1,7] = 216.5, p < 0.005) and the three-way interaction of SNR × processing × group was non-significant (F[1,7] = 5.0, p = 0.06). From the preceding analysis, we can conclude that changes in SNR, processing and group affected performance. Planned comparisons, based on independent samples t-tests (2-tailed, unequal variances assumed), between the scores obtained by the two listener groups indicated no significant difference (p = 0.155) in benefit from CC processing at −5 dB SNR. There was also no significant difference (p = 0.686) in the 0-dB SNR condition.

On average, HI listeners were not able to utilize and integrate the consonant information at the same level as the ENH listeners in the −5 dB SNR condition (HI listeners received only 64% of the benefit obtained by ENH listeners). There was, however, a considerable variability among subjects, and Fig. 3 shows the individual HI listener’s scores in the various SNR conditions. As shown in Fig. 3 (SNR = −5 dB), some subjects (S2 and S3) received little benefit, while others received a large benefit (subjects S6 to S8). Subject S7, in particular, received a 60% benefit with CC processing. This variability in performance among HI listeners led us to wonder whether their benefit was somewhat related to their audiometric thresholds. To examine that, we computed the correlation between the CC benefit and average audiometric thresholds (computed using the average thresholds at octave frequencies of 400 Hz to 8000 Hz). The CC benefit was computed as the difference between the score obtained in the CC condition relative to the score obtained in the control UN condition (no processing). The resulting correlation coefficient was r = −0.37 (p = 0.36) for SNR = −5 dB and r = −0.65 (p = 0.08) for SNR = 0 dB. Overall, no significant correlation was observed between audiometric thresholds and benefit with CC processing.

Individual HI listeners’ scores obtained in the unprocessed (noise-corrupted) condition (gray bars) and in the CC condition (dark bars).

The lack of correlation did not explain why some HI listeners did not receive as much benefit as ENH listeners, at least in the −5 dB SNR condition. There are several possible reasons for that. First, there exists the audibility issue with HI listeners with high-frequency sensorineural hearing loss. As many studies have demonstrated, high-frequency amplification might not be always beneficial, particularly for patients with “dead regions” in their cochlea (Moore, 2001) or for patients with hearing loss exceeding 55 dB HL at high-frequency (≥ 3000 Hz) regions (e.g., Hogan and Turner, 1998; Ching et al., 1998; Turner and Cummings, 1999). We could have increased the presentation level; however, several studies have observed “rollover” effects, in that high presentation levels led to a decrease in speech intelligibility (e.g., Studebaker et al., 1999; Molis and Summers, 2003). In fact, Molis and Summers (2003) have shown that increasing the intensity of the high-frequency speech bands produced more rollover than for the low-frequency bands. To further examine the audibility issue, we analyzed selected high-frequency consonants using short-term acoustic measures, such as those used by Stelmachowicz et al. (1993). In particular, 1/3-octave band levels were computed for the consonants /s/ and /t/ taken directly from an IEEE sentence. The 1/3-octave band levels were computed based on the spectrum of 25-ms (Hanning-windowed) segment of the selected consonants, and these are shown in Fig. 4. From these we can deduce that for some HI listeners (e.g., S2, S3) who have low hearing thresholds in the low frequencies (< 1 kHz), portions of the consonant spectra might not have been audible. Furthermore, the frequency location of the spectral peaks of these two consonants was in the range of 5–6 kHz, which might fall outside the bandwidth of most amplification devices.

1/3-octave band levels (dB SPL) as a function of frequency for segments of the consonants /t/ (left panel) and /s/ (right panel) taken from an IEEE sentence.

Second, NH listeners are generally better than most HI listeners at integrating information carried by consonants with otherwise corrupted vowels to understand words in continuous speech (Kewley-Port et al., 2007). When presented with sentences containing clean consonants and vowels replaced with speech-shaped noise, elderly HI listeners performed worse than young NH listeners (Kewley-Port et al., 2007). On this regard, the outcome of the present study is consistent with that of Kewley-Port et al., (2007). The audibility issue, and to some extent the consonant information integration issue, is examined in the follow up experiment with the use of selective consonant amplification in continuous speech.

EXPERIMENT 2: SELECTIVE AMPLIFICATION OF CONSONANTS IN CONTINUOUS SPEECH CORRUPTED BY BACKGROUND NOISE

The data from Experiment 1 indicated that some HI subjects were not able to perceive (and integrate) the information carried by consonants in continuous speech to the same extent as ENH listeners did, at least in the extremely low SNR (−5 dB) condition. The reasons for that were unclear. In this experiment, we investigate the hypothesis that the high-frequency cues present in consonants (e.g., /t/, /s/) were not audible to them and thus not perceptible. To test this hypothesis, we selectively amplified the noisy consonants contained in sentences while leaving the noisy sonorant sounds unaltered. If the consonants were somehow not audible to HI listeners at low SNR levels, then amplifying them ought to reduce the performance difference between normal-hearing and HI listeners. This hypothesis is tested in the present experiment. Amplifying the consonants (while leaving the level of the sonorant sounds the same) will alter the CVR, and we will thus be in a position to assess the impact of CVR modification on continuous speech recognition.