Perceptual weighting of individual and concurrent cues for sentence intelligibility: Frequency, envelope, and fine structure

Daniel Fogerty

doi:10.1121/1.3531954

. 2011 Feb 11;129(2):977–988. doi: 10.1121/1.3531954

Perceptual weighting of individual and concurrent cues for sentence intelligibility: Frequency, envelope, and fine structure

Daniel Fogerty ^1,^a)

PMCID: PMC3070991 PMID: 21361454

Abstract

The speech signal may be divided into frequency bands, each containing temporal properties of the envelope and fine structure. For maximal speech understanding, listeners must allocate their perceptual resources to the most informative acoustic properties. Understanding this perceptual weighting is essential for the design of assistive listening devices that need to preserve these important speech cues. This study measured the perceptual weighting of young normal-hearing listeners for the envelope and fine structure in each of three frequency bands for sentence materials. Perceptual weights were obtained under two listening contexts: (1) when each acoustic property was presented individually and (2) when multiple acoustic properties were available concurrently. The processing method was designed to vary the availability of each acoustic property independently by adding noise at different levels. Perceptual weights were determined by correlating a listener’s performance with the availability of each acoustic property on a trial-by-trial basis. Results demonstrated that weights were (1) equal when acoustic properties were presented individually and (2) biased toward envelope and mid-frequency information when multiple properties were available. Results suggest a complex interaction between the available acoustic properties and the listening context in determining how best to allocate perceptual resources when listening to speech in noise.

INTRODUCTION

Examining how speech information is distributed over the spectrum has been of paramount importance for modeling the intelligibility of speech under degraded conditions, such as from reverberation, competing talkers, filtering, or hearing loss (e.g., French and Steinberg, 1947; Fletcher and Galt, 1950; Houtgast and Steeneken, 1985; ANSI, 1997). The measurement of frequency-importance functions has application to understanding the functional impact of hearing loss, as well as designing assistive devices and speech transmission technologies. The primary method of obtaining these importance functions has been through high- and low-pass filtering studies in which the intelligibility of restricted frequency regions has been tested (e.g., French and Steinberg, 1947). Research investigating frequency-band importance functions have found that certain frequency bands contribute more to the intelligibility of uninterrupted speech and this information has been incorporated into the speech intelligibility index (SII; ANSI, 1997). Each frequency band contains acoustic information which can further be divided into two temporal properties: envelope (E) and temporal fine structure (TFS). The spectral distribution of the perceptual use of E and TFS properties has also been examined using high- and low-pass filtering methods (Ardoint and Lorenzi, 2010). Each of these studies has examined the independent contribution of isolated spectral regions where the spectral regions have been defined typically using octave or 1∕3-octave bands.

However, in contrast to the filtered speech contexts that form the basis of these frequency-importance functions, in everyday listening environments, listeners have the entire broadband speech spectrum available. Therefore, these methods index how specific adjacent frequency regions contribute independently to speech intelligibility, but they do not account for how perceptual use of these frequency regions might change when multiple spectral regions are available, particularly if these spectral regions are widely separate (for how listeners may synergistically combine non-adjacent bands see Grant and Braida, 1991; Warren et al., 1995). How the perceptual use of E and TFS temporal properties combines across frequency regions also has yet to be fully explored.

The current study explores the perceptual use of multiple frequency and temporal acoustic channels for sentence intelligibility when these channels are presented separately and concurrently. Experiment 1 explores the perceptual use of three different frequency bands in broadband speech contexts. Furthermore, it extends investigations to the E and TFS. Experiment 2 measures the perceptual weight of the E and TFS information across three frequency bands. In particular, Experiment 2 measures these weights when multiple cues are available to listeners using the correlational method (Berg, 1989; Richards and Zhu, 1994; Lutfi, 1995).

The correlational method was developed and applied to speech (see Doherty and Turner, 1996) to determine how individual listeners perceptually weight individual acoustic components. In the general correlational method used by Doherty and Turner (1996), speech was divided into three frequency bands. The speech acoustics in each band were individually and randomly degraded on each trial through the addition of noise. For example, the signal-to-noise ratio (SNR) within each of three bands on a given trial could be −7, +1, and −1 dB. The trial-by-trial data were then analyzed by calculating the point-biserial correlation (Lutfi, 1995) of the SNR within each band and the listener’s performance (using binary coding: correct∕incorrect) on that given trial. These correlations were then normalized to 1.0 across the three bands to obtain relative weights that were interpreted as indexing the proportion to which that band’s weight contributed on each trial to the listener’s speech recognition performance. This correlational method has the advantage of SII-type filtering experiments in that it can determine the relative contributions of independent frequency bands, but, unlike the SII approach, does so by using the full bandwidth of speech. Another advantage of the correlational method over the filtering experiments is that it is able to obtain weights when the speech information presented is separated into substantially different frequency regions.

The correlational method has been applied to a host of domains, including spectral shape discrimination (Berg and Green, 1990, 1992; Lentz and Leek, 2002) and temporal envelope discrimination (Apoux and Bacon, 2004), as well as, cochlear-implant weighting strategies (Mehr et al., 2001) and electrode function (Turner et al., 2002). In addition, Calandruccio and Doherty (2007) have extended the approach to obtain weighting functions for IEEE sentence materials. This project is designed specifically to address how temporal information within each of several frequency bands contributes to speech understanding.

The contribution of various frequency regions is likely dependent upon the type of linguistic information conveyed within that region. This may help explain why different frequency-weighting functions are obtained for different speech materials (ANSI, 1997). For example, vowels appear to contribute more than consonants in sentence contexts (Kewley-Port et al., 2007; Fogerty and Kewley-Port, 2009), but not in word contexts (Owren and Cardillo, 2006; Fogerty and Humes, 2010). Different linguistic information becomes available and is more informative in different speech contexts, such as the prosodic contour constraining the lexical access of words in sentences (Wingfield et al., 1989; Laures and Weismer, 1999). By bandpass filtering speech into three frequency regions broadly associated with prosodic, sonorant, and obstruent∕fricative linguistic categories, it is possible to map linguistic contributions onto more general acoustic properties. In addition to frequency contributions, speech within each frequency band is comprised of complex temporal modulations in amplitude (i.e., E) and frequency (i.e., TFS). Amplitude modulations alone, such as implemented by cochlear implants, convey voicing, and manner cues, while frequency cues convey information regarding the place of articulation (Xu et al., 2005).

Experiment 1a investigates the contribution of different frequency regions of linguistic significance. Grant and Walden (1996) have found that high-frequency speech acoustics convey the linguistic information of syllabic number and stress while low-frequency speech acoustics convey intonation patterns. Thus, the frequency spectrum of speech may be divided into regions of predominant linguistic significance. In contrast to previous filtering studies, the contribution of these frequency regions is examined in a wideband listening context.

Experiment 1b extends these methods to examining the independent contributions of E and TFS temporal dynamics that have been analyzed across these three frequency regions. The different temporal timescales present in speech, slow E modulations and fast TFS modulations, also convey different types of linguistic information (Rosen, 1992). For example, the E provides information about the syllabic structure and speech rate while the TFS provides acoustic cues related to dynamic formant transitions. Experiment 1b examines the independent contributions of these two types of temporal modulation properties. Previous studies have manipulated the amount of temporal information available in several ways, such as by varying the number of analysis bands (e.g., Shannon et al., 1995; Dorman et al., 1997; Hopkins et al., 2008), filtering the modulation rate (e.g., Drullman et al., 1994a,b, 15; van der Horst et al., 1999), or both (e.g., Xu et al., 2005). By doing so, these studies have altered the temporal information present in the speech stimulus and in some cases also the spectral information. In contrast to these methods, experiment 1b varies the SNR from trial-to-trial and uses the correlational method to derive importance estimates. In doing so, the processing of temporal information remains unchanged, but the availability of these cues is altered systematically, thereby, allowing the measurement of the perceptual contributions of specific temporal modulation properties to speech intelligibility.

Finally, experiment 2 investigates the distributions of perceptual weights for the E and TFS across frequency bands. In contrast to experiment 1 that investigates individual acoustic channels, experiment 2 measures perceptual weights when speech information in multiple acoustic channels is concurrently available. For example, while listeners may be able to use all spectral information equally, they may place more weight on a particular spectral region when information across the frequency spectrum is available.

This study varies the availability of each of these acoustic cues (frequency bands or temporal properties) across a range of SNRs. In doing so, this set of experiments:

(1)
measures the contributions of individual acoustic properties in a masked wideband context;
(2)
examines the contributions of linguistically relevant acoustic divisions;
(3)
demonstrates the feasibility of extending the SNR method to temporal properties; and
(4)
measures perceptual weights for the E and TFS across frequency bands when multiple cues are available.

EXPERIMENT 1A: INDIVIDUAL FREQUENCY BAND PERCEPTUAL WEIGHTS

This experiment investigated the individual contributions of each of three frequency bands for wideband speech presentation of meaningful sentences. Performance in each test band was determined as a function of SNR while the contribution of the other two bands was limited by noise masking. The frequency regions used here are related to primary (linguistically relevant) speech information.

Listeners

Nine young normal-hearing listeners (18–21 yr) were paid to participate in the study. All participants were native speakers of American English and had pure-tone thresholds bilaterally not greater than 20 dB HL at octave intervals from 250 to 8000 Hz (ANSI, 2004).

Stimuli and design

IEEE∕Harvard sentences were selected for use in this study (IEEE, 1969). Sentences were previously recorded by a male talker and are available on CD ROM (Loizou, 2007). These stimuli are all meaningful sentences that contain five keywords, such as in the experimental sentence, “The birch canoe slid on the smooth planks.” All signal processing of the stimuli were completed in matlab using custom software and modifications of code provided by Smith et al. (2002). All experimental sentences were down-sampled to a sampling rate of 16 000 Hz and passed through a bank of bandpass filters to process speech into three different frequency bands. These frequency bands represent equal distance along the cochlea using a cochlear map (Liberman, 1982) and correspond roughly to prosodic, sonorant, and obstruent∕fricative linguistic categories, respectively. Table TABLE I. displays the frequency range of these bands along with the corresponding number of equivalent rectangular bands (ERBs; Moore and Glasberg, 1983) and the calculated SII values assuming full contribution (i.e., band SNR > 15 dB) for each band. Note that the selected bands for this study have about the same number of ERBs and similar SII values. Fundamental frequency and formant frequency values were calculated for all experimental sentences using STRAIGHT, a speech analysis, modification, and synthesis system (Kawahara et al., 1999). Mean values for each sentence were calculated using a 50-ms sliding window and are provided in Table TABLE II..

Table 1.

Frequency range, ERB, and SII values for the three frequency bands used in this study.

Band	Frequency range (Hz)	ERB	SII (ANSI, 1997)
1	80–528	8.3	0.32
2	528–1941	9.9	0.35
3	1941–6400	10.0	0.33

Open in a new tab

Table 2.

Means and standard deviations (StDev) for the fundamental (F0) and formant (F1, F2, F3) frequency values averaged across 600 IEEE sentences.

Formant	Mean (Hz)	StDev (Hz)
F0	131	7
F1	492	31
F2	1561	71
F3	2910	68

Open in a new tab

The contribution of each independent frequency band was investigated by varying the SNR of the target band of interest and maintaining a constant negative SNR in the other two bands. Thus, stimuli contained speech acoustic cues across the entire stimulus spectrum, with cues heavily masked in the two non-target (i.e., masked) bands. The presence of the noise masked bands was included to prevent off-frequency listening and more closely match listening conditions when broadband speech information is available.

Figure 1 displays the processing for the independent band conditions. After passing the speech stimulus through the bank of three bandpass filters, a constant noise was added to the two masked bands to yield a SNR of −5 dB in each of those bands. A unique sample of noise was generated for each sentence to match the same power spectrum as the target sentence. This noise was then bandpass-filtered and summed to the corresponding speech band. For the remaining (target) speech band, the SNR was varied individually over a range of five SNRs (11, 5, 2, −1, −7 dB). Thus, performance on this task varied as a function of SNR in the target frequency band. All three frequency bands were investigated, creating a total of 15 conditions (3 frequency bands × 5 SNR levels). Twenty-four sentences (120 keywords) were presented per condition with no sentences being repeated for a given listener. All 360 sentences were presented in a fully randomized order. After processing, all stimuli were up-sampled to 44828 Hz for presentation through Tucker–Davis Technologies (TDT) System-III hardware.

Stimulus processing diagram for experiment 1a. The noise was generated to match the long-term average speech spectrum for the target sentence. The noise level in the target band is indexed by i, corresponding to the five SNR conditions.

Calibration

A speech-shaped noise was designed in Adobe Audition to match the long-term average speech spectrum (±2 dB) of a concatenation of 600 unprocessed sentences from the IEEE database. Note that this calibration noise, which was only used during calibration for all experiments described here, was different from the unique noise samples that were used during stimulus processing, described above, which matched the spectrum of each sentence. Stimuli were presented via TDT System-III hardware using 16-bit resolution at a sampling frequency of 48 828 Hz. The output of the TDT D∕A converter was passed through a headphone buffer (HB-7) and then to an ER-3A insert earphone. This calibration noise was set to 70 dB SPL through the insert earphone in a 2-cc coupler using a Larson Davis Model 2800 sound level meter with linear weighting. Therefore, the original unprocessed wideband sentences were calibrated to be presented at 70 dB SPL. However, after filtering and noise masking, the overall sound level of the combined stimulus varied according to the individual condition. This ensured that levels representative of typical conversational level (70 dB SPL) were maintained for the individual spectral components of speech across all sentences.

Procedure

Listeners were seated individually in a sound attenuating booth and listened to the stimulus sentences presented unilaterally to their right ear. Test stimuli were controlled by TDT System-III hardware connected to a PC computer running a custom designed matlab stimulus presentation interface. Each listener was instructed regarding the task using verbal and written instructions. All listeners completed familiarization trials prior to each experimental test that exhibited the same signal processing as the experimental sentences. These sentences were selected from male talkers in the TIMIT database (Garofolo, et al., 1990) and processed according to the stimulus processing procedures for that task. All listeners completed experimental testing regardless of performance on the familiarization tasks. No feedback was provided during familiarization or testing to avoid explicit learning across the many stimulus trials. All listeners received different randomizations of the 360 experimental sentences (120 sentences∕target band). Each sentence was presented only once.

During the experimental testing, each sentence was presented individually and the listener was prompted to repeat the sentence aloud as accurately as possible. Listeners were encouraged to guess, without regard to whether their responses made logical sentences. No feedback was provided. All listener responses were digitally recorded for later analysis. Only keywords repeated exactly were scored as correct (e.g., no missing or extra suffixes). In addition, keywords were allowed to be repeated back in any order to be counted as correct repetitions. Each keyword was marked as 0 or 1, corresponding to an incorrect or correct response, respectively. Three native English speakers were trained to serve as raters and scored all recorded responses. Inter-rater agreement on a 10% sample of responses was 93%.

Results and discussion

Frequency band (three bands) and SNR (five levels) were entered as repeated-measures variables in a general linear model analysis of the percent-correct scores (based on keywords). All percent-correct scores, here and elsewhere, were transformed into rationalized arcsine units to stabilize the error variance (Studebaker, 1985) prior to analysis. A main effect for SNR [F(4,32) = 173.4, p < 0.001] and target band × SNR interaction [F(8,64) = 6.2, p = 0.001] were obtained after Greenhouse–Geisser correction of the degrees of freedom. No main effect of frequency band was obtained. Post hoc paired-samples t tests demonstrated significant differences between the low-frequency band and both the mid- and high-frequency bands for SNRs at −1 and 5 dB (Bonferroni-corrected p < 0.003). Figure 2 plots the performance in each band across SNR. Performance for each of the three bands was significantly correlated across all SNRs (Pearson’s r = 0.69 – 0.90, p < 0.05), indicating that individual differences in listener performance were consistent across all three bands and SNRs.

Performance function for each frequency band varied independently. Error bars indicate standard error of the mean.

The trial-by-trial data were analyzed by calculating the point-biserial correlation (Lutfi, 1995) between the SNR within the target band and the listener’s word score (using binary coding: correct∕incorrect) on that given trial. These correlations are provided in Table TABLE III.. Correlations contained 600 points for each frequency band (5 SNRs × 24 sentences × 5 keywords). All raw correlations were significantly above the null hypothesis of zero (for a 95% confidence interval, |r| > 1.96∕600^½, see Lutfi, 1995), indicating perceptual use of each frequency band. The correlations across the three frequency bands were then normalized to sum to one to obtain relative weights for each band (see Doherty and Turner, 1996; Turner et al., 1998). Figure 3 displays these relative correlational weights. As Turner et al. (1998) noted with nonsense syllables, the correlational weights obtained for each target frequency band, each measured in isolation, are approximately equal. No difference between frequency band weights were obtained (Bonferroni-corrected p > 0.016), indicating that the perception of each independent band was affected by the noise in a similar fashion. This is consistent with the independent, additive nature of bands in Articulation Theory (French and Steinberg, 1947) and the equivalent SII values estimated for the bands used in this study (reported in Table TABLE I.). Performance for all listeners followed the equal-weighting pattern apparent in the mean data.

Table 3.

Point-biserial correlations in each independent frequency band for all listeners. Average performance in percent-correct across the three bands is also listed (second column).

Listener	Average (%)	Band 1	Band 2	Band 3
1a_01	35	0.40	0.39	0.42
1a_02	50	0.39	0.29	0.33
1a_03	37	0.39	0.44	0.40
1a_04	47	0.42	0.31	0.36
1a_05	41	0.38	0.37	0.32
1a_06	41	0.44	0.43	0.33
1a_07	29	0.33	0.23	0.25
1a_08	25	0.24	0.24	0.25
1a_09	40	0.40	0.35	0.35
Mean	38	0.38	0.34	0.33

Open in a new tab

Normalized correlations or relative weights for each target band. Dots display individual data for the nine listeners. Fewer than nine dots indicate identical weights for listeners. The solid line represents the mean weight.

Finally, it is important to note that the SII predicts the contribution of bands down to −15 dB SNR. Therefore, the conditions presented here reflect contributions of dominant speech information in the target band in combination with other non-target information. However, evaluation of masked non-target bands alone did not exceed 8% correct word identification during piloting. Furthermore, performance variations occurred across the target SNR, with the availability of masked non-target bands held constant across conditions. Overall, results suggest equal contributions of these three bands when independently tested in the presence of other wideband speech information.

EXPERIMENT 1B: INDIVIDUAL ENVELOPE AND FINE STRUCTURE PERCEPTUAL WEIGHTS

Experiment 1b was designed to test the temporal modulation properties of the E and TFS, as defined by the Hilbert transform, using methods analogous to experiment 1a. Previous research has typically measured the perceptual strength of E or TFS information by either varying the number of analysis frequency bands (e.g., Dorman et al., 1997; Shannon et al., 1998; Smith et al., 2002) or systematically adding temporal information to successively more bands (Hopkins et al., 2008). Filtering modulation frequencies of the envelope (Drullman et al., 1994a,b, 15; Shannon et al., 1995; van der Horst et al., 1999; Xu and Pfingst, 2003) or quantifying the strength of speech envelope modulations relative to spurious non-signal modulations (Dubbelboer and Houtgast, 2008) have also been investigated as methods to evaluate the importance of temporal information. However, varying the SNR has proven to be a robust method for examining the contribution of specific speech information (e.g., Miller et al., 1951). Using this same method for the evaluation of temporal information provides a significant advantage for comparing perceptual weights between frequency and temporal dimensions as it is able to place both acoustic properties on the same measurement scale using the same type of speech distortion. This approach has been used increasingly to investigate the perceptual use of E information (e.g., Apoux and Bacon, 2004), and is extended here to also investigate perceptual weighting of TFS information. This method has the advantage of presenting the entire speech stimulus, while selectively masking E or TFS cues. Therefore, experiment 1b obtained performance functions independently for E and TFS acoustics by varying the SNR of the target speech information. Note that the E was defined by the Hilbert transform and therefore contained even the highest modulation rates, which would include some periodicity cues. These fast-rate E-modulation cues were included so as to provide all speech information, as fast-rate cues may significantly contribute to speech intelligibility. Recent studies have similarly presented the entire Hilbert envelope as the E information (e.g., Gallun and Souza, 2008; Hopkins and Moore, 2010).

Methods

Eight new listeners participated in experiment 1b (19–23 yr). All participants were native speakers of American English and had pure-tone thresholds bilaterally no greater than 20 dB HL at octave intervals from 250 to 8000 Hz (ANSI, 2004).

Experiment 1b used the same presentation procedures as in experiment 1a. Sentences were again selected from the IEEE database for testing. The 240 sentences used here were not presented in experiment 1a. The E contains amplitude fluctuations across a range of modulation rates. Modulation rates for E peak around 4 Hz for broadband processing of these IEEE speech materials. Importantly, E and TFS contributions have been shown to be similar when processed using three frequency bands (Smith et al., 2002) and vowels and consonants are identified equally well for three-band envelope vocoders processed over a range of 150–5500 Hz (Xu et al., 2005). The metric for frequency band analysis (linear vs logarithmic) also does not influence speech perception using only envelope cues (Shannon et al., 1998). Therefore, the three frequency bands used in experiment 1a were used as the analysis bands for the current E and TFS processing.

A diagram for the stimulus processing is displayed in Fig. 4. A speech-shaped noise matching the long-term average of the target sentence was created. This noise was scaled to −5 dB SNR and added to the target sentence. Then, the masked sentence was passed through a bank of three analysis filters. Within each frequency band, the E and TFS were extracted using the Hilbert transform, resulting in isolated E and TFS properties in each of the three bands. These masked modulation properties will be referred to as −5_E and −5_TFS, respectively. The speech-shaped noise was also scaled over the range of SNR values used in experiment 1a (11, 5, 2, −1, −7 dB) and added to the original sentence, from which the target E (E_target) and target TFS (TFS_target) were extracted over the three analysis bands. For envelope testing, in each analysis band the E_target and masked −5_TFS were combined and summed across frequency bands. The same was performed for TFS testing, where the E in each band was replaced by the masked −5_E for that band and combined with the TFS_target. Thus, for both E and TFS testing, the non-test modulation property was masked at a constant −5 dB SNR while the target portion was varied over the range of test SNRs. Finally, the entire stimulus was re-filtered at 6400 Hz and upsampled to 48 828 Hz to produce the final stimulus. There were a total of ten conditions (2 temporal properties × 5 SNRs). Twenty-four sentences (120 keywords) were presented per condition.

Stimulus processing diagram for envelope (E) and fine structure (TFS) stimuli (experiment 1b). The same SNR was applied across all three frequency bands and is indexed by i, corresponding to the five SNR values available.

As in experiment 1a, listeners were seated alone in a sound-attenuating booth. Each participant listened to a unique randomization of the sentences in each of the ten conditions monaurally via an ER-3A insert earphone and responded aloud by repeating as accurately as possible the sentence that was presented. Responses were digitally recorded for offline analysis and scoring. All listeners first received familiarization trials, as in experiment 1a. No feedback was provided.

Results and discussion

A two (temporal property: E and TFS) by five (SNR level) repeated-measures analysis of variance (ANOVA) resulted in significant main effects for the temporal property [F(1,7) = 123.7, p < 0.001] and SNR [F(4,28) = 346.0, p < 0.001], with a significant interaction between these two variables as well [F(4,28) = 18.1, p < 0.001]. Post hoct tests between E and TFS conditions at each SNR demonstrated significant differences at −1, 2, and 5 dB SNR (Bonferroni-corrected p < 0.01). As seen in Fig. 5, in general, the perceptual use of E and TFS acoustic information varies similarly across SNRs with listeners tending toward better performance, by about 9 percentage points on average, for the E conditions. As noted for the frequency band stimuli of experiment 1, correlations between E and TFS conditions were significant across all SNRs (Pearson’s r = 0.84 – 0.87, p < 0.01) with the exception of at 11 dB SNR. Also of note is that performance appears to level off at 81%. This is in agreement with Shannon et al. (1995) who found that E-only speech reached a maximum of about 80% for sentences in quiet when processed through three frequency bands.

Performance function for each temporal property varied independently. Error bars indicate standard error of the mean. E = envelope, TFS = temporal fine structure.

As in experiment 1a, point-biserial correlations were calculated between the word score and the SNR in the target temporal property for that trial and these appear in Table TABLE IV.. Correlations contained 600 points for both temporal modulation property conditions (5 SNRs × 24 sentences × 5 keywords). Again both E and TFS raw correlations were significantly above zero (p < 0.05). The correlations across the two temporal conditions were then normalized to sum to one to obtain relative weights. These weights are plotted in Fig. 6. No significant difference was obtained between E and TFS perceptual weights [t(7) = −2.84, p = 0.79]. All listeners followed this pattern, indicating that they placed equal relative perceptual weight on these two types of temporal acoustic information.

Table 4.

Point-biserial correlations in each independent modulation band across all SNRs for all listeners. Average percent-correct performance across the two types of modulation is also listed (second column).

Listener	Average (%)	E	TFS
1b_01	57	0.50	0.44
1b_02	63	0.41	0.43
1b_03	61	0.41	0.43
1b_04	57	0.43	0.44
1b_05	51	0.46	0.49
1b_06	68	0.39	0.34
1b_07	51	0.45	0.49
1b_08	47	0.50	0.52
Mean	57	0.44	0.45

Open in a new tab

Normalized correlations for the envelope (E) and fine structure (TFS) target stimuli. Dots display individual data for the eight listeners who completed both E and TFS testing. Individual data points overlap. The solid line represents the mean weight.

DISCUSSION OF EXPERIMENT 1

Results of experiment 1a demonstrated that performance varies as a function of the SNR similarly within each of the three frequency bands. Furthermore, all listeners placed equal weight on each of these individual spectral regions. These frequency bands convey very different types of speech information, yet these results indicate that listeners are able to use these different acoustic cues to obtain similar speech-recognition performance for these speech materials.

Each of these bands was presented in a wideband speech context that best models the real world acoustic environment for normal-hearing listeners. As the SNR in this target band was the only independent variable, changes in performance may be attributed to the use of speech information in that target band. However, it is important to note that each target band was presented along with other wideband speech information, albeit noise masked. Therefore, it cannot be determined whether each target band provides direct information for speech recognition, or indirectly facilitates sentence intelligibility via the disambiguation of other off-band speech information.

Results of experiment 1b demonstrated that performance varies as a function of the SNR similarly for E and TFS information. The results obtained here for E and TFS masking are in good agreement with other methods that have indicated approximately equal contributions of E and TFS when processed in three frequency bands. Using a chimaeric paradigm where the E of one sentence and the TFS of another sentence were combined, Smith et al. (2002) found that listeners had the same accuracy for both the E and TFS sentences when processed through three frequency bands. The equal perceptual weights obtained in the current study support this finding. Ardoint and Lorenzi (2010) also examined the spectral distribution of E and TFS cues using high- and low-pass filtering. They found a cross-over frequency of these filters at 1500 Hz for both E and TFS information, which is in the middle of the mid-frequency band in this study. This suggests that E and TFS information were equally distributed across each of the three frequency bands used here and is consistent with the equal perceptual weights obtained.

This study introduced a method of varying the availability of temporal properties without altering the underlying analysis and extraction of E and TFS, thereby, preserving the modulation and frequency spectra of the underlying speech stimulus. This method allows equivalent signal processing to be applied to temporal modulation information as applied to the frequency band information. Results demonstrated that listeners are equally able to use acoustic information in each of these acoustic channels when presented in wideband contexts. However, it may be that E or TFS cues are more informative in certain spectral regions than others. The processing method used here allows for the examination of perceptual weights placed on E and TFS cues in different spectral regions, which experiment 2 explores.

EXPERIMENT 2: PERCEPTUAL WEIGHTING OF ENVELOPE AND FINE STRUCTURE ACROSS FREQUENCY FOR SPEECH IN NOISE

Experiment 1 demonstrated the feasibility of using noise masking and the correlational method to measure the perceptual weights assigned to E and TFS speech information when each temporal cue was independently available. Experiment 2 was designed to investigate the perceptual weight that is assigned to these two temporal modulation properties in different frequency regions when multiple acoustic cues are concurrently available.

Temporal modulation properties may be weighted differently in each of the frequency bands. It may be that the TFS is perceptually more available in the low-frequency band, as neural fine structure coding decreases at higher frequencies (reviewed by Moore, 2008). However, TFS added to individual bands across the spectrum facilitates speech understanding above what is provided by E cues alone (Hopkins and Moore, 2010). Envelope acoustics might be weighted more heavily in high-frequency regions (Apoux and Bacon, 2004) where the onsets and offsets of consonants are most present, as abrupt E changes are coded by slow synchronous neural responses (Moore et al., 2001). Experiment 2 investigated the perceptual weight of E and TFS properties across frequency bands when multiple sources of speech information were available. A correlational method (Richards and Zhu, 1994; Lutfi, 1995) was again used to relate the availability of speech information in a specific channel, indexed by the SNR, with performance on that given trial. Thus, perceptual weights were obtained using a trial-by-trial analysis of performance and signal information.

A number of researchers have used this correlational method to examine the perceptual weights that individuals place on speech information in different frequency regions for nonsense speech (Doherty and Turner, 1996; Turner et al., 1998), sentences (Calandruccio and Doherty, 2007), and with the hearing impaired (Calandruccio and Doherty, 2008). Apoux and Bacon (2004) have also examined E contributions in isolated band-limited stimuli. However, the current study extends these investigations by independently and concurrently varying E and TFS information in three frequency bands. Thus, the distribution of perceptual weights applied to temporal information across the spectrum in wide-band speech was obtained in this experiment. That is, in experiments 1a and 1b there was only one “channel” of speech information predominantly available at a time—either one of the three spectral channels (exp 1a) or one of two possible temporal channels (exp 1b). Here, six channels of information (3 spectral bands × 2 temporal modulation properties) are available.