Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2016 Aug 3;140(2):EL197–EL203. doi: 10.1121/1.4960074

Speech recognition interference by the temporal and spectral properties of a single competing talkera)

Daniel Fogerty 1,b), Jiaqian Xu 1,c)
PMCID: PMC6910003  PMID: 27586780

Abstract

This study investigated how speech recognition during speech-on-speech masking may be impaired due to the interaction between amplitude modulations of the target and competing talker. Young normal-hearing adults were tested in a competing talker paradigm where the target and/or competing talker was processed to primarily preserve amplitude modulation cues. Effects of talker sex and linguistic interference were also examined. Results suggest that performance patterns for natural speech-on-speech conditions are largely consistent with the same masking patterns observed for signals primarily limited to temporal amplitude modulations. However, results also suggest a role for spectral cues in talker segregation and linguistic competition.

1. Introduction

The problem of talker segregation has frequently been investigated using the coordinate response measure (Bolia et al., 2000). This relatively context-free corpus has proven to be a useful tool for investigating how individual acoustic properties of the talker facilitate speech segregation, particularly in single competing talker conditions. These materials follow a constrained context in the form of “Ready [call sign] go to [color] [number] now” in which one of eight call signs, four colors, and eight numbers could occur. The database also contains recordings of four male and four female speakers. Factors such as differences between the two competing talkers in sex (Brungart, 2001; Brungart et al., 2001), vocal tract size (Darwin et al., 2003), fundamental frequency (Lee and Humes, 2012), and speech onset times (Lee and Humes, 2012) have received considerable attention. In addition, linguistic interference from the competing message has also received considerable attention, particularly by using time-reversed versions of the competing talker that preserve spectral and temporal properties, but remove semantic content that could potentially interfere with the processing of the target message. Such interference by overlapping sound energy or by linguistic content has commonly been referred to as energetic and informational forms of masking, respectively. Recently, Stone and colleagues (Stone et al., 2012; Stone and Moore, 2014) have advanced a modulation-based form of interference whereby the amplitude modulations of the competing speech interfere with processing the amplitude modulations of the target speech. Indeed, they have even described the random amplitude fluctuations intrinsic to “steady-state” noise, traditionally defined as an energetic masker, as also, or perhaps primarily, imposing modulation-based interference. Considerable evidence now suggests that primary information regarding the speech message is conveyed by the amplitude modulations of speech (e.g., Shannon et al., 1995; Fogerty, 2011; Apoux and Healy, 2013). The importance of amplitude modulations (i.e., temporal envelope) for speech recognition therefore warrants a direct investigation to examine the likely potential for modulation interference from a competing talker. While previous investigations have examined speech recognition in the presence of speech-shaped and envelope-modulated noise (e.g., Brungart, 2001; Brungart et al., 2001), these studies have not isolated temporal properties of the target talker in order to directly measure modulation interference with the target talker's temporal envelope. Toward that end, this study used the coordinate response measure to examine temporal interference between the temporal envelopes of the target and competing talkers. Given the highly constrained sentence frames, all sentences contain highly similar modulation spectra, and therefore, maximum modulation interference would be expected under these conditions. Additional factors related to talker sex, spectral detail, and linguistic interference were also examined.

2. Methods

2.1. Listeners

Thirteen young adults (mean age = 20.8 years; range = 18–26 years, 12 female) participated in the experiment. All participants were native speakers of American English and had audiometric thresholds below 20 dB hearing level at octave frequencies from 250 to 8000 Hz.

2.2. Stimuli and design

The target and competing talker sentences from the coordinate response measure were either naturally preserved (Nat) or degraded to preserve primarily temporal envelope (ENV) cues, i.e., amplitude modulation. Competing sentences were always spoken by a talker different from the target talker and was either of the same or different sex. Finally, linguistic content was examined by using time-forward or time-reversed competing sentences. Thus, this study used a 2 (target speech preservation) × 2 (competing speech preservation) × 2 (competing sex) × 2 (temporal playback) design. Thirty-two sentences were presented in each condition resulting in a total of 512 trials. On half of the trials the target talker was male and on the other half, the target talker was female.

2.3. Envelope processing

Sentences were processed using a modification of the chimeric paradigm (Smith et al., 2002) as implemented by Fogerty (2011). Processing selectively preserved amplitude modulations (ENV) and degraded frequency modulations conveyed by the temporal fine structure (TFS). This method has the advantage of limiting cues provided by the acoustic TFS while avoiding confounds of vocoder processing that use the same carrier for both target and competing stimuli (Apoux and Healy, 2013). First, the speech signal was combined with noise matching the same power spectrum of the target speech at 11 dB signal-to-noise ratio (SNR). The noise was generated by randomizing the phase of the Fourier speech spectrum. This signal was filtered into three contiguous frequency bands with cutoff frequencies of 80, 528, 1941, and 6400 Hz. Each frequency band represented equal distance along the cochlea. The ENV was extracted from each band using the Hilbert transform. Second, the same procedure was conducted for the same speech sample, this time combined with noise at −5 dB SNR. In this second case, the Hilbert transform was used to extract the TFS from each band. Third, the ENV from the 11 dB SNR copy and the TFS from the −5 dB SNR copy were combined within each frequency band and then summed across frequency. The combined stimulus resulted in restoration of the full speech signal, but with greater relative preservation of the acoustic ENV component compared to the noise degraded acoustic TFS component.

2.4. ENV modulation analysis

Performance in this task may, in part, be determined by the temporal properties of the target and competing stimulus. Therefore, an ENV analysis was conducted to determine the relative ENV preservation following processing and time-reversal. For this analysis, the ENV was extracted from the stimulus file via halfwave rectification and low-pass filtering using a sixth-order Butterworth filter at 50 Hz and downsampled to 1000 Hz. Correlations were then conducted between the stimulus-extracted ENV of the various conditions for 128 sentences spoken by one of the male CRM talkers. The correlation for each sentence was calculated over at least 1449 points (range: 1449–1797 points). This analysis revealed a significant correlation between the natural and ENV processed sentence versions, r = 0.95, with much weaker correlations, as expected, between time-forward and time-reversed sentence versions, r = 0.25 to 0.35. Important to note is that these correlations were conducted between two stimulus versions of the same sentence, and therefore indicate how the stimulus manipulation altered the temporal properties of the original recording. Results demonstrate significant preservation of the original ENV following ENV stimulus processing. Actual correlations between the target and competing stimulus, that had different keywords and were spoken by a different talker, were much reduced. Correlations for time-forward maskers averaged 0.19 and 0.23 for different and same sex talkers, respectively (p < 0.05). Correlations with time-reversed maskers did not reach significance (r = −0.09 and −0.10, respectively, p > 0.05).

Modulation masking, determined by the similarity between the modulation spectrum of the target and masker sentence, is believed to play an essential role in speech recognition during modulated maskers such as speech (e.g., Stone et al., 2012; Stone and Moore, 2014). Therefore, correlations between the modulation spectra of the various conditions were also calculated for the same sentence comparisons as the above analysis. Modulation spectra were determined from the stimulus-extracted ENV by calculating the fast Fourier transform (FFT), normalizing amplitudes based on the DC, and summing into octave band modulation bins between 1 and 32 Hz. Correlations between sentences were determined based on the resulting DC-normalized modulation index calculated across the octave band modulation frequencies. Correlations were r = 0.95 between natural and ENV and r = 1.0 between time-forward and time-reversed versions of the same sentence. These results demonstrate significant similarity in modulation spectra for the different target and masker comparisons. Correlations between the target and competing stimulus pairs (i.e., different talker and message) were somewhat reduced, but averaged between 0.83 and 0.86. The smaller correlation coefficients were obtained for competitors of a different sex; similar magnitudes were obtained for time-forward and time-reversed stimuli.

2.5. Procedure

Participants were tested at individual computer terminals in a sound attenuating booth. Stimuli were presented monaurally via a Sennheiser HD 280 Pro headphone. Headphones were calibrated so that the average level of the target speaker was presented at 70 dB sound pressure level using a Larson Davis 800 b sound level meter. The target-to-masker ratio was set to 0 dB. Participant responses were recorded using a custom matlab response interface that required participants to press a button corresponding to the color and number spoken by the target speaker. The target speaker was always identified according to the call sign “Baron.” Conditions were blocked and counterbalanced among the participants. All participants completed a familiarization block of eight trials without feedback immediately prior to each experimental condition.

3. Results and discussion

3.1. Accuracy analysis

Results were first analyzed according to listeners' accuracy in correctly identifying the number and color of the target sentence and are displayed in Fig. 1(a). Results were analyzed using a 2 (target: natural, E) × 2 (competitor: natural, E) × 2 (sex: same, different) × 2 (playback: forward, reversed) repeated-measures analysis of variance (ANOVA). Statistical results of the analysis are displayed in Table 1. Consistent with previous studies, listeners performed significantly better with time-reversed compared to time-forward stimuli as well as for different sex versus same sex conditions. There were also significant main effects for the processing of the target and competitor. In general, listeners performed better for the natural target talker and when the competitor was spectrally degraded.

Fig. 1.

Fig. 1.

Results across the experimental conditions. (A) Accuracy in RAU for identifying the correct color and number of the target sentence. (B) Effect of masking (ENV-processing) or unmasking (natural-processing) the acoustic TFS in the target or competing talker. Positive difference scores indicate improvements in performance relative to the Nat-Nat condition for masking or the ENV-ENV for unmasking. DR = different sex, time-reversed; SR = same sex, time-reversed; DF = different sex, time-forward; SF = same sex, time-forward.

Table 1.

F values and effect sizes (η2) from the ANOVA conducted on the accuracy data. The degrees of freedom (df) = 1,12 for all F values in this table. Significant F values are marked ns = not significant.

Term F(1,12) η2
Target (T) 47.9a 0.80
Competitor (C) 27.0a 0.69
Sex (S) 71.2a 0.86
Playback (P) 105.9a 0.90
T × C 21.0a 0.64
T × S 30.7a 0.72
C × S 5.0b 0.29
T × C × S 18.6a 0.61
T × P ns
C × P 44.1a 0.79
T × C × P ns
S × P ns
T × S × P ns
C × S × P ns
T × C × S × P 5.3b 0.31
a

p < 0.01.

b

p < 0.05.

However, a three-way interaction with the temporal playback demonstrates that the spectral resolution of the competitor did not matter as much for time-reversed competitors, particularly when the target talker was also spectrally degraded. This may suggest that the spectral resolution of the competitor plays a more significant role when it occurs within a meaningful masker, possibly suggesting less of a role in segregating talkers. Instead, performance with time-reversed competitors may be mostly governed by properties such as modulation masking.

There was also a significant three-way interaction with talker sex. This interaction indicated that when the target is spectrally degraded, improved spectral resolution of the competing talker does not provide significant benefit in either same (p > 0.05) or different (p > 0.05) sex conditions. In contrast, when the target spectrum is preserved, listeners do better with a spectrally degraded competitor (p < 0.05), unless that competitor is also time-reversed (and therefore already near ceiling performance). In addition, for natural targets, talker sex only plays a role for natural competitors, as performance was similar for same and different sex conditions when the competitor was spectrally degraded. Interestingly, talker sex did play a significant role when both target and competitor talkers were spectrally degraded (p < 0.05), indicating that sex information was still retained following ENV processing as implemented here.

3.2. Contribution of the acoustic TFS

The contribution of the acoustic TFS to performance may be viewed by comparison of the Nat-ENV and ENV-Nat conditions to the baseline conditions in which both target and competitor signals were preserved (Nat-Nat) or were both degraded (ENV-ENV). Comparison to the natural case (Nat-Nat) determines how masking the acoustic TFS of the target (ENV-Nat) or competitor (Nat-ENV) affected intelligibility. Likewise, similar comparisons to the ENV-ENV case allow for determining how unmasking the acoustic TFS, of either the target or competitor, contributed to performance. These difference scores are displayed in Fig. 1(b). Here, positive scores indicate improvements in performance due to the masking or unmasking of the acoustic TFS. Results demonstrate that acoustic TFS information in the target generally improves performance and masking it decreases performance. In contrast, acoustic TFS information in the competitor decreases performance, and performance is improved when that information is masked. This general pattern is observed for both time-forward and time-reversed conditions. Interestingly, the presence of acoustic TFS in the target also appears to interact with the talker sex of the masker. Acoustic TFS is more important in the target when the competitor is of the same sex. In contrast, the competing acoustic TFS affects performance the most for time-forward maskers. These results indicate that for this study, the preservation of acoustic TFS in the target talker may facilitate source segregation from a competing spectrally degraded competitor. In contrast, acoustic TFS in the competitor resulted in poorer performance, potentially due to greater linguistic processing of the competing message. This possibility was explicitly investigated through an analysis of keyword intrusions from the competing talker. That is, participant responses that provided keywords spoken by the competing talker.

3.3. Intrusions from competing talker

In addition to correct identification of target keywords, we were also interested in confusions introduced by the competing talker. Therefore, we analyzed how often the listener's incorrect responses occurred from processing competing color-number information from the unattended talker. An intrusion was defined as an incorrect response corresponding to the selection of the color, number, or both from the competing stimulus. These intrusions have been viewed as an index of informational masking as opposed to random guessing due to inaudibility (i.e., energetic masking) from the competing signal (e.g., Lee and Humes, 2012) and likely also represent a failure to appropriately segregate and attend to the target talker. For this analysis, the proportion of intrusions was first calculated relative to the total incorrect responses. Only forward playback conditions were analyzed, as reversed competing sentences were unintelligible and therefore resulted in a negligible number of intrusions due to chance selection. Average results for this intrusion data are displayed in Fig. 2. The stacked bar graphs indicate the proportion of responses in which listeners (a) responded correctly to both the number and color spoken by the target talker, (b) responded by selecting one or two keywords from the competitor (i.e., intrusions), or (c) did not select a color or number spoken by either talker (i.e., errors). Combined, these response types account for all responses made by the participant.

Fig. 2.

Fig. 2.

(Color online) Proportion of responses that were correct, included a keyword from the competing talker (intrusion), or did not report any keywords from either the target or competing talker (errors). D = different sex, S = same sex.

A 2 target × 2 competitor × 2 talker sex repeated-measures ANOVA was conducted on the proportion of intrusions (green bars in Fig. 2). Results, provided in Table 2, demonstrate that more intrusions were observed for spectrally degraded targets, natural competitors, and competitors that matched the sex of the target talker. Significant interactions were explained by the Nat-ENV condition where talker sex did not matter due to the large proportion of correct responses. In addition, when ENV was the target, spectral resolution of the competitor did not provide a significant benefit for the same sex condition. Combined, these results suggest greater linguistic competition when the competitor preserves spectral information and that the effect of talker sex is determined, in part, by the availability of this spectral information. This occurred even though the ENV processing did preserve some periodicity cues.

Table 2.

F values and effect sizes (η2) from the ANOVA conducted on the intrusion data. The degrees of freedom (df) = 1,12 for all F values in this table. Significant F values are marked.

Term F(1,12) η2
Target (T) 11.8a 0.50
Competitor (C) 25.7a 0.68
Sex (S) 34.4a 0.74
T × C 9.9a 0.45
T × S 25.0a 0.68
C × S 5.9b 0.33
T × C × S 16.4a 0.58
a

p < 0.01.

b

p < 0.05.

Total remaining errors after intrusions (red bars in Fig. 2) were also analyzed by a 2 target × 2 competitor × 2 talker sex repeated-measures ANOVA. Only a significant main effect of the target was observed [F(1,12) = 24.1, p < 0.001, η2 = 0.67], with better performance for natural targets. All other differences between conditions were accounted for by correct and intrusion responses. The average difference between the two types of targets was by 6.6 percentage-points.

4. Summary and conclusions

This study reports on speech intelligibility in the presence of a competing talker for low-context stimuli (i.e., CRM sentences). The effect of temporal and spectral properties of the target and competing sentence was examined through the use of spectrally degrading the target and/or competing signal. Factors related to talker sex and linguistic interference were also examined by either matching or mismatching the sex between the target and competing talker or by time-reversing the competing sentence. An initial acoustic analysis demonstrated that the modulation spectra for the target and competing stimuli were highly correlated. Temporal envelopes were also significantly associated for time-forward, but not time-reversed stimuli. This indicates that temporal amplitude dips within the time-reversed stimulus could also introduce additional opportunities to “glimpse” the target signal during brief temporal intervals at improved target-to-masker ratios that were not available for time-forward competitors. Thus, the better performance for time-reversed competitors may be related, in part, to enhanced glimpsing opportunities in addition to a release from linguistic interference that occurs for time-forward speech.

Overall, results demonstrated significant performance patterns for when the target speech is preserved versus when it was limited to primarily temporal cues. Results are also consistent with the previous literature on the effects of talker sex (i.e., better performance with different sex competition) and time-reversal (e.g., Brungart, 2001; Brungart et al., 2001). While results for both natural and ENV conditions are largely consistent with the previous literature, comparison between these conditions is informative regarding the relative role of spectral cues present in the target and competing stimulus. These results suggest that target TFS information may facilitate source segregation, particularly for same sex conditions. However, competing TFS information may also increase linguistic interference from the competing message. This observation was supported by an analysis of intrusion errors from the competing sentence. Combined with the above evidence, this suggests that while TFS may enhance source segregation, it still appears to contribute to linguistic interference of segregated signals. This finding is significant regarding recent discussions related to the role of TFS for source segregation and linguistic processing (Apoux and Healy, 2013; Fogerty and Entwistle, 2015).

Acknowledgments

This work was supported, in part, by the South Carolina Honors College Exploration Scholars Program (J.X.) and by National Institutes of Health/National Institute on Deafness and Other Communication Disorders Grant No. R03-DC012506 (D.F.).

a)

Portions of these data were presented at the 2013 American Speech-Language-Hearing Association Convention, Chicago, IL.

References and links

  • 1. Apoux, F. , and Healy, E. W. (2013). “ A glimpsing account of the role of temporal fine structure information in speech recognition,” in Basic Aspects of Hearing ( Springer, New York: ), pp. 119–126. [DOI] [PubMed] [Google Scholar]
  • 2. Bolia, R. S. , Nelson, W. T. , Ericson, M. A. , and Simpson, B. D. (2000). “ A speech corpus for multitalker communications research,” J. Acoust. Soc. Am. 107(2), 1065–1066. 10.1121/1.428288 [DOI] [PubMed] [Google Scholar]
  • 3. Brungart, D. S. (2001). “ Informational and energetic masking effects in the perception of two simultaneous talkers,” J. Acoust. Soc. Am. 109(3), 1101–1109. 10.1121/1.1345696 [DOI] [PubMed] [Google Scholar]
  • 4. Brungart, D. S. , Simpson, B. D. , Ericson, M. A. , and Scott, K. R. (2001). “ Informational and energetic masking effects in the perception of multiple simultaneous talkers,” J. Acoust. Soc. Am. 110(5), 2527–2538. 10.1121/1.1408946 [DOI] [PubMed] [Google Scholar]
  • 5. Darwin, C. J. , Brungart, D. S. , and Simpson, B. D. (2003). “ Effects of fundamental frequency and vocal-tract length changes on attention to one of two simultaneous talkers,” J. Acoust. Soc. Am. 114, 2913–2922. 10.1121/1.1616924 [DOI] [PubMed] [Google Scholar]
  • 6. Fogerty, D. (2011). “ Perceptual weighting of individual and concurrent cues for sentence intelligibility: Frequency, envelope, and fine structure,” J. Acoust. Soc. Am. 129(2), 977–988. 10.1121/1.3531954 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Fogerty, D. , and Entwistle, J. L. (2015). “ Level considerations for chimeric processing: Temporal envelope and fine structure contributions to speech intelligibility,” J. Acoust. Soc. Am. 138(5), EL459–EL464. 10.1121/1.4935079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Lee, J. H. , and Humes, L. E. (2012). “ Effect of fundamental-frequency and sentence-onset differences on speech-identification performance of young and older adults in a competing-talker background,” J. Acoust. Soc. Am. 132(3), 1700–1717. 10.1121/1.4740482 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Shannon, R. V. , Zeng, F. G. , Kamath, V. , Wygonski, J. , and Ekelid, M. (1995). “ Speech recognition with primarily temporal cues,” Science 270(5234), 303–304. 10.1126/science.270.5234.303 [DOI] [PubMed] [Google Scholar]
  • 10. Smith, Z. M. , Delgutte, B. , and Oxenham, A. J. (2002). “ Chimaeric sounds reveal dichotomies in auditory perception,” Nature 416(6876), 87–90. 10.1038/416087a [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Stone, M. A. , Füllgrabe, C. , and Moore, B. C. (2012). “ Notionally steady background noise acts primarily as a modulation masker of speech,” J. Acoust. Soc. Am. 132(1), 317–326. 10.1121/1.4725766 [DOI] [PubMed] [Google Scholar]
  • 11. Stone, M. A. , and Moore, B. C. (2014). “ On the near non-existence of ‘pure’ energetic masking release for speech,” J. Acoust. Soc. Am. 135(4), 1967–1977. 10.1121/1.4868392 [DOI] [PubMed] [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES