Skip to main content
The Journal of the Acoustical Society of America logoLink to The Journal of the Acoustical Society of America
. 2018 Aug 22;144(2):EL131–EL137. doi: 10.1121/1.5051051

Effects of age and duration of deafness on Mandarin speech understanding in competing speech by normal-hearing and cochlear implant children

Duo-Duo Tao 1, Yang-Wenyi Liu 2,a), Ye Fei 1, John J Galvin III 3, Bing Chen 2,b), Qian-Jie Fu 4
PMCID: PMC6909997  PMID: 30180674

Abstract

Due to poor perception of fundamental frequency (F0) cues that are important for lexical tone perception and talker segregation, pediatric Chinese cochlear implant (CI) users may be especially susceptible to informational masking. Here, speech recognition thresholds (SRTs) were measured in steady noise or competing speech in Mandarin-speaking CI and normal-hearing (NH) children. CI children were more susceptible to informational masking and were unable to use F0 cues to segregate talkers. SRTs were significantly correlated with chronological age in NH children and with duration of deafness in CI children, suggesting that auditory deprivation may limit developmental processes important for talker segregation.

1. Introduction

While cochlear implant (CI) users are often able to understand speech under ideal listening conditions, speech understanding in noise remains quite poor. Unlike normal-hearing (NH) listeners, CI users do not experience release from masking with fluctuating noise1,2 or competing speakers,3,4 relative to steady-state noise (SSN). Compared to SSN, which is thought to produce more “energetic” masking at the auditory periphery, competing speech produces both energetic and “informational” masking, in which competing talkers (despite differences in timing or pitch) interfere with each other at more central levels of auditory processing.5 Due to its dynamic temporal and spectral properties, speech is less predictable than SSN. However, these dynamics also allow NH listeners to perceive target speech in the temporal and/or spectral gaps of the masker speech. Due to the limited electric dynamic range, coarse spectral resolution, broad analysis filters, and interactions among the implanted electrodes, CI users are less able to take advantage of differences in the spectral-temporal properties to segregate competing talkers.

Different from English (where voice pitch cues can convey talker characteristics, prosody, vocal emotion, etc.), Mandarin Chinese is a tonal language in which voice pitch cues convey linguistic meaning.6 Fundamental frequency (F0) is the primary cue for lexical tone perception, although listeners can also make use of duration and amplitude cues that co-vary with F0 to recognize lexical tones.6,7 Due to the coarse spectral resolution, F0 cues are poorly perceived by CI users, which limits understanding of pitch-mediated speech information such as voice gender, prosody, vocal emotion, and for Mandarin-speaking CI users, lexical tones.8–10 Because of poor perception of F0 cues, Mandarin-speaking CI users depend more strongly on the co-varying amplitude and duration cues to recognize lexical tones.11 Since amplitude contour cues may be more negatively affected by the dynamics of competing speech than by SSN, Mandarin-speaking CI users may be highly susceptible to competing speech.

The ability to segregate target speech from interfering speech is important for children (e.g., paying attention to a teacher's voice when there is interfering speech in the classroom). Reference 20 evaluated information masking of speech in children using a closed-set speech recognition paradigm (Coordinate Response Measure, or CRM). They found that performance was dominated by informational masking when the target and masker speech were the same sex, and that children were more susceptible to informational masking than were adults. They also found a significant correlation between age at testing and the effect of informational masking. However, when signals are spectro-temporally degraded as in the CI case, it is unclear how CI children will be affected by informational masking and how development may factor into talker segregation. For Mandarin-speaking CI children, poor perception of important F0 cues may pose further challenges in segregating target and masker speech. In this study, speech understanding for a target talker was measured in SSN or in the presence of a competing talker (with the same or different gender as the target) in pediatric Mandarin-speaking CI and NH listeners. We hypothesized that CI children would be more susceptible to informational maskers than would NH children. We further hypothesized that age at testing (i.e., auditory development) would be associated with susceptibility to both energetic (SSN) and informational masking (competing speech) in both CI and NH children.

2. Methods

2.1. Subjects

Sixteen Mandarin-speaking Chinese CI users participated in the study (10 males and 6 females). The mean age at testing was 9.1 yrs (range = 7–14 yrs), the mean duration of deafness was 3.4 yrs (range = 1–8 yrs), and the mean CI experience was 5.7 yrs (range = 2–13 yrs). CI subject demographic information is shown in Table 1. Twelve NH children (5 males and 7 females; mean age = 10.3 yrs, range = 7–14 yrs) served as experimental controls for the CI children. Additionally, 10 NH adults (6 males and 4 females; mean age = 24.2 yrs, range = 23–26 yrs) served as another set of experimental controls for the NH children to observe potential age effects on energetic and informational masking. All NH subjects had pure tone thresholds <20 dB hearing level at all audiometric frequencies between 125 and 8000 Hz. In compliance with ethical standards for human subjects, written informed consent was obtained from all participants before proceeding with any of the study procedures. This study was approved by the Institutional Review Board in The First Affiliated Hospital of Soochow University, Suzhou, China.

Table 1.

CI subject demographic information.

Subject Gender Age at test (yrs) Dur deaf (yrs) CI exp (yrs) CI ear Device
S1 F 7 5 2 R MED-EL
S2 F 7 5 2 R MED-EL
S3 M 7 1 6 L Cochlear
S4 M 7 1 6 R MED-EL
S5 M 7 3 4 R MED-EL
S6 M 7 1 6 R MED-EL
S7 M 7 2 5 R Cochlear
S8 F 8 2 6 R MED-EL
S9 M 8 4 4 R Cochlear
S10 F 9 4 5 L MED-EL
S11 F 10 8 2 R MED-EL
S12 M 10 2 8 R AB
S13 M 11 8 3 R AB
S14 F 13 4 9 R Cochlear
S15 M 14 1 13 R Cochlear
S16 M 14 4 10 R Cochlear

2.2. CMS test materials

The Closed-set Mandarin Speech (CMS; Ref. 12) test materials were used to test speech understanding with the different maskers. The CMS test materials consist of familiar words selected to represent the natural distribution of vowels, consonants, and lexical tones found in Mandarin Chinese. Ten key words in each of five categories (Name, Verb, Number, Color, and Fruit) were produced by a native Mandarin talker, resulting in a total of 50 words that can be combined to produce 100 000 unique sentences.

2.3. Test conditions and procedures

Speech recognition for a target talker was measured in the presence of SSN or a competing male or female talker. The target sentences were produced by a single male talker (mean F0 across all CMS stimuli = 136 Hz). The SSN masker was white noise that was bandpass filtered to match the average spectrum (across all CMS words) of the target speech. Speech maskers were produced by a male talker (mean F0 = 178 Hz; different from the male target talker) or a female talker (mean F0 = 246 Hz).

All listening conditions were similar to CRM tests5,12 using CMS sentence materials. Like CRM tests, two keywords (randomly selected from the Number and Color categories) were embedded in a carrier phrase uttered by the male target talker. The first word in the target phrase was always the Name “Xiaowang,” followed by randomly selected words from the remaining categories. Subjects were instructed to pick the correct words for the target talker from only the Number and Color categories; no selections could be made from the remaining categories, which were grayed out. For the Male and Female maskers, the first word was randomly selected from the Name category, excluding Xiaowang. Thus, the target phrase could be (keywords in bold type) “Xiaowang Sold Four Blue Strawberries,” “Xiaowang Chose One Green Banana,” etc., while the masker could be “Xiaozhang Sold Three Red Kumquats,” “Xiaodeng Took Six White Papayas,” etc. Figure 1 shows spectrograms for a male target sentence (top) and female masker sentence (bottom), along with the waveforms mixed at 0 dB signal-to-noise ratio (SNR) for the target (blue) and masker (red); the stimuli can be heard in Mm1. Figure 1 shows clear spectral and temporal differences that might be used to segregate the target and masker.

Fig. 1.

Fig. 1.

(Color online) Spectrograms and waveforms, for example, target and masker stimuli. The top spectrogram is for the target sentence produced by a male talker. The bottom spectrogram is for the masker sentence produced by a female talker. The middle panel shows the target (blue) and masker (red) sentences at 0 dB SNR. The target prompt name (Xiaowang) is underlined and the ovals show the keywords (capitalized and underlined) for the target.

Mm. 1.

The CMS stimuli sample. The stimuli shown in Fig. 1 are presented here. First, the male target sentence is presented, followed by the female masker sentence, followed by the two sentences combined at 0 dB SNR. This is a file of type “wav” (2.45 Mb).

Download audio file (2.5MB, wav)
DOI: 10.1121/1.5051051.1

Speech reception thresholds (SRTs) for CMS sentences in noise, defined as the SNR that produced 50% correct keyword recognition,13,14 were adaptively measured using a closed-set test paradigm. Three masker conditions were tested: (1) SRTs in SSN, (2) SRTs with a male target and male masker (M + M), and (3) SRTs with a male target and a female masker (M + F). The onset and offset of the masker was always the same as the target (i.e., small time corrections were made to the duration of each sentence). Target sentences were randomly generated from among the ten words in each of the Verb, Number, Color, and Fruit categories, but always used Xiaowang from the Name category. Masker sentences were randomly generated from among the Name (excluding Xiaowang), Verb, Number (excluding the target keyword), Color (excluding the target keyword), and Fruit categories. A total of 20 sentences were presented in each test run, with no keyword presented more than twice during the run.

All CI subjects were tested with their clinical processors and settings; these were not changed during the course of testing. For all subjects, stimuli were presented in the sound field at 65 dBA via a single loudspeaker; subjects were seated directly facing the loudspeaker at a 1 m distance. During each test trial, a sentence was presented at the target SNR; the initial SNR was 10 dB. The subject clicked on one of the ten responses for each of the Number and Color categories. If the subject correctly identified both keywords, the SNR was reduced by 4 dB (step size). If the subject did not correctly identify both keywords, the SNR was increased by 4 dB. After two reversals, the step size was reduced to 2 dB for either correct or incorrect response. The SRT was calculated by averaging the last six reversals in SNR. If there were fewer than 6 reversals within 20 trials, the test run was discarded and another run was measured. Two test runs were completed for each condition and the SRT was averaged across runs.

3. Results

Figure 2 shows boxplots of SRTs for NH adults, NH children, and CI children for the different test conditions. A split-plot repeated-measures analysis of variance (RM ANOVA) was used to compare NH adult and NH child data, with test condition (SSN, M + M, M + F) as the within-subject factor and subject group (NH adult, NH child) as the between-subject factor. Results showed significant effects for test condition [F(2,40) = 51.078, p < 0.001; η2 = 0.719] and subject group [F(1,20) = 12.827, p = 0.002; η2 = 0.391]; there was no significant interaction [F(2,40) = 2.963, p = 0.063; η2 = 0.129]. Post hoc Bonferroni pairwise comparisons showed that SRTs were significantly better for M + F than for M + M and SSN (p < 0.001 in both cases), with no significant difference between M + M and SSN (p > 0.05). Post hoc Bonferroni pairwise comparisons showed that SRTs were significantly better for NH adults than for NH children (p < 0.05).

Fig. 2.

Fig. 2.

(Color online) Boxplots of SRTs for NH adults, NH children, and CI subjects for the different masker conditions. The boxes show the 25th and 75th percentiles, the error bars show the 5th and 95th percentiles, the symbols show outliers, the solid horizontal line shows the median, and the dashed horizontal line shows the mean.

A split-plot RM ANOVA was used to compare NH child and CI child data, with test condition (SSN, M + M, M + F) as the within-subject factor and subject group (NH child, CI child) as the between-subject factor. Results showed significant effects for test condition [F(2,52) = 29.299, p < 0.001; η2 = 0.530] and subject group [F(1,26) = 317.433, p < 0.001; η2 = 0.924]; there was a significant interaction [F(2,52) = 53.69, p < 0.001; η2 = 0.674]. For NH children, post hoc Bonferroni pairwise comparisons showed that SRTs were significantly better for M + F than for SSN and M + M (p < 0.001 in both cases), with no significant difference between M + M than SSN (p > 0.05). For CI children, SRTs were significantly better for SSN than for M + M and M + F (p < 0.001 in both cases) with no significant difference between M + M and M + F (p > 0.05). SRTs were significantly better for NH than for CI children for all conditions (p < 0.001 in all cases).

Age at testing was compared to SRTs for all testing conditions in NH and CI pediatric subjects. Significant correlations were observed for NH children only for M + M (r = −0.67, p = 0.017) and M + F (r = −0.88, p < 0.001). There were no significant correlations for any of the testing conditions for CI subjects. For CI subjects, duration of deafness and CI experience were compared to SRT data. A significant correlation was observed between duration of deafness and SRTs only for M + F (r = 0.66, p = 0.005); there were no significant correlations for the remaining conditions. There were no significant correlations between CI experience and SRTs for any of the conditions.

4. Discussion

In this study, understanding of Mandarin Chinese in noise or competing speech was significantly poorer for CI children than for NH children, consistent with previous English1,2 and Chinese studies15 with adult CI users. CI performance was poorer with the speech maskers while NH performance was better with the speech maskers, relative to SSN. Performance was significantly better for NH adults than for NH children for all maskers.

Consistent with our first hypothesis, CI children were more susceptible to informational masking than were NH children. Unlike NH children, CI children experienced no release from masking with the dynamic speech maskers compared to SSN, consistent with many previous CI studies with English-speaking adult listeners.1–4 For NH children, SRTs with the male target were significantly better with the competing female than male talker. CI children were unable to use talker gender differences to segregate competing speech. This finding is consistent with Ref. 16 who found that adult Chinese CI users were unable to use F0 differences to segregate concurrent Mandarin vowels.

However, the present data are not consistent with some previous studies (e.g., Refs. 4 and 17), who found that adult, English-speaking CI listeners were able to use target-masker voice gender pitch differences to segregate competing talkers. The importance of F0 cues to Mandarin speech perception may partly explain the inability of the present Mandarin-speaking CI children to use voice pitch differences in the competing speech conditions. Because amplitude and duration cues provide lexical tone information11 when F0 information is not available (limiting the ability to segregate talkers), these lexically-important cues may be less accessible for Mandarin speaking than for English speaking CI users in competing speech regardless of voice pitch difference between target and masker speech. As such, Mandarin-speaking CI users may be more susceptible to informational masking and are unable to use voice pitch difference to segregate talkers.

There was a mixed result for our second hypothesis. Age at testing was correlated with SRTs only for NH children, and only for the informational masking conditions (M + M, M + F). This result is consistent with Ref. 20, who found a significant correlation between age at testing and the effect of informational masking in pediatric listeners. There was no correlation between age at testing and any of the masker conditions for CI children. This suggests that development may contribute to susceptibility to informational masking when F0 information is available, as in the NH case. NH adults performed significantly better than NH children in all testing conditions, with much better performance for the competing speech conditions. Thus, beyond the significant correlations between SRTs and age at testing observed in NH children, immature auditory development appeared to be a limiting factor in overall susceptibility to informational masking and the ability to use F0 cues to segregate competing talkers. For CI children, duration of deafness was correlated with SRTs only for the M + F condition. This suggests that extended auditory deprivation may negatively affect segregating talkers even with large F0 differences. Taken together, these results suggest that the ability to segregate talkers may improve with age but worsen with spectral degradation and increased duration of deafness.

The mean difference in SRTs between NH children and CI children was 6.80, 11.75, and 19.66 dB for the SSN, M + M, and M + F conditions, respectively. It is difficult to know how central processing might have differed across NH and CI children for the speech maskers, which represent some combination of energetic and informational masking. In terms of energetic masking, dips in the spectral and temporal envelopes allow for “glimpses” of the target speech. As such, there was less energetic masking with speech than with SSN. However, given the coarse spectral resolution and other signal processing components (e.g., compressive amplitude mapping, automatic gain control, channel interaction, etc.), CI children may have experienced more energetic masking with speech than did NH children. References 18 and 19 suggested that limited spectral resolution (e.g., due to broadened cochlear filters, current spread, etc.) may effectively flatten the modulation spectrum (i.e., sensitivity to amplitude fluctuations across modulation rates) and smooth temporal envelope cues. If glimpses in the spectral and temporal envelope were reduced for CI users, informational masker differences (e.g., voice pitch, lexical cues) may have been less available.

5. Summary and conclusions

Understanding of Mandarin Chinese in the presence of steady noise (energetic masking) or competing speech (energetic + informational masking) was measured in Mandarin-speaking pediatric CI and NH listeners, and in adult NH listeners. Major findings include:

  • (1)

    For all maskers, performance was significantly poorer for CI children than for NH children. CI children were also more susceptible to informational masking.

  • (2)

    CI children were unable to benefit from voice pitch differences between the target and masker speech.

  • (3)

    For NH children, there was a significant correlation between age at testing and SRTs for speech maskers, regardless of the F0 difference between target and masker talkers. There was no such correlation in CI children, suggesting that the coarse spectro-temporal resolution may have been a more limiting factor than age at testing in terms of informational masking.

  • (4)

    Duration of deafness in CI children was significantly correlated with SRTs for competing speech when the target and masker talker gender was different. Thus, early implantation may allow for better segregation of competing talkers.

Acknowledgments

We thank the subjects for their participation in this study. We thank two anonymous reviewers for helpful comments. This work was supported by the National Institutes of Health (Grant No. R01-004792); the National Natural Science Foundation of China (Grant Nos. 81600796, 81570914, and 81371087); the Natural Science Foundation of Jiangsu Province for Young Scholar (Grant No. 20160344); the Natural Science Foundation of Jiangsu Province for Colleges and Universities (Grant No. 16KJB320011); the Key Medical Discipline Program of Suzhou (Grant No. SZXK201503); and the China Scholarship Council (Grant No. 201706100146).

Contributor Information

Duo-Duo Tao, Email: .

Yang-Wenyi Liu, Email: .

Ye Fei, Email: .

John J. Galvin, III, Email: .

Bing Chen, Email: .

Qian-Jie Fu, Email: .

References and links

  • 1. Nelson P. B., Jin S. H., Carney A. E., and Nelson D. A., “Understanding speech in modulated interference: Cochlear implant users and normal-hearing listeners,” J. Acoust. Soc. Am. 113, 961–968 (2003). 10.1121/1.1531983 [DOI] [PubMed] [Google Scholar]
  • 2. Fu Q.-J. and Nogaki G., “Noise susceptibility of cochlear implant users: The role of spectral resolution and smearing,” J. Assoc. Res. Otolaryngol. 6, 19–27 (2005). 10.1007/s10162-004-5024-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Stickney G. S., Zeng F. G., Litovsky R., and Assmann P., “Cochlear implant speech recognition with speech maskers,” J. Acoust. Soc. Am. 116, 1081–1091 (2004). 10.1121/1.1772399 [DOI] [PubMed] [Google Scholar]
  • 4. Cullington H. E. and Zeng F. G., “Speech recognition with varying numbers and types of competing talkers by normal-hearing, cochlear-implant, and implant simulation subjects,” J. Acoust. Soc. Am. 123, 450–461 (2008). 10.1121/1.2805617 [DOI] [PubMed] [Google Scholar]
  • 5. Brungart D. S., “Informational and energetic masking effects in the perception of two simultaneous talkers,” J. Acoust. Soc. Am. 109, 1101–1109 (2001). 10.1121/1.1345696 [DOI] [PubMed] [Google Scholar]
  • 6. Liang Z. A., “The auditory perception of Mandarin Tones,” Acta. Phys. Sin. 26, 85–91 (1963). [Google Scholar]
  • 7. Lin M. C., “The acoustic characteristics and perceptual cues of tones in Standard Chinese,” Chinese Yuwen 204, 182–193 (1988). [Google Scholar]
  • 8. Fu Q.-J., Chinchilla S., Nogaki G., and Galvin J. J. III, “Voice gender identification by cochlear implant users: The role of spectral and temporal resolution,” J. Acoust. Soc. Am. 118, 1711–1718 (2005). 10.1121/1.1985024 [DOI] [PubMed] [Google Scholar]
  • 9. Chatterjee M. and Peng S. C., “Processing F0 with cochlear implants: Modulation frequency discrimination and speech intonation recognition,” Hear. Res. 235, 143–156 (2008). 10.1016/j.heares.2007.11.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Luo X., Fu Q.-J., and Galvin J. J. III, “Vocal emotion recognition by normal-hearing listeners and cochlear implant users,” Trends Amplif. 11, 301–315 (2007). 10.1177/1084713807305301 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Fu Q.-J. and Zeng F. G., “Identification of temporal envelope cues in Chinese tone recognition,” Asia. Pac. J. Speech, Lang. Hear. 5, 45–57 (2005). [Google Scholar]
  • 12. Tao D.-D., Fu Q.-J., Galvin J. J. III, and Yu Y.-F., “The development and validation of the Closed-set Mandarin Sentence (CMS) test,” Speech Commun. 92, 125–131 (2017). 10.1016/j.specom.2017.06.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Plomp R. and Mimpen A. M., “Speech-reception threshold for sentences as a function of age and noise level,” J. Acoust. Soc. Am. 66, 1333–1342 (1979). 10.1121/1.383554 [DOI] [PubMed] [Google Scholar]
  • 14. Bolia R. S., Nelson W. T., Ericson M. A., and Simpson B. D., “A speech corpus for multitalker communications research,” J. Acoust. Soc. Am. 107, 1065–1066 (2000). 10.1121/1.428288 [DOI] [PubMed] [Google Scholar]
  • 15. Liu Y.-W., Tao D.-D., Jiang Y., Galvin J. J. III, Fu Q.-J., Yuan Y.-S., and Chen B., “Effect of spatial separation and noise type on sentence recognition by Mandarin-speaking cochlear implant users,” Acta Otolaryngol. 137, 829–836 (2017). 10.1080/00016489.2017.1292050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Luo X., Fu Q.-J., Wu H. P., and Hsu C. J., “Concurrent-vowel and tone recognition by Mandarin-speaking cochlear implant users,” Hear. Res. 256, 75–84 (2009). 10.1016/j.heares.2009.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Visram A. S., Kluk K., and McKay C. M., “Voice gender differences and separation of simultaneous talkers in cochlear implant users with residual hearing,” J. Acoust. Soc. Am. 132, EL135–EL141 (2012). 10.1121/1.4737137 [DOI] [PubMed] [Google Scholar]
  • 18. Oxenham A. J. and Kreft H. A., “Speech perception in tones and noise via cochlear implants reveals influence of spectral resolution on temporal processing,” Trends Hear. 18, 1–14 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Oxenham A. J. and Kreft H. A., “Speech masking in normal and impaired hearing: Interactions between frequency selectivity and inherent temporal fluctuations in noise,” Adv. Exp. Med. Biol. 894, 125–132 (2016). 10.1007/978-3-319-25474-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Wightman F. L. and Kistler D. J., “Informational masking of speech in children: Effects of ipsilateral and contralateral distracters,” J. Acoust. Soc. Am. 118, 3164–3176 (2005). 10.1121/1.2082567 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of the Acoustical Society of America are provided here courtesy of Acoustical Society of America

RESOURCES