Melodic contour identification and sentence recognition using sung speech

Joseph D Crew; John J Galvin, III; Qian-Jie Fu

doi:10.1121/1.4929800

. 2015 Sep 25;138(3):EL347–EL351. doi: 10.1121/1.4929800

Melodic contour identification and sentence recognition using sung speech

Joseph D Crew ^1,^a), John J Galvin III ², Qian-Jie Fu ²

PMCID: PMC4583512 PMID: 26428838

Abstract

For bimodal cochlear implant users, acoustic and electric hearing has been shown to contribute differently to speech and music perception. However, differences in test paradigms and stimuli in speech and music testing can make it difficult to assess the relative contributions of each device. To address these concerns, the Sung Speech Corpus (SSC) was created. The SSC contains 50 monosyllable words sung over an octave range and can be used to test both speech and music perception using the same stimuli. Here SSC data are presented with normal hearing listeners and any advantage of musicianship is examined.

1. Introduction

For cochlear implant (CI) users with residual acoustic hearing, combined use of the CI and a hearing aid (HA) in the contralateral ear has been shown to improve speech and music perception (Kong et al., 2005; Dorman et al., 2008; Crew et al., 2015). The benefits of bimodal listening are somewhat variable, often with listeners attending to the “better ear” for different speech and music perception tasks (e.g., Crew et al., 2015). In previous bimodal studies, very different stimuli (e.g., speech versus piano notes) and perceptual tests (e.g., speech understanding in noise versus melodic contour identification) have been used to evaluate bimodal performance. The results indicate that CIs convey speech information quite well, and the HAs convey the pitch information quite well. As such, a subject is likely to focus on a single device during a particular task because that cue is better represented by a single device making it difficult to observe a bimodal benefit. It seems preferable to combine varying musical pitch and speech information within a stimulus such that both devices will be needed to perform the task. This may allow better observation of the contributions and interactions between acoustic and electric hearing.

To address these concerns, we have created a database of sung monosyllable words that contain both musical pitch and speech information called the Sung Speech Corpus (SSC). The SSC allows sentence recognition to be measured with or without variations in pitch cues (fundamental frequency or F0) and melodic pitch perception to be measured with and without variation in words or timbre cues, both using the same stimuli. Thus, the SSC may help elucidate the contributions of pitch and timbre to speech and music perception. The SSC may be especially useful for evaluating speech and music perception in unilateral and bilateral CI users, bimodal listeners, and normal hearing (NH) listeners with pitch processing deficits. The size of the database (100 000 possible sentences with 27 possible contours) makes the SSC useful to compare performance across numerous experimental conditions. In this paper, we present speech and music perception results with the SSC in adult normal hearing (NH) listeners. Such data are important for future comparison to performance in hearing-impaired listeners (e.g., HA users, CI users, bimodal listeners, etc.).

Long-term musical experience has been shown to improve both speech and music perception (Peretz et al., 2003; Kraus et al., 2009; Parbery-Clark et al., 2009), possibly because musicians are able to extract and track pitch in complex listening environments. This “musician effect” has not been consistently observed across studies (Ruggles et al., 2014) and may depend on the listening task (Fuller et al., 2014). The SSC contains acoustically complex stimuli in terms of pitch and timbre cues that may be weighted differently depending on the listening task (speech versus music perception); it is even possible that pitch and timbre cues may not be optimally integrated in some listeners. As such, we hypothesized that an advantage may be observed for musicians as the stimuli became more complex (e.g., melodic pitch perception with varying timbre cues, sentence recognition with varying pitch cues).

2. Methods

2.1. Subjects

Sixteen NH subjects participated in the study. All subjects had pure tone thresholds less than 20 dB hearing loss at all audiometric frequencies between 125 and 4000 Hz. Subjects were divided into two categories of eight subjects each: musicians (mean age, 30.5 years; age range, 24–47 years) and non-musicians (mean age, 27.8 years; age range, 24–30 years). Musicians were defined as regularly playing a musical instrument at the time of recruitment. Non-musicians were defined as never having any formal musical training or never informally learning to play an instrument (e.g., guitar lessons, sang in a choir). Potential subjects who had some music training but did not meet the musician criteria were excluded, as they had too much training to be a non-musician, but too infrequent playing to be considered a musician for this study.

2.2. SSC

The SSC consists of 50 sung monosyllable words produced by a single adult male that can be used to create a simple sentence with the following syntax: “name” “verb” “number” “color” “clothing” (e.g., “Bob sells three blue ties”). Each of the five categories contains ten words, and each word was sung at all 13 pitches from A2 (110 Hz) to A3 (220 Hz) in discrete semitone steps. As such, a five-word sentence can be constructed to contain a five-note melody, allowing sentence recognition and melodic contour identification (MCI) to be measured using the same stimuli. Natural speech utterances were also produced for each word to allow comparison between naturally produced and sung speech. All stimuli were 500 ms in duration. Minimal adjustments were made to the stimuli after recording to obtain exact target F0, amplitude and duration. Figure 1 shows the response screen for the sentence recognition test (left panel) and the MCI test (right panel).

Fig. 1. — (Color online) Response screens for sentence recognition (left panel) and MCI (right panel). There are five categories with ten words each for the sentence recognition test; there are nine possible contours for MCI.

2.3. Test procedures and conditions

Sentence recognition was measured using a closed-set matrix procedure, similar to other matrix sentence testing studies (Rader et al., 2015). To create a test sentence, a word was randomly selected from each category. Depending on the test condition, the F0 for each word was selected to create a target pitch contour. For the “flat contour” condition, the F0 was the same across all words. For the “fixed contour” condition, one of four dynamic contours (rising, rising-falling, falling-rising, and falling) was used for all sentences during testing. For the “random contour” condition, all nine possible contours were presented during testing. Sentences were also tested using naturally produced speech (“spoken”). During testing, a test sentence was presented to the subject, who responded by clicking on the word within each category that best matched the word presented (left panel of Fig. 1). Subjects were allowed to repeat the sentence up to three times. Performance was scored based on complete sentence recognition. The sentence recognition test took approximately 6–8 min to complete each run. Audio demo Mm. 1, presents example stimuli for each of the four sentence test conditions.

Mm. 1.

Audio examples of the sentence test stimuli for the spoken, flat contour, fixed contour, and random contour conditions. This is a file of type “wav” (5355 kB).

Download audio file^{(5.2MB, wav)}

DOI: 10.1121/1.4929800.1

Open in a new tab

MCI was also measured using the SSC stimuli, using methods described in Galvin et al. (2007). The F0 spacing between notes in the contour was varied between one and three semitones. For the “fixed word” condition, 1 of the 50 words was randomly chosen, and this word was used for all notes in the contour (e.g., “Bob Bob Bob Bob Bob”). For the “fixed sentence” condition, one word from each category was randomly chosen to construct a single sentence (e.g., “Bob sells three blue ties”) that was used for all contours during MCI testing. For the “random sentence” condition, words were randomly chosen from each category to create different sentences for each contour (e.g., “Bob sells three blue ties,” “John wants five brown shoes,”). As a control condition, MCI was also measured with the MIDI piano sample used in Galvin et al. (2008) and Crew et al. (2015). During testing, a contour was presented to the subject, who responded by clicking on one of the nine response boxes shown onscreen (right panel of Fig. 1). Subjects were allowed to repeat the contour up to three times. MCI performance was scored in terms of overall percent correct, as well as percent correct for each semitone spacing condition. The MCI test took approximately 4–5 min to complete each run. Audio demo Mm. 2, presents example stimuli in pairs for each of the four MCI test conditions: piano, fixed word, fixed sentence, and random sentence.

Mm. 2.

Audio examples of the MCI test stimuli with three-semitone spacing for the piano, fixed word, fixed sentence, and random sentence conditions. This is a file of type “wav” (6703 kB).

Download audio file^{(5.2MB, wav)}

DOI: 10.1121/1.4929800.2

Open in a new tab

All subjects were tested in while sitting in a sound-treated booth and directly facing a single loudspeaker. All stimuli were presented in sound field at 65 dBA. The four sentence conditions and the four MCI conditions were tested in separate blocks and the test block order was randomized across subjects. No preview or trial-by-trial feedback was provided. A minimum of two test blocks were tested for each condition; if the difference in performance was greater than 10%, a third run was tested.

3. Results

Figure 2 shows sentence recognition (top panels) and MCI performance (bottom panels) for musicians and non-musicians for the different test conditions. For sentence recognition, performance for both groups was very good for all conditions, with musicians and most non-musicians scoring near 100% correct. A split-plot analysis of variance (ANOVA) was performed on the sentence recognition data, with subject group (musicians or non-musicians) as the across-group factor and test condition (spoken, flat contour, fixed contour, or random contour) as the within-group factor. Results showed no significant effects for subject group [F(1,14) = 3.34, p = 0.089] or test condition [F(3,14) = 1.22, p = 0.288]; there was no significant interaction [F(3,14) = 0.28, p = 0.688].

Fig. 2. — Box plots for sentence recognition (top panels) and MCI (bottom panels) scores for musicians (M) and non-musicians (NM). Each panel shows data for different test conditions. The boxes show the 25th and 75th percentile, the error bars show the 10th and 90th percentiles, the solid line shows the median, the dashed line shows the mean, and the symbols show outliers.

For MCI, there was a strong musician advantage. Musician performance was nearly perfect in all conditions, while non-musician performance was generally poorer and more variable. A split-plot ANOVA was performed on the MCI data, with subject group (musicians or non-musicians) as the across-group factor and test condition (piano, fixed word, fixed sentence, and random sentence) as the within-group factor. Results showed significant effects for subject group [F(1,42) = 17.604, p = 0.001] and test condition [F(3,42) = 23.05, p < 0.001]; there was a significant interaction [F(3,42) = 22.0, p < 0.001].

Because musicians scored nearly 100% correct in all test conditions, post hoc pairwise comparisons (with Bonferroni corrections) were performed only on non-musician data. Performances in the piano and fixed word conditions were significantly better than in the fixed sentence or random sentence conditions (p < 0.001 in all four comparisons); there were no significant differences among the remaining conditions. Performance with three-semitione spacing was significantly better than with one-semtione spacing (p = 0.008) with no other significant differences.

4. Discussion

There was no significant musician effect for sentence recognition, even for the most complex condition (random contour), possibly due to ceiling performance effects. Still, the present data are in line with previous studies that tend to show small musician effects for speech, if at all (Parberry-Clark et al., 2009; Fuller et al., 2014; Ruggles et al., 2014). Supporting our hypothesis, the musician effect for MCI became stronger as the stimuli became more complex. Musicians performed nearly perfectly for all test conditions, suggesting that musicians were better available to extract pitch information despite variations in timbre (in this case, words). Variations in timbre clearly affect non-musicians' melodic pitch perception. The present results are in agreement with a previous study that showed that MCI performance in CI users was significantly affect by instrument timbre, with performance decreasing as the timbre complexity increased (Galvin et al., 2008); In that study, NH performance was also more variable as the timbre complexity increased, similar to the present results.

It is possible that semantic differences across trials may have added to the complexity of the MCI task. If so, performance for the fixed sentence condition (the same sentence across trials) should have been better than for the random sentence condition (different sentences across trials). For non-musicians, performance was similar for the fixed and random conditions, and both were significantly poorer than for the fixed word condition (the same word across all notes and across all trials). Further, there was no significant difference between the fixed word and piano conditions, suggesting that consistent timbre cues allowed non-musicians to better extract pitch information.

The SSC may be an effective tool with which to probe to relative contributions of acoustic and electric hearing to speech and music perception in bimodal CI listeners. Pitch perception is poor with the CI alone, and the poor spectral resolution may lead to confusion between pitch and timbre cues causing a deficit in speech and/or music perception when both cues are varied. Adding a contralateral HA may improve pitch perception, which in turn may improve speech and music performance when pitch and/or timbre cues are varied. As such, the SSC may reveal greater bimodal benefit to speech performance than observed with previous studies. And while CI signal processing may be modified to improve melodic pitch perception (e.g., semitone-spaced frequency allocation) speech perception may be negatively affected. Optimizing CI signal processing for music perception must not be at the expense of speech perception. The SSC may be used to evaluate both speech and music perception using stimuli that contain both pitch and timbre cues; as such, improvements and decrements in speech and/or music perception can be easily observed.

Acknowledgments

The authors thank the subjects for their participation. This work was supported by the NSF GK-12 Body Engineering Los Angeles program and NIDCD R01-DC004993 and R01-DC004792.

References and links

1. Crew, J. D. , Galvin, J. J. 3rd , Landsberger, D. M. , and Fu, Q.-J. (2015). “ Contributions of electric and acoustic hearing to bimodal speech and music perception,” PLoS One. 10, e0120279. 10.1371/journal.pone.0120279 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Dorman, M. F. , Gifford, R. H. , Spahr, A. J. , and McKarns, S. A. (2008). “ The benefits of combining acoustic and electric stimulation for the recognition of speech, voice and melodies,” Audiol. Neurotol. 13, 105–112. 10.1159/000111782 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Fuller, C. D. , Galvin, J. J. 3rd , Free, R. H. , and Baskent, D. (2014). “ Musician effect in cochlear implant simulated gender categorization,” J. Acoust. Soc. Am. 135, EL159–EL165. 10.1121/1.4865263 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Galvin, J. J. 3rd , Fu, Q.-J. , and Nogaki, G. (2007). “ Melodic contour identification by cochlear implant listeners,” Ear Hear. 28, 302–319. 10.1097/01.aud.0000261689.35445.20 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Galvin, J. J. 3rd , Fu, Q.-J. , and Oba, S. (2008). “ Effect of instrument timbre on melodic contour identification by cochlear implant users,” J. Acoust. Soc. Am. 124, EL189–EL195. 10.1121/1.2961171 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Kong, Y.-Y. , Stickney, G. S. , and Zeng, F.-G. (2005). “ Speech and melody recognition in binaurally combined acoustic and electric hearing,” J. Acoust. Soc. Am. 117, 1351–1361. 10.1121/1.1857526 [DOI] [PubMed] [Google Scholar]
7. Kraus, N. , Skoe, E. , Parbery-Clark, A. , and Ashley, R. (2009). “ Experience-induced malleability in neural encoding of pitch, timbre and timing: Implications for language and music,” Ann. N. Y. Acad. Sci. 1169, 543–557. 10.1111/j.1749-6632.2009.04549.x [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Parbery-Clark, A. , Skoe, E. , Lam, C. , and Kraus, N. (2009). “ Musician enhancement for speech in noise,” Ear Hear. 30, 653–661. 10.1097/AUD.0b013e3181b412e9 [DOI] [PubMed] [Google Scholar]
9. Peretz, I. , Champod, A. S. , and Hyde, K. (2003). “ Varieties of musical disorders: The Montreal battery of evaluation of amusia,” Ann. N. Y. Acad. Sci. 999, 58–75. 10.1196/annals.1284.006 [DOI] [PubMed] [Google Scholar]
10. Rader, T. , Adel, Y. , Fastl, H. , and Baumann, U. (2015). “ Speech perception with combined electric-acoustic stimulation: A simulation and model comparison,” Ear Hear., in press (2015). [DOI] [PubMed]
11. Ruggles, D. R. , Freyman, R. L. , and Oxenham, A. J. (2014). “ Influence of musical training on understanding voiced and whispered speech in noise,” PLoS One. 9, e86980. 10.1371/journal.pone.0086980 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c1] 1. Crew, J. D. , Galvin, J. J. 3rd , Landsberger, D. M. , and Fu, Q.-J. (2015). “ Contributions of electric and acoustic hearing to bimodal speech and music perception,” PLoS One. 10, e0120279. 10.1371/journal.pone.0120279 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c2] 2. Dorman, M. F. , Gifford, R. H. , Spahr, A. J. , and McKarns, S. A. (2008). “ The benefits of combining acoustic and electric stimulation for the recognition of speech, voice and melodies,” Audiol. Neurotol. 13, 105–112. 10.1159/000111782 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c3] 3. Fuller, C. D. , Galvin, J. J. 3rd , Free, R. H. , and Baskent, D. (2014). “ Musician effect in cochlear implant simulated gender categorization,” J. Acoust. Soc. Am. 135, EL159–EL165. 10.1121/1.4865263 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c4] 4. Galvin, J. J. 3rd , Fu, Q.-J. , and Nogaki, G. (2007). “ Melodic contour identification by cochlear implant listeners,” Ear Hear. 28, 302–319. 10.1097/01.aud.0000261689.35445.20 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c5] 5. Galvin, J. J. 3rd , Fu, Q.-J. , and Oba, S. (2008). “ Effect of instrument timbre on melodic contour identification by cochlear implant users,” J. Acoust. Soc. Am. 124, EL189–EL195. 10.1121/1.2961171 [DOI] [PMC free article] [PubMed] [Google Scholar]

[c6] 6. Kong, Y.-Y. , Stickney, G. S. , and Zeng, F.-G. (2005). “ Speech and melody recognition in binaurally combined acoustic and electric hearing,” J. Acoust. Soc. Am. 117, 1351–1361. 10.1121/1.1857526 [DOI] [PubMed] [Google Scholar]

[c7] 7. Kraus, N. , Skoe, E. , Parbery-Clark, A. , and Ashley, R. (2009). “ Experience-induced malleability in neural encoding of pitch, timbre and timing: Implications for language and music,” Ann. N. Y. Acad. Sci. 1169, 543–557. 10.1111/j.1749-6632.2009.04549.x [DOI] [PMC free article] [PubMed] [Google Scholar]

[c8] 8. Parbery-Clark, A. , Skoe, E. , Lam, C. , and Kraus, N. (2009). “ Musician enhancement for speech in noise,” Ear Hear. 30, 653–661. 10.1097/AUD.0b013e3181b412e9 [DOI] [PubMed] [Google Scholar]

[c9] 9. Peretz, I. , Champod, A. S. , and Hyde, K. (2003). “ Varieties of musical disorders: The Montreal battery of evaluation of amusia,” Ann. N. Y. Acad. Sci. 999, 58–75. 10.1196/annals.1284.006 [DOI] [PubMed] [Google Scholar]

[c10] 10. Rader, T. , Adel, Y. , Fastl, H. , and Baumann, U. (2015). “ Speech perception with combined electric-acoustic stimulation: A simulation and model comparison,” Ear Hear., in press (2015). [DOI] [PubMed]

[c11] 11. Ruggles, D. R. , Freyman, R. L. , and Oxenham, A. J. (2014). “ Influence of musical training on understanding voiced and whispered speech in noise,” PLoS One. 9, e86980. 10.1371/journal.pone.0086980 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Melodic contour identification and sentence recognition using sung speech

Joseph D Crew

John J Galvin III

Qian-Jie Fu

Abstract

1. Introduction

2. Methods

2.1. Subjects

2.2. SSC

Fig. 1.

2.3. Test procedures and conditions

Mm. 1.

Mm. 2.

3. Results

Fig. 2.

4. Discussion

Acknowledgments

References and links

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Melodic contour identification and sentence recognition using sung speech

Joseph D Crew

John J Galvin III

Qian-Jie Fu

Abstract

1. Introduction

2. Methods

2.1. Subjects

2.2. SSC

Fig. 1.

2.3. Test procedures and conditions

Mm. 1.

Mm. 2.

3. Results

Fig. 2.

4. Discussion

Acknowledgments

References and links

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases