Abstract
The identity of a speech sound can be affected by the long-term spectrum of a preceding stimulus. Poor spectral resolution of cochlear implants (CIs) may affect such context effects. Here, spectral contrast effects on a phoneme category boundary were investigated in CI users and normal-hearing (NH) listeners. Surprisingly, larger contrast effects were observed in CI users than in NH listeners, even when spectral resolution in NH listeners was limited via vocoder processing. The results may reflect a different weighting of spectral cues by CI users, based on poorer spectral resolution, which in turn may enhance some spectral contrast effects.
1. Introduction
Our perception of speech sounds can be affected by the acoustic spectrum of the preceding context. For instance, the categorization of a vowel along the continuum /i/ - /ε/ (as in “bit” to “bet”) can be shifted in a contrastive manner, so that /ε/ is more likely to be perceived if the long-term spectrum of the preceding sentence has more energy near the formant frequencies of /i/, and vice versa (Watkins, 1991; Stilp et al., 2015; Feng and Oxenham, 2018). Such spectral contrast effects may assist in our ability to maintain perceptual constancy of speech sounds by compensating for the varying voice characteristics of different speakers (Ladefoged and Broadbent, 1957) or the spectral coloration caused by different listening environments (Watkins, 1991). Cochlear implant (CI) users suffer from poor spectral resolution due to the limited number of independent spectral channels and the interaction between the channels from current spread, which degrades some of the auditory cues that are used in speech perception. A recent study of spectral contrast effects in normal-hearing (NH) listeners using vocoder-based simulations of CI processing found poorer discrimination between vowels, presumably due to loss of spectral resolution, and seemingly larger contrast effects (Stilp, 2017). In this paper, spectral contrast effects were estimated in CI users by measuring the effect of the long-term averaged spectrum of a preceding sentence on the phoneme category boundary between /i/ as in “bit,” and /ε/ as in “bet.” The results were compared with those of NH listeners using vocoder simulations with and without simulated current spread.
2. Method
2.1. Subjects
Twelve NH listeners and 12 post-lingually deafened Advanced Bionics and Cochlear CI users were enrolled in this study. All of them were native speakers of American English. The NH participants were between 18 and 26 yrs old. Of the CI users, three were bilaterally implanted, resulting in a total of 15 ears. One unilaterally implanted CI user and one ear of a bilaterally implanted user failed the screening described in Sec. 2.3. Therefore, a total of 11 CI users (13 ears) were tested. Their information is provided in Table 1. All participants provided written informed consent and were compensated for their time. All protocols were approved by the Institutional Review Board of the University of Minnesota.
Table 1.
Demographic information for the CI participants. The following pairs denote the left and right ears of the same participants, respectively: D19/D26 and P12/P13. * subjects excluded from data analysis due to poor fitting of psychometric functions.
| Subject Code | Gender | Age (Yrs) | CI use (Yrs) | Etiology | Duration HL prior to implant (Yrs) | Device |
|---|---|---|---|---|---|---|
| D02 | F | 66.8 | 15.0 | Unknown | 1 | AB |
| D10 | F | 62.5 | 13.9 | Unknown | 8 | AB |
| D19 | F | 57.1 | 12.3 | Unknown | 11 | AB |
| D26 | F | 57.1 | 7.5 | Unknown | 11 | AB |
| D27 | F | 64.9 | 7.4 | Otosclerosis | 13 | AB |
| D28 | F | 71.2 | 13.8 | Familial progressive SNHL | 7 | AB |
| D39 | M | 69.3 | 7.8 | Unknown | 7 | AB |
| D41 | F | 67.8 | 4.3 | Familial progressive SNHL | 41 | AB |
| *D42 | M | 60.8 | 2.9 | Familial progressive SNHL | 2 | AB |
| *D44 | F | 69.8 | 9.2 | Familial progressive SNHL | 18 | AB |
| N25 | M | 50.2 | 20.5 | Maternal rubella | <1 | Cochlear |
| P12 | F | 27.1 | 2.1 | Sudden hearing loss | 3 | Cochlear |
| P13 | F | 27.1 | 4.1 | Sudden hearing loss | 1 | Cochlear |
2.2. Stimuli
The protocol and stimuli were the same as described in the study by Feng and Oxenham (2018). The context sentence was “The last word you hear is…” (1874 ms long) and the two test words were “bit” and “bet.” All stimuli were spoken by a female native speaker of American English with a mean fundamental frequency (F0) of 224 Hz. The vowel parts of the test words were segregated from the recorded words using the software package Praat (Boersma and Weenink, 2015). Each vowel was separated from the consonant /b/ by manually truncating the onset of the word in steps of 0.1 ms until the /b/ was not perceived. The formant contours of the vowels were then extracted using Praat. For the /i/, the first format, , increases from 584 to 617 Hz, while increases from 2474 to 2501 Hz. For the /ε/, increases from 820 to 829 Hz, while decreases from 2171 to 2138 Hz. A 10-step continuum between /i/ and /ε/ was generated by filtering the voice source [extracted from the /i/ endpoint by inverse filtering with the linear prediction coefficients (LPCs) of the vowel] with formant contour grids that were created by linearly interpolating between the formant trajectories of the first and second formants of the two vowel endpoints, also using Praat. The vowels were then embedded back between the onset part with the consonant /b/ extracted from the original recording of “bit” and the last part including the silent gap and the consonant /t/ to form the target word. The LPCs of the spectra of the two vowel endpoints were used in the inverse forms to obtain frequency responses that were the inverse of the spectral envelope of the vowels. Spectral envelope difference (SED) filters were created by placing the inverse filter of one vowel in series with the filter of the other, or vice versa (Watkins, 1991). The context sentence was filtered by either the /i/ - /ε/ filter [Fig. 1(a)], which results in a perceptual enhancement of /ε/ in the following target word, or by the inverse (/ε/ - /i/) [Fig. 1(b)], which results in a perceptual enhancement of /i/ in the target. There was a 100-ms silent gap between the sentence and the target word.
Fig. 1.
(Color online) The frequency response of SED filters that enhance the /ε/ vowel (a) and /i/ (b). The solid black lines show the frequency responses that were used in the experiment to filter the context sentence. The red dotted and blue dashed lines show the spectra after applying a low- or high-pass spectral modulation filter, respectively. The cutoff spectral modulation rate in both cases was 1.37 ripples per octave. Adapted from Feng and Oxenham (2018).
The context sentence and target word were low-pass filtered with a cutoff frequency of 5 kHz and played at equal root-mean-square (RMS) amplitude. All sounds were sampled at 22 050 Hz and presented via the auxiliary input directly to the speech processor of CI users (bypassing the microphone) at their most comfortable levels, and monaurally via Sennheiser (Old Lyme, CT) HD650 headphones to the right ear at 60 dB sound pressure level to NH listeners. The stimuli to NH listeners were first passed through a 16-channel tone-excited envelope vocoder, where each stimulus was divided into 16 frequency sub-bands by using bandpass filters generated with the function “fir1” in Matlab (Mathworks, Natick, MA) with cutoff frequencies and center frequencies equal to that of the standard Advanced Bionics clinical map. The temporal envelope for each sub-band was extracted using a Hilbert transform. The envelope was low-pass filtered with a fourth-order Butterworth filter with a cutoff frequency of 50 Hz and then was used to modulate pure-tone carriers with frequencies corresponding to the center frequencies of each channel. In the condition that simulates channel interactions, each carrier was modulated by the weighted sum of the intensity envelopes from all 16 channels (Oxenham and Kreft, 2014). The weights were selected to produce 12 dB/oct slopes on both sides of the center frequency of each channel.
2.3. Procedure
Listeners were seated in a double-walled sound-attenuating booth. They were asked to categorize the target word at the end of the sentence as either “bit” or “bet.” No feedback was provided. The context sentence was always one of the three possibilities: the original sentence without any filtering, the sentence filtered to emphasize /i/ and de-emphasize /ε/ (/i/ - /ε/ SED filter), or the sentence filtered to emphasize /ε/ and de-emphasize /i/ (/ε/ - /i/ SED filter). The experiment consisted of 600 trials per listener (10 targets × 3 context conditions × 20 repetitions), which were randomized in order for each listener and were divided into 10 blocks of 60 trials each. Listeners took short breaks between blocks. All 12 NH listeners were tested under vocoded conditions both with and without spectral spread. All listeners had to pass a screening test where they were required to reach performance of at least 80% correct in a discrimination task with only the two endpoint target words. All NH listeners passed the screening in conditions with and without spectral spread.
2.4. Analysis
For individual listeners, the proportion of /ε/ responses for each stimulus was analyzed using a generalized linear model with the function “glmfit” in Matlab. The binomial distribution was used to reflect the fact that responses were coded in a binary manner. A probit function was used to fit the psychometric function of each listener in each condition. Two of the 11 CI users were excluded from further data analysis due to poor fitting of psychometric functions (the model coefficient for slope was not significantly different from 0, p > 0.05). This led to data from a total of nine CI listeners (2 with two ears) being analyzed. In the case of CI listeners who had both ears tested, the results from the two ears were averaged and treated as the responses from a single listener. However, the statistical outcomes remained the same when the results from each ear were treated as independent observations (Donaldson et al., 2015; Bierer and Litvak, 2016).
3. Results
Figure 2(a) shows the psychometric functions of phoneme discrimination fitted to the mean data, when there was no filtering applied to the preceding sentence. For the data from the NH listeners, the slope of the psychometric function was shallower in the condition that incorporated the simulated channel interactions (greater spectral spread). A paired-samples t-test on the individual fitted functions from the NH listeners confirmed a significant effect of spectral spread on the slope of psychometric function [t(11) = 7.22, p < 0.001, Cohen's d = 2.47]. The average slope of the psychometric function for the CI users was 0.47 log-odds per vowel step, which is comparable to that of the psychometric function from the NH listeners of 0.46 in the condition with spectral spread [independent-samples t-test, t(19) = −0.17, p = 0.87, Cohen's d = 0.08]. Figure 2(b) shows the psychometric functions when one of the SED filters was applied to the preceding sentence. In general, the word “bet” was perceived more often when the SED filter emphasizing the spectral features of /i/ was applied to the sentence (filled symbols, solid curves), whereas “bit” was more likely to be perceived when the inverse filter was applied (open symbols, dashed curves). The differences between the two curves within each panel imply a spectral contrast effect. Typically, the size of the effect is quantified as the difference in the phoneme boundaries, defined as the shift in the midpoints of the two functions. The averaged shift of the phoneme boundary in the NH listeners was larger with channel interactions (2.23 steps on the vowel continuum ± 1.05 standard deviation, s.d.) than without (1.09 steps ± 0.66 s.d.). A paired-samples t-test on the shift in the phoneme boundary in NH listeners revealed a significant effect of the channel interaction [t(11) = −1.13, p = 0.006, Cohen's d = 1.27]. The average shift in CI users (4.90 steps ± 3.00 s.d.) was larger than that in the NH listeners even in the condition with channel interactions. An independent-samples t-test on the shift in the phoneme boundary, comparing CI users with the NH listeners in the condition with channel interactions, confirmed that the effect in CI users was larger [t(19) = −2.88, p = 0.010, Cohen's d = 1.19].
Fig. 2.
Proportion of /ε/ responses to a target word along a ten-step continuum between bit and bet. The left column shows mean data from 12 NH listeners with a vocoder with no channel interactions (NH-VC No Spread); the middle column shows mean data from the same 12 NH listeners with a vocoder with channel interactions (NH-VC Spread); and the right column shows mean data from 9 CI users. (a) Results with the context sentence with no filtering. (b) Results with the context sentence filtered with the SED filter that emphasizes the spectrum of the /i/ (/i/ - /ε/) (filled circles and solid curves) or the SED filter emphasizes the spectrum of /ε/ (/ε/ - /i/) (open circles and dashed curves). Slope and midpoint (mp) values are the mean values from the individually fitted functions. However, the solid and dashed curves represent functions fitted to the mean data. Error bars represent ±1 standard error of the mean across listeners.
Although the results suggest a larger spectral contrast effect when spectral resolution is degraded, it is possible that this increase is at least in part due to the shallower slope of the psychometric function observed in the CI users and in the NH listeners with simulated channel interactions. Because shallow slopes indicate less sensitivity in discrimination, one step shift on the continuum is not necessarily equivalent to the same shift with a steeper underlying psychometric function. To account for the effect of different underlying psychometric functions, an alternative measure of the spectral contrast effect was derived, which was based on the effect size (difference in means, normalized by the s.d. of the underlying distribution). The difference in the midpoints of two psychometric functions (which can be thought of as cumulative normal distributions) was normalized by the average s.d. (the reciprocal of the slope), in that same way as sensitivity, or d′, is calculated
| (1) |
where m indicates midpoint and S indicates slope. Considering first the data from the NH listeners, a paired-samples t-test on the normalized midpoint shifts showed that the effect of channel interactions was not significant [t(11) = 0.15, p = 0.88, Cohen's d = 0.054]. Thus, although increasing the spectral spread via vocoding led to a shallower psychometric function and a concomitant increase in the difference between the midpoints, the size of the spectral contrast effect, quantified as the normalized midpoint shift, was not affected.
The results from the CI users were compared with those from the NH listeners using the vocoder with the 12 dB/oct spread. An independent-samples t-test on the normalized midpoint shift between the two contexts found that the shift was significantly different between the two groups [t(19) = −3.06, p = 0.001, Cohen's d = 1.27]. In other words, even though the CI and NH groups in the condition with spread had comparable spectral resolution, as suggested by the similarly shallow psychometric functions and by earlier studies with similar spread in speech perception experiments (Oxenham and Kreft, 2014), the CI users showed larger spectral contrast effects than the NH listeners. Whereas the size of the context effect (in terms of normalized midpoint shift) for the NH listeners was 0.85 (s.d. 0.43) and 0.88 (s.d. 0.43) in the spread and no-spread conditions, respectively, it was 1.83 (s.d. 1.00) for the CI users.
4. Discussion
Our results suggest that CI users may experience increased effects of spectral compensation, relative to those observed in NH listeners. This increased effect of context does not seem to be explained just by the poorer spectral resolution experienced by CI users. When poorer spectral resolution was simulated in NH listeners, the slopes of the psychometric functions were similar to those found in CI users, but the spectral contrast effect, measured as the normalized midpoint shifts, remained larger in the CI users. It should be noted that on average the CI users were older than the NH listeners, meaning that we cannot rule out potential age effects, although the one younger CI user (P12/P13) had a normalized spectral contrast effect size of 3.50, which was larger than the mean for the CI users and so was much larger than that of the NH listeners.
Earlier studies have, if anything, reported reduced or absent context effects in CI users. For instance, Aravamudhan and Lotto (2005) reported absent contrastive context effects in CI users for either /da/ - /ga/ targets following /al/ - /ar/ contexts. Winn et al. (2013) found reduced context effects in CI users in distinguishing the fricatives /s/ and /ʃ/. In a vowel-identification paradigm similar to that of Summerfield et al. (1984), CI users were also found to benefit from a precursor that had the inverse spectrum of the target vowel, leading to enhancement of the vowel formants, but not to the same extent as NH listeners (Wang et al., 2012). In these cases, it has been assumed that reduced spectral resolution led to reduced spectral contrast effects. On the other hand, one study reported larger effects in NH listeners when the stimuli were presented through a noise-excited envelope vocoder (Stilp, 2017), although in that study the influence of the shallower psychometric function was not considered. One study has reported larger-than-normal speech context effects in hearing-impaired listeners (Stilp and Alexander, 2016), who might also suffer from poor spectral resolution due to broader cochlear filters (Moore et al., 1999).
One possible explanation for the conflicting findings in the effect size change due to spectral degradation is the use of different time scales. The studies that reported absent (Aravamudhan and Lotto, 2005) or reduced (Winn et al., 2013) contrast effects in CI users measured the effect from spectral properties of the adjacent phonemes, whereas our study and the studies using vocoding (Stilp, 2017) and HI listeners (Stilp and Alexander, 2016) used whole sentences as context. However, Wang et al. (2012) used precursors of 1 s duration, and CI users still showed less enhancement than NH listeners, suggesting that precursor duration cannot completely account for the differences observed between different studies.
Another explanation is that the enhanced CI contrast effects observed here may reflect different spectral cues used by CI users to distinguish vowels. The spectra of SED filters used in our study, as shown in Fig. 1, have both high-rate spectral modulation, such as the sharp peaks and troughs at the formant frequencies of the two vowels (blue dashed lines in Fig. 1), as well as slow-rate spectral modulation, such as the coarse spectral change (red dashed lines in Fig. 1). Since CI users rely on the coarser spectral cues in speech perception (Winn et al., 2012; Winn and Litovsky, 2015), their perceptual compensation might also be based on the coarse spectral change rather than the fine spectral changes presumably used by NH listeners (Stilp et al., 2015). The fact that NH listeners did not show the same enhanced context effects when listening through a vocoder if the change of psychometric function slope is taken into account highlights a limitation of vocoders in simulating CI perception, but also suggests that the listening strategies of CI users (including what spectral cues are used to distinguish vowels) may develop over longer time periods and so may not be reflected in the listening strategies of NH listeners when presented acutely with vocoder simulations.
Acknowledgments
This study was supported by NIH Grant No. R01 DC012262. We thank Heather Kreft for assistance in recruiting the CI participants, and our participants for providing their time. Christian Stilp and one anonymous reviewer provided helpful comments on an earlier version of this paper.
References and links
- 1. Aravamudhan, R. , and Lotto, A. J. (2005). “ Phonetic context effects in adult listeners with cochlear implant,” J. Acoust. Soc. Am. 118, 1962–1963. 10.1121/1.4781551 [DOI] [Google Scholar]
- 2. Bierer, J. A. , and Litvak, L. (2016). “ Reducing channel interaction through cochlear implant programming may improve speech perception: Current focusing and channel deactivation,” Trends Hear. 20, 1–12. 10.1177/2331216516653389 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Boersma, P. , and Weenink, D. (2015). “ Praat: Doing phonetics by computer,” http://www.praat.org/ (Last viewed September 30, 2018).
- 4. Donaldson, G. S. , Rogers, C. L. , Johnson, L. B. , and Oh, S. H. (2015). “ Vowel identification by cochlear implant users: Contributions of duration cues and dynamic spectral cues,” J. Acoust. Soc. Am. 138, 65–73. 10.1121/1.4922173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Feng, L. , and Oxenham, A. J. (2018). “ Spectral contrast effects produced by competing speech contexts,” J. Exp. Psychol. Hum. Percept. Perform. (published online 2018). 10.1037/xhp0000546 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ladefoged, P. , and Broadbent, D. E. (1957). “ Information conveyed by vowels,” J. Acoust. Soc. Am. 29, 98–104. 10.1121/1.1908694 [DOI] [PubMed] [Google Scholar]
- 7. Moore, B. C. , Vickers, D. A. , Plack, C. J. , and Oxenham, A. J. (1999). “ Inter-relationship between different psychoacoustic measures assumed to be related to the cochlear active mechanism,” J. Acoust. Soc. Am. 106, 2761–2778. 10.1121/1.428133 [DOI] [PubMed] [Google Scholar]
- 8. Oxenham, A. J. , and Kreft, H. A. (2014). “ Speech perception in tones and noise via cochlear implants reveals influence of spectral resolution on temporal processing,” Trends Hear. 18, 1–14. 10.1177/2331216514553783 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Stilp, C. E. (2017). “ Acoustic context alters vowel categorization in perception of noise-vocoded speech,” J. Assoc. Res. Otolaryngol. 18, 465–481. 10.1007/s10162-017-0615-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Stilp, C. E. , and Alexander, J. M. (2016). “ Spectral contrast effects in vowel categorization by listeners with sensorineural hearing loss,” Proc. Mtgs. Acoust. 26, 060003. 10.1121/2.0000233 [DOI] [Google Scholar]
- 11. Stilp, C. E. , Anderson, P. W. , and Winn, M. B. (2015). “ Predicting contrast effects following reliable spectral properties in speech perception,” J. Acoust. Soc. Am. 137, 3466–3476. 10.1121/1.4921600 [DOI] [PubMed] [Google Scholar]
- 12. Summerfield, Q. , Haggard, M. , Foster, J. , and Gray, S. (1984). “ Perceiving vowels from uniform spectra: Phonetic exploration of an auditory aftereffect,” Percept. Psychophys. 35, 203–213. 10.3758/BF03205933 [DOI] [PubMed] [Google Scholar]
- 13. Wang, N. , Kreft, H. , and Oxenham, A. J. (2012). “ Vowel enhancement effects in cochlear-implant users,” J. Acoust. Soc. Am. 131, EL421–EL426. 10.1121/1.4710838 [DOI] [PubMed] [Google Scholar]
- 14. Watkins, A. J. (1991). “ Central auditory mechanisms of perceptual compensation for spectral-envelope distortion,” J. Acoust. Soc. Am. 90, 2942–2955. 10.1121/1.401769 [DOI] [PubMed] [Google Scholar]
- 15. Winn, M. B. , Chatterjee, M. , and Idsardi, W. J. (2012). “ The use of acoustic cues for phonetic identification: Effects of spectral degradation and electric hearing,” J. Acoust. Soc. Am. 131, 1465–1479. 10.1121/1.3672705 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Winn, M. B. , and Litovsky, R. Y. (2015). “ Using speech sounds to test functional spectral resolution in listeners with cochlear implants,” J. Acoust. Soc. Am. 137, 1430–1442. 10.1121/1.4908308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Winn, M. B. , Rhone, A. E. , Chatterjee, M. , and Idsardi, W. J. (2013). “ The use of auditory and visual context in speech perception by listeners with normal hearing and listeners with cochlear implants,” Front. Psychol. 4, 824. 10.3389/fpsyg.2013.00824 [DOI] [PMC free article] [PubMed] [Google Scholar]


