Abstract
Cochlear implant (CI) listeners typically perform poorly on tasks involving the pitch of complex tones. This limitation in performance is thought to be mainly due to the restricted number of active channels and the broad current spread that leads to channel interactions and subsequent loss of precise spectral information, with temporal information limited primarily to temporal-envelope cues. Little is known about the degree of spectral resolution required to perceive combinations of multiple pitches, or a single pitch in the presence of other interfering tones in the same spectral region. This study used noise-excited envelope vocoders that simulate the limited resolution of CIs to explore the perception of multiple pitches presented simultaneously. The results show that the resolution required for perceiving multiple complex pitches is comparable to that found in a previous study using single complex tones. Although relatively high performance can be achieved with 48 channels, performance remained near chance when even limited spectral spread (with filter slopes as steep as 144 dB/octave) was introduced to the simulations. Overall, these tight constraints suggest that current CI technology will not be able to convey the pitches of combinations of spectrally overlapping complex tones.
Keywords: cochlear implants, complex pitch, vocoder
Introduction
Pitch is a fundamental property of sound that is integral to the perception of music and speech, as well as being a primary cue used in auditory scene analysis (McDermott 2004; Oxenham 2012, 2018). Pitch perception is generally poorer than normal in people with cochlear hearing loss, partly due to poorer frequency selectivity and a loss of spectrally resolved harmonics (Glasberg and Moore 1986; Bernstein and Oxenham 2006), and is poorer still in listeners with cochlear implants (CIs) (McDermott 2004; Fu et al. 2004; Chatterjee and Peng 2008; Looi et al. 2008; Deroche et al. 2016; Gaudrain and Başkent 2018). For complex harmonic tones, such as those found in voiced speech and musical sounds, CI users do not have access to the low-numbered resolved harmonics that provide normal-hearing listeners with a salient pitch (Houtsma and Smurzynski 1990; Bernstein and Oxenham 2003). Instead, they are limited to the pitch conveyed via the periodicity of the temporal envelope (McDermott 2004; Oxenham 2018). Temporal-envelope pitch is not only weaker and less accurate than that conveyed by low-numbered resolved harmonics, it is also more susceptible to interference by other sounds and reverberation (Carlyon 1996; Qin and Oxenham 2005; Micheyl et al. 2010a; Oxenham 2018). Most importantly for the purposes of this study, it is thought that temporal-envelope pitch cannot convey more than one pitch at a time when sounds are presented simultaneously in the same spectral region (Carlyon 1996; Micheyl et al. 2010a; Kreft et al. 2013), making the accurate perception of harmony in music extremely difficult.
The lack of access to low-numbered resolved harmonics affects pitch perception not only in music (Kong et al. 2004; Galvin et al. 2007; Zeng et al. 2014; Fielden et al. 2015; Mehta and Oxenham 2017) but also in several aspects of speech perception, including speaker identification and discrimination (Gaudrain and Başkent 2018), the perception of prosody in speech produced by dynamic changes in pitch (Deroche et al. 2016), and the understanding of speech in situations with multiple talkers (Stickney et al. 2004; Rosen et al. 2013).
An important question is how much spectral resolution is needed to extract pitch from a mixture using a spectral F0 cue, and whether that differs from the resolution needed to extract the pitch from a single complex tone. A previous study using vocoder simulations showed that a relatively high number of channels (between 32 and 64) was needed to convey a pitch strong enough to enable the discrimination of small pitch changes within a short melody (Mehta and Oxenham 2017). Most importantly, the vocoder channels also required very sharp spectral slopes, exceeding 72 dB/octave, to produce accurate pitch perception. Although these specifications are already well beyond the capabilities of current CIs, it is not clear if the resolution needed to convey one pitch at a time is the same as that needed to convey multiple pitches simultaneously. Having three or more complex tones is common in natural listening environments, as well as in music, where multiple tones can often begin and end at the same time (e.g., Parncutt et al. 2019). More importantly, mixtures of concurrent complex tones (such as multiple instruments played simultaneously) lead to a drastic decrease in peripheral resolvability of the resulting mixture, even if all the component complexes consist of low-numbered resolved harmonics (e.g., Micheyl et al. 2010b; Graves and Oxenham 2019). Studies have shown that judgments of music quality in both CI users and hearing-impaired listeners decrease when multiple instruments are involved (Looi et al. 2007; Nemer et al. 2017). Hence, it is important to investigate whether the resolution deemed sufficient for single tones is the same or higher for mixtures of complex tones. A recent study with normal-hearing listeners suggested that the pitches from multiple harmonic complex tones could be extracted so long as each individual tone contained some resolved harmonics, even if no resolved harmonics remained after the three complex tones were combined (Graves and Oxenham 2019). It is possible that normal-hearing listeners have access to temporal fine structure (TFS) cues from individual harmonics that may remain accessible even if the individual harmonics are no longer spectrally resolved (Moore 2019). Such TFS cues are not available to CI users, and are also not available via vocoder simulations that involve a discrete number of frequency channels, because the TFS of the vocoder carrier does not accurately reflect the TFS of the original stimulus within the passband of the vocoder channel.
The present study asks how pitch is extracted from simultaneous mixtures of complex tones, and whether the limits for doing so are different from the limits determined for single complex tones. The paradigm, adapted from Graves and Oxenham (2019), involved harmonic complex tones that were gated on and off synchronously and were bandpass filtered into the same spectral region. This approach helps to rule out the use of acoustic cues (such as spectral or temporal “glimpsing” of individual harmonic tones, and temporal-envelope cues) that may be sometimes, but not always, available in everyday listening environments. The overall pattern of results suggests that the minimum resolution required to perceive combinations of three pitches is similar to that needed to perceive single pitches, and confirms that the required resolution is so high as to be essentially unattainable using current CI technology.
General Methods
Participants
Fourteen participants (5 male, 9 female, aged 19–30) took part in both the experiments in the study. All participants had normal hearing, defined as audiometric thresholds for octave frequencies between 250 and 8000 Hz no greater than 20 dB hearing level (HL). Participants varied in their years of musical training, ranging from 0 to 18 years, and reported no history of neurological or hearing disorders. All participants provided written informed consent and were compensated for their time. The experiments were conducted at the University of Minnesota, Twin Cities. The University of Minnesota Institutional Review Board approved all protocols.
Vocoder Parameters
For both experiments, the stimuli were processed with noise-excited envelope vocoders. Although tone-excited vocoders may better represent CI users’ performance in speech-perception tasks (Whitmal et al. 2007), they can also potentially add extraneous pitch cues, which would be problematic for our tasks, which explicitly investigate pitch perception. All the stimuli were first passed through an analysis filterbank stage, consisting of high-order (N > 1000) FIR bandpass filters with cutoff frequencies logarithmically evenly spaced between 200 and 6000 Hz. The number of channels was varied based on the experimental condition. A flat overall spectral response was confirmed (< 0.1 dB ripple in the passband) with negligible spectral overlap between the channels. The impulse responses of the filters were time aligned with a latency of about 20 ms. Following the analysis filter stage, the temporal envelopes in each channel were computed using a Hilbert transform, followed by a fourth-order Butterworth low-pass filter with a cutoff frequency of 50 Hz to reduce periodicity cues from the temporal envelope. The resulting envelopes were then used to modulate independent samples of Gaussian white noise. These were then re-filtered through a synthesis filter bank. The slopes of the synthesis filterbank were varied according to the parameters in the various experiments. For the no (or negligible) spectral overlap conditions, the same high-order FIR filters were used as for the initial stimulus analysis. For all the conditions with spectral overlap, the synthesis filterbank consisted of symmetrical, zero phase shift, Butterworth filters ranging from 12th (72 dB/oct) to 24th order (144 dB/oct) with the cutoff frequencies being the 6-dB down points. The synthesis filters were essentially triangular with no flat bandpass region. The final output stimulus was generated by summing the outputs from all channels of the synthesis filterbank and rescaling the normalized amplitude to the target sound level. For all the multiple pitch stimuli, the complex tones were generated and combined before being processed through the vocoder.
Stimulus Presentation
Stimuli were presented diotically at a sampling rate of 44.1 kHz through an L22 soundcard (Lynx Studio Technology, Costa Mesa, CA) to HD 650 headphones (Sennheiser, Old Lyme, CT). Each of the complex tones was presented at an overall level of 65 dB SPL before mixing with the other stimuli. The presentation of stimuli and responses was controlled with the AFC software package (Ewert 2013) via a MATLAB platform (MathWorks, Natick, MA). For all experiments, the vocoded stimuli were presented in a background of threshold-equalizing noise (TEN), bandpass filtered between 20 Hz and 10 kHz, and presented at a level of 55 dB SPL per equivalent rectangular bandwidth (ERB) around 1 kHz (Moore et al. 2000), and was gated on and off with 20-ms ramps.
Experiment 1: Effect of Number of Spectral Channels
Rationale
The aim of experiment 1 was to determine the number of spectral channels required for listeners to accurately perceive the pitch of one complex tone in the presence of other complex tones. In this experiment, there was no spectral overlap between adjacent channels of the vocoder.
Stimuli
A schematic diagram of the stimuli used in experiment 1 is shown in Fig. 1. Each trial consisted of three reference complex tones, termed cue tones, followed by a mixture of the target complex tone and two masker complex tones. The F0 of the cue tones was roved between trials by ± 1 semitone around 353 Hz with uniform distribution on a semitone scale. The target tone in the mixture was presented either 1 semitone above or below the cue tone. Relative to the two masker tones, the target had either the lowest (LOW condition), middle (MID condition), or highest (HIGH condition) F0 within the mixture. The spacing between the adjacent F0s in the mixture was between 3 and 6 semitones, selected at random on each trial with uniform distribution on a semitone scale. All tones were generated as broadband harmonic complex tones with equal-amplitude harmonics added in sine phase. The target complex tones as well as the cue tones were bandpass filtered with corner frequencies of 1 and 4 kHz, and the masker complex tones were bandpass filtered with corner frequencies of 700 and 5000 Hz. In both cases, 48 dB/octave spectral slopes were imposed outside the passbands. The wider bandwidth of the masker tones ensured that the highest and lowest components of the mixture belonged to the masker tones, and so prevented any usable cues regarding the target based on the lowest or highest audible components in the mixture. The tones were 250 ms in duration, including 10-ms onset and offset ramps separated by 50-ms gaps. The TEN background was gated on 300 ms before the first cue tone and was gated off 200 ms after the final tone. All the stimuli (except for the TEN) were vocoded with the noise vocoder described earlier before being presented to the listeners. The number of channels was 32, 48, or 64. With the logarithmic spacing of the channels between 200 and 6000 Hz, the number of channels corresponds to channel filter bandwidths of roughly 1/7th, 1/10th, and 1/15th octaves (corresponding to 1.7, 1.2, and 0.8 semitones), respectively. For comparison, the estimated equivalent rectangular bandwidth (ERB) of the normal auditory filter at 2 kHz (the center of the target passband) is about 1/6th octave according to Glasberg and Moore (1990), based on measures using simultaneous masking, and is about 1/10th octave according to Oxenham and Shera (2003), based on measures using a fixed low-level signal under forward masking, which avoid potentially confounding effects of suppression (e.g., Houtgast 1972; Moore 1978). In other words, the estimated ERBs for the human auditory filters are similar to those of the vocoder in the 32- and 48-channel conditions. The same FIR filters as the analysis filters were used for the synthesis filter in this experiment.
Fig. 1.
Schematic representation of the stimulus paradigm (left-side panels) and the spectrum of the mixtures (right-side panel s). The top row shows the HIGH condition where the maskers are both lower in frequency than the target. The middle and bottom rows illustrate the MID and LOW conditions, respectively. The vertical grey lines (leftside panels) indicate the start and stop of the background noise.
Procedure
The F0 of the target tone was always one semitone higher or lower than the F0 of the cue tones with equal a priori probability. The participants’ task was to determine whether the pitch of the target tone was higher or lower than that of the preceding cue tones. All participants carried out a screening/training session before moving on to the experimental tasks. The screening task consisted of non-vocoded complex tones. Participants completed 3 blocks, each corresponding to a different position of the target tone (LOW, MID, or HIGH). Each block contained 4 runs with 20 trials per run. Visual feedback of “correct” or “wrong” was displayed after each trial response. Verbal feedback from the research assistant or experimenter was also given after each block. All three blocks needed to be completed with at least 80 % accuracy in order for the participant to continue with the experiment. The order of blocks was randomized across participants. If the participant did not meet the criterion on their first round of screening in any block, they were reinstructed verbally with the schematic (see Fig. 1), and the level of the target was increased to 5 dB above the masker to make the target tone more salient. After this block, the target level was reduced again to be equal to that of each masker. Fourteen of the initial 20 participants passed the screening task. There was no significant difference between the participants who passed the screening and those who did not in terms of years of musical training. Although all the initial participants could perform the task at a minimum of 65–70 % accuracy, the high-performance cutoff of 80 % on all non-vocoded blocks was set to ensure that poor performance in the subsequent vocoded conditions was not due to participants’ inability to do the task in the absence of vocoding.
The test task consisted of the same stimuli described above but vocoded with the non-overlapping spectral channels. Eighty trials were completed for each channel condition (32, 48, and 64 channels) in each target position (LOW, MID, HIGH). These 80 trials were broken up into four runs of 20 trials each. Within each run, the vocoder condition remained the same. The entire experiment consisted of 36 runs (3 target positions × 3 channel conditions × 4 runs). The order of these runs was randomized for each participant and each set of runs. Most participants completed the task within one 2-h session.
Results and Discussion
The results from the 14 participants are shown in Fig. 2 in terms of percentage of correct responses. A generalized linear mixed model (GLMM) with logit link function was fitted to the binary scores of individual trials. The models were implemented in R R Core Team (2018) using the lme4 package (Bates et al. 2014). To find the minimal adequate model (MAM), we followed the backward model selection method (Jaeger 2008) to drop factors with non-significant fixed effects from a full model including all possible fixed effects. The full model was first fitted with number of channels (channel: 32, 48, 64), position of target tone (position: LOW, MID, HIGH), and their interaction as fixed effects and intercept per subject as the random effect, i.e., in lme4 syntax as follows: response ~ channel × position + (1|subject). Because the likelihood ratio test (LRT) can be unreliable in testing fixed effects with relatively small sample sizes like our dataset (Bolker et al. 2009), the significance of fixed effects in the global model was tested with a parametric bootstrapping method using the afex package Singmann et al. (2018), which has been found to be more robust than the LRT method (Luke 2017). To calculate the p value using a parametric bootstrap method, a reference distribution was formed with 1000 simulations, as suggested by the afex package document. A significant main effect of number of channels (p = .002), target position (p = 0.001), and their interaction (p = 0.04) was found, so the full model was kept as the MAM, and all following analyses were based on the MAM.
Fig. 2.
Percent correct scores for experiment 1 are plotted for each channel condition (32, 48, and 64 channels) for each position of the target (high, mid, and low). Each dot within the bar represents an individual subject. The black line in the middle of each bar denotes the mean, the lighter portions represent standard error of the mean, and the darker outer portions represent standard deviations.
Additional analyses included comparing the accuracy of each condition with chance level (50 % accuracy) using the lsmean package (Lenth 2016) on the logit scale. Performance for all conditions was significantly above chance (p < 0.0001 for all comparisons, right-tailed, Bonferroni adjusted α = 0.05/9 = 0.0055) except when the target was in the LOW position and number of channels was 32 (z.ratio = 2.228, p = 0.012, right-tailed). Due to a significant interaction between target tone position and number of channels, pairwise comparisons were performed between 32/48/64 channels within each level of target tone. Therefore, all pairwise comparison p values were calculated with Tukey’s method for comparing a family of three estimates. Among all pairs of channel number conditions, performance in conditions with more channels was significantly better than that with fewer channels (p < 0.02 for all comparisons), with the exception of no significant difference between the accuracy for the LOW target position for 48 vs 64 channels (z.ratio = − 1.849, p = 0.154).
In the MAM, both number of channels and target position affected response accuracy when there was no spectral overlap between adjacent channels of the vocoder. Increase in the number of channels resulted in improved behavioral accuracy. However, even with 32 channels, participants could achieve performance accuracy significantly above chance for the MID and HIGH conditions. This is consistent with the findings of our previous study using single complex tones (Mehta and Oxenham 2017), where 32 channels were also required to produce performance that was significantly above chance when there was no spectral overlap between channels.
Experiment 2: Effect of Number of Spectral Channels and Spectral Overlap
Rationale
Although 32 channels are sufficient to produce above chance performance, and good performance can be obtained with 48 channels, the conditions in experiment 1 contained no spectral overlap between vocoder channels. This is a very different situation from that experienced in CIs, where there is considerable interaction and overlap between adjacent electrodes. This overlap has been simulated in normal-hearing listeners with vocoders using filters with slopes as shallow as 8–12 dB/octave (e.g., Oxenham and Kreft 2014; O’Neill et al. 2019). Our earlier study (Mehta and Oxenham 2017) suggested that much steeper filter slopes (> 72 dB/octave) were required to accurately perceive the pitch of single complex tones. The aim of experiment 2 was to determine what filter slopes would be necessary to accurately hear out the pitch of one complex tone in the presence of other complex tones.
Stimuli and Procedure
The stimuli and procedure were the same as for experiment 1. The conditions differed based on the vocoder parameters. The number of channels were again 32, 48, or 64. However, instead of using essentially rectangular synthesis filters, the synthesis filters were Butterworth filters with slopes on either side of the center frequency of 72, 96, 120, or 144 dB/octave. As in experiment 1, the target F0 was presented in three different positions, relative to the two masker complex tones (LOW, MID, HIGH). Eighty trials were completed for each channel condition and each filter-slope condition in each target position. These 80 trials were broken up into four runs of 20 trials each. Within each run, the vocoder condition remained the same. The entire experiment consisted of 144 runs (3 target positions × 3 channel conditions × 4 slopes × 4 runs). The order of these runs was randomized for each participant. Most participants completed the task within two 2-h sessions
Results and Discussion
The results of this experiment are shown in Fig. 3. Similar to the analysis of experiment 1, we fitted a global GLMM and dropped factors with non-significant fixed effects to obtain the MAM. The global GLMM includes main effects of number of channels (channel: 32, 48, 64), position of target tone (position: LOW, MID, HIGH) and filter slope (slope: 72, 96, 120, and 144 dB/octave). All two- and three-way interactions were kept as fixed effects, and the random effect was the intercept per subject, i.e., in lme4 syntax as follows: response ~ channel × position × slope + (1|subject). Parametric bootstrap testing with a reference distribution formed with 1000 simulations showed a significant main effect of filter slope (p = 0.001), target position (p = 0.001), and number of channels (p = 0.01). Similar to experiment 1, the interaction of number of channels and target position was significant (p = 0.02), while other interactions did not reach significance, including the interaction of filter slope and number of channel (p = 0.13), interaction of filter slope and target position (p = 0.05), and the three-way interaction of all factors (p = 0.14). Therefore, the reduced model included all main effects and the interaction between number of channel and target position, in lme4 syntax as follows: response ~ channel + position + slope + position:channel + (1|subject). Though marginally significant factors were dropped, the reduced model was better than the global model according to both Akaike information criteria (AIC) and Bayesian information criteria (BIC) (Global model: AIC = 3072.3, BIC = 3228.5; reduced model: AIC = 3063.5, BIC = 3118.4. Note that a lower AIC or BIC value is indicative of a better model). Thus, the reduced model was chosen to be the MAM for further analysis.
Fig. 3.
Percent correct scores for experiment 2 are plotted. Each subplot shows data for different amounts of spectral overlap conditions (72, 96, 120, 144 dB/octave), for each channel condition (32, 48, and 64 channels) and target positions (high, mid, and low) shown within each panel. Each dot within the bar represents an individual subject. The black line in the middle of each bar denotes the mean, the lighter portions represent standard error of the mean, and the darker outer portions represent standard deviations.
In the MAM, there is no interaction between slope and other factors. Therefore, in the following analyses, any results of slope were averaged over target position and channel, while results of channels and target position were averaged over slope. Response accuracy was significantly above chance for all values of slope (p < 0.0001 for all conditions, right-tailed, Bonferroni adjusted α = 0.05/4 = 0.0125) except 72 dB/octave (z.ratio = 1.574, p = 0.0595, right-tailed). Response accuracy was also significantly above chance for all combinations of target position and channel number (p < 0.0004 for all, right-tailed, Bonferroni adjusted α = 0.05/9 = 0.0055), with the exception of the LOW target position in the 64-channel condition (z.ratio = 1.213, p > 0.11, right-tailed). Pairwise comparisons between filter-slope conditions with Tukey’s method for comparing a family of 4 estimates showed that performance in the 72 dB/octave condition was significantly worse than all other conditions (p < 0.0001), and no significant difference was found between other filter slopes (p > 0.5 for all). Similarly, due to the significant interaction term, a pairwise comparison between a number of channels was performed within each level of target position and p values were adjusted with Tukey’s method for comparing a family of 3 estimates. There was no significant difference between pairs of channel numbers when target tone location was HIGH or MID (p > 0.06 for all). When target tone location was LOW, performance with 64 channels was significantly worse than performance with 32 channels (z.ratio = 2.375, p = 0.046) or 48 channels (z.ratio = 3.608, p = 0.0009). There was no significant difference between 32 channels and 48 channels (z.ratio = − 1.233, p = 0.433).
The MAM suggested that all three factors affected perception of the target pitch and relationship between slope and other two factors was linear. Participants’ accuracy was at or close to chance level with 72 dB/octave filter slopes. When the filter slopes were increased to 96 dB/octave, participants performed above chance level on average, but further increasing the filter slope to 120 or 144 dB/octave did not lead to significantly better accuracy. For those conditions with accuracy above chance level, changes in the number of channels did not affect the accuracy of detecting a target at any of the target positions.
Overall, even with 64 channels and filters slopes of 144 dB/octave, performance was above chance but still mediocre, averaging less than 60 % correct. It seems that for the accurate extraction of a pitch among other competing pitches (as found in experiment 1 for 48 channels or more), filter slopes greater than 144 dB/octave are required.
General Discussion
Summary of Results
The aim of this study was to determine the minimum spectral resolution (in terms of the number of channels and the steepness of the spectral slopes of each channel) needed to convey spectral pitch information via the resolved harmonics of a complex tone in the presence of other spectrally and temporally overlapping masker complex tones. The results suggest that 48 channels (corresponding to filter bandwidths of ~ 1/10th octave) are sufficient for high levels of performance (> 80 % correct), but only when there is no spectral overlap between channels (experiment 1). When spectral overlap was introduced, performance was above chance but remained relatively poor, even with slopes as steep as 144 dB/octave. In the cases, performance was limited by the filter slopes, so that increasing the number of channels from 32 to 64 had no significant effect (experiment 2).
Comparison of Vocoder Characteristics with Normal Auditory Filters
In experiment 1, good performance was achieved with 48 channels or more, corresponding to filter bandwidths of around 1/10th octave. These filters are somewhat narrower than the estimated human ERBs using simultaneous masking at moderate sound levels, which are around 1/6th octave at 2 kHz (Glasberg and Moore 1990), but are comparable to the human ERBs estimated using forward masking at low sound levels (Oxenham and Shera 2003). This finding suggests that the tolerance for pitch extracted from resolved harmonics is similar to the limits of the normal human auditory system, and may also explain why animals with broader auditory filters (such as the chinchilla or ferret) seem to rely more on the temporal-envelope pitch provided by unresolved harmonics (Shofner and Chaney 2013; Walker et al. 2019).
One important distinction between the vocoder simulations and the normal auditory system is the resolution with which the spectrum is sampled along the frequency axis. With the vocoder, resolution is limited to the spacing between channels (corresponding to about 10 channels or samples per octave in the case of the 48-channel vocoder), whereas resolution in the normal auditory system is limited by number of inner hair cells along the length of the cochlea, which is estimated to be about 3500 in total (e.g., Wright et al. 1987), or about 350 per octave, assuming roughly even distribution on a logarithmic scale between 20 and 20,000 Hz. Therefore, even if the filter bandwidths are comparable between the vocoder and the normal auditory system, the filter density is greater in the normal auditory system than in the vocoder by more than an order of magnitude, leading to a potentially severe undersampling of the spectrum.
Perhaps the most important finding in the present study is that filter slopes of 144 dB/octave were still not sufficiently steep to enable good performance in this challenging pitch task (experiment 2). The effect of filter slopes on the spectrum is illustrated in Fig. 4, which compares the spectra produced by the vocoder with essentially no spectral overlap between channels (experiment 2) and the vocoder with 144 dB/octave filter slopes. It can be seen that even the relatively steep slopes of 144 dB/octave result in a strongly degraded representation of the spectrum, relative to the no-overlap conditions. Psychophysical tuning curves (PTCs) are probably the most direct way of assessing auditory filter slopes in the normal human auditory system. Again, the method can affect the estimates, but forward masking is believed to most accurately reflect the underlying cochlear tuning, as estimated with neural tuning curves (e.g., Sumner et al. 2018). Moore (1978) provided mean estimates of filter slopes under forward masking at 1 kHz of 135 dB/octave (range: 90–180 dB/octave) on the low-frequency slope and 390 dB/octave (range: 310–650 dB/octave) on the high-frequency slope. It is likely that slopes continue to steepen with increasing frequency (Shera et al. 2002). Thus, the 144 dB/octave slopes used in our vocoder may not have matched the selectivity of the normal auditory system, and seem not to have been sufficient to produce high levels of performance in this challenging task of hearing out one complex tone in the presence of other spectrally overlapping ones.
Fig. 4.
Spectrum of the three-tone mixture stimulus after vocoding with the No Overlap and 144 dB/oct conditions (upper and lower rows, respectively). The F0s of the tones were 353.55, 471.93, and 561.23 Hz
Limitations and Future Directions
The purpose of this study was to test the limits of the ability of listeners to use low-numbered spectrally resolved harmonics to hear out the pitch of one complex tone in the presence of other interfering complex tones. The task was deliberately made challenging by filtering all the tones into the same spectral region and using equal-amplitude harmonics to rule out the possibility of detecting the pitch of the target via “spectral glimpsing,” which might have been possible had the spectral envelopes of the target and maskers differed. Of course, in real music (and speech), a target tone is rarely completely spectrally and temporally overlapped with any interfering tones. For instance, in the case of music, different instruments have different spectral envelopes so that one instrument may dominate in any given spectral region. Nevertheless, there are numerous conditions in which instruments do overlap in time and in spectrum. In addition, pitch based just on temporal-envelope cues is highly susceptible to interference, even when there is no spectral overlap between the target and the interferer (e.g., Kreft et al. 2013). It may, however, be informative to use the types of vocoders tested here with more natural stimuli (such as real music performances) to determine the degree to which the limits tested here generalize to more natural settings. We expect that the findings will at least generalize to different timbres (e.g., different spectral envelopes for all tones), as most studies have found that pitch (and pitch dominance of the lower harmonics) does not depend strongly on the shape of the spectral envelope (e.g., Plomp 1967; Ritsma 1967), unless the timbre changes from trial to trial (e.g., Allen and Oxenham 2014).
The discussion of pitch extraction in the current set of experiments has mainly focused on interpretations based on spectral (or place) pitch. Alternate codes for pitch extraction can also be purely temporal or spectrotemporal. There are several reasons why our study focuses on the spectral or place interpretation of pitch. First, CI users are generally insensitive to pulse timing cues beyond about 300 Hz (Carlyon et al. 2010), suggesting that timing cues from the TFS of resolved harmonics would generally not be available to them. Second, studies with normal-hearing listeners have suggested that timing cues based on TFS are neither sufficient (Oxenham et al. 2004) nor necessary (Oxenham et al. 2011; Lau et al. 2017) for complex pitch perception. Therefore, although the role of TFS information from the vocoded stimuli cannot be completely ruled out, it seems reasonable to assume for the purposes of this study that the spectral representation of the harmonics is of most relevance for CI users.
A final limitation to consider is a general one concerning the use of vocoder simulations in normal-hearing listeners with acoustic stimuli. Although some early studies in anesthetized animals showed broad activation of auditory cortex from electrical stimulation of the cochlea via a CI (Raggio and Schreiner 1999; Bierer and Middlebrooks 2002), in line with expectations from comparisons of acoustic simulations of CI stimulation in humans (e.g., Oxenham and Kreft 2014; O’Neill et al. 2019), recent studies with awake marmosets have reported that CI stimulation led to much less auditory cortical activation than comparable acoustic stimulation (Johnson et al. 2016, 2017). Such differences may in part reflect the poor spectral resolution provided by the CI, but it may also indicate more fundamental differences in the response of the auditory system to electrical stimulation, which would not be captured via a vocoder simulation. Nevertheless, if vocoders with appropriate spectral smearing are considered a “best case” simulation of CI processing, the main conclusion of this study, that perception of multiple simultaneous pitches is unlikely to be achievable with current CI technology, remains valid. In the longer term, sufficient spectral resolution may be achievable via other approaches, such as different implantation sites (Middlebrooks and Snyder 2007, 2010) and/or the use of neurotrophic agents (Pinyon et al. 2014; Wise et al. 2016) to reduce the physical distance between the electrode and the stimulated neurons.
Conclusions
The results from this study confirm that pitch perception based on spectrally resolved harmonics is highly unlikely to be achieved within current CIs, and extend that conclusion to conditions with multiple overlapping harmonic tones. This finding implies that CI users will remain dependent on the pitch conveyed by periodicity in the temporal envelope (Fu et al. 2004; Kong et al. 2011), which is generally limited to F0s below about 300 Hz (Carlyon et al. 2010; Kong and Carlyon 2010; Macherey et al. 2011; Cosentino et al. 2016) and is not capable of conveying more than one pitch from stimuli presented simultaneously to the same spectral regions (Carlyon 1996; Kreft et al. 2013; Mehta and Oxenham 2018). Because multiple pitch perception is key to many aspects of music perception, our results suggest that presenting music adequately will remain a challenge with current CI technology, and thus provide further impetus to seek alternative stimulation methods and/or sites to overcome the limitations of poor spectral resolution.
Funding Information
This work is supported by NIH grant R01DC005216 (AJO).
Compliance with Ethical Standards
All participants provided written informed consent and were compensated for their time. The experiments were conducted at the University of Minnesota, Twin Cities. The University of Minnesota Institutional Review Board approved all protocols.
Conflict of Interest
The authors declare that they have no conflict of interest.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Allen EJ, Oxenham AJ. Symmetric interactions and interference between pitch and timbre. J Acoust Soc Am. 2014;135:1371–1379. doi: 10.1121/1.4863269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bates D, Mächler M, Bolker B, Walker S (2014) Fitting linear mixed-effects models using lme4
- Bernstein JG, Oxenham AJ. Pitch discrimination of diotic and dichotic tone complexes: harmonic resolvability or harmonic number? J Acoust Soc Am. 2003;113:3323–3334. doi: 10.1121/1.1572146. [DOI] [PubMed] [Google Scholar]
- Bernstein JGW, Oxenham AJ. The relationship between frequency selectivity and pitch discrimination: sensorineural hearing loss. J Acoust Soc Am. 2006;120:3929–3945. doi: 10.1121/1.2372452. [DOI] [PubMed] [Google Scholar]
- Bierer JA, Middlebrooks JC. Auditory cortical images of cochlear-implant stimuli: dependence on electrode configuration. J Neurophysiol. 2002;87:478–492. doi: 10.1152/jn.00212.2001. [DOI] [PubMed] [Google Scholar]
- Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, Stevens MH, White JS. Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol Evol. 2009;24:127–135. doi: 10.1016/J.TREE.2008.10.008. [DOI] [PubMed] [Google Scholar]
- Carlyon RP. Encoding the fundamental frequency of a complex tone in the presence of a spectrally overlapping masker. J Acoust Soc Am. 1996;99:517–524. doi: 10.1121/1.414510. [DOI] [PubMed] [Google Scholar]
- Carlyon RP, Deeks JM, McKay CM. The upper limit of temporal pitch for cochlear-implant listeners: stimulus duration, conditioner pulses, and the number of electrodes stimulated. J Acoust Soc Am. 2010;127:1469–1478. doi: 10.1121/1.3291981. [DOI] [PubMed] [Google Scholar]
- Chatterjee M, Peng S-C. Processing F0 with cochlear implants: modulation frequency discrimination and speech intonation recognition. Hear Res. 2008;235:143–156. doi: 10.1016/j.heares.2007.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cosentino S, Carlyon RP, Deeks JM, Parkinson W, Bierer JA. Rate discrimination, gap detection and ranking of temporal pitch in cochlear implant users. J Assoc Res Otolaryngol. 2016;17:371–382. doi: 10.1007/s10162-016-0569-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Deroche MLD, Kulkarni AM, Christensen JA et al (2016) Deficits in the sensitivity to pitch sweeps by school-aged children wearing cochlear implants. Front Neurosci:10. 10.3389/fnins.2016.00073 [DOI] [PMC free article] [PubMed]
- Ewert SD. AFC - a modular framework for running psychoacoustic experiments and computational perception models. Proc Conf Acoust AIA-DAGA. 2013;2013:1326–1329. [Google Scholar]
- Fielden CA, Kluk K, Boyle PJ, McKay CM. The perception of complex pitch in cochlear implants: a comparison of monopolar and tripolar stimulation. J Acoust Soc Am. 2015;138:2524–2536. doi: 10.1121/1.4931910. [DOI] [PubMed] [Google Scholar]
- Fu Q-J, Chinchilla S, Galvin JJ. The role of spectral and temporal cues in voice gender discrimination by normal-hearing listeners and cochlear implant users. J Assoc Res Otolaryngol JARO. 2004;5:253–260. doi: 10.1007/s10162-004-4046-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galvin JJ, Fu Q-J, Nogaki G. Melodic contour identification by cochlear implant listeners. Ear Hear. 2007;28:302–319. doi: 10.1097/01.aud.0000261689.35445.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaudrain E, Başkent D. Discrimination of voice pitch and vocal-tract length in cochlear implant users. Ear Hear. 2018;39:226–237. doi: 10.1097/AUD.0000000000000480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glasberg BR, Moore BCJ. Auditory filter shapes in subjects with unilateral and bilateral cochlear impairments. J Acoust Soc Am. 1986;79:1020–1033. doi: 10.1121/1.393374. [DOI] [PubMed] [Google Scholar]
- Glasberg BR, Moore BCJ. Derivation of auditory filter shapes from notched-noise data. Hear Res. 1990;47:103–138. doi: 10.1016/0378-5955(90)90170-T. [DOI] [PubMed] [Google Scholar]
- Graves JE, Oxenham AJ. Pitch discrimination with mixtures of three concurrent harmonic complexes. J Acoust Soc Am. 2019;145:2072–2083. doi: 10.1121/1.5096639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Houtgast T. Psychophysical evidence for lateral inhibition in hearing. J Acoust Soc Am. 1972;51:1885–1894. doi: 10.1121/1.1913048. [DOI] [PubMed] [Google Scholar]
- Houtsma AJM, Smurzynski J. Pitch identification and discrimination for complex tones with many harmonics. J Acoust Soc Am. 1990;87:304–310. doi: 10.1121/1.399297. [DOI] [Google Scholar]
- Jaeger TF. Categorical data analysis: away from ANOVAs (transformation or not) and towards logit mixed models. J Mem Lang. 2008;59:434–446. doi: 10.1016/J.JML.2007.11.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson LA, Santina CCD, Wang X. Selective neuronal activation by cochlear implant stimulation in auditory cortex of awake primate. J Neurosci. 2016;36:12468–12484. doi: 10.1523/JNEUROSCI.1699-16.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson LA, Santina CCD, Wang X. Representations of time-varying cochlear implant stimulation in auditory cortex of awake marmosets (Callithrix jacchus) J Neurosci. 2017;37:7008–7022. doi: 10.1523/JNEUROSCI.0093-17.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kong Y-Y, Carlyon RP. Temporal pitch perception at high rates in cochlear implants. J Acoust Soc Am. 2010;127:3114–3123. doi: 10.1121/1.3372713. [DOI] [PubMed] [Google Scholar]
- Kong Y-Y, Cruz R, Jones AJ, Zeng F-G. Music perception with temporal cues in acoustic and electric hearing. Ear Hear. 2004;25(2):173–185. doi: 10.1097/01.AUD.0000120365.97792.2F. [DOI] [PubMed] [Google Scholar]
- Kong Y-Y, Mullangi A, Marozeau J, Epstein M. Temporal and spectral cues for musical timbre perception in electric hearing. J Speech Lang Hear Res JSLHR. 2011;54:981–994. doi: 10.1044/1092-4388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kreft HA, Nelson DA, Oxenham AJ. Modulation frequency discrimination with modulated and unmodulated interference in normal hearing and in cochlear-implant users. J Assoc Res Otolaryngol. 2013;14:591–601. doi: 10.1007/s10162-013-0391-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lau Bonnie K., Mehta Anahita H., Oxenham Andrew J. Superoptimal Perceptual Integration Suggests a Place-Based Representation of Pitch at High Frequencies. The Journal of Neuroscience. 2017;37(37):9013–9021. doi: 10.1523/JNEUROSCI.1507-17.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lenth RV (2016) Least-squares means: the R package lsmeans. J Stat Softw 69. 10.18637/jss.v069.i01
- Looi V, McDermott H, McKay C, Hickson L. Music perception of cochlear implant users compared with that of hearing aid users. Ear Hear. 2008;29:421–434. doi: 10.1097/AUD.0b013e31816a0d0b. [DOI] [PubMed] [Google Scholar]
- Looi V, McDermott H, McKay C, Hickson L. Comparisons of quality ratings for music by cochlear implant and hearing aid users. Ear Hear. 2007;28:59S–61S. doi: 10.1097/AUD.0b013e31803150cb. [DOI] [PubMed] [Google Scholar]
- Luke SG. Evaluating significance in linear mixed-effects models in R. Behav Res Methods. 2017;49:1494–1502. doi: 10.3758/s13428-016-0809-y. [DOI] [PubMed] [Google Scholar]
- Macherey O, Deeks JM, Carlyon RP. Extending the limits of place and temporal pitch perception in cochlear implant users. J Assoc Res Otolaryngol. 2011;12:233–251. doi: 10.1007/s10162-010-0248-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDermott HJ. Music perception with cochlear implants: a review. Trends Amplif. 2004;8:49–82. doi: 10.1177/108471380400800203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mehta Anahita H., Oxenham Andrew J. Vocoder Simulations Explain Complex Pitch Perception Limitations Experienced by Cochlear Implant Users. Journal of the Association for Research in Otolaryngology. 2017;18(6):789–802. doi: 10.1007/s10162-017-0632-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mehta AH, Oxenham AJ. Fundamental-frequency discrimination based on temporal-envelope cues: effects of bandwidth and interference. J Acoust Soc Am. 2018;144:EL423–EL428. doi: 10.1121/1.5079569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Micheyl C, Hunter C, Oxenham AJ. Auditory stream segregation and the perception of across-frequency synchrony. J Exp Psychol Hum Percept Perform. 2010;36:1029–1039. doi: 10.1037/a0017601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Micheyl C, Keebler MV, Oxenham AJ. Pitch perception for mixtures of spectrally overlapping harmonic complex tones. J Acoust Soc Am. 2010;128:257–269. doi: 10.1121/1.3372751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Middlebrooks JC, Snyder RL. Auditory prosthesis with a penetrating nerve array. J Assoc Res Otolaryngol. 2007;8:258–279. doi: 10.1007/s10162-007-0070-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Middlebrooks JC, Snyder RL. Selective electrical stimulation of the auditory nerve activates a pathway specialized for high temporal acuity. J Neurosci. 2010;30:1937–1946. doi: 10.1523/JNEUROSCI.4949-09.2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moore BCJ. The roles of temporal envelope and fine structure information in auditory perception. Acoust Sci Technol. 2019;40:61–83. doi: 10.1250/ast.40.61. [DOI] [Google Scholar]
- Moore BCJ. Psychophysical tuning curves measured in simultaneous and forward masking. J Acoust Soc Am. 1978;63:524–532. doi: 10.1121/1.381752. [DOI] [PubMed] [Google Scholar]
- Moore BCJ, Huss M, Vickers DA, Glasberg BR, Alcántara JI. A test for the diagnosis of dead regions in the cochlea. Br J Audiol. 2000;34:205–224. doi: 10.3109/03005364000000131. [DOI] [PubMed] [Google Scholar]
- Nemer JS, Kohlberg GD, Mancuso DM, Griffin BM, Certo MV, Chen SY, Chun MB, Spitzer JB, Lalwani AK. Reduction of the harmonic series influences musical enjoyment with cochlear implants. Otol Neurotol. 2017;38:31–37. doi: 10.1097/MAO.0000000000001250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Neill ER, Kreft HA, Oxenham AJ. Speech perception with spectrally non-overlapping maskers as measure of spectral resolution in cochlear implant users. J Assoc Res Otolaryngol. 2019;20:151–167. doi: 10.1007/s10162-018-00702-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oxenham AJ. Pitch Perception. J Neurosci. 2012;32:13335–13338. doi: 10.1523/JNEUROSCI.3815-12.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oxenham AJ. How we hear: the perception and neural coding of sound. Annu Rev Psychol. 2018;69:27–50. doi: 10.1146/annurev-psych-122216-011635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oxenham AJ, Bernstein JGW, Penagos H. Correct tonotopic representation is necessary for complex pitch perception. Proc Natl Acad Sci U S A. 2004;101:1421–1425. doi: 10.1073/pnas.0306958101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oxenham AJ, Kreft HA. Speech perception in tones and noise via cochlear implants reveals influence of spectral resolution on temporal processing. Trends Hear. 2014;18:2331216514553783. doi: 10.1177/2331216514553783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oxenham AJ, Micheyl C, Keebler MV, Loper A, Santurette S. Pitch perception beyond the traditional existence region of pitch. Proc Natl Acad Sci. 2011;108:7629–7634. doi: 10.1073/pnas.1015291108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oxenham AJ, Shera CA. Estimates of human cochlear tuning at low levels using forward and simultaneous masking. J Assoc Res Otolaryngol. 2003;4:541–554. doi: 10.1007/s10162-002-3058-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parncutt R, Reisinger D, Fuchs A, Kaiser F. Consonance and prevalence of sonorities in Western polyphony: roughness, harmonicity, familiarity, evenness, diatonicity. J New Music Res. 2019;48:1–20. doi: 10.1080/09298215.2018.1477804. [DOI] [Google Scholar]
- Pinyon J. L., Tadros S. F., Froud K. E., Y. Wong A. C., Tompson I. T., Crawford E. N., Ko M., Morris R., Klugmann M., Housley G. D. Close-Field Electroporation Gene Delivery Using the Cochlear Implant Electrode Array Enhances the Bionic Ear. Science Translational Medicine. 2014;6(233):233ra54–233ra54. doi: 10.1126/scitranslmed.3008177. [DOI] [PubMed] [Google Scholar]
- Plomp R. Pitch of complex tones. J Acoust Soc Am. 1967;41:1526–1533. doi: 10.1121/1.1910515. [DOI] [PubMed] [Google Scholar]
- Qin MK, Oxenham AJ. Effects of envelope-vocoder processing on f0 discrimination and concurrent-vowel identification. Ear Hear. 2005;26:451–460. doi: 10.1097/01.aud.0000179689.79868.06. [DOI] [PubMed] [Google Scholar]
- Raggio MW, Schreiner CE. Neuronal responses in cat primary auditory cortex to electrical cochlear stimulation. III. Activation patterns in short- and long-term deafness. J Neurophysiol. 1999;82:3506–3526. doi: 10.1152/jn.1999.82.6.3506. [DOI] [PubMed] [Google Scholar]
- R Core Team (2018) R: a language and environment for statistical computing. Vienna Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
- Ritsma RJ. Frequencies dominant in the perception of the pitch of complex sounds. J Acoust Soc Am. 1967;42:191–198. doi: 10.1121/1.1910550. [DOI] [PubMed] [Google Scholar]
- Rosen S, Souza P, Ekelund C, Majeed AA. Listening to speech in a background of other talkers: effects of talker number and noise vocoding. J Acoust Soc Am. 2013;133:2431–2443. doi: 10.1121/1.4794379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shera CA, Guinan JJ, Oxenham AJ. Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements. Proc Natl Acad Sci. 2002;99:3318–3323. doi: 10.1073/pnas.032675099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shofner WP, Chaney M. Processing pitch in a nonhuman mammal (Chinchilla laniger) J Comp Psychol. 2013;127:142–153. doi: 10.1037/a0029734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singmann et al. (2018) Analysis of factorial experiments, package ‘afex’. R Package version 0.13—145, 1–44. Comprehensive R Archive Network (CRAN)
- Stickney GS, Zeng F-G, Litovsky R, Assmann PF. Cochlear implant speech recognition with speech maskers. J Acoust Soc Am. 2004;116:1081–1091. doi: 10.1121/1.1772399. [DOI] [PubMed] [Google Scholar]
- Sumner CJ, Wells TT, Bergevin C, Sollini J, Kreft HA, Palmer AR, Oxenham AJ, Shera CA. Mammalian behavior and physiology converge to confirm sharper cochlear tuning in humans. Proc Natl Acad Sci. 2018;115:11322–11326. doi: 10.1073/pnas.1810766115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Walker KM, Gonzalez R, Kang JZ, et al. Across-species differences in pitch perception are consistent with differences in cochlear filtering. ELife. 2019;8:e41626. doi: 10.7554/eLife.41626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whitmal NA, Poissant SF, Freyman RL, Helfer KS. Speech intelligibility in cochlear implant simulations: effects of carrier type, interfering noise, and subject experience. J Acoust Soc Am. 2007;122:2376–2388. doi: 10.1121/1.2773993. [DOI] [PubMed] [Google Scholar]
- Wise AK, Tan J, Wang Y, Caruso F, Shepherd RK. Improved auditory nerve survival with nanoengineered supraparticles for neurotrophin delivery into the deafened cochlea. PLOS ONE. 2016;11:e0164867. doi: 10.1371/journal.pone.0164867. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright et al. (1987) Hair cell distributions in the normal human cochlea. Acta Otolaryngol (Stockh) 104:4–48. 10.3109/00016488709098971
- Zeng F-G, Tang Q, Lu T. Abnormal pitch perception produced by cochlear implant stimulation. PLoS ONE. 2014;9:e88662. doi: 10.1371/journal.pone.0088662. [DOI] [PMC free article] [PubMed] [Google Scholar]