Abstract
Previously, selection of l channels was prioritized according to formant frequency locations in an l-of-n-of-m–based signal processing strategy to provide important voicing information independent of listening environments for cochlear implant (CI) users. In this study, ideal, or ground truth, formants were incorporated into the selection stage to determine the effect of accuracy on (1) subjective speech intelligibility, (2) objective channel selection patterns, and (3) objective stimulation patterns (current). An average +11% improvement (p < 0.05) was observed across six CI users in quiet, but not for noise or reverberation conditions. Analogous increases in channel selection and current for the upper range of F1 and a decrease across mid-frequencies with higher corresponding current, were both observed at the expense of noise-dominant channels. Objective channel selection patterns were analyzed a second time to determine the effects of estimation approach and number of selected channels (n). A significant effect of estimation approach was only observed in the noise and reverberation condition with minor differences in channel selection and significantly decreased stimulated current. Results suggest that estimation method, accuracy, and number of channels in the proposed strategy using ideal formants may improve intelligibility when corresponding stimulated current of formant channels are not masked by noise-dominant channels.
I. INTRODUCTION
Cochlear implants (CI) serve as a solution to sensorineural hearing loss by simulating the sensation of sound with the delivery of electrical current directly to an intracochlear electrode array. In order to generate an electric signal representative of the acoustic input, various signal processing strategies are used to perform time-frequency analysis, map amplitudes to current values, map filterbands to corresponding electrodes, and maintain a balance between the presentation of spectral and temporal information in order to optimize speech recognition. A subset of popular signal processing strategies, referred to as n-of-m strategies, selects n channels of m total channels on a frame-by-frame basis instead of including all channels in the selection [i.e., only n channels are stimulated in an interleaved manner using the Continuous Interleaved Sampling (CIS) method instead of stimulating m channels either interleaved or simultaneously] (Holden et al., 2005; McKay et al., 1992; Skinner et al., 1991; Skinner et al., 2002; Wilson et al., 1991; Wilson et al., 1993). The selection of n channels in a commercial strategy, based on the Advanced Combination Encoding (ACE) from Cochlear Ltd. (Macquarie University, Sydney, Australia), is determined by spectral energy across m total channels synonymous with the total number of electrodes. In most acoustic conditions, the n-highest channels (out of m total) represent the source sound or target speaker, similar to peak-picking strategies. While CI speech recognition with ACE is high in quiet conditions, performance has been shown to decrease in the presence of noise and reverberation (Fetterman and Domico, 2002; Fu and Nogaki, 2005; Hazrati and Loizou, 2012; Neuman et al., 2010). It is hypothesized that spectral selection of channels in this manner may not be appropriate for speech-in-noise listening situations or that the spectral information in the subset of selected channels (e.g., peaks of the signal) are insufficient for adequate speech-in-noise perception.
Researchers have proposed various selection criteria for n channels to improve intelligibility deficits in naturalistic, noisy conditions (Büchner et al., 2008; Hazrati and Loizou, 2013; Kals et al., 2010; Kludt et al., 2021; Nogueira et al., 2005, 2016; Saba et al., 2018; Tabibi et al., 2020). One successful method of re-addressing selection was proposed by Noguiera et al. who used a masking function to prevent selection of adjacent channels to reduce channel interactions resulting in stimulation of similar regions of the cochlea, referred to as the psychoacoustic ACE (PACE) strategy (Büchner et al., 2008; Nogueira et al., 2005). An average improvement of +8% was observed in CI listeners at +15 dB speech-shaped noise. Comparable or slightly better performance was achieved with a smaller number of channels using the PACE (or MP3000) strategy as compared to ACE (Büchner et al., 2008; Buechner et al., 2011; Kludt et al., 2021; Nogueira et al., 2005). Kals et al., further proposed a selection criterion which assigns adjacent channels into “Selected Groups,” where only a single channel within a limited number of “Selected Groups” are stimulated to ensure the maximum spatial distribution of channels. While no significant improvement in intelligibility was found in speech shaped noise conditions, comparable performance to the CIS strategy was achieved with about a third of the total stimulated channels (Kals et al., 2010). Previously, Saba et al., proposed a selection criteria based on the location of formant frequencies to increase the amount of salient voicing information (Saba, 2021; Saba et al., 2018). A subset of n selected channels, l were reserved for channels spanning the location of the formant estimate and the remainder of the channels were selected using the energy-based criteria used in ACE processing (Saba et al., 2018). Significant improvements of +12.3% were observed for CI listeners in babble noise using the formant-ACE (FACE) strategy, but not for speech shaped noise or reverberation. These studies demonstrate how channel selection can affect spectral representation and intelligibility for CI listeners.
Formant frequencies, which are resonances within the speech signal related to the vocal tract structure, provide important phonemic knowledge and voicing characteristics to the listener. These cues, in addition to formant transitions (Iverson et al., 2006), spectral contrast (Loizou and Poroy, 2001), and duration (Donaldson et al., 2015), contribute to better recognition of vowels known to impact speech understanding more than the recognition of consonants (Kewley-Port et al., 2007). Many formant estimation techniques and approaches exist and can be employed across a broad range of computational resources (e.g., real-time, operation with minimum perceptual delays, offline server-dependent, etc.). Linear predictive coding (LPC) (Snell and Milinazzo, 1993), line spectral pairs (LSP) (Deller et al., 2000), all-pole representation of speech (El-Jaroudi and Makhoul, 1991), peak-picking (Chen and Loizou, 2004), short-term spectrum sampling (Flanagan, 1956), and the chirp z-transform (CZT) (Rabiner et al., 1969; Schafer and Rabiner, 1970), are common in the field of speech processing and have been shown to operate either in real time or with minimal processing delays. These approaches are fairly robust for quiet or clean speech signals, but accuracy may decrease in the presence of noise (Chen and Loizou, 2004) or with inappropriate parameters, such as filter order, bandwidth, and erroneous root-solving (Vallabha and Tuller, 2002). In addition to noise-related estimation errors, explicit estimation can be difficult when two formants are in close proximity (e.g., back vowels, /a/) (Vallabha and Tuller, 2002), especially for spectral-based approaches in CI signal processing where bandwidths of filters are broad and resolution of bandpass filters are constrained within a limited dynamic range (Friesen et al., 2001; Henry and Turner, 2003; Rubinstein, 2004).
It is important to note that spectral representation is degraded for CI listeners due to a number of factors (Zeng et al., 2008), such as dynamic range <30 dB (Fu and Shannon, 1999), inactive or dead regions of spiral ganglion cells (Moore, 2004), broad filter banks and/or limited number of electrode stimulation sites (Friesen et al., 2001; Fu and Shannon, 2002), tonotopic mismatch (Faulkner et al., 2003; Oxenham et al., 2004), current spreading and/or adjacent channel/electrode interaction (Pfingst et al., 2001; Stickney et al., 2006; Zhu et al., 2012), and spectral smearing (Fu and Nogaki, 2005; Shannon et al., 1995). Specifically for n-of-m strategies and other envelope-based techniques, formant information is provided to the CI listener via amplitude-based cues across multiple channels. Of the many factors that can affect the saliency of these cues, researchers develop solutions focused on the signal-to-noise ratio (SNR) and the number of stimulated channels (Hu and Loizou, 2008; Li and Loizou, 2008; Xu and Pfingst, 2008). At low signal-to-noise ratios (SNRs), these important cues may not be as effective due to informational masking and decreased spectral contrast, or are not provided to the implant listener in the subset of n channels using a selection criterion based on energy. The number of available channels (which corresponds to electrodes) for stimulation and the spectral resolution of the filter banks defining each channel also affects the outcome of harmonic and pitch information. Varying the number of channels has been shown to impact vowel and consonant recognition differently. Early studies with older implant systems reported that a minimum of four–six channels is sufficient to provide spectral information to CI listeners, however, asymptotic performance was achieved after eight channels (Dorman et al., 1997; Friesen et al., 2001; Shannon et al., 1995). A recent study by Croghan et al., has observed a relationship such that increasing the number of channels is more beneficial for speech recognition in CI subjects with better spectrotemporal resolution (Croghan et al., 2017).
In the present study, a formant-based selection framework initially proposed by Saba et al., is used to determine the effects of estimation accuracy, estimation approach, and the subset of selected channels dictated by n-maxima. In the previous study by the authors, minor improvements in speech intelligibility (2.0–7.7%) were observed at two different SNRs for speech-shaped noise and negligible differences in intelligibility (–4.3–7.0%) in simulated reverberation conditions (Saba et al., 2018). When voice-activity-detection was used prior to formant estimation, intelligibility was found to increase as well as the number of differences in channel selection in the mid-frequency range associated most commonly with the second formant (F2) measured across all noise and noise-free conditions. The results from the previous study suggest that improved estimation accuracy may impact on the selection of channels which may thus impact speech intelligibility. Here, a subjective evaluation of intelligibility is proposed to assess ceiling level performance for CI listeners when the estimation accuracy is as close to ground truth as possible in an attempt to remove the dependence of estimation errors or non-optimal estimation accuracy. Additionally, an objective evaluation is proposed to investigate the effects of estimation approach and number of selected channels by analyzing channel selection and stimulating current specific for each simulated listening condition (e.g., +10 dB SNR speech-shaped noise, reverberation with T60 = 600 ms, etc.). Analysis of different estimation approaches may provide suggestions for real-time performance, whereas analysis of different n-maxima values may provide insight on the impact of the l-of-n-of-m strategy across the CI population with varying number of active electrodes for selection. Thus, the results of the proposed subjective and objective experiments can be used to determine the feasibility of a signal processing strategy using the formant-priority channel selection criterion. The investigation of feasibility can also be used to indicate whether this criterion can improve intelligibility for CI listeners broadly in challenging listening environments with noise and/or reverberation.
II. METHODS
A. Experimental design
An analysis of channel selection was performed subjectively and objectively. For the subjective evaluation, speech intelligibility was recorded with six CI subjects for two variations of a formant-based channel selection criteria embedded within the n-of-m signal processing framework, ideal formant-ACE (ID-FACE), and ideal formant–ACE-enhanced (ID-FACE+). The selection of n channels and resulting channel-specific stimulation current was quantified objectively to assess patterns associated with various levels of simulated speech-shaped noise and reverberation conditions known to reduce speech intelligibility. This experiment tested the hypothesis that robust formant estimates in noise-free conditions, unlike degraded estimates in the presence of noise and reverberation, will ensure the selection of channels associated with all three formants. As a follow-up study to Saba et al., 2018, the use of ideal formant frequencies is hypothesized to demonstrate ceiling level performance for the formant-priority selection criteria (Saba et al., 2018). For the objective evaluation, estimation approaches and the total number of selected channels were varied to determine the effect on channel selection in comparison to the ID-FACE strategy. These data together are used to determine which factors of the proposed selection criteria ensure selection (and/or stimulation) of channels associated with formant frequencies.
B. Signal processing
1. Cochlear implant signal processing strategies
In this study, a commercial signal processing strategy, the Advanced Combination Encoding (ACE), is used as the n-of-m framework, where n channels are selected out of the m number of electrodes based on the n-highest spectral energy bands corresponding to each electrode. The input signal is pre-emphasized to balance energy differences between high and low frequencies. Time-frequency analysis is used in addition to envelope detection and spectral energy calculation to map the power spectral densities from the fast Fourier transform (FFT) to individual channels which correspond to individual intracochlear electrodes. Prior to radio frequency transmission and generation, the signal is passed through a logarithmic compression function to normalize the power spectral density into current represented as clinical levels within the dynamic range specified in the clinical mapping parameters (MAP) of each CI user.
Previously, a channel prioritization algorithm was used to ensure the selection of channels (ranging from 1–3) where formant frequencies were present, used in the FACE strategy (Saba et al., 2018). After formant channels were identified and selected, the channel prioritization algorithm then selects the remaining number of channels up to n-maxima according to the highest spectral energy, as in the ACE strategy. Thus, channel selection for this formant-based strategy uses an l-of-n-1-of-m selection criteria (Saba et al., 2018), where l refers to the number of channels corresponding to each formant, ranging between 1 and 3. Figures 1(E) and 1(F) illustrate differences in channel selection for the word “around,” where the energy-based selection strategy, ACE, did not stimulate the channels where F2 and F3 were located. Ideal (ID) formant frequency estimates are used here instead of the enhanced formant estimation algorithm in Saba et al. (2018). This selection criteria was embedded within the framework of ACE and is referred to as ID-FACE; thus, the only difference between ACE and ID-FACE is the channel selection stage. The block diagram of ID-FACE is shown in Fig. 2. Next, all power spectral density values are passed through the same logarithmic compression function in ACE where the output is constrained between base and saturation levels. Therefore, it is possible that ID-FACE selects channels prior to the compression stage that will not result in a current value above the base level and will not be stimulated. The resulting current values corresponding to the selected channels using ID-FACE are not modified, in that, channel selection does not ensure the channel is stimulated. Figures 1(B–D) provide the electric representation of speech using electrodes and current (or amplitude) values. Similar stimulation is denoted in black vertical stimulus lines, stimulation from priority formant channels is illustrated in green where red lines indicate the channel was removed from the original spectral energy–based method to ensure the selection of the formant channel (Saba, 2021). The reference to channel selection and electrode selection are used interchangeably, such that channels are ordered in an increasing sequence from low to high frequencies (i.e., 1–22), and electrodes are ordered in a decreasing sequence (i.e., from 22–1). Both naming conventions refer to channels from apex-to-base of the cochlea, or from apical electrodes (low frequency) to basal electrodes (high frequency).
FIG. 1.
(Color online) (A) Spectrogram of the word “around” with formant plots from the ideal estimates in red, green, and blue, for F1, F2, and F3, respectively. Difference in stimulation between ACE and ID-FACE for an 8 ms frame of the vowel /a/ at 2.18 ms (highlighted in yellow) shown using electrodograms for (B) F1 range (electrodes 22–17, 0.25–0.875 kHz), (C) F2 range (electrodes 17–7, 0.875–3.3 kHz), and (D) F3 range (electrodes 12–6, 1.7–3.8), where green stimulus lines indicate the addition of a new electrode with ID-FACE and red stimulus lines indicate the removal of an electrode selected using ACE. Individual electrode selection using (E) ACE and (F) ID-FACE. Shaded regions and individual bars indicate F1, F2, and F3 in red, blue, and green, respectively. The estimate from a 12-th order LPC is shown as the line above the bars.
FIG. 2.
(Color online) (A) Block diagram of the signal processing strategy, ID-FACE and ID-FACE+. The schematic inside the l channel boosting block, used in ID-FACE+, occurs in three stages where l channels below base level (BL) for stimulation are increased to the base level, the amount increase is applied to all l channel to preserve the spectral slope, and the remaining ‘(n-l)’ channels are reduced by the amount ‘l’ channels were increased.
To ensure formant channels are both selected and stimulated, a three-stage channel boosting algorithm was designed and incorporated after l-of-n-of-m channel selection and is referred to as ID-FACE+ (Saba, 2021). This boosting algorithm (see l channel boosting block in Fig. 2) ensures the corresponding current of formant channels are above threshold constraints (base level for stimulation) post-compression and channel mapping stages. The input to the channel boosting algorithm is the l number of channels associated with the ideal formant frequency values. First, the resulting power spectral densities (PSD) of the formant channel is compared to the overall stimulation base level set by Cochlear Ltd. for electrical stimulation according to amplitude of the acoustic signal. If the PSD of the formant channel is below base level, the channel is assigned to base level. For frames where more than one channel is above the base level, the individual channel is boosted according to the ratio of the PSD below base level to the minimum PSD value of the frame during the compression stage. When boosting occurs for one formant channel, the spectral slopes of the remaining formant channels are preserved. For example, if the formant channel associated with F1 is below base level, the channels associated with F2 and F3 are boosted in a similar manner to preserve the spectral slope. Boosting is applied to l formant channels by weighting the original logarithmic growth function as in Saba and Hansen (2022), where n-l channels are passed through the standard logarithmic compression function in ACE. Here, an exponential weight of 0.5 is used where the range of inputs into the weighted compression function is less than 1. Boosting is performed in an exponential way to increase the PSD without reaching the saturation limit. The ID-FACE+ strategy uses the same selection as ID-FACE, but compensates for channels that were selected but not stimulated.
2. Ideal formant frequency estimation
To generate ideal (ID) formant frequencies, spectrograms of quiet sentence tokens of the test battery were visually inspected using an online, open-source program, Wavesurfer (Shue et al., 2009; Sjölander and Beskow, 2000). An automated function to draw formant plots–indicative of the formant tracks and/or formant transitions–within the software was adapted to reflect the same frame-by-frame analysis used in the proposed formant-based strategies (i.e., windowing, model order, frequency constraints, and FFT parameters). Automated formant tracks for all three formants were verified and adjusted by hand to resolve largely discontinuous contours, pitch estimates as F1, or other visual abnormalities in the formant tracks. Formants (F1, F2, F3) were exported from Wavesurfer (Sjölander and Beskow, 2000) into matlab (MathWorks Inc., Natick, MA) and are referred to as ideal (ID) formant frequencies. Formant tracks were recorded from the noise-free, quiet condition and time-aligned to the corresponding noisy condition within the subjective test battery in an offline manner. Figure 1(A) illustrates a spectrogram for the word “around” with the formant plots for F1 (red), F2 (green), and F3 (blue). These frequencies are used in the ID-FACE and ID-FACE+ strategies to select formant channels. The method for identifying ideal estimates in the quiet condition was developed in such a way so as to provide as close to ground truth estimates as possible without the negative effects of estimation error due to the introduction of noise.
3. Estimation approaches
To investigate the effects of estimation approach on channel selection, three alternate estimation approaches were utilized. Individual estimates according to each approach were used to generate channel selection patterns and compared objectively in a side-by-side manner with the ID-FACE strategy and the control strategy, ACE. In all three approaches, estimates were generated offline, stored individually within the matlab environment, and imported into the framework accordingly during signal processing. These estimation approaches were analyzed objectively for channel selection behavior and were not evaluated with CI subjects.
Linear Predictive Coding (LPC) Approach. A traditional LPC approach was adapted from Chen and Loizou (2004) in matlab and was selected to represent the fundamental baseline method used in the enhanced formant estimation algorithm used in the FACE strategy in the previous study by the authors (Saba et al., 2018). Six model orders (N = 6, 8, 10, 12, 16, 20) were used in an offline experiment to determine the highest accuracy using the traditional root-solving method of LPC in quiet conditions evaluated against ideal frequency tracts. From this evaluation, coefficients from a 12th-order polynomial were generated using a Hamming window of 8 ms with 87.5% overlap. Formants and corresponding bandwidths were calculated from coefficients meeting bandwidth criteria of 500 Hz and frequencies within a range of 200–3200 Hz. Formant frequencies from the previous frame were used to ensure continuity within 150 Hz between frames.
Burg (BURG) algorithm. Short-term spectral analysis using the Burg algorithm (Anderson, 1974; Burg, 1972) was automated through PRAAT, a software used for audio and phonetic analyses (Boersma and Weenikoersma, 2001). This algorithm performs spectral analysis from a least squares minimization solution to achieve the forward and backward prediction error from Levinson–Durbin recursion equation. To generate three formant frequencies, a 4 ms analysis window with 75% overlap was used with an upper constraint of 4.2 kHz, without pre-emphasis. This approach is referred to as the BURG algorithm for the remainder of the study and was selected because this software is widely used for analysis of speech in the field of speech science.
Unconstrained spectral pairs (U-LSP). An approach to calculate line spectral frequencies, or line spectral pairs (LSP), was generated using a Toeplitz inversion method (Crosmer and Barnwell, 1985; Itakura, 1975; Markel and Gray, 1976). Linear prediction was used to generate the coefficients from the forward and backward prediction polynomials. An optimal model order of eight was selected using an iterative design approach based on the highest estimation accuracy in relation to the ideal formant frequency tracts. Toeplitz matrices were then inverted and flipped using samples from an 8 ms Hamming window with 87.5% overlap. Formant frequencies and bandwidths were calculated using the arctangent of imaginary and real roots of the backward polynomial. Unlike the roots of LPC coefficients, all roots of LSP coefficients lie on the unit circle with natural ordering across frequency (e.g., 0–4 kHz, 0–8 kHz). Similar to the bandwidth continuity clause used to enhance the LPC estimate, this same algorithm was applied to sift through each of the LSP frequencies on the order of “w0, w1, … wn” if bandwidths exceeded 200 Hz; however, the original LSP constraints were not incorporated into the approach (Crosmer and Barnwell, 1985; Crosmer, 1985). Therefore, the method is considered to be unconstrained and is referred to as unconstrained line spectral pairs or U-LSP. This common method was selected as it is widely utilized in speech technology for speaker identification, automatic speech recognition, speech coding, and speech synthesis.
C. Subjective evaluation: Speech intelligibility
1. Speech battery
A random set of 240 IEEE sentences were phonetically transcribed by hand for voiced portions of speech, originally aided by forced alignment (IEEE, 1969; Ochshorn and Hawkins, 2016; Spahr et al., 2012). Vowels, /i/, /I/, /e/, /E/, /@/, /a/, /o/, /U/, /u/, /R/, /A/, /Y/, /W/, /O/; consonants, /c/, /p/, /b/, /t/, /d/, /k/, /g/, /f/, /T/, /D/, /s/, /z/, /v/, /S/, Z/, /h/, /Q/, /w/, /y/, /r/, /C/, /J; semi-vowels, /l/, /w/, /y/, /r/; and nasals, /m/, /n/, /G/ were labeled according to the Carnegie Mellon University (CMU) Dictionary. Start and stop times were generated and applied to each time-aligned noisy acoustic sentence token. Two different interference types were used to evaluate the efficacy of the proposed strategies: speech-shaped noise (SSN), reverberation, and the combination of speech-shaped noise and reverberation. Five simulated acoustic conditions were used as the speech test battery for subject evaluation in addition to a noise-free condition: +10 dB SNR SSN, +5 dB SNR SSN, T60 = 300 ms, T60 = 600 ms, and +10 dB SNR, T60 = 600 ms, where T60 values reference the amount of reverberation. Room impulse responses (RIR) from Neuman et al. (2010) were convolved with the input signal to develop the reverberant conditions. For the reverberation and SSN combination, SSN was added to the signal prior to reverberation to simulate how this noise type would occur in a naturalistic environment.
2. Procedure
Call and response listening procedures were used to determine speech intelligibility for ACE and ID-FACE and for ACE and ID-FACE+. Five different stimulated acoustic conditions were evaluated in addition to baseline quiet. The test battery was processed using each of the strategies offline to generate the electrode and current information according to each strategy and stored as individual matlab files. During the listening procedure, the offline processed files were randomized and presented to subjects through a unilateral, direct connect setup using the CCi-MOBILE Research Platform (developed by UT-Dallas CRSS-CILab, Richardson, TX) in a double-walled sound booth (Ghosh et al., 2022; Hansen et al., 2019). The presentation of electric stimuli in this manner bypasses the clinical speech processor as well as the auditory input from the microphones (i.e., the research platform configuration used here was not configured to operate in real time). Bilateral subjects were asked to subjectively select their better ear for testing and to remove the contralateral processor. For subjects with residual hearing in the non-implanted ear (bimodal or single sided deafness subjects) or the contralateral implant ear (bilateral subjects), ear plugs were provided to force subjects to focus on the electric delivery of sentence tokens. Intelligibility was scored as the total number of words correct. Subjects were given a training set of 10 sentences for each of the proposed strategies prior to beginning the test. A total of 20 sentences were used to calculate intelligibility for each strategy-condition pair. Each subject was presented 240 sentences during the test phase and 20 sentences during the training phase. The average test duration was 134 min with a single, 10 min break, where subjects were allowed to request additional breaks at any time. The same procedure was used to evaluate ACE and ID-FACE+ in a follow-up experiment 3 months after participation in the first experiment only for speech-shaped noise conditions with a subset of the subject population (N = 2; S1, S5).
3. Subjects
Inclusion criteria consisted of CI users with implants manufactured by Cochlear Ltd. using ACE processing as the routine signal processing strategy with at least 6 months of experience with their device. Subject demographics for the six participants are shown in Table I. Twenty sentences, not included in the test battery, were used to determine clinical baseline performance (control) within a single-walled sound booth using their clinical processor. Subjects S1 and S5 yielded baseline performance less than 75%, where subjects S2 and S4 yielded performance less than 80%. Clinical maps (MAP) were provided for each subject by their audiologist to ensure comparable performance to their clinical processor using the CCi-MOBILE Research Platform (Ghosh et al., 2022; Hansen et al., 2019). S1 and S5 from the first experiment with ID-FACE were tested in the follow-up experiment using the ID-FACE+ strategy.
TABLE I.
Cochlear implant subject clinical processor specifications and demographics for the subjective evaluation, where stimulation rate is denoted as stim. rate in Hz or pulses per second per channel. Speech intelligibility for baseline performance in quiet were calculated as the average intelligibility in the quiet, noise-free condition. All subjects use Cochlear Ltd. implant systems.
| ID | Implant type (Cochlear Ltd.) | Device experience (yrs) | Active electrodes | Stim. Rate (Hz) | n-maxima | Age (yrs) | Baseline performance (quiet conditions) (%) |
|---|---|---|---|---|---|---|---|
| S1 | CI24RE | 10 | 21 | 1000 | 8 | 72 | 65.2 |
| S2 | CI24RE | 9 | 22 | 900 | 8 | 66 | 79.3 |
| S3 | CI512 | 12 | 21 | 900 | 8 | 71 | 98.8 |
| S4 | CI422 | 7 | 22 | 500 | 8 | 57 | 78.7 |
| S5 | CI24R | 16 | 18 | 900 | 12 | 66 | 54.3 |
| S6 | CI24RE | 7 | 20 | 500 | 8 | 66 | 87.1 |
4. Statistical analysis
For analysis of subjective data (average sentence intelligibility scores), a repeated-measures, two-way analysis of variance (ANOVA) was used to determine the effects of signal processing strategy (ACE vs ID-FACE) and simulated noise/reverb condition (6 total: quiet, +10, +5 dB SNR SSN; T60 = 300ms, 600 ms; +10 dB SNR SSN, T60 = 600 ms). Bonferroni multiple comparisons tests were used to determine significance at the 0.05 level between signal processing strategies for each noise/reverb condition.
D. Objective evaluation: Strategy-specific selection patterns
1. Channel selection and stimulated current patterns
The same dataset from the subjective evaluation was analyzed objectively. Electrical stimuli (electrodes and current pairs) were processed and analyzed offline for a total of 360 sentences (60 sentences per subject) corresponding to the randomized presentation of conditions from the subjective evaluation dataset. Channel selection and stimulated current were used to determine: (1) individual channel selection and stimulation patterns of ID-FACE for each of the six simulated listening conditions, and (2) how estimation approach and number of selected channels (n-maxima) affect channel selection across three of the six listening conditions. Channel selection was quantified on an individual channel basis as the total number of voiced frames identified using phonetic transcription for 22 vowels defined in the CMU dictionary. Stimulated current of the selected channels was quantified as a percentage of the dynamic range for each CI subject and constrained between 0–1 for voiced frames identified from phonetic transcription. The three estimation approaches (LPC, U-LSP, and BURG) embedded in the framework of ID-FACE were compared against ID-FACE and ACE. Three subsets of n-maxima were analyzed to determine the effect of the total number of selected channels using the formant-based channel selection criteria. N-maxima was adjusted objectively on a subject-by-subject basis for the following options: N - 4, N + 3, and N + 6, where “N” refers to n-maxima. All subjects, with the exception of S5, had an n-maxima value of eight channels. Therefore, the options for n in the n-of-m approach was decreased by four, increased by three, and increased by six, respectively. For S5 where n-maxima is set to 12, the N + 6 condition is representative of N + 4 due to the lack of available electrodes in the subject's clinical MAP.
2. Statistical analysis
Objective data consist of two types of mathematical data: (1) cumulative total of how many times an individual channel was stimulated in an individual sentence token; averaged channel-wise across 20 sentences of the same noise condition; averaged across each of the six subject MAPs, (2) corresponding average stimulated current of individual stimulated channels represented as a percentage of the total dynamic range constrained between the threshold and comfort levels defined in each subject MAP; averaged in the same manner as (1). These objective data were used to determine the following effects: (1) the effects of strategy (ACE, ID-FACE), (2) the effect of estimation approach (LPC, U-LSP, BURG), and (3) the effect of n-maxima (number of selected channels) across individual channels (1–22) for each noise condition. Post hoc Bonferroni-corrected multiple comparisons tests were used to determine significant differences between strategies across the same channel within a single noise condition. Subjective data (speech intelligibility) and objective data (channel selection, current values) were found to be normally distributed and homoscedastic; therefore, p-values and F-statistics were not adjusted due to the lack of failed assumptions of ANOVA.
III. RESULTS
1. Subjective evaluation of intelligibility with ID-FACE and ID-FACE+
Speech intelligibility for ACE and ID-FACE strategies are shown in Fig. 3(A) where the number above each box represents the average percentage point difference compared to the control strategy (ACE). Results from a repeated-measures, two-way ANOVA revealed a significant effect of listening condition (noise/reverb) (F[2,25] = 44.66, p < 0.0001) and interaction (strategy x noise) (F[5,25] = 4.102, p = 0.0074), but the effect of strategy failed to reach significance (F[1,5] = 0.4522, p = 0.5311). As SNR of SSN decreased, intelligibility decreased, where the average at the +5 dB SNR level was below 50%. Average intelligibility (N = 6) with ID-FACE was 88.2% in quiet, +11.0% points higher than the control at 77.2%, which was found to be significant (p < 0.01). Individual improvements of +15.5%, +16.6%, and +19.0% in this condition were observed with S1, S4, and S5, respectively, for subjects with baseline performance in quiet below 80% (shown in Table I); whereas improvements for subjects with performance above 80% were 11.2% and 5.6% for S2 and S6, but not with S3 at –2.0%. Results were comparable (within 3% points) with the proposed strategy across the SSN conditions, where average intelligibility was 67.0% (ACE) and 64.2% (ID-FACE) at +10 dB and 43.3% (ACE) and 42.7% (ID-FACE) at +5 dB SNR. Comparable performance (within 6% points) was also observed for the two reverberation-only conditions and the reverberation and noise condition, where intelligibility was higher with ACE, but was not found to be significantly different (p > 0.05). Figure 3(B) illustrates the average (N = 2) and individual percentage point improvement for ID-FACE and ID-FACE+ with individual formant channel boosting for low performing subjects (baseline <80%), S1 and S5. ID-FACE+ only resulted in higher average intelligibility at the +10 dB SNR level, but not for quiet or + 5 dB SNR conditions. Average improvement was +7.4%, +2.8%, and +4.1% points with ID-FACE+ and +17.2%, –5.1%, and +8.8% points with ID-FACE in quiet, +10 dB, and +5 dB SNR, respectively.
FIG. 3.
(A) Speech intelligibility (N = 6) for quiet (baseline), speech-shaped-noise (SSN), reverberation (T60), and the combination of noise and reverberation (SSN, T60) conditions for ACE (light gray) and ID-FACE (dark gray). Individual subject performance is illustrated using symbols, median values are denoted as the line within the box, and whiskers represent the minimum and maximum values. Significance was calculated at the 0.05 level. (B) Percentage point improvement in speech intelligibility for S1 and S5 for ID-FACE and ID-FACE+ (hatched) with individual channel enhancement where boxes denote the minimum and maximum values.
2. Objective analysis of selection patterns with ID-FACE
Tables II and III provide results from a two-way ANOVA for channel selection and stimulation current, respectively. There was a significant effect of strategy for both SSN conditions and the reverberation and SSN for channel selection. A significant effect of strategy was also observed but only with the SSN and reverberation condition for stimulated current; all other conditions failed to reach significance. Channel selection for each strategy, calculated as the number of voiced frames resulting in stimulation current above base level/threshold comfort level, are modeled for each condition in Fig. 4, where line plots indicate the average number of frames. Surface plots are constrained between the first and third quartiles. In most cases, selection with ID-FACE was found to be similar or negligibly lower than ACE as the listening environment became more challenging. In the quiet condition, the proposed strategy selected more electrodes in the lower (22–17, 0.2–0.9 kHz) frequency ranges than the control strategy which was found to be significant for electrodes representative of the upper and lower range of F1 [shown in the red shaded area in Fig. 4(A)]. Heat maps above each surface plot in Fig. 4 illustrate the difference in stimulated current (ΔI) between ID-FACE and ACE, where negative values indicate a higher current with ACE (shown as light gray) and positive values indicate a higher current with ID-FACE (shown as dark gray). While selection of the mid-frequency electrodes (16–10, 1–2.2 kHz) was slightly lower than ACE, the corresponding current, provided in Table IV and shown as a heatmaps above the surface plots in Fig. 4, was significantly higher current for the upper range of F1 (19–18, 0.5–0.8 kHz), a small portion of the F2 range (12, 1.5–1.8 kHz), and across the upper range of F3 (9–6, 2.3–4 kHz). A significant (p < 0.01) decrease in selection and significant (p < 0.05) increase in current was observed for electrode 13 (Fc = 1.45 kHz) and 12 (Fc = 1.7 kHz), respectively.
TABLE II.
Results from a repeated-measures, two-way analysis of variance (ANOVA) of channel selection quantified as the total number of voiced frames (as a summation across 23 vowels) for ACE and ID-FACE strategies at each of the six acoustic conditions where “sig.” refers to significance, “ns” refers to no significant difference (p > 0.05) and bold type refers to a significance difference of various p-values < 0.05.
| Effects of strategy-specific selection patterns: Individual channel selection (number of stimulated frames) | ||||
|---|---|---|---|---|
| Condition | Strategy (ACE, ID-FACE) F-statistic, p-value, sig. | Channel (1–22) | Interaction (strategy x channel) | |
| Quiet | F(1,5) = 0.5640, p = 0.4865, ns | F(21, 105) = 14.96, p < 0.0001 | F(21, 105) = 1.576, p = 0.0692, ns | |
| Speech-shaped noise | ||||
| +10 dB SNR | F(1,5) = 9.945, p = 0.0253 | F(21, 105) = 3.855, p < 0.0001 | F(21, 105) = 0.9433, p = 0.5379, ns | |
| +5 dB SNR | F(1,5) = 12.93, p = 0.0156 | F(21, 105) = 5.347, p < 0.0001 | F(21, 105) = 5.777, p < 0.0001 | |
| Reverberation | ||||
| T60 = 300 ms | F(1,5) = 2.752, p = 0.1581, ns | F(21, 105) = 16.77, p < 0.0001 | F(21, 105) = 1.278, p = 0.2072, ns | |
| T60 = 600 ms | F(1,5) = 0.9899, p = 0.3655, ns | F(21, 105) = 14.76, p < 0.0001 | F(21, 105) = 3.540, p < 0.0001 | |
| SSN, Reverberation | ||||
| +10 dB, T60 = 600 ms | F(1,5) = 48.24, p = 0.001 | F(21, 105) = 10.22, p < 0.0001 | F(21, 105) = 2.959, p = 0.0001 | |
TABLE III.
Results from a repeated-measures, two-way ANOVA of average stimulated current (represented as a percentage of the total dynamic range) calculated for voiced frames of speech (averaged across 23 vowels) for ACE and ID-FACE strategies at each of the six acoustic conditions where “sig.” refers to significance, ‘ns’ refers to no significant difference (p > 0.05) and bold type refers to a significance difference of various p-values < 0.05.
| Effects of strategy-specific selection patterns: Stimulated current of individual channels (normalized clinical levels) | ||||
|---|---|---|---|---|
| Condition | Strategy (ACE, ID-FACE) F-Statistic, p-value, sig. | Channel (1–22) | Interaction (strategy x channel) | |
| Quiet | F(1,5) = 6.289, p = 0.054, ns | F(21, 105) = 10.39, p < 0.0001 | F(21, 105 = 2.829, p = 0.0003 | |
| Speech-shaped noise | ||||
| +10 dB SNR | F(1,5) = 0.0687, p = 0.804, ns | F(21, 105) = 9.823, p < 0.0001 | F(21, 105) = 1.972, p = 0.0132 | |
| +5 dB SNR | F(1,5) = 1.930, p = 0.2234, ns | F(21, 105) = 3.672, p < 0.0001 | F(21, 105) = 5.448, p < 0.0001 | |
| Reverberation | ||||
| T60 = 300 ms | F(1,5) = 0.419), p = 0.546, ns | F(21, 105) = 15.82, p < 0.0001 | F(21, 105) = 2.368, p = 0.0022 | |
| T60 = 600 ms | F(1,5) = 1.312, p = 0.3038, ns | F(21, 105) = 15.35, p < 0.0001 | F(21, 105) = 1.806, p = 0.027 | |
| SSN, Reverberation | ||||
| +10 dB, T60 = 600 ms | F(1,5) = 21.30, p = 0.0058 | F(21, 105) = 17.34, p < 0.0001 | F(21, 105) = 1.553, p = 0.0758, ns | |
FIG. 4.
(Color online) (A) Channel/electrode selection patterns, calculated as the number of stimulated frames for voiced-only segments of the speech battery (25th–75th quartile), and corresponding stimulated current for (A) quiet, (B) +10 dB SNR SSN, (C) +5 dB SNR SSN, (D) T60 = 300 ms, (E) T60 = 600 ms, and (F) +10 dB SNR SSN, T60 = 600 ms with ACE (light gray) and ID-FACE (dark gray). Heat maps above each plot indicated the average percent difference in current (ΔI) between ID-FACE and ACE, where lighter electrodes represent more stimulation with ACE and less stimulation with ID-FACE and darker electrodes represent more stimulation with ID-FACE and less stimulation with ACE constrained between –7.5% and 5% according to the overall differences across each condition. Overlap (gray) in the contour plot represents similar selection. Line plots represent average selection with each strategy. Significance is represented as ‘*’ from Bonferroni multiple comparisons tests at the 0.05 level.
TABLE IV.
Results from Bonferroni multiple comparison tests of average stimulated current selection (represented as a percentage of the total dynamic range, calculated for the voiced frames of speech averaged across each of the 23 vowels) in quiet, +5 dB SNR SSN and +10 dB SNR SSN, T60 = 600 ms for (i) ID-FACE vs ACE, (ii) LPC, U-LSP, and BURG vs ID-FACE. Significance for ID-FACE is reported as either an increase (↑) or decrease (↓) according to the number of symbols for p < 0.0332, p < 0.0021, p < 0.0002, and p < 0.0001, respectively, for each electrode number (No.) assigned according to center frequencies (freq.) in Hz.
| Center freq. (Hz) | Electrode No. | Quiet | +5 dB SSN | +10 dB SNR, T60 = 600 ms | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID-FACE | LPC | U-LSP | BURG | ID-FACE | LPC | U-LSP | BURG | ID-FACE | LPC | U-LSP | BURG | ||
| 250 | 22 | ns | ns | ns | |||||||||
| 375 | 21 | ↓↓↓↓ | ↑↑ | ↑↑ | |||||||||
| 500 | 20 | ↓ | ↓↓↓↓ | ↑ | |||||||||
| 625 | 19 | ↑↑ | ↓↓↓ | ↓ | |||||||||
| 750 | 18 | ↑↑↑ | ↓↓ | ↓↓ | ↓ | ||||||||
| 875 | 17 | ↓↓↓↓ | ↓↓ | ↓ | |||||||||
| 1000 | 16 | ↓ | |||||||||||
| 1125 | 15 | ↓↓ | |||||||||||
| 1250 | 14 | ↓↓ | |||||||||||
| 1450 | 13 | ↑ | ↓↓↓↓ | ||||||||||
| 1700 | 12 | ↑ | ↓↓↓↓ | ↓↓ | |||||||||
| 1950 | 11 | ↓↓↓↓ | ↓↓ | ↓↓↓ | |||||||||
| 2200 | 10 | ↓↓↓↓ | ↑ | ↑↑↑↑ | |||||||||
| 2500 | 9 | ↓ | ↓ | ↓↓↓ | |||||||||
| 2900 | 8 | ↓↓↓ | ↓↓↓↓ | ||||||||||
| 3300 | 7 | ↓↓↓ | ↓↓↓↓ | ||||||||||
| 3800 | 6 | ↑ | ↓↓ | ||||||||||
| 4400 | 5 | ↓ | |||||||||||
| 5000 | 4 | ||||||||||||
| 5700 | 3 | ||||||||||||
| 6500 | 2 | ||||||||||||
| 7500 | 1 | ||||||||||||
As SNR decreased from +10 to +5, the selection of some low-frequency (21–22, 0.2–0.4 kHz), mid-frequency (17–15, 0.8–1.2 kHz), and all high-frequency (8–1, 2–7.9 kHz) electrodes significantly decreased as shown by the asterisks above the gray surface lines. The decrease in selection in these ranges did result in significant differences in stimulated current for +10 dB condition. In contrast to selection in the quiet condition, an increase in selection of mid-range electrodes (13–9, 1.3–2.7 kHz) was observed and found to be significant for electrodes 10–9 (2–2.7 kHz) at the +5 dB level. Here, the stimulated current of the mid-range electrodes was significantly lower than ACE by an average of 8%.
As T60 reverberation value was increased from 300–600 ms, selection of electrodes 8–6 (2.7–4. kHz), and 12–10 (1.5–2 kHz) were significantly greater with ACE, whereas electrodes 21–20 (0.2–0.4 kHz) were significantly greater with ID-FACE. Selection for the combination of SSN and reverberation mimicked that of the individual +5 dB SNR and T60 = 600 ms conditions but with more pronounced peaks and a larger increase in selection with ACE for the lower range of F2 (electrodes 15–18, 0.75–1.125 kHz). Significant differences in current were observed in the combinational condition across the majority of the speech spectrum up to 4.4 kHz.
Electrodograms for each subject are provided in Fig. 5 for quiet [Fig. 5(A)] and the SSN and reverberation condition [Fig. 5(B)]. Black lines represent the same stimulation patterns as ACE where differences appear in green or red for channels that are introduced or removed with ID-FACE, respectively. In the quiet condition, more channels were removed than added with the ideal formant-based selection criteria. Increases in channel selection were visually located within the ranges of F1 (electrodes 22–18) and F2 (electrodes 17–7). For the SSN and reverberation condition, channel selection with ID-FACE increased the selection of electrodes 14–10 (1.3–2.3 kHz), indicative of the second formant, at the expense of high-frequency, noise-dominant electrodes 8–1 (2.7–7.9 kHz). The unbalanced ratio of removed channels to added channels, or rather channels that were prioritized with ID-FACE and not selected with ACE, indicates selected formant channels did not produce electrical stimulation above the threshold or base level.
FIG. 5.
(Color online) Individual electrodograms for each of the six CI subjects (S1–S6) in (A) quiet and (B) +5 dB SNR SSN for the IEEE sentence, “Sell your gift to the buyer at a good gain”. Black lines represent electrical stimulation for ACE, red or green lines represent stimulation that was either removed or introduced with the ID-FACE strategy, respectively. The height of stimulus bars represents the current (amplitude values). Electrodes span logarithmically from 0.1–7.5 kHz from the apex of the cochlea to the base (low-to-high frequency).
3. Effect of estimation approach on channel selection
Table IV provides individual channel differences in stimulated current across three selected conditions for each estimation approach, (i.e., LPC, BURG, and U-LSP) with respect to ACE and ID-FACE. Results from a two-way ANOVA indicated no significant effect of estimation approach for channel selection (F[4,20] = 1.758, p = 0.1771) or average stimulated current (F[4,20] = 2.564, p = 0.0699) in the quiet condition. However, a significant effect was observed for both measures at +5 dB SNR (channel selection: F[4,20] = 14.47, p < 0.0001; current: F[4,20] = 5.111, p < 0.0053), and +10 dB SNR, T60 = 600 ms (channel selection: F[4,20] = 37.22, p < 0.0001; current: F[4,20] = 19.98, p < 0.0001). While a decrease in current in the upper range of F2 (electrodes 12–9) was observed for ID-FACE in the low-level SSN condition, LPC and U-LSP estimation approaches were found to be significantly higher for electrodes 13 (Fc = 1.45 kHz) and 10 (Fc = 2.2 kHz). For the reverberation and SSN condition, significant (p < 0.05) reduction was observed in the following regions: (i) in the higher F3 range (9–5, 2.3–4.7 kHz), (ii) the majority of the F1 range (21–18, 0.3–0.8 kHz), and (iii) the lower F2 range (17–11, 0.8–2 kHz). This pattern was consistent for U-LSP and BURG as compared to ACE, but demonstrated an increase in current in the F1 region for electrode 21 with LPC and electrodes 21–20 with U-LSP and lower current for electrodes 19–18 with BURG and electrode 17 with U-LSP when compared directly to ID-FACE.
Differences in channel selection across each of the estimation approaches are illustrated in Fig. 6. For the quiet condition [Fig. 6(A)], little deviation in the number of frames across the three estimation approaches was recorded. However, in the low-level SSN condition [Fig. 6(B)], each of the estimation approaches selected high-frequency electrodes 7–1 (3.1–7.5 kHz) on average less than ACE, as shown by the higher peaks in light gray. This is the same selection pattern observed with the ID-FACE strategy. Selection patterns for BURG and LPC formant estimation methods were relatively similar to each other and to that of ID-FACE where any increase in selection was found for mid-level electrodes (11–9, 1.8–2.7 kHz) associated with the mid-range for F2 and a decrease in selection for the remainder of the electrodes. Unlike the two previous approaches, U-LSP indicated higher selection of electrodes 18–15 (0.6–1.2 kHz) as shown by the light blue peaks indicating a higher selection associated for the upper F1 and lower F2 range.
FIG. 6.
(Color online) Quartile surface plots of channel selection for the estimation approaches in blue (EST-TYPE) within the framework of ID-FACE against the control, ACE in gray for (A) quiet and (B) +5 dB SNR SSN. “*” denotes significant differences (p < 0.05) and “ns” denotes no significant difference between the estimation approach and ID-FACE from Bonferroni multiple comparisons test.
4. Effect of the number of channels on channel selection
Figure 7 illustrates the effect of n-maxima for a higher and a lower subset of electrodes. Three subsets were objectively analyzed: N − 4, N + 3, and N + 6, where N represents n-maxima for each subject for two conditions: quiet and the low-level SSN condition. For S5, the N + 6 configuration is representative of N + 4 due to the lack of available channels in the subject MAP. A two-way ANOVA revealed significant effects of n-maxima for the quiet (channel selection: F[5,25]= 25.72, p < 0.0001; current: F[5,25] = 9.841, p < 0.0001), +5 dB SNR SSN (channel selection: F[5,25] = 59.61, p < 0.0001; current: F[5,25] = 109.8, p < 0.0001), and the +10 dB SNR SSN and T60 = 600 ms (channel selection: F[5,25] = 31.67, p < 0.0001; current: F[5,25] = 44.5, p < 0.0001) conditions. A higher number of frames were recorded for N + 3 and N + 6 subsets and lower for N − 4 as expected but the increase/reduction was not proportional across each of the electrodes in comparison to the original n-maxima. Independent of the condition, as n-maxima was varied, larger differences were observed for all electrodes outside the range of 12–10 (1.5–2.3 kHz).
FIG. 7.
(Color online) Quartile surface plots of channel selection as a function of n-maxima for (A) quiet and (B) +5 dB SNR SSN. Red surface plots indicate a reduction (Nmax - 4) of selected channels and green surface plots indicate an increase (Nmax + 3, Nmax + 6) in selected channels. Average selection is represented as line plots between both strategies.
IV. DISCUSSION
The goal of the subjective evaluation of ID-FACE was to determine whether increasing formant frequency estimation accuracy with the use of ideal formant frequencies, or as close to ground truth as possible, increases speech intelligibility beyond that of the predecessor strategy, FACE from Saba et al. (2018). This l-of-n-of-m selection criterion to identify and prioritize the selection of formants was hypothesized to improve intelligibility for CI listeners in difficult listening conditions. Results from the acute listening experiment yielded comparable performance with the proposed strategies, ID-FACE and ID-FACE+ as compared to the control strategy (ACE) in the noise and reverberation combinations tested, but demonstrated a significant improvement for the quiet condition. The goal of the objective evaluation of ID-FACE was to determine the resulting effect of overall selection and stimulation patterns. Without any background noise, ID-FACE increased the selection of five-most apical (low frequency) electrodes. However, as the SNR of SSN decreased, the pattern of selection was reversed where the formant-based strategy significantly decreased the selection of eight-most apical and the eight-most basal electrodes and increased for the mid-range electrodes (9–13, 1.45–2.5 kHz). Regardless of observed increases in channel selection for the mid-range frequencies known to be associated with the second formant (Assmann and Summerfield, 2004; Kewley-Port and Watson, 1994; Parikh and Loizou, 2005), the null hypothesis was supported in the most difficult noise condition; in that, an increase in selection of formant-based channels did not result in intelligibility improvements at low SNRs. This is attributed to the fact that stimulated current was significantly lower than ACE for the same subset of channels. An increase in channel selection with a corresponding decrease in stimulated current, as observed in the low-level SNR conditions, suggests that the selected formant-based channels were: (a) previously masked by the noise-dominant channels, (b) resulted in stimulation current below threshold and/or base level such that channels were “selected” but did not result in stimulation of the corresponding electrode, and (c) may have benefited from individual channel boosting exceeding that of the overall energy to compensate for the overall current loss of the particular frame. However, this was not the case for the quiet condition where differences in selection and stimulated current occurred simultaneously (i.e., selection increased and stimulated current increased) or where a decrease in selection corresponded to an increase in stimulated current. Selection patterns observed in the two-contrasting simulated listening conditions (quiet vs +5 dB SNR SSN) provides evidence of the energy-based n-maxima selection strategy to potentially neglect some of the important spectral information via formant frequencies.
Individual channel boosting in the follow-up listening experiment with ID-FACE+ was performed by increasing the energy for the formant bands using an exponential weighting value during the compression stage at the expense of the other selected bands within the frame. This type of boosting attempts to reach saturation level faster within the logarithmic compression stage to provide the CI listener with slightly enhanced vocalic cues. Manipulating power spectral densities within the electric domain is limited as it is only applicable for l-of-m channels and is constrained by the overall energy of the particular frame. This means the amount of boosting in each frame is dependent on the logarithmic relationship of the pre-boosted channel and the saturation level, the number of channels not above base level, and the amount of boosting needed to bring channels above base level while preserving spectral slope. Therefore, individual channel boosting does not guarantee the CI listener will perceive these enhanced cues as amplitude changes for l channels are variable in each frame, (i.e., not constant or above a specific audiometric threshold as in Loizou and Poroy, 2001). The study by Loizou and Poroy reported that contrasting the maximum and minimum amplitudes of channels within a frame by 4–6 dB constrained between the base and saturation levels produced significant intelligibility improvements with both NH and CI listeners (Loizou and Poroy, 2001). This type of enhancement ensures that channels are boosted similarly in each frame and that the amount of boosting is within a perceivable range for CI users, unlike in ID-FACE+. In a speech enhancement scheme by Lyzenga et al., spectral expansion was achieved by manipulating the amplitude spectrum using an exponential weighting factor within the range of formant frequencies and using linear filters to enhance F2 and F3 cues. The authors found significant improvements in speech reception thresholds for NH listeners and a slight improvement with hearing impaired listeners using an expansion and lift approach (Lyzenga et al., 2002). It should be noted that neither ID-FACE nor ID-FACE+ include loudness compensation due to the difference of current attributed to the differences in channel selection. ID-FACE+ boosts individual channels at the expense of energy from other channels such that there is no difference in dB SPL. Of the factors investigated to affect the l-of-n-of-m framework, loudness was not included. The study by Lyzenga et al., also reported larger benefits for speech spoken by a female as opposed to speech spoken by a male. In the present experiment, the speech battery consisted only of male speech. During the experiment, one subject (S1) stated sentences produced with ID-FACE+ appeared to have a higher pitch than the sentences processed with ACE and credited intelligibility benefits in the noisy conditions to this phenomenon. This could suggest that the spectral slope of the frame after the weighting compression function may shift the perceived F0 contour, however, further analysis is needed to determine whether boosting with the preservation of spectral slope disrupts pitch perception.
Combining the objective and subjective results for ID-FACE, a significant improvement in intelligibility was observed in the quiet condition with unique channel selection and corresponding current patterns. There are four distinct regions of differences in current and two distinct regions of differences in channel selection for the quiet condition. Channel selection was significantly higher for three of the five-most apical (low frequency) electrodes; however, stimulating current was lower for the first three electrodes and significantly higher for electrodes 18–19 (0.625–0.75 kHz). Therefore, an increase in current and channel selection only occurred simultaneously for the electrodes associated with the upper region of F1. Channel selection was lower for mid-frequency electrodes (10–17, 0.875–2.2 kHz) in the lower half of the F2 region (significant for one-of-seven electrodes), whereas stimulating current was higher across the same region (significant for one-of-seven electrodes). A peak of increased channel selection was observed for electrode 9 (Fc = 2.5 kHz) corresponding to a significant decrease in stimulating current. Conversely, a peak of decreased channel selection was observed for electrode 6 (3.8 kHz) corresponding to a significant increase in stimulating current. The former findings are different than Saba et al., 2018 where the formant-based strategies reduced the selection of apical electrodes and increased the selection of mid-frequency electrodes. In the previous study, channel selection was averaged across all nine acoustic conditions (consisting of babble and speech-shaped noise types and reverberation) and isolated according to the selected electrodes, i.e., disregarding which channels were not selected to prioritize the formant channels. In the present study, the analysis of channel selection is performed with greater detail where the channel selection is considered for all 22 electrodes so as to determine how the selection of formant bands reduces the number of channels selected using the energy-based criteria. Assuming CI listeners are most accustomed to ACE processing where channel selection has been identified using low-frequency clusters (Büchner et al., 2008; Kludt et al., 2021; Lai et al., 2018; Nogueira et al., 2005; Tabibi et al., 2020), if a decrease in selection occurs in the mid-frequency range yet the current values of the selected channels in that region are higher, this could suggest the perception of the second peak (F2) is stronger. This further suggests that adjacent electrodes are not stimulated in the region so the peak appears to be prominent. The presentation of prominent peaks for noise-free speech can contribute to higher speech intelligibility (Assmann and Summerfield, 2004) and be used to explain the average 11.0% point benefit observed with ID-FACE.
The patterns for the speech-shaped noise and reverberation conditions were considerably different than the patterns observed in the quiet condition which increased selection of apical, low frequency electrodes (22–18, 0.3–0.8 kHz). Either one of the three following patterns was observed: (1) higher selection of mid-range electrodes (9–13, 1.45–2.5 kHz) with lower stimulating current, or (2) lower selection of the eight-most apical and basal electrodes with higher stimulating current; or (3) higher selection of five-most basal electrodes with lower stimulating current. The differences in patterns across individual electrodes between quiet and noisy/reverberant conditions may also be related to the ability of implant listeners to perceive small-scale changes. Many factors can contribute to the known decreased frequency sensitivity for CI recipients. Previous literature suggests that as frequency increases, difference limens (DL) also increase (Chen and Loizou, 2004; Rogers et al., 2006). Therefore, the variance of F2 and F3 cues provided to the implant listener using either the ground truth estimates or any of the three estimation approaches (or theoretically any other estimation approach), must be outside DLs to potentially make an impact on channel selection and in turn, speech intelligibility. In a study by Chen and Zeng, NH listeners were found to be more sensitive to frequency changes demonstrated by DLs within a range of 5–10 Hz, whereas the ability of implant listeners to perceive fine spectral changes decreased from 10–550 Hz (Chen and Zeng, 2004). This phenomenon, coupled with the little to no significant channel selection differences observed between the estimation approaches and ground truth or ideal formant frequencies, provides evidence that the incorporation of less accurate estimates, such as the FACE strategy from Saba et al., can be made without compromising speech intelligibility (Saba et al., 2018). This may also suggest that algorithms used to generate these estimates may not need the full computational power required to drive an accurate estimate. Contrastingly, the robustness of the formant estimate may depend more on its proximity to cut-off frequencies of bandpass filters rather than the estimate accuracy itself (Croghan et al., 2017; Donaldson and Nelson, 2000). This is because the estimate is only used to drive the selection of a particular electrode. Likewise, the robustness may also depend more on the spectrotemporal sensitivity of CI listeners to perceive individual absences or additions of a single channel. For example, if the difference in stimulation between the two strategies is only indicative of an individual electrode of which the CI listener does not rely on for perceptual purposes, then the proposed strategy may not be as successful for the particular CI user. Thus, this supports the overall conclusion that the successfulness of the l-of-m-of-n approach may depend on the perceptual ability of CI listeners to perceive small-scale changes.
The goal of the objective analysis regarding differences in estimation approach and number of selected channels was to determine the feasibility of the formant-based selection criterion. Selection from the various estimation approaches in noise were not found to be significantly different than in noise-free environments. This suggests that the maximum potential for robustness exists in a noise-free environment. If no effect was observed in the selection of individual channels, then an analysis must also be done to identify the effects of the corresponding current of the selected channels. Ideal frequency-based selection resulted in lower current where is it assumed that the estimate did not correspond to a spectral peak. However, for LPC- and U-LSP–based selection, a significant increase in current was observed on various individual channels across the F2 region. Specifically, the energy of the l band selection was found to be higher and conjectured to be more closely resembled peaks observed in the noisy signal as opposed to peaks observed in the noise-free signal (i.e., ideal estimates that were measured in quiet). For selection criteria based on energy, the identification of global spectral peaks will become more challenging when local spectral peaks increase as the SNR decreases. The inability to effectively select peaks using an energy-based criterion for channel selection will occur, regardless of any estimation approach in the presence of noise.
The objective analyses of the subset of selected channels, or n-maxima, revealed a non-uniform increase in selection across the 22 electrodes, but followed the same trends for original subsets according to the subject MAPs shown in Table I for N + 3 and N + 6. For N - 4, however, channel selection was found to be significantly different for the quiet and +5 dB SNR noise condition. As n-maxima (or n in n-of-m, the number of channels selected for stimulation in each frame) decreases, the impact of priority selection of l channels (or three representing F1, F2, and F3) increases, especially for situations when selection using ACE does not include channels representative of formant frequencies. For example, a MAP with an n-maxima of six, only 50% (three-of-six) of the selection is prioritized to formant channels, whereas only 25% (three-of-12) of the selection is prioritized when n-maxima is 12. This is further impacted when ACE does not include two or more formants. It should be noted that in some studies (Kals et al., 2010; Nogueira et al., 2005), better performance was achieved with a smaller number of channels, such that the proposed selection criterion was more effective when the selection was limited. Similarly, the selection with ID-FACE in this particular case, which includes at least two–three differences in channel selection, may illicit larger perceptual differences with the selection of non-dominant, information-bearing channels.
While the goals of the present study were to analyze the influence of various factors on channel selection behavior of a simplistic l-of-n-of-m criterion, the study provides a framework for analyzing channel selection behavior. The use of ideal (or as close to ground truth as possible) formant frequencies is one solution to disrupt selection of neighboring channels in any listening condition. This effectively decreased selection of commonly clustered low frequency electrodes (22–13, 0.25–1.125 kHz) with the selection of formant-bearing channels at the expense of noise-dominant channels or adjacent peaks. Other solutions have been proposed by incorporating various masking functions (Büchner et al., 2008; Kludt et al., 2021; Nogueira et al., 2005), refractory periods (Babacan, 2010; El Boghdady et al., 2016; Lai et al., 2018), and spatial separation (Bolner et al., 2020; Kals et al., 2010), to name a few. The selection of l channels is not negatively influenced by presence of noise or reverberation since formant estimates are generated and stored offline, unlike traditional estimation methods and real-time approaches. However, the selection of priority channels in listening situations where the SNR is low, does not account for the effects of noise in the particular channel, and may result in lower energy than noise-dominant channels.
V. CONCLUSIONS
In this study, cochlear implant signal processing strategies were studied which included the effects of: (i) formant estimation accuracy, (ii) formant estimation approach, and (iii) number of selected channels within the context of a formant-based channel selection criterion proposed within an n-of-m–type strategy and compared against a control energy-based criteria. Six CI users were evaluated based on implementing ideal (or as close to ground truth) formant estimates to determine the effect of estimation accuracy on speech intelligibility. Significant improvements were observed for the noise-free condition, but not for speech-shaped or reverberation noise conditions. An objective analysis of stimulation components including channel selection and stimulated current considered the following: (i) simultaneous increases in current and selection, (ii) increases in current coupled with a decrease in selection, or (iii) increases in selection without increasing current, where the former two were attributed to higher speech intelligibility. The effect of estimation approach resulted in significant differences in channel selection for a few channels from the baseline strategy, but followed the same overall selection patterns as ID-FACE. The number of channels, n-maxima, was found to be play a significant role in channel selection as it controls the impact of l channel differences. Therefore, when l channels in the l-of-n-of-m selection criterion provide a representation of speech for CI users that is: (a) significantly different to the baseline strategy, (b) perceivable to the implant listener according to their MAP defined by the amount of sensorineural hearing loss, (c) contains salient voicing information, and (d) is not masked by adjacent or noise-dominant channels, benefits in speech intelligibility can be achieved.
ACKNOWLEDGMENTS
This work was supported by Grant No. R01-DC016839 from the National Institute on Deafness and Other Communication Disorders, National Institutes of Health. The authors would like to thank Dr. Salim Saba for his expertise on various statistical analyses performed in this study. Additionally, special thanks to Colin Brochtrup for his help extracting formant estimates from open-source approaches and to transcription specialist Fajhr Qureshi and speech scientist Salar Jafarlou for their work on phonetic transcription and phoneme level forced alignment.
References
- 1. Anderson, N. (1974). “ On the calculation of filter coefficients for maximum entropy spectral analysis,” Geophys. 39, 69–72. 10.1190/1.1440413 [DOI] [Google Scholar]
- 2. Assmann, P. F. , and Summerfield, Q. (2004). “ The perception of speech under adverse conditions,” in Springer Handbook for Auditory Research, edited by Greenberg W., Ainsworth W. A., Popper A. N., and Fay R. R. ( Springer, New York: ), pp. 231–308. [Google Scholar]
- 3. Babacan, O. (2010). “ Implementation of a Neurophsiology-Based Coding Strategy for the Cochlear Implant,” ZORA, pp. 11–41, doi:10.5167/uzh-46000. [Google Scholar]
- 4. Boersma, P. , and Weenikoersma, D. (2001). “ Praat, a system for doing phonetics by computer,” Glot Int. 5, 341–345. [Google Scholar]
- 5. Bolner, F. , Magits, S. , van Dijk, B. , and Wouters, J. (2020). “ Precompensating for spread of excitation in a cochlear implant coding strategy,” Hear. Res. 395, 107977. 10.1016/j.heares.2020.107977 [DOI] [PubMed] [Google Scholar]
- 6. Büchner, A. , Nogueira, W. , Edler, B. , Battmer, R. D. , and Lenarz, T. (2008). “ Results from a psychoacoustic model-based strategy for the nucleus-24 and freedom cochlear implants,” Otol. Neurotol. 29, 189–192. 10.1097/mao.0b013e318162512c [DOI] [PubMed] [Google Scholar]
- 7. Buechner, A. , Beynon, A. , Szyfter, W. , Niemczyk, K. , Hoppe, U. , Hey, M. , Brokx, J. , Eyles, J. , Van de Heyning, P. , Paludetti, G. , Zarowski, A. , Quaranta, N. , Wesarg, T. , Festen, J. , Olze, H. , Dhooge, I. , Müller-Deile, J. , Ramos, A. , Roman, S. , Piron, J.-P. , Cuda, D. , Burdo, S. , Grolman, W. , Roux Vaillard, S. R. , Huarte, A. , Frachet, B. , Morera, C. , Garcia-Ibáñez, L. , Abels, D. , Walger, M. , Müller-Mazotta, J. , Leone, C. A. , Meyer, B. , Dillier, N. , Steffens, T. , Gentine, A. , Mazzoli, M. , Rypkema, G. , Killian, M. , and Smoorenburg, G. (2011). “ Clinical evaluation of cochlear implant sound coding taking into account conjectural masking functions, MP3000TM,” Cochlear Implants Int. 12, 194–204. 10.1179/1754762811Y0000000009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Burg, J. P. (1972). “ The relationship between maximum entropy spectra and maximum likelihood spectra,” Geophysics 37, 375–376. 10.1190/1.1440265 [DOI] [Google Scholar]
- 9. Chen, B. , and Loizou, P. C. (2004). “ Formant frequency estimation in noise,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada (IEEEE, New York), pp. I-581–I-584. 10.1109/ICASSP.2004.1326052 [DOI] [Google Scholar]
- 10. Chen, H. , and Zeng, F.-G. (2004). “ Frequency modulation detection in cochlear implant subjects,” J. Acoust. Soc. Am. 116, 2269–2277. 10.1121/1.1785833 [DOI] [PubMed] [Google Scholar]
- 11. Croghan, N. B. H. , Duran, S. I. , and Smith, Z. M. (2017). “ Re-examining the relationship between number of cochlear implant channels and maximal speech intelligibility,” J. Acoust. Soc. Am. 142, EL537–EL543. 10.1121/1.5016044 [DOI] [PubMed] [Google Scholar]
- 12. Crosmer, J. , and Barnwell, T. (1985). “ A low bit rate segment vocoder based on line spectrum pairs,” in ICASSP '85, IEEE International Conference on Acoustics, Speech, and Signal Processing, ( Institute of Electrical and Electronics Engineers, Tampa, FL: ), pp. 240–243. 10.1109/ICASSP.1985.1168223 [DOI] [Google Scholar]
- 13. Crosmer, J. R. (1985). Very Low Bit Rate Speech Coding Using the Line Spectrum Pair Transformation of the LPC Coefficients (Georgia Institute of Technology, Atlanta, GA: ). [Google Scholar]
- 14. Deller, J. R. , Hansen, J. H. L. , and Proakis, J. G. (2000). “ Short-term processing of speech,” in Discrete-Timer Processing of Speech Signals, edited by Herrick R. J. ( Wiley-IEEE Press, New York: ), 1st ed., pp. 225–263. [Google Scholar]
- 15. Donaldson, G. S. , and Nelson, D. A. (2000). “ Place-pitch sensitivity and its relation to consonant recognition by cochlear implant listeners using the MPEAK and SPEAK speech processing strategies,” J. Acoust. Soc. Am. 107, 1645–1658. 10.1121/1.428449 [DOI] [PubMed] [Google Scholar]
- 16. Donaldson, G. S. , Rogers, C. L. , Johnson, L. B. , and Oh, S. H. (2015). “ Vowel identification by cochlear implant users: Contributions of duration cues and dynamic spectral cues,” J. Acoust. Soc. Am. 138, 65–73. 10.1121/1.4922173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Dorman, M. F. , Loizou, P. C. , and Rainey, D. (1997). “ Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs,” J. Acoust. Soc. Am. 102, 2403–2411. 10.1121/1.419603 [DOI] [PubMed] [Google Scholar]
- 18. El Boghdady, N. , Kegel, A. , Lai, W. K. , Dillier, N. , Kong, W. , and Dillier, N. (2016). “ A neural-based vocoder implementation for evaluating cochlear implant coding strategies,” Hear. Res. 333, 136–149. 10.1016/j.heares.2016.01.005 [DOI] [PubMed] [Google Scholar]
- 19. El-Jaroudi, A. , and Makhoul, J. (1991). “ Discrete all-pole modeling,” IEEE Trans. Signal Process. 39, 411–423. 10.1109/78.80824 [DOI] [Google Scholar]
- 20. Faulkner, A. , Rosen, S. , and Stanton, D. (2003). “ Simulations of tonotopically mapped speech processors for cochlear implant electrodes varying in insertion depth,” J. Acoust. Soc. Am. 113, 1073–1080. 10.1121/1.1536928 [DOI] [PubMed] [Google Scholar]
- 21. Fetterman, B. L. , and Domico, E. H. (2002). “ Speech recognition in background noise of cochlear implant patients,” Otolaryngol-Head. Neck Surg. 126, 257–263. 10.1067/mhn.2002.123044 [DOI] [PubMed] [Google Scholar]
- 22. Flanagan, J. L. (1956). “ Automatic extraction of formant frequencies from continuous speech,” J. Acoust. Soc. Am. 28, 110–118. 10.1121/1.1908188 [DOI] [Google Scholar]
- 23. Friesen, L. M. , Shannon, R. V. , Baskent, D. , and Wang, X. (2001). “ Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants,” J. Acoust. Soc. Am. 110, 1150–1163. 10.1121/1.1381538 [DOI] [PubMed] [Google Scholar]
- 24. Fu, Q. , and Shannon, R. V. (1999). “ Effect of acoustic dynamic range on phoneme recognition in quiet and noise by cochlear implant users,” J. Acoust. Soc. Am. 106, L65–L70. 10.1121/1.428148 [DOI] [PubMed] [Google Scholar]
- 25. Fu, Q. J. , and Nogaki, G. (2005). “ Noise susceptibility of cochlear implant users: The role of spectral resolution and smearing,” J. Assoc. Res. Otolaryngol. 6, 19–27. 10.1007/s10162-004-5024-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Fu, Q. J. , and Shannon, R. V. (2002). “ Frequency mapping in cochlear implants,” Ear Hear. 23, 339–348. 10.1097/00003446-200208000-00009 [DOI] [PubMed] [Google Scholar]
- 27. Ghosh, R. , Ali, H. , and Hansen, J. H. L. (2022). “ CCi-MOBILE: A portable real time speech processing platform for cochlear implant and hearing research,” IEEE Trans. Biomed. Eng. 69, 1251–1263. 10.1109/TBME.2021.3123241 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Hansen, J. H. L. , Ali, H. , Saba, J. N. , Charan, M. C. R. , Mamun, N. , Ghosh, R. , and Brueggeman, A. (2019). “ CCi-MOBILE: Design and evaluation of a cochlear implant and hearing aid research platform for speech scientists and engineers,” in Proceedings from the International Conference on Biomedical and Health Informatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Hazrati, O. , and Loizou, P. C. (2012). “ The combined effects of reverberation and noise on speech intelligibility by cochlear implant listeners,” Int. J. Audiol. 51, 437–443. 10.3109/14992027.2012.658972 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Hazrati, O. , and Loizou, P. C. (2013). “ Comparison of two channel selection criteria for noise suppression in cochlear implants,” J. Acoust. Soc. Am. 133, 1615–1624. 10.1121/1.4788999 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Henry, B. A. , and Turner, C. W. (2003). “ The resolution of complex spectral patterns by cochlear implant and normal-hearing listeners,” J. Acoust. Soc. Am. 113, 2861–2873. 10.1121/1.1561900 [DOI] [PubMed] [Google Scholar]
- 32. Holden, L. K. , Skinner, M. W. , and Holden, T. A. (2005). “ Speech recognition with the advanced combination encoder and transient emphasis spectral maxima strategies in nucleus 24 recipients,” J. Speech. Lang. Hear. Res. 48, 681–702. 10.1044/1092-4388(2005/047) [DOI] [PubMed] [Google Scholar]
- 33. Hu, Y. , and Loizou, P. C. (2008). “ A new sound coding strategy for suppressing noise in cochlear implants,” J. Acoust. Soc. Am. 124, 498–509. 10.1121/1.2924131 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.IEEE (1969). “ IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroacoust. 17, 225–246. 10.1109/TAU.1969.1162058 [DOI] [Google Scholar]
- 35. Itakura, F. (1975). “ Line spectrum representation of linear predictor coefficients of speech signals,” J. Acoust. Soc. Am. 57, S35. 10.1121/1.1995189 [DOI] [Google Scholar]
- 36. Iverson, P. , Smith, C. A. , and Evans, B. G. (2006). “ Vowel recognition via cochlear implants and noise vocoders: Effects of formant movement and duration,” J. Acoust. Soc. Am. 120, 3998–4006. 10.1121/1.2372453 [DOI] [PubMed] [Google Scholar]
- 37. Kals, M. , Schatzer, R. , Krenmayr, A. , Vermeire, K. , Visser, D. , Bader, P. , Neustetter, C. , Zangerl, M. , and Zierhofer, C. , (2010). “ Results with a cochlear implant channel-picking strategy based on ‘Selected Groups,’ ” Hear. Res. 260, 63–69. 10.1016/j.heares.2009.11.012 [DOI] [PubMed] [Google Scholar]
- 38. Kewley-Port, D. , Burkle, T. Z. , and Lee, J. H. (2007). “ Contribution of consonant versus vowel information to sentence intelligibility for young normal-hearing and elderly hearing-impaired listeners,” J. Acoust. Soc. Am. 122, 2365–2375. 10.1121/1.2773986 [DOI] [PubMed] [Google Scholar]
- 39. Kewley-Port, D. , and Watson, C. S. (1994). “ Formant-frequency discrimination for isolated English vowels,” J. Acoust. Soc. Am. 95, 485–496. 10.1121/1.410024 [DOI] [PubMed] [Google Scholar]
- 40. Kludt, E. , Nogueira, W. , Lenarz, T. , and Buechner, A. (2021). “ A sound coding strategy based on a temporal masking model for cochlear implants,” PLoS ONE 16, e0244433. 10.1371/journal.pone.0244433 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Lai, W. K. , Dillier, N. , and Killian, M. (2018). “ A neural excitability based coding strategy for cochlear implants,” J. Biomed. Sci. Eng. 11, 159–181. 10.4236/jbise.2018.117014 [DOI] [Google Scholar]
- 42. Li, N. , and Loizou, P. C. (2008). “ Factors influencing intelligibility of ideal binary-masked speech: Implication for noise reduction,” J. Acoust. Soc. Am. 123, 1673–1682. 10.1121/1.2832617 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Loizou, P. C. , and Poroy, O. (2001). “ Minimum spectral contrast needed for vowel identification by normal hearing and cochlear implant listeners,” J. Acoust. Soc. Am. 110, 1619–1627. 10.1121/1.1388004 [DOI] [PubMed] [Google Scholar]
- 44. Lyzenga, J. , Festen, J. M. , and Houtgast, T. (2002). “ A speech enhancement scheme incorporating spectral expansion evaluated with simulated loss of frequency selectivity,” J. Acoust. Soc. Am. 112, 1145–1157. 10.1121/1.1497619 [DOI] [PubMed] [Google Scholar]
- 45. Markel, J. D. , and Gray, A. H. (1976). “ Speech synthesis structures,” in Linear Predictions of Speech, edited by Markel J. D. and Gray A. J.. (Springer-Verlag, Berlin, Germany, 1976), pp. 92–128. [Google Scholar]
- 46. McKay, C. M. , Mcdermott, H. J. , Vandali, A. E. , and Clark, G. M. (1992). “ A comparison of speech perception of cochlear implantees using the spectral maxima sound processor (SMSP) and the MSP (MULTIPEAK) processor,” Acta Otolaryngol. 112, 752–761. 10.3109/00016489209137470 [DOI] [PubMed] [Google Scholar]
- 47. Moore, B. C. J. (2004). “ Dead regions in the cochlea: Conceptual foundations, diagnosis, and clinical applications,” Ear Hear. 25, 98–116. 10.1097/01.AUD.0000120359.49711.D7 [DOI] [PubMed] [Google Scholar]
- 48. Neuman, A. C. , Wroblewski, M. , Hajicek, J. , and Rubinstein, A. (2010). “ Combined effects of noise and reverberation on speech recognition performance of normal-hearing children and adults,” Ear Hear. 31, 336–344. 10.1097/AUD.0b013e3181d3d514 [DOI] [PubMed] [Google Scholar]
- 49. Nogueira, W. , Büchner, A. , Lenarz, T. , and Edler, B. (2005). “ A psychoacoustic ‘NofM’-type speech coding strategy for cochlear implants,” EURASIP J. Adv. Signal Process. 2005, 101672. 10.1155/ASP.2005.3044 [DOI] [Google Scholar]
- 50. Nogueira, W. , Rode, T. , and Büchner, A. (2016). “ Spectral contrast enhancement improves speech intelligibility in noise for cochlear implants,” J. Acoust. Soc. Am. 139, 728–739. 10.1121/1.4939896 [DOI] [PubMed] [Google Scholar]
- 51. Ochshorn, R. M. , and Hawkins, M. (2016). “ Gentle: A robust yet lenient forced aligner built on Kaldi,” https://lowerquality.com/gentle/ (Last viewed December 18, 2019).
- 52. Oxenham, A. J. , Bernstein, J. G. W. , and Penagos, H. (2004). “ Correct tonotopic representation is necessary for complex pitch perception,” Proc. Natl. Acad. Sci. U.S.A. 101, 1421–1425. 10.1073/pnas.0306958101 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Parikh, G. , and Loizou, P. C. (2005). “ The influence of noise on vowel and consonant cues,” J. Acoust. Soc. Am. 118, 3874–3888. 10.1121/1.2118407 [DOI] [PubMed] [Google Scholar]
- 54. Pfingst, B. E. , Franck, K. H. , Xu, L. , Bauer, E. M. , and Zwolan, T. A. (2001). “ Effects of electrode configuration and place of stimulation on speech perception with cochlear prostheses,” J. Assoc. Res. Otolaryngol. 2, 87–103. 10.1007/s101620010065 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Rabiner, L. R. , Schafer, R. , and Rader, C. M. (1969). “ The chirp z-transform algorithm,” IEEE Trans. Audio Electroacoust. 17, 86–92. 10.1109/TAU.1969.1162034 [DOI] [Google Scholar]
- 56. Rogers, C. F. , Healy, E. W. , and Montgomery, A. A. (2006). “ Sensitivity to isolated and concurrent intensity and fundamental frequency increments by cochlear implant users under natural listening conditions,” J. Acoust. Soc. Am. 119, 2276–2287. 10.1121/1.2167150 [DOI] [PubMed] [Google Scholar]
- 57. Rubinstein, J. T. (2004). “ How cochlear implants encode speech,” Curr. Opin. Otolaryngol. Head Neck Surg. 12, 444–448. 10.1097/01.moo.0000134452.24819.c0 [DOI] [PubMed] [Google Scholar]
- 58. Saba, J. N. (2021). Leveraging Landmark Acoustic Features in Cochlear Implant Signal Processing ( The University of Texas at Dallas, Dallas, TX: ), available at https://hdl.handle.net/10735.1/9445. [Google Scholar]
- 59. Saba, J. N. , Ali, H. , and Hansen, J. H. L. (2018). “ Formant priority channel selection for an ‘n-of m’ sound processing strategy for cochlear implants,” J. Acoust. Soc. Am. 144, 3371–3380. 10.1121/1.5080257 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Saba, J. N. , and Hansen, J. H. L. (2022). “ Speech modification for intelligibility in cochlear implant listeners: Individual effects of vowel- and consonant-boosting,” in Proceedings of Annual Conference of the International Speech Communication Association (INTERSPEECH), Incheon, Korea, pp. 5473–5477. [Google Scholar]
- 61. Schafer, R. W. , and Rabiner, L. R. (1970). “ System for automatic formant analysis of voiced speech,” J. Acoust. Soc. Am. 47, 634–648. 10.1121/1.1911939 [DOI] [PubMed] [Google Scholar]
- 62. Shannon, R. V. , Zeng, F.-G. G. , Kamath, V. , Wygonski, J. , and Ekelid, M. (1995). “ Speech recognition with primarily temporal cues,” Science 270, 303–304. 10.1126/science.270.5234.303 [DOI] [PubMed] [Google Scholar]
- 63. Shue, Y. , Keating, P. , and Vicenik, C. (2009). “ VOICESAUCE: A program for voice analysis,” J. Acoust. Soc. Am. 126(4 Supplement), 2221. 10.1121/1.3248865 [DOI] [Google Scholar]
- 64. Sjölander, K. , and Beskow, J. (2000). “ Wavesurfer - an open source speech tool,” in Proceedings 6th International Conference Spoken Language Processing (ICSLP 2000), Vol. 4, pp. 464–467. [Google Scholar]
- 65. Skinner, M. W. , Arndt, P. L. , and Staller, S. J. (2002). “ Nucleus® 24 advanced encoder conversion study: Performance versus preference,” Ear Hear. 23, 2–17. 10.1097/00003446-200202001-00002 [DOI] [PubMed] [Google Scholar]
- 66. Skinner, M. W. , Holden, L. K. , Holden, T. A. , Dowell, R. C. , Seligman, P. M. , Brimacombe, J. A. , and Beiter, A. L. (1991). “ Performance of postlinguistically deaf adults with the wearable speech processor (WSP III) and mini speech processor (MSP) of the nucleus multi-electrode cochlear implant,” Ear Hear. 12, 3–22. 10.1097/00003446-199102000-00002 [DOI] [PubMed] [Google Scholar]
- 67. Snell, R. C. , and Milinazzo, F. (1993). “ Formant estimation from LPC analysis data,” IEEE Trans. Speech Audio Process. 1, 129–134. 10.1109/89.222882 [DOI] [Google Scholar]
- 68. Spahr, A. J. , Dorman, M. F. , Litvak, L. M. , Van Wie, S. , Gifford, R. H. , Loizou, P. C. , Loiselle, L. M. , Oakes, T. , and Cook, S. (2012). “ Development and validation of the AzBio sentence lists,” Ear Hear. 33, 112–117. 10.1097/AUD.0b013e31822c2549 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Stickney, G. S. , Loizou, P. C. , Mishra, L. N. , Assmann, P. F. , Shannon, R. V. , and Opie, J. M. (2006). “ Effects of electrode design and configuration on channel interactions,” Hear. Res. 211, 33–45. 10.1016/j.heares.2005.08.008 [DOI] [PubMed] [Google Scholar]
- 70. Tabibi, S. , Kegel, A. , Lai, W. K. , and Dillier, N. (2020). “ A bio-inspired coding (BIC) strategy for cochlear implants,” Hear. Res. 388, 107885. 10.1016/j.heares.2020.107885 [DOI] [PubMed] [Google Scholar]
- 71. Vallabha, G. K. , and Tuller, B. (2002). “ Systematic errors in the formant analysis of steady-state vowels,” Speech Commun. 38, 141–160. 10.1016/S0167-6393(01)00049-8 [DOI] [Google Scholar]
- 72. Wilson, B. S. , Finley, C. C. , Lawson, D. T. , Wolford, R. D. , Eddington, D. K. , and Rabinowitz, W. M. (1991). “ Better speech recognition with cochlear implants,” Nature 352, 236–238. 10.1038/352236a0 [DOI] [PubMed] [Google Scholar]
- 73. Wilson, B. S. , Finley, C. C. , Lawson, D. T. , Wolford, R. D. , and Zerbi, M. (1993). “ Design and evaluation of a continuous interleaved sampling (CIS) processing strategy for multichannel cochlear implants,” J. Rehabil. Res. Dev. 30, 110–116. PMID: 8263821 [PubMed] [Google Scholar]
- 74. Xu, L. , and Pfingst, B. E. (2008). “ Spectral and temporal cues for speech recognition: Implications for auditory prostheses,” Hear. Res. 242, 132–140. 10.1016/j.heares.2007.12.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Zeng, F.-G. , Rebscher, S. , Harrison, W. , Sun, X. , and Feng, H. (2008). “ Cochlear implants: System design, integration, and evaluation,” IEEE Rev. Biomed. Eng. 1, 115–142. 10.1109/RBME.2008.2008250 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Zhu, Z. , Tang, Q. , Zeng, F. G. , Guan, T. , and Ye, D. (2012). “ Cochlear-implant spatial selectivity with monopolar, bipolar and tripolar stimulation,” Hear. Res. 283, 45–58. 10.1016/j.heares.2011.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]







