Abstract
Cochlear implant (CI) recipients often struggle to understand speech in reverberant environments. Speech enhancement algorithms could restore speech perception for CI listeners by removing reverberant artifacts from the CI stimulation pattern. Listening studies, either with cochlear-implant recipients or normal-hearing (NH) listeners using a CI acoustic model, provide a benchmark for speech intelligibility improvements conferred by the enhancement algorithm but are costly and time consuming. To reduce the associated costs during algorithm development, speech intelligibility could be estimated offline using objective intelligibility measures. Previous evaluations of objective measures that considered CIs primarily assessed the combined impact of noise and reverberation and employed highly accurate enhancement algorithms. To facilitate the development of enhancement algorithms, we evaluate twelve objective measures in reverberant-only conditions characterized by a gradual reduction of reverberant artifacts, simulating the performance of an enhancement algorithm during development. Measures are validated against the performance of NH listeners using a CI acoustic model. To enhance compatibility with reverberant CI-processed signals, measure performance was assessed after modifying the reference signal and spectral filterbank. Measures leveraging the speech-to-reverberant ratio, cepstral distance and, after modifying the reference or filterbank, envelope correlation are strong predictors of intelligibility for reverberant CI-processed speech.
I. INTRODUCTION
Cochlear implants (CIs) restore hearing by electrically stimulating the auditory nerve using an electrode array inserted into the cochlea. In quiet listening environments, CIs can restore the perception of speech (Firszt et al., 2004; Skinner et al., 2002). Despite this benefit, speech perception can be challenging for CI recipients in listening environments containing background noise or acoustic reverberation, such as in classrooms or busy social settings (Cullington and Zeng, 2008; Dorman and Gifford, 2017; Kressner et al., 2018). To alleviate communication difficulties in these settings, several studies have pursued the development of speech enhancement algorithms that selectively remove the influence of noise or reverberant artifacts on the CI stimulation pattern (Bolner et al., 2016; Crowson et al., 2020; Goehring et al., 2017; Goehring et al., 2019; Hazrati et al., 2013a; Hazrati et al., 2013b; Hazrati and Loizou, 2013). While algorithms for noise mitigation have been deployed in commercially available CI devices (Carlyon and Goehring, 2021; Gifford and Revit, 2010; Henry et al., 2023), no such algorithms for reverberation mitigation are yet available.
A widely adopted benchmark for speech enhancement, time-frequency masking, applies a matrix of gain values to a time-frequency (T-F) representation of the noisy or reverberant signal. The ideal T-F mask indicates where the artifacts are, informed by a priori knowledge of the acoustic distortion in each T-F unit of a degraded speech signal relative to its clean counterpart. In real-world listening scenarios, a priori knowledge of the local distortion is often unavailable, motivating the development of algorithms for estimating the ideal T-F mask from the noisy or reverberant signal. Previous works have demonstrated the feasibility of estimating T-F masks from the reverberant signal using an ideal T-F mask as the objective during algorithm development (Chu et al., 2022; Hazrati et al., 2013a; Hazrati et al., 2013b; Hazrati and Loizou, 2013). These works either leveraged knowledge of the reverberant environment (Hazrati et al., 2013a; Hazrati et al., 2013b; Hazrati and Loizou, 2013) or the linguistic content (Chu et al., 2022) to achieve statistically significant improvements in speech intelligibility after the application of the estimated mask. In real-world listening scenarios, not only is knowledge of the reverberant environment or the linguistic content unavailable, but the characteristics of the listening environment often change, such as when a listener moves within a reverberant environment or from one reverberant environment to another. Further work is needed to develop robust T-F mask estimation algorithms that optimally restore intelligibility in a range of reverberant environments without knowledge of the listening scenario.
During algorithm development, estimation of the ideal mask is often assessed using traditional measures of machine learning algorithm performance, such as the hit rate and false alarm rate (Kim et al., 2009) or accuracy (Chu et al., 2022). These measures assess the classification or regression performance of the machine learning algorithm for each T-F unit independently. Although the intelligibility of the signal mitigated by the estimated T-F mask should approach the intelligibility of the signal mitigated with the ideal T-F mask as mask estimation improves, assessing improvements in mask estimation using traditional measures of algorithm performance may not necessarily reflect proportional improvements in the intelligibility of the mitigated signal (Kressner et al., 2016a). Different T-F components of the CI stimulus have different levels of importance for speech understanding (Bosen and Chatterjee, 2016) and, importantly, the type of mask estimation errors produced by the speech enhancement algorithm and their spectro-temporal distribution can have a significant impact on the speech intelligibility of the mitigated signal (Kressner et al., 2016b; Kressner and Rozell, 2015). To assess the efficacy of a speech enhancement algorithm during development, the intelligibility of the mitigated stimulus must be assessed with consideration for the contribution of each T-F unit to higher-order features of the speech signal, such as those encompassing acoustic-phonetic cues.
Speech intelligibility is reliably assessed using subjective measurement of speech perception, in which a score is assigned to individual speech utterances based on a participant's recall of the speech content. Speech perception testing with CI recipients represents the gold standard for assessing a speech enhancement algorithm intended for CIs. However, during the initial phases of speech enhancement algorithm development, it can be advantageous to thoroughly explore the algorithmic parameter space to maximize gains in intelligibility while minimizing the distortions imposed by the algorithm. Subjective speech perception testing would be impractical for algorithmic parameter search as it would require many conditions per listener, leading to prohibitively long algorithm development and testing times. When testing with CI listeners, generalization of performance trends can often be hindered by variability in speech perception outcomes across CI recipients, due to differences in physiological and cognitive factors related to hearing loss and cochlear implantation, and by small sample sizes, due to challenges in recruitment and the limited availability of volunteers. Further, subjective speech perception testing may not be warranted during the initial phases of algorithm development when mask estimation and thus mitigated speech intelligibility is poor, particularly for CI listeners who often require speech in background noise to be present at a positive signal-to-noise ratio before the speech signal is intelligible (Dorman and Gifford, 2017).
Alternatively, initial testing of speech enhancement algorithms can be conducted with normal-hearing (NH) listeners using an acoustic model of CI processing referred to as vocoding (Whitmal et al., 2007). Although the acoustic model may not realistically depict all aspects of CI listening (Başkent, 2012; Bhargava et al., 2014; Karoui et al., 2019; Laneau et al., 2006; Svirsky et al., 2021), initial testing with NH listeners could reduce the impact associated with inter-subject variability and ceiling and floor effects on speech intelligibility outcomes, enabling preliminary evaluations of speech enhancement algorithms to isolate the restoration of acoustic-phonetic speech content with greater precision. In addition, testing with NH listeners would enable larger sample sizes and assessment of algorithm performance across a wide range of listening scenarios, ensuring the enhancement algorithm is robust. Despite these benefits, subjective speech perception testing remains costly and time intensive making it poorly suited for fine tuning speech enhancement algorithms during their development.
An appealing alternative to subjective speech perception testing is objective speech intelligibility measurement, in which a mathematical comparison is made between the original and enhanced speech signal. Objective intelligibility measures model aspects of auditory perception to compute a score from the enhanced speech signal that reflects the anticipated perceptual trends. Objective measures could be used to estimate speech intelligibility outcomes across various speech enhancement algorithm configurations without requiring listener effort, thus facilitating rapid, repeatable, and cost-effective algorithm development. Before objective intelligibility measures can be used during the development of reverberant speech enhancement algorithms for CIs, objective measures must demonstrate that they can accurately reflect speech perception outcomes in reverberant conditions where intelligibility is restored with varying levels of efficacy. Further, subjective intelligibility trends must reflect aspects of CI listening, either by acquiring speech perception outcomes with CI recipients or with signals that reflect the processing that is performed by the CI device.
CI devices provide considerably reduced spectral information when compared to broadband or natural speech. Some speech intelligibility measures have been proposed that incorporate the spectral resolution available in the CI stimulation pattern during the computation of the objective measure score. Such measures include the envelope correlation measure (ECM) (Yousefian and Loizou, 2012), which was designed to predict the speech reception threshold for CI listeners, and the speech-to-reverberant modulation energy ratio (SRMR) for CIs, SRMR-CI (Santos et al., 2013; Santos and Falk, 2014), an extension of the SRMR measure (Falk et al., 2010) for predicting intelligibility trends of reverberant speech for CI listeners. The reduction in spectral resolution imposed by CI processing has also been incorporated during the evaluation of objective measure performance. Cosentino et al. (2012) evaluated objective measures on reverberant vocoded speech and demonstrated that the vocoded signals could reflect perceptual trends in reverberant speech intelligibility for CI listeners. Cosentino et al. additionally investigated how changes to the vocoder parameters impacted objective measure performance, finding that the use of noise-band carriers degraded performance, but that most measures robustly represented speech intelligibility outcomes for CI listeners with changes to the vocoder type and the number of channels. To identify measures that reflected CI intelligibility trends, a few studies (Chen et al., 2013; Falk et al., 2015; Santos et al., 2012; Santos et al., 2013; Santos and Falk, 2014) validated the scores produced by a range of objective measures against intelligibility data acquired with eleven CI listeners from Hazrati and Loizou (2012b), one of the few studies to test the speech perception of CI listeners in both noise and reverberation. Measures were identified that reflected subjective intelligibility trends in increasingly reverberant, noisy, reverberant-noisy, and T-F masked reverberant-noisy conditions. The studies of (Chen et al. (2013), Falk et al. (2015), Santos et al. (2012), Santos et al. (2013), Santos and Falk (2014) identified objective measures that reflect CI intelligibility in challenging listening conditions, although the limited availability of reverberant-only and non-optimally mitigated conditions made isolation of the effects of reverberation and speech enhancement challenging.
Several objective intelligibility measures have been validated for use with speech spoken in the presence of background noise (Hollube and Kollmeier, 1996; Kates, 2005; Ma et al., 2009; Taal et al., 2011; Yousefian and Loizou, 2012) and for speech enhancement algorithms that target the removal of background noise from noisy speech mixtures (Hu and Loizou, 2006). However, trends in objective measure performance obtained in noise may not necessarily reflect trends that would be obtained in reverberation, as noise and reverberation impact speech signal characteristics differently. For instance, the short-time spectro-temporal structure of reverberant speech resembles that of anechoic speech, while noise often does not share the same spectro-temporal structure. Additionally, speech distortions introduced by background noise are typically additive, while distortions introduced by reverberation can combine constructively or destructively with the target speech signal. Due to the different mechanisms by which noise and reverberation degrade important speech information, some studies have proposed speech intelligibility measures specifically for use with reverberant stimuli (Chen et al., 2013; Kokkinakis and Loizou, 2011; Santos et al., 2013; Santos and Falk, 2014). Most of these validation studies employed noisy-reverberant conditions (Chen et al., 2013; Falk et al., 2015; Santos et al., 2012; Santos et al., 2013; Santos and Falk, 2014), although a few studies validated objective measure performance using solely reverberant stimuli (Cosentino et al., 2012; Kokkinakis and Loizou, 2011). Given the previously mentioned differences between noise and reverberation, objective measures that perform well in noisy and reverberant-noisy conditions may not necessarily perform well in solely reverberant conditions.
Of the studies using reverberant or noisy-reverberant stimuli (Chen et al., 2013; Cosentino et al., 2012; Falk et al., 2015; Kokkinakis and Loizou, 2011; Santos et al., 2012; Santos et al., 2013; Santos and Falk, 2014), none employed algorithms which resulted in variable levels of restoration of intelligibility, as would typically be observed during the development of speech enhancement algorithms. When stimuli were used that spanned a range of intelligibility levels, those stimuli were generated from increasingly reverberant and noisy listening environments (Chen et al., 2013; Cosentino et al., 2012; Falk et al., 2015; Santos et al., 2012; Santos et al., 2013; Santos and Falk, 2014) and likely do not reflect the signal characteristics of speech after ineffective mitigation by a speech enhancement algorithm. For example, increasing amounts of reverberation will result in gradual changes in the modulation spectrum whereas enhancement algorithms such as T-F masking often impose non-linear distortions via spectral subtraction. Although some analyses included enhanced speech conditions (Chen et al., 2013; Falk et al., 2015; Santos and Falk, 2014), the enhanced speech conditions used ideal T-F mask conditions where mask parameters were manipulated to yield high restorations in speech intelligibility, reflecting an upper bound for speech enhancement algorithm performance. These conditions were likely not reflective of the speech enhancement efficacy that would be observed during the initial stages of algorithm development and parameter tuning. Errors made by mask estimation algorithms are often structured in time and frequency and objective measures can fail to predict trends in T-F masked speech, particularly when differences between masks are not uniformly distributed (Kressner et al., 2016a). To use objective measures to assess the efficacy of reverberation mitigation algorithms during development, analysis of objective measure performance must use a validation dataset that represents structured and realistic changes in important speech content across T-F mask-mitigated conditions.
In this study, we validate objective intelligibility measures using a subjective intelligibility dataset that encompasses a gradual reduction of reverberant artifacts and acoustic-phonetic information from the reverberant speech signal, leading to progressive changes in subjective intelligibility across conditions. By encompassing gradual and systematic changes in intelligibility, the subjective dataset simulates trends in signal removal that would likely be observed during algorithm development. To capture the spectral resolution provided by the CI, the subjective dataset describes intelligibility outcomes for NH individuals listening to CI-processed and vocoded speech. First, we compare the performance of a range of objective measures using the subjective intelligibility dataset. Then, we evaluate changes in measure performance after modifying the computation of objective intelligibility measures to better account for reverberation or the spectral resolution available after CI processing. The article concludes with recommendations for objective measures that accurately reflect the restoration of acoustic-phonetic information in reverberant vocoded speech to facilitate the development of reverberant speech-enhancement algorithms for CIs.
II. METHODS
A. Subjective intelligibility dataset
The subjective intelligibility dataset is comprised of speech perception scores obtained from 20 NH listeners using a simulation of the Advanced Combination Encoder (ACE) processing strategy (Holden et al., 2002; Vandali et al., 2000), a widely used CI speech processing strategy. The dataset includes sentences and intelligibility scores for unenhanced reverberant speech, direct-path reverberant speech, and reverberant speech enhanced by one of two T-F masking algorithms: either employing hard (binary mask, BM) or soft (ratio mask, RM) attenuation of signal components. The binary and ratio T-F masking algorithms have been used extensively in the literature to improve the intelligibility of reverberant speech for both NH (Roman and Woodruff, 2011, 2013; Zhao et al., 2016; Zhao et al., 2017; Zhao et al., 2018) and CI (Hazrati et al., 2013a; Hazrati et al., 2013b; Hazrati and Loizou, 2012a, 2013; Kokkinakis et al., 2011) listeners.
T-F masking algorithms remove reverberant distortions from reverberant speech signals by applying a matrix of gain values to a T-F representation of the reverberant signal. The amount of reverberant distortion in each T-F unit of the reverberant signal was quantified using the speech-to-reverberant ratio (SRR), defined as
| (1) |
where and are T-F representations of the direct path and reverberant signals, respectively, and and represent time and frequency (Naylor and Gaubitch, 2010). Binary gain values were obtained by thresholding the SRR, resulting in the BM,
| (2) |
where is the threshold in dB (Kokkinakis et al., 2011; Roman and Woodruff, 2013). Alternatively, SRR values could be mapped to continuous gain values to generate a RM,
| (3) |
where is the SRR on a linear scale [i.e., the operand of the logarithm function in Eq. (1)] and controls the slope of the gain function (Lim and Oppenheim, 1979). Several parameter values were implemented for (with values of −50, −15, −12, −9, −6, −3, 0, 3, 6, and 12 dB) and (with values of 0.005, 0.05°0.1, 0.25, 0.5, 0.75, 1, 1.5, 2.5, and 5), resulting in ten conditions for each masking algorithm.
The tolerance for reverberant distortions in the reverberant signal was progressively reduced by increasing the values of the mask parameters, or , yielding increasingly attenuated reverberant signals and a range of speech intelligibility outcomes. The range of intelligibility scores encompassed the extremes of under- to over-attenuation of the reverberant signal and represents potential outcomes of an enhancement algorithm during development. For both the BM and RM T-F masking strategies, there was a concave relationship between speech intelligibility and mask parameter value, indicating parameters at which mitigated signal intelligibility was maximized with the respective mask. The subjective intelligibility dataset used to evaluate objective measure performance is described in complete detail in Shahidi et al. (2022).
Reverberant speech material was created by convolving sentences from the Hearing in Noise Test (HINT) speech corpus (Nilsson et al., 1994) with a room impulse response (RIR) from the Aachen Impulse Response (AIR) database (Jeub et al., 2009), thereby simulating the effect of the reverberant environment. The RIR was recorded in a lecture theatre with a reverberation time (RT60) of 0.8 seconds and dimensions of 10.8 by 10.9 by 3.15 meters (length by width by height). The source-to-microphone distance was 7.1 meters, which is well outside the critical distance of the room, approximated as 1.2 meters (Naylor and Gaubitch, 2010). To create the direct-path speech material, the RIR was truncated to remove impulses occurring 5 milliseconds after the initial impulse and convolved with each anechoic sentence. All speech files were sampled at 16 kHz prior to convolution with the RIR.
The ACE processor transforms an acoustic signal into a pattern of stimulation pulses for each electrode in the electrode array over time, where each electrode encodes information pertaining to a discrete range of frequencies. ACE processing was simulated using the Nucleus matlab toolbox (Swanson and Mauch, 2006). Signals were transformed into the frequency domain using a discrete Fourier transform with a window length of 128 samples and a frame shift of 32 samples, resulting in a short-time Fourier transform with 64 frequency channels, each with a bandwidth of 125 Hz. A spectral weighting was then used to group the magnitude squared Fourier coefficients into 22 electrode channels and a square root was applied to extract the envelope in each channel. To attenuate artifact-dominant stimulus pulses, T-F masking was applied to the envelope magnitudes after the spectral weighting. Then, eight of the 22 electrode channels with the largest magnitude were retained in each processing cycle and the stimulus was scaled to fall between base and saturation levels and logarithmically compressed. Vocoded waveform stimuli of the mitigated reverberant stimulus patterns were generated using the resulting CI pulse amplitudes as the amplitudes of sinusoidal carriers, with the frequencies of the sinusoidal carriers determined by the center frequencies of electrode channels in the default ACE program. Waveforms were resynthesized by summing sinusoidal carriers across frequency channels.
Twenty native speakers of American English with self-reported normal hearing were recruited to participate in the subjective intelligibility experiment. The participants' ages ranged from 19 to 35 years. Each participant was presented with 20 lists of ten sentences selected randomly from the HINT corpus, with each list corresponding to one experimental condition. Since each of the HINT lists was formed to match the phonemic distribution of the entire HINT corpus (Nilsson et al., 1994), randomly selecting one list for each condition and participant provided a thorough sampling of the phonetic information represented in the HINT database. During testing, participants were seated at a computer in a soundproof booth and vocoded speech stimuli were presented to the listener diotically through Sony MDR7506 headphones at a sound pressure level (SPL) of 65 dB. Each sentence was presented once, and subjects were instructed to type the words they were able to hear. Participants were encouraged to correct their responses for typos prior to submitting their responses. The percent of correctly identified phonemes in each sentence was automatically scored by comparison with the phonemes corresponding to HINT-corpus vocabulary.
B. Objective speech intelligibility prediction
Several measures are evaluated in the present study for predicting the intelligibility trends of reverberant vocoded speech after mitigation by T-F masking. These measures were included in the analysis to provide consistency with previous analyses (Chen et al., 2013; Falk et al., 2015; Santos et al., 2012; Santos et al., 2013) or because they demonstrated good intelligibility prediction performance on CI-processed (Yousefian and Loizou, 2012) or reverberant (Kinoshita et al., 2016) speech.
1. Description of objective measures
Objective measures estimate the quality or intelligibility of speech signals using perceptually relevant features of speech. Measures can be classified as intrusive or non-intrusive depending on the requirement of a reference signal or not. The computation of intrusive objective measures involves the comparison of the degraded speech utterance, here referred to as the target signal, and a signal exemplifying highly intelligible speech for the same speech utterance, referred to as the reference signal. These measures include a quality standard based on audible distortions (ITU, 2001; Kokkinakis and Loizou, 2011), as well as measures based on linear-predictive coding (LPC) (Quackenbush et al., 1988), frequency modulations (Chen et al., 2013; Falk et al., 2010; Santos et al., 2013; Santos and Falk, 2014), estimates of the signal-to-noise ratio (SNR) (Hollube and Kollmeier, 1996; Kates, 2005; Ma et al., 2009), and correlations of the speech envelope within frequency bands (Taal et al., 2011; Yousefian and Loizou, 2012). In this study, reverberant vocoded signals processed by a simulation of the ACE processor are used as the input to objective measure calculation. Of the measures tested, only measures using frequency modulation features do not require a reference signal. Details of the measures analyzed in this work are provided in the following.
a. Perceptual Evaluation of Speech Quality (PESQ).
The PESQ measure is the International Telecommunications Union Recommendation for speech quality assessment (ITU-T, 2001). Although PESQ is a speech quality measure, highly intelligible utterances should also garner high-quality ratings, suggesting that PESQ may indicate intelligibility trends to some extent. PESQ is based on a sensory model combining two distortion-related factors: average symmetrical disturbance ( ) and average asymmetrical disturbance ( ),
| (4) |
The distortion factors are estimated by comparing target and reference signals after auditory representation mapping and compressive loudness scaling. The coefficients were optimized in the original standard using conventional telephony data (ITU-T Rec. P.862, 2001). More recently, a variant of PESQ was developed for the overall quality of reverberant speech by fitting the PESQ coefficients to NH listener data, resulting in the overall Perceptual Evaluation of Speech Quality (oPESQ) measure (Kokkinakis and Loizou, 2011). PESQ and oPESQ produce scores between 1 and 4.5, with larger values indicating better speech quality.
b. Derivative LPC Measures.
The resonances of the speech signal can be modelled by comparing the LPC coefficients of target and reference speech. Using the LPC coefficients, the log-likelihood ratio (LLR) measures the likelihood that the LPC coefficients of the target signal are derived from the same distribution as the original signal (Quackenbush et al., 1988),
| (5) |
where and are the LPC vectors of the reference and target speech signals, respectively; and is the autocorrelation matrix of the reference speech signal. Smaller LLR values indicate more similarity between the cepstral coefficients of the target and reference signal. To incorporate ceiling and floor effects, LLR is typically limited between 0 and 2.
Unlike the log-likelihood measure, the cepstrum distance (CD) (Kitawaki et al., 1988; Quackenbush et al., 1988) transforms the LPC coefficients into cepstrum coefficients, facilitating an estimate of the distance between two spectra,
| (6) |
where is the order of the LPC analysis; and and are vectors of the cepstrum coefficients for the reference and target speech signals, respectively. As the CD captures the distance between two spectra, CD values are restricted to positive values, with lower values representing more similar spectra. CD is typically limited in the range from 0 to 10 dB. Both the LLR and CD use summary statistics (the mean and the median, respectively) to incorporate temporal context within each speech utterance.
c. Frequency Modulation Measures.
Late reverberant reflections tend to smear the speech signal envelope, leading to changes in the modulation spectrum. These changes in the modulation spectrum can be incorporated into an objective measure to infer intelligibility trends. The modulation spectrum area (ModA) (Chen et al., 2013) exploits the observation that the area of the modulation spectrum decreases with increasing amounts of reverberation. The target signal is first decomposed into four frequency bands with center frequencies from 300 to 7600 Hz. Then, temporal envelopes are extracted using the Hilbert transform, downsampled, and mean subtracted. The modulation spectrum is computed for each acoustic frequency band by passing envelopes through a 1/3-octave filterbank with center frequencies ranging between 0.5 and 8 Hz. The area under the modulation spectrum ( ) is found within each acoustic frequency band (indicated by ) after summing 13 modulation indices (spanning 0.5–10 Hz) across each spectrum. ModA results from the average of the modulation spectrum areas over all frequency bands,
| (7) |
ModA was originally validated using intelligibility data from CI users under noisy reverberant and T-F masked noisy reverberant conditions (Chen et al., 2013).
The SRMR (Falk et al., 2010) exploits the fact that reverberant signals contain increased high-frequency modulation energy. The SRMR estimates the ratio of the spectral modulation energy as a result of early and late reverberant reflections. First, the modulation spectral energy for each acoustic frequency band within the target signal is computed, the modulation frequency bins are grouped into eight bands, and each band is averaged over time, resulting in the average modulation energy in frequency band for modulation filter . The average modulation-band energy is then found over frequencies, and a ratio of high to low modulation frequency content is computed,
| (8) |
where is the number of modulation filters. SRMR uses an initial 23-channel gammatone filterbank to bandpass filter the target signal. A CI-specific variant of the SRMR (SRMR-CI) was created to predict CI speech intelligibility by replacing the gammatone filterbank with a 22-channel filterbank emulating the filterbank used in Nucleus CI devices and by using a smaller range of modulation bands (4–64 Hz compared to 4–128 Hz) (Santos et al., 2013; Santos and Falk, 2014). Validation of the SRMR-CI measure used intelligibility data from CI users under reverberant, noisy reverberant, and T-F masked noisy reverberant conditions (Santos et al., 2013; Santos and Falk, 2014). ModA, SRMR, and SRMR-CI produce scores with positive values, with larger values indicating better intelligibility. Unlike the other measures included in this analysis, the modulation frequency-based measures do not require a reference signal as they rely on only the modulation frequency content of the target signal.
d. SNR Estimation Measures.
A standard approach to objective speech perception measurement leverages local estimates of the SNR to determine what portion of the spectral information is audible to the listener (ANSI, 1997). The SNR estimates are combined across frequency channels, often using a band importance weighting, , over frequencies
| (9) |
The articulation index weightings defined in the ANSI standard (ANSI, 1997) are often used to weight the estimated SNR over frequencies, although more recently a weighting function based on the reference spectrum raised to a power was shown to provide superior performance for some measures (Ma et al., 2009). Before being combined across frequency channels, the SNR estimate is limited to remove segments at the upper and lower bounds of intelligibility perception and then mapped to fall in the range of 0 to 1.
The normalized covariance measure (NCM) (Hollube and Kollmeier, 1996) uses the covariance between the target and reference signal envelopes, , in frequency band to estimate the apparent SNR, with envelopes obtained using the Hilbert transform,
| (10) |
where indicates limiting the SNR estimate to the range from −15 to 15 dB. In contrast, the coherence speech intelligibility index (CSII) (Kates, 2005) leverages information from the spectrum in short time frames to estimate the SNR using the normalized cross-spectral density summarized by the magnitude squared coherence (MSC) to weight the target spectrum,
| (11) |
where are critical passband filters used to group the values into 25 frequency bands; and index the output and input frequency channels, respectively; and is the number of temporal frames. Particular to reverberant distortions, the frequency-weighted segmental speech-to-reverberant ratio (fwSRR) (Ma et al., 2009) leverages the SRR as the estimate of distortion within frequency bands by considering the difference between the target and reference signals as indicative of reverberant distortions. The fwSRR weights the reference and target signal using a Gaussian-shaped window and then combines the resulting signals in an intermediate ratio of signal distortion within each frequency band,
| (12) |
where is the number of frequency bands; is the number of temporal frames; and and denote the reference and target signal spectra, respectively, after Gaussian-window weighting for frequency band at frame . The fwSRR uses the median to incorporate temporal context within each speech utterance. The NCM and CSII produce scores between 0 and 1 while the fwSRR produces positive-valued scores. Larger values indicate better intelligibility for all SNR-estimate measures.
e. Spectral Envelope Correlation Measures.
The final approach to objective evaluation leverages the correlation between the temporal envelopes of the reference and target signals. The short-time objective intelligibility (STOI) measure segments each signal into one-third octave bands and overlapping temporal frames, normalizes and clips the target signal, correlates the target and reference signals, and then averages correlations over all frequency channels (Taal et al., 2011). Similarly, the ECM extracts the temporal envelopes for each signal from ACE CI processing, downsamples the envelopes to 50 Hz to limit the modulation frequencies to those that contribute the most to intelligibility in CI users, correlates the target and reference signals, and then averages correlations across frequency channels (Yousefian and Loizou, 2012). By leveraging the ACE processed signals, ECM implicitly incorporates information about the compression function, number of active electrodes, and electrical dynamic range that characterize individual CI users' settings. Both STOI and ECM produce scores between 0 and 1, with larger values indicating better intelligibility. STOI limits the target signal to be lower bounded at −15 dB relative to the reference signal, while ECM contains no such dynamic range limitation stage.
2. Modifications to the objective intelligibility measure
Most objective intelligibility measures aim to model typical hearing in additive noise conditions. When evaluating objective measures, the ability of a measure to predict speech intelligibility trends for a particular listener or acoustic scenario can be greatly impacted by changes to the computation of the objective measure, such as alterations to the band-importance function or spectral filterbank (Ma et al., 2009; Santos et al., 2012). After a baseline evaluation of each measure, we examine two changes to the objective measure computation that may improve speech intelligibility prediction for mitigated-reverberant and CI-processed signals.
The CI stimulation pattern is characterized by limited spectral resolution, due to the limited number of electrodes contained in the electrode array. Most of the measures examined in this study utilize spectral representations that reflect the fine frequency resolution available to normal hearing individuals, such as the spectral representation produced by the one-third octave bands during the computation of the STOI measure. Given that the spectral resolution conferred by the CI stimulation pattern is considerably reduced in comparison to the resolution available to NH individuals, we investigate the use of CI-specific spectral filterbanks during the computation of the spectral representation. The CI spectral representation was extracted as described for the ACE processing simulation in Sec. II A, aligning the computed spectral features used by the objective measure with the frequency bands used during CI processing, vocoding, and T-F mask application.
For intrusive objective measures, measure computation could also be altered by modifying the reference signal which exemplifies a highly intelligible version of the target speech utterance. To our knowledge, changes to the reference signal have not been previously considered in the literature, where typically the clean speech signal is used as the reference signal (e.g., Cosentino et al., 2012; Hollube and Kollmeier, 1996; Hu and Loizou, 2006; ITU-T, 2001; Kates, 2005; Kinoshita et al., 2016; Ma et al., 2009; Quackenbush et al., 1988; Santos et al., 2012; Taal et al., 2011; Yousefian and Loizou, 2012). When considering reverberant conditions, reverberant signals will contain some temporal delay and energetic decay due to the propagation of the original signal along the direct path between speaker and listener. This will lead to temporal and amplitude incongruences between the target and reference signals, even for ideally mitigated reverberant signals. In this study, we examine the effect of using the direct path of the reverberant signal as a reference to account for the temporal delay and amplitude attenuation due to reverberant reflections specific to each utterance. We also investigated the contributions of temporal delay to objective measure performance to determine if performance changes arising from the use of the direct path reference are solely due to the time alignment of the reference and target signals. Temporal alignment is imposed by delaying the clean reference signal by the temporal lag with the largest cross correlation between the target and reference signals.
3. Performance criteria
Four performance criteria were used to evaluate the ability of the objective measures to predict speech intelligibility, as in previous analyses of objective measures (Chen et al., 2013; Cosentino et al., 2012; Falk et al., 2015; Feng and Chen, 2022; Santos et al., 2012; Santos et al., 2013). The first criterion, Pearson's correlation coefficient ( ) (Pearson, 1894) quantifies the linear relationship between the subjective intelligibility results and the objective measure score. To probe the relative performance between testing conditions, the second criterion assesses the ranking capabilities of the objective measure using Spearman's correlation coefficient ( ). Often the mapping from intelligibility performance to objective score is non-linear and monotonic (Santos et al., 2013; Taal et al., 2011). To address this possibility, we calculate a sigmoid mapping from the objective score to the subjective intelligibility space (Plomp, 1986) and use the sigmoid-transformed objective score to calculate a sigmoidal Pearson's correlation as our third criterion ( ). To facilitate statistical analysis, the negative of the CD and LLR measures, for which smaller values indicate greater speech intelligibility, were used in the analysis. The fourth criterion is an estimate of the standard deviation of the error ( ) that would result if the sigmoid-transformed objective measure was used in place of the subjective intelligibility score. To reduce intra- and inter-subject variability, subjective and objective scores were averaged over conditions and reported on a per-condition basis prior to calculation of the performance criteria (Möller et al., 2011). To characterize the statistical significance between the correlation-based performance criteria of two or more measures, a Fisher transformation z-test was used with a significance level of 0.05.
III. RESULTS
A. Overall trends
Table I shows the performance criteria of the objective intelligibility measures for predicting trends in subjective intelligibility outcomes. Since T-F masking strategies vary in their manner of information removal, we anticipated objective measures will exhibit different performance trends when considering T-F masking strategies individually. The performance of each measure was analyzed over all conditions and for conditions mitigated by the binary and ratio masking strategies separately. When all conditions from the subjective study were considered, the CD and fwSRR measures best reflected the subjective intelligibility trends, achieving the highest , , and outcomes (CD: 0.91, 0.95, and 0.93, respectively; fwSRR: 0.92, 0.94, and 0.92, respectively), and the lowest (CD: 0.05, fwSRR: 0.05) among the tested measures.
TABLE I.
Per-condition performance criteria for all measures. Measure performance is evaluated over all conditions, only BM conditions, or only RM conditions in the subjective intelligibility dataset. The criteria in bold represent the set of measures that performed significantly better (p < 0.05) than the remaining measures. Performance criteria, Pearson's correlation, ; Sigmoidal Pearson's correlation, ; standard deviation of the estimation error of the sigmoid transformed intelligibility function .
| Objective Measure | All conditions | BM conditions | RM conditions | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ECM | 0.72 | 0.61 | 0.74 | 0.07 | 0.56 | 0.15 | 0.61 | 0.05 | 0.46 | 0.52 | 0.46 | 0.04 |
| STOI | 0.74 | 0.53 | 0.74 | 0.07 | 0.40 | 0.07 | 0.45 | 0.05 | 0.63 | 0.66 | 0.63 | 0.04 |
| SRMR | 0.84 | 0.77 | 0.83 | 0.06 | 0.74 | 0.58 | 0.84 | 0.05 | 0.95 | 0.99 | 0.94 | 0.03 |
| SRMR-CI | 0.82 | 0.71 | 0.81 | 0.06 | 0.72 | 0.62 | 0.82 | 0.05 | 0.97 | 0.99 | 0.97 | 0.02 |
| ModA | −0.26 | −0.21 | −0.26 | 0.00 | 0.01 | −0.07 | 0.27 | 0.06 | −0.31 | −0.55 | −0.31 | 0.00 |
| NCM | 0.24 | 0.22 | 0.22 | 0.04 | 0.01 | 0.10 | 0.13 | 0.06 | 0.39 | 0.59 | 0.39 | 0.04 |
| CSII | 0.21 | 0.20 | 0.20 | 0.03 | −0.06 | 0.07 | 0.10 | 0.06 | 0.39 | 0.55 | 0.39 | 0.04 |
| fwSRR | 0.92 | 0.94 | 0.92 | 0.05 | 0.96 | 0.89 | 0.99 | 0.01 | 0.82 | 0.95 | 0.82 | 0.04 |
| CD | 0.91 | 0.95 | 0.93 | 0.05 | 0.98 | 0.98 | 0.98 | 0.02 | 0.85 | 0.89 | 0.85 | 0.04 |
| LLR | 0.78 | 0.86 | 0.81 | 0.07 | 0.86 | 0.93 | 0.86 | 0.04 | 0.72 | 0.76 | 0.72 | 0.04 |
| PESQ | 0.79 | 0.64 | 0.78 | 0.07 | 0.50 | 0.58 | 0.51 | 0.05 | 0.49 | 0.52 | 0.49 | 0.04 |
| oPESQ | 0.70 | 0.40 | 0.71 | 0.07 | 0.37 | 0.46 | 0.37 | 0.00 | −0.26 | 0.08 | −0.26 | 0.00 |
For the frequency modulation measures, the SRMR and the SRMR-CI indicated intelligibility trends well, resulting in values of 0.84 and 0.82, respectively, while the ModA performed poorly, obtaining a of −0.26. These results suggest that the relative ratio of modulation energies is a better indicator of intelligibility trends than the area of the modulation spectrum. The NCM and CSII measures also demonstrated very poor performance ( = 0.24 and = 0.21, respectively) in this analysis. The third SNR-estimate measure included in this analysis, the fwSRR, demonstrated considerably better performance than the NCM and CSII measures, achieving a correlation of = 0.92. This was the first evaluation of fwSRR to use CI-processed signals.
For the LPC coefficient measures, the CD achieved good agreement with trends in the subjective intelligibility conditions, earning a correlation of = 0.91. Performance of the LLR was modest, resulting in a correlation of = 0.78, indicating that the differences in intelligibility between the target and reference LPC coefficients are better described by cepstral distance than by the log-likelihood ratio.
The oPESQ measure fit to reverberant speech from Kokkinakis and Loizou (2011) performed worse than its original counterpart, PESQ, with outcomes of 0.70 and 0.79 for oPESQ and PESQ, respectively, despite oPESQ using parameters that were fit to subjective quality outcomes of reverberant speech. This result further emphasizes that a variety of conditions must be represented in the dataset used in the development and evaluation of objective measures to ensure robust measure performance.
When evaluated on conditions mitigated by the BM, LPC coefficient-based measures, particularly the CD measure, and the fwSRR measure had the highest correlation with intelligibility trends. The CD, fwSRR, and LLR achieved 's of 0.98, 0.96, and 0.86, respectively, on binary-mask mitigated conditions. The LPC coefficients of speech indicate the resonances of the vocal tract that shape the broader scale spectro-temporal features of the speech signal. The wholesale deletion of T-F units by BM perturbs the global spectro-temporal structure, appearing as prominent deviations in LPC-based measures.
For conditions mitigated by the RM, the SRMR and SRMR-CI measures achieved the highest correlations with intelligibility trends, with outcomes of 0.95 and 0.97, respectively, followed by the CD measure and the fwSRR measure which achieved outcomes of 0.85 and 0.82, respectively. Since the SRMR, SRMR-CI, and fwSRR measures capture information in the local temporal modulations within each T-F unit, they are likely sensitive to perturbations by the continuous values of the RM.
B. Per-measure trends
We examined measure performance individually to investigate how the performance was impacted by the conditions included in the subjective intelligibility dataset. Figure 1 presents subjective intelligibility scores as a function of the predicted objective intelligibility scores, along with the sigmoidal mapping used to calculate the performance criterion. Each masking strategy included multiple listening conditions resulting from different amounts of attenuation of the reverberant signal, with one level of attenuation resulting in the best outcomes in subjective intelligibility for each masking strategy.
FIG. 1.
Scatterplots of objective versus subjective intelligibility scores for the: (A) ECM, (B) STOI, (C) SRMR, (D) SRMR-CI, (E) ModA, (F) NCM, (G) CSII, (H) fwSRR, (I) CD, (J) log LLR, (K) PESQ, and (L) oPESQ. BM and RM conditions are indicated by circle and triangle markers, respectively. The shading of markers indicating binary- or ratio-masking conditions describes the proportion of the reverberant signal removed in that condition. Direct-path reverberant and unmitigated reverberant conditions are indicated by plus-sign and cross-markers, respectively.
Measures exploiting temporal envelope correlation features achieved modest correlations ( = 0.72 and 0.74 for the ECM and STOI, respectively) with the subjective intelligibility outcomes. The ECM [Fig. 1(A)] resulted in slightly smaller correlations with the subjective intelligibility scores than the STOI measure [Fig. 1(B)], suggesting that the one-third octave frequency filterbank and band-limiting used by the STOI measure isolated speech features that better indicated intelligibility trends for the T-F masked reverberant signals after CI processing and vocoding. Both the ECM and STOI measures tended to overestimate the intelligibility benefits of sparse, over-mitigated signals, as demonstrated by markers with lighter shading earning larger objective intelligibility scores. For instance, in the BM-mitigated conditions, both the ECM and STOI indicated further increases in intelligibility after 53.3% of the signal was attenuated, despite the greatest restoration in subjective intelligibility occurring after 44.6% of the reverberant signal was attenuated.
The performance of the measures using frequency modulation information is presented in Figs. 1(C)–1(E). The SRMR [Fig. 1(C)] and SRMR-CI [Fig. 1(D)] measures achieved good correlation with the subjective intelligibility scores ( = 0.84 and = 0.82, respectively), while the ModA measure [Fig. 1(E)] did not ( = –0.26). The SRMR and SRMR-CI measures demonstrated a linear, monotonically increasing relationship with subjective intelligibility scores in the ratio-masked conditions. In the binary-masked conditions, both the SRMR and SRMR-CI measures over-estimated the intelligibility benefits of under-attenuated reverberant signals, indicating the signals with 25.2% and 30.5% attenuation as having the best objective intelligibility, respectively. In contrast, the ModA measure gave a poor indication of intelligibility trends in both the binary and RM conditions. As shown in Fig. 1(E), the ModA measure confounded increasing attenuation with improved intelligibility outcomes in both ratio- and binary-masked conditions, leading to poor prediction of subjective intelligibility trends.
The performance of measures which use estimates of the SNR to approximate perceptual outcomes is presented in Figs. 1(F)–1(H). The NCM [Fig. 1(F)] and CSII [Fig. 1(G)] measures achieved poorer correlations with subjective intelligibility trends ( = 0.24 and = 0.21, respectively) than the fwSRR measure [Fig. 1(H); = 0.92]. Both the NCM and CSII measures inferred the presence of reverberant artifacts as an improvement in the objective intelligibility of both binary- and ratio-masked speech. In contrast, the fwSRR measure demonstrated a linear relationship with subjective intelligibility scores, with fwSRR scores accurately capturing trends across the mitigated conditions. For the RM-mitigated conditions, the subjectively most-intelligible condition (after 53.5% attenuation of reverberant artifacts) earned the largest objective fwSRR score. In some BM mitigated conditions, the fwSRR measure overestimated the intelligibility of the mitigated speech (e.g., after 53.3% and 62.2% attenuation of the stimulus), although these conditions achieved similar objective measure scores as the most intelligible binary-mask condition (occurring after 44.6% of the stimulus was removed).
The LPC coefficient-based measures, the CD [Fig. 1(I)] and the LLR [Fig. 1(J)], achieved good correlations with subjective intelligibility scores ( = 0.91 and = 0.78, respectively). Both the CD and LLR measures obtained better performance on the binary-masked speech than the ratio-masked speech, as evidenced by a more linear relationship between objective and subjective scores for binary-masked conditions than ratio-masked conditions in Figs. 1(I) and 1(J). For speech mitigated by the RM, the LLR overestimated the intelligibility of one condition containing reverberant artifacts that were detrimental to speech intelligibility (after 30.6% attenuation of the reverberant signal) relative to the most-intelligible ratio-masked condition (after 53.5% attenuation), while accurately representing trends among the under- and over-mitigated ratio-mask conditions. In contrast, the CD accurately reflected intelligibility trends for all under- and over-mitigated conditions after mitigation by either the BM or RM.
Finally, the performance of speech quality measures leveraging symmetric and asymmetric audible distortions is presented in Fig. 1(K) and 1(L). Although the PESQ [Fig. 1(K)] and oPESQ [Fig. 1(L)] achieved moderate correlations with subjective intelligibility scores ( = 0.79 and = 0.70, respectively), on par with the LLR, STOI, and ECM, both the PESQ and oPESQ measures tended to over-estimate the intelligibility outcomes of under-attenuated signals. This outcome is demonstrated in the plot of PESQ scores in Fig. 1(K), where the under-attenuated binary and ratio-masked conditions (characterized by less than 36.9% and 30.61% attenuation, respectively) tended to achieve larger PESQ scores than their over-attenuated counterparts. The oPESQ measure resulted in a similar objective score across all mitigated conditions, yielding poor performance when distinguishing between over- and under-attenuated conditions for either the BM or RM.
C. Impact of objective measure modifications
To improve the intelligibility prediction for signals characterized by reverberation or CI processing, we investigated various modifications to the objective measures that better captured changes in speech content imposed by reverberant enclosures and CI processing. The change in measure performance with the proposed modifications is highlighted in Fig. 2 in terms of the criterion and presented in Table II in terms of the change in the , , , and criteria.
FIG. 2.
(Color online) Changes in the Pearson's correlation coefficient ( ) over all masking conditions after modifications to the reference signal or spectral filterbank, either: (A) when the direct path (Direct) signal is used as the reference signal, (B) when the clean reference signal is aligned temporally with the target signal, and (C) when the CI filterbank is used to extract spectral features when using the clean reference signal. Marker shape and color distinguish groups of objective measures leveraging the same category of speech features. Line style differentiates measures within each category of speech features.
TABLE II.
Change ( ) from the baseline performance criteria presented in Table I after modification to the objective measure under evaluation. The modifications included: (i) using the direct path signal as the reference signal; (ii) using the time-aligned clean signal as the reference signal; and (iii) replacing the spectral filterbank with the CI spectral filterbank during speech feature computation. A positive value indicates an improvement from the baseline to the modified performance criteria. The criteria in bold represent the set of measures that performed significantly better (p < 0.05) after the indicated modification. Performance criteria, Pearson's correlation, ; Sigmoidal Pearson's correlation, ; standard deviation of the estimation error of the sigmoid transformed intelligibility function .
| Objective Measure | Direct path reference | Time-aligned clean reference | CI spectral filterbank | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ECM | 0.24 | 0.31 | 0.22 | −0.04 | 0.20 | 0.33 | 0.21 | −0.03 | — | — | — | — |
| STOI | 0.16 | 0.29 | 0.16 | −0.02 | 0.15 | 0.26 | 0.15 | −0.01 | 0.21 | 0.37 | 0.22 | −0.03 |
| NCM | 0.40 | 0.08 | 0.41 | 0.03 | 0.47 | 0.41 | 0.55 | 0.04 | 0.50 | 0.44 | 0.57 | 0.04 |
| CSII | 0.38 | 0.08 | 0.39 | 0.04 | 0.20 | 0.08 | 0.20 | 0.03 | 0.25 | 0.09 | 0.25 | 0.03 |
| fwSRR | 0.03 | −0.04 | 0.03 | −0.01 | −0.03 | −0.03 | −0.01 | 0.00 | −0.03 | −0.04 | −0.02 | 0.01 |
| CD | 0.03 | −0.01 | 0.04 | −0.01 | −0.01 | −0.03 | 0.00 | 0.00 | −0.06 | −0.04 | −0.05 | 0.01 |
| LLR | 0.15 | 0.01 | 0.11 | −0.02 | 0.01 | −0.04 | 0.01 | 0.00 | −0.01 | −0.06 | −0.02 | 0.00 |
| PESQ | −0.07 | 0.03 | 0.03 | 0.00 | 0.01 | 0.03 | 0.01 | 0.00 | — | — | — | — |
| oPESQ | −0.01 | 0.29 | 0.07 | 0.00 | 0.01 | 0.19 | 0.04 | 0.01 | — | — | — | — |
First, the reference signal used by intrusive objective intelligibility measures was modified to accommodate the temporal characteristics of reverberant speech. All intrusive measures were evaluated first using the clean reference signal as a baseline and then using the direct path of the reverberant signal as the reference. The change in performance with the direct path reference signal is presented in Fig. 2(A) and in the left-most section of Table II. Using the direct path reference, the intelligibility prediction performance of the ECM, STOI, LLR, NCM, and CSII measures, as assessed by the criterion, was raised to 0.96, 0.90, 0.92, 0.64, and 0.59, respectively, as shown in Fig. 2(A). When using the direct path reference signal, statistically significant improvements in performance from the clean reference signal were noted for the ECM (of 0.24, 0.31, and 0.22, for the , , and criteria, respectively) and LLR (of 0.15 for the criterion) measures, accompanied by decreases in the standard deviation of the estimation error ( , of 0.04 and 0.02, respectively). Although substantial improvements in the performance of the STOI, NCM, and CSII measures were observed (for example, when measured by the criterion, of 0.16, 0.40, and 0.38, respectively), statistical significance was not reached after correlation coefficients were normalized using the Fisher z transformation. The LLR, NCM, and CSII measures yielded improvements in the and criteria, but not the criterion, suggesting that the use of the direct path reference improved the overall linear relationship of objective scores to subjective outcomes while having little effect on the ranking of listening conditions in terms of subjective difficulty. This fact, combined with the observation that the standard deviation of the estimation error, , increased for the NCM and CSII measures (by 0.03 and 0.04, respectively), prompted an inspection of the measure-specific performance per listening condition. This inspection revealed that improvements in correlation-based criteria were largely driven by a correction in the objective score and ranking assigned to the direct path condition, and not by improvements in intelligibility prediction of the T-F masked conditions. Thus, the use of the direct path reference signal did not ameliorate the errors in intelligibility prediction observed for the NCM and CSII measures, in which the intelligibility of under-attenuated mitigated conditions was over-estimated (see Fig. 1).
Notably, most of the measures demonstrating performance gains using the direct-path reference all employ some form of temporal correlation of the reference and target signals during speech feature computation. As the speech features contained in the reverberant signal are delayed in time by an amount that reflects propagation along the direct path from source to listener, performance improvements using the direct-path reference could have largely resulted from an improved temporal alignment of the target and reference signals. To investigate this hypothesis, a second analysis was conducted with the clean reference signal after temporal alignment with the target signal. The changes in measure performance using the temporally aligned clean reference signal are presented in Fig. 2(B) and the middle section of Table II. After temporally aligning the clean reference signal with each target utterance, the ECM, STOI, NCM, and CSII measures demonstrated improvements in correlation with subjective intelligibility scores of 0.20, 0.15, 0.47, and 0.20, respectively, leading to final outcomes of 0.92, 0.89, 0.71, and 0.41, respectively, as shown in Fig. 2(B). After applying the Fisher transformation to the correlation metrics, statistically significant improvements were noted for only the ECM and NCM measures (Table II). Unlike when using the direct path reference signal, the performance of the LLR measure did not improve when using the temporally aligned clean reference signal, suggesting that the improvements previously noted for the LLR measure were likely due to the amplitude attenuations captured by the direct path reference signal after propagation through the reverberant enclosure. Interestingly, although the ECM, STOI, NCM, and CSII measures all explicitly rely on the correlation of the temporal envelopes within frequency bands, only the ECM and NCM resulted in statistically significant improvements in the correlation criteria. For the NCM, improvements using the time-aligned clean reference were greater than those using the direct path reference [compare Fig. 2(A) and Fig. 2(B)]. Despite benefitting from the time-alignment of the target and reference signals inherently imposed by the direct-path reference, the covariance-based estimate of the SNR employed by the NCM better reflected intelligibility trends when calculated using the clean reference signal.
For the second modification, the spectral filterbank used during speech feature computation was modified to resemble the spectral filterbank used during CI processing and vocoding, reflecting the spectral resolution available in the mitigated reverberant signals. This modification was implemented only for the STOI, NCM, CSII, fwSRR, CD, and LLR measures as the SRMR-CI, ModA, and ECM measures already employ spectral filterbanks matching the spectral resolution used during speech processing by a CI and implementation of the CI spectral filterbank within PESQ computation was infeasible. The change in measure performance using the CI spectral filterbank is presented in the rightmost section of Table II and in Fig. 2(c). When using the CI spectral filterbank, the STOI, NCM, and CSII all demonstrated improvements in intelligibility prediction, as demonstrated by improvements in the criteria of 0.21, 0.50, and 0.25, respectively, yielding final overall outcomes of 0.95, 0.74, and 0.46, respectively, as shown in Fig. 2(c). Only the STOI improvements in , , and , and the NCM improvements in and were statistically significant, although the standard deviation of the prediction error incurred by STOI increased after modifications to the spectral filterbank.
IV. DISCUSSION
This study validated objective intelligibility measures against subjective intelligibility scores obtained with NH listeners and vocoded reverberant speech material to facilitate the efficient development of reverberant speech enhancement algorithms for CIs. Speech material was processed by a simulation of CI processing to extract a stimulation pattern over time and electrode channel and then subsequently vocoded, capturing the spectro-temporal resolution provided by the CI. To simulate intelligibility outcomes resulting from a speech enhancement algorithm during development, reverberant speech components were gradually removed from the reverberant signal and speech intelligibility was measured as the percent of phonemes correctly identified. Gradual removal of reverberant speech components was achieved by varying the parameters of T-F masking algorithms, allowing the subjective intelligibility dataset to reflect the restoration of phonemic information with increasingly aggressive attenuation of reverberant signal components.
A. Comparison to previous work—Differences in the listener population
The objective measure performance presented in Sec. III A differ in a few notable ways from measure performance presented in previous studies employing subjective data from CI listeners (Chen et al., 2013; Falk et al., 2015; Santos et al., 2012; Santos et al., 2013; Santos and Falk, 2014). Previous analyses, using data from CI listeners in unmitigated noisy and reverberant conditions (Santos et al., 2013) and optimally-mitigated noisy and reverberant conditions (Chen et al., 2013; Santos and Falk, 2014), suggested that the NCM, CSII, and ModA measures were good candidates for probing the intelligibility of reverberant and enhanced reverberant speech for CI listeners, with the NCM achieving a of 0.96 (Chen et al., 2013; Santos et al., 2013), CSII achieving a of 0.93 (Santos et al., 2013), and ModA achieving values of 0.78, 0.82, and 0.98 across the three studies (Chen et al., 2013; Santos et al., 2013; Santos and Falk, 2014). In our analysis, the NCM, CSII, and ModA measures demonstrated considerably worse performance, earning values of 0.24, 0.21, and −0.26, respectively. Smaller differences in performance were noted for the STOI, SRMR, SRMR-CI, and fwSRR measures, with these measures achieving values of 0.74, 0.84, 0.82, and 0.92, respectively, in this analysis and values of 0.81 (Falk et al., 2015), 0.93 (Santos et al., 2013), 0.96 (Santos et al., 2013), and 0.70 (Santos et al., 2012), respectively, in previous analyses employing subjective data from CI listeners.
Differences in the listener population used to obtain the subjective intelligibility dataset could have contributed to discrepancies in measure performance in several ways. Although vocoding can capture the spectro-temporal resolution of the CI stimulus, vocoding does not accurately represent the perception of speech after electrical stimulation of the auditory nerve and subsequent integration at higher cortical levels (Karoui et al., 2019; Svirsky et al., 2021). For example, CI recipients exhibit differences in top-down restoration of words from phonemes when compared to NH individuals listening to vocoded speech (Bhargava et al., 2014). These differences suggest that the vocoder and phoneme-based scoring employed here may not accurately reflect trends in speech intelligibility for CI listeners, particularly across the mitigated conditions in which important speech information may be obfuscated by reverberant artifacts or attenuated by the T-F mask.
Subjective data acquired with NH listeners will also not capture the variability in speech outcomes often observed across CI listeners. When evaluating objective measure performance on CI listener data, Santos et al. observed reductions in measure performance when conditions demonstrating high subjective variability were included in the analysis, such as when both noise and reverberation were present in the speech signal (Santos et al., 2012). In this study, the use of subjective data acquired with NH listeners may have reduced the variability in subjective outcomes within conditions, potentially leading, for instance, to the higher scores achieved by the fwSRR measure in this analysis ( ) when compared to an earlier analysis using CI listener data ( ; Santos et al., 2012). Given the qualitative differences in subjective intelligibility for NH and CI listeners, the analyses presented in this work cannot capture the perception of CI listeners but rather reflect the speech information remaining in a signal with the same spectra-temporal resolution as the CI stimulus after the application of an enhancement algorithm. Further analyses of objective intelligibility measure performance using subjective data acquired with CI listeners are needed to inform the development of speech enhancement algorithms for CIs.
B. Comparison to previous work—Differences in the listening conditions
Differences in measure performance in this study and earlier works will also be influenced by differences in the listening conditions of the subjective intelligibility dataset employed in each analysis. For instance, the listening conditions may contain reverberation or additive noise which impose qualitatively different distortions on speech. In previous analyses using listening conditions containing additive noise, the ECM and STOI measures demonstrated high correlation with subjective intelligibility trends, with the ECM earning a of 0.96 for intelligibility trends obtained with CI listeners (Yousefian and Loizou, 2012) and the STOI earning a of 0.96 for intelligibility trends obtained with NH listeners (Taal et al., 2011). In our analysis, in which the listening conditions contained a range of mitigated reverberant conditions, the ECM and STOI measures performed modestly, earning values of 0.72 and 0.74, respectively. As presented in Sec. III B, these correlation-based measures were only able to restore high correlation with the target speech signal after aggressive attenuation of reverberant components, suggesting that the presence of reverberant artifacts had a detrimental effect on measure performance.
Some measures demonstrated better performance when evaluated in reverberant conditions in our analysis than when previously evaluated in noisy conditions. The fwSRR, CD, and LLR measures, which were originally introduced to probe the quality of speech coding systems (Kitawaki et al., 1988; Quackenbush et al., 1988), demonstrated modest to poor correlations with subjective trends (fwSRR: = 0.81; CD: = − 0.49; LRR: = − 0.56) when evaluated on noisy speech after the application of one of several noise reduction algorithms (Ma et al., 2009). In our analysis using mitigated reverberant speech material, the fwSRR, CD, and LLR measures correlated well with subjective trends (yielding = 0.92, = 0.91, and = 0.78, respectively). In fact, the fwSRR and CD measures earned the highest correlations with subjective scores of all the measures tested in this study, indicating that the speech features leveraged by these measures captured the restoration in phonemic information across the enhanced reverberant conditions.
Across studies employing reverberation in the subjective intelligibility dataset, differences in the reverberant listening scenario could also influence differences in objective measure performance. For instance, the degradations due to early and late reverberant reflections will differ with the reverberant environment and with the method used to obtain the RIR (either simulated or recorded). Measures that reflect the unique degradations imposed by one reverberant environment may not perform well in other reverberant environments. Such changes in measure performance with the reverberant environment are apparent for the PESQ and oPESQ measures, which demonstrated values of 0.77 and 0.91 when evaluated on NH listener data and reverberant speech generated with recorded RIRs with RT60s of 0.291 and 0.447 seconds (Kokkinakis and Loizou, 2011) and values of 0.95 and 0.99 when evaluated on CI listener data and vocoded reverberant speech using simulated RIRs with RT60s in the range of 0.4 to 2.0 seconds (Cosentino et al., 2012). In this analysis, in which reverberant speech material was generated with a recorded RIR with an RT60 of 0.8 seconds, the PESQ and oPESQ measures earned values of 0.79 and 0.70, respectively, which are relatively lower correlations than those previously demonstrated in other reverberant scenarios, particularly for the oPESQ measure. Although parameters of the oPESQ measure were selected to reflect perceptual outcomes for NH listeners in reverberation (Kokkinakis and Loizou, 2011), the poorer performance of the oPESQ measure in this analysis suggests that oPESQ does not accurately represent perceptual trends when the characteristics of the reverberant listening scenario change.
Differences in the listening conditions also likely contributed to discrepancies in measure performance observed in this study and previous studies using CI listener data (Chen et al., 2013; Falk et al., 2015; Santos et al., 2012, 2013; Santos and Falk, 2014), as the CI listener data, derived from (Hazrati and Loizou, 2012a,b), included additive noise at several SNRs and reverberation modelled using recorded RIRs with RT60s of 0.3, 0.6, 0.8, and 1.0 seconds. Although there is some evidence to suggest that measures evaluated on vocoded reverberant speech can reflect intelligibility trends for CI listeners (Cosentino et al., 2012), the unique degradations imposed by the reverberant environment will interact with the compression implemented within CI processing, as both reverberation and compression impose distortions on the temporal envelope of speech (Kerber and Seeber, 2013; Reinhart et al., 2016; Reinhart et al., 2019). These distortions could impact the performance of objective measures that leverage the temporal envelope, such as the ECM and STOI, providing additional challenges for generalizing the results from evaluations of objective measures with CI-processed and vocoded speech material in different reverberant conditions. Given the variability in measure performance across studies employing different reverberant listening scenarios, a thorough investigation of measure performance is needed using speech spoken in a range of reverberant conditions.
C. Comparison to previous work—Differences in the speech enhancement algorithm
For evaluations of objective measure performance that included enhanced speech conditions in the subjective dataset, measure performance will also be influenced by characteristics of the speech enhancement algorithm. Of the studies that evaluated objective measures with CI listener data, those that included enhanced-speech conditions (Chen et al., 2013; Falk et al., 2015; Santos and Falk, 2014) all employed T-F masking as the speech enhancement method, in which the BM was implemented with a threshold, , of either 0 dB or −8 dB (Hazrati and Loizou, 2012a). Of the studies that evaluated objective intelligibility measures with enhanced reverberant speech, none employed the RM as the speech enhancement method. When examining changes in measure performance with the inclusion of BM-enhanced conditions in the evaluation of objective measures, Chen et al. (2013) noted a minor reduction in performance for the ModA measure ( to ) while Falk et al. (2015) observed larger reductions in performance for the SRMR ( to ), SRMR-CI ( to ), STOI ( to ), and NCM ( to ) measures. In our analysis, both ratio- and binary-mask mitigated conditions were included, with several values of the (–50 to 12 dB) and (0.005 to 5) parameters employed for the BMs and ratio masks, respectively, across conditions. When only RM-mitigated conditions were included in our analysis, the correlations achieved by the SRMR and SRMR-CI measures were larger ( = 0.95 and = 0.97, respectively) than when only BM-mitigated conditions were included ( = 0.74 and = 0.72, respectively). The STOI and NCM measures also achieved larger correlations with ratio-mask mitigated speech ( = 0.63 and = 0.39, respectively) than with binary-mask mitigated speech ( = 0.40 and = 0.01, respectively). These observations are consistent with the reduction in performance for the SRMR, SRMR-CI, STOI, and NCM measures noted in (Falk et al., 2015) after the inclusion of binary-mask mitigated conditions, despite the differences in the listener populations across the two studies. When only BM-mitigated conditions were included in our analysis, the CD, fwSRR, and LLR measures demonstrated superior performance ( = 0.98, = 0.96, and = 0.86, respectively) than when only RM-mitigated conditions were included in the analysis (resulting in = 0.85, = 0.82, and = 0.72, respectively). These results indicate that the type of T-F masking strategy and therefore the manner of artifact removal (either continuous or discrete) must be considered when evaluating or applying objective intelligibility measures to enhanced speech.
The change in speech intelligibility after the application of a speech enhancement algorithm will also differ with the parameters employed by the T-F masking gain function, which influences the amount of attenuation applied to the reverberant speech signal. By employing several values of the and parameters for the BMs and RMs, respectively, the subjective dataset used in this study captured the gradual restoration and subsequent attenuation of the target speech signal after T-F masking. For the BM, was varied from −50 to 12 dB, capturing 1.5% to 86.3% attenuation of the reverberant speech signal after the application of the mask. Note that these parameters encompass the parameterizations for used in (Hazrati and Loizou, 2012b) and in evaluations of objective measure performance using CI listener data (Chen et al., 2013; Falk et al., 2015; Santos and Falk, 2014). For the RM, was varied from 0.01 to 5, capturing 2.1% to 92.3% attenuation of the reverberant speech signal. Of the studies that validate objective measures using reverberant speech material, none have employed ratio-mask mitigated reverberant speech. By employing several values for the and parameters, the subjective dataset used in this study contained both optimally mitigated conditions, which resulted in the largest restorations in subjective intelligibility, and non-optimally mitigated conditions, which resulted in lower subjective intelligibility due to under- or over-attenuation of the reverberant signal. The inclusion of non-optimally mitigated conditions provided a model for objective measure performance during the development of reverberant speech enhancement algorithms. The non-optimally mitigated conditions revealed that the NCM, CSII, and ModA measures are not suitable for evaluating the development of reverberant speech enhancement algorithms as they consistently failed to capture intelligibility trends across the gradually attenuated conditions. The NCM and CSII measures overestimated the intelligibility of all under-attenuated conditions containing reverberant artifacts, suggesting that the modulations conferred by the reverberant artifacts were interpreted as improvements in SNR by these measures. The ModA measure, on the other hand, overestimated the intelligibility of all sparse over-mitigated signals, interpreting the non-linear distortions imposed by T-F masking as an increased level of modulation, leading to an increase in the area under the modulation spectrum. The inclusion of non-optimally mitigated conditions also highlighted shortcomings in the ECM, STOI, SRMR, SRMR-CI, and PESQ measures, which tended to interpret either over-attenuation (in the case of ECM and STOI) or under-attenuation (in the case of SRMR and SRMR-CI for BM-mitigated conditions and PESQ for all conditions) as improvements in intelligibility. While these measures could still be employed during the development of reverberant speech enhancement algorithms, the previously noted limitations should be considered when interpreting the objective scores assigned by these measures to mitigated speech signals.
The range of mask parameterizations employed in this study encompassed the range of possible mitigated speech intelligibility outcomes, indicating mask parameters that resulted in the largest restorations in speech intelligibility. Although mask parameters are often selected heuristically (Hazrati et al., 2013b,a; Hazrati and Loizou, 2013), recent works suggest that parameters can be selected to maximize improvements in reverberant speech intelligibility under conditions characterized by CI processing (Kokkinakis et al., 2011; Kokkinakis and Stohl, 2021; Shahidi et al., 2022) and that objective measures could be used to determine the mask parameterizations that best restore intelligibility (Kokkinakis and Stohl, 2021). Towards this application, the current analysis indicated objective measures that accurately identified parameterizations of the BM and RM that resulted in the largest gains in subjective intelligibility. The CD and LLR indicated the most-intelligible parameterizations for the BM, while the SRMR, SRMR-CI, and fwSRR indicated the most-intelligible parameterizations for the RM, distinct from the STOI and PESQ measures used in (Kokkinakis and Stohl, 2021). Future work should investigate whether the measures indicated by this analysis could automate parameter selection for T-F masks, further facilitating the development of reverberant speech enhancement algorithms.
D. Modifications to objective measure evaluation
Using the same subjective intelligibility dataset for validation, modifications to objective measure evaluation were investigated to exploit characteristics of reverberant or CI-processed speech. When considering the choice of the reference signal in reverberant environments, several intrusive measures benefitted from using the direct path signal as the reference. These measures included the ECM, STOI, NCM, CSII, and LLR, which demonstrated improvements in correlation with subjective data ( ) of 0.24, 0.16, 0.40, 0.38, and 0.15, respectively, with the change in correlation achieving significance for the ECM and LLR measures. Despite improvements in correlation for the NCM and CSII measures, the use of the direct-path reference did not improve the ranking of mitigated reverberant conditions for these measures ( , not significant for both NCM and CSII). When the direct-path signal is unavailable, some, but not all, of the performance gains conferred by the direct-path reference could be realized by temporally aligning the clean reference and reverberant target signals. Temporal alignment of the reference signal yielded improvements in correlation ( ) for the ECM, STOI, NCM, and CSII measures of 0.20, 0.15, 0.47, and 0.20, respectively, with the ECM and NCM achieving significant improvements in correlation. Implementing the spectral filterbank used during CI processing during measure computation also improved intelligibility prediction performance for the STOI, NCM, and CSII measures under the conditions tested. Using the CI spectral filterbank, the STOI, NCM, and CSII measures demonstrated improvements in correlation ( ) of 0.21, 0.50, and 0.25, respectively, with the improvements noted for the STOI and NCM measures attaining significance.
E. Recommendations for objective measure use
Taken altogether, the analyses presented in this study yield recommendations for objective intelligibility measures that reflect subjective trends for reverberant mitigated speech with the same spectro-temporal resolution as that provided by a CI. Overall, the ECM performed the best when using the direct path signal as the reference (resulting in , , and outcomes of 0.96, 0.92, and 0.97, respectively). The next best-performing measure was the STOI using the CI spectral filterbank (resulting in , , and outcomes of 0.95, 0.91, and 0.96, respectively), followed by the fwSRR using the direct path reference (resulting in , , and outcomes of 0.95, 0.90, and 0.95, respectively) and the CD using the direct path reference (resulting in , , and outcomes of 0.94, 0.93, and 0.97, respectively). If knowledge of the reverberant environment is unavailable, and thus the direct path RIR cannot be constructed, the STOI could be implemented with the CI spectral filterbank (indicated with , , and outcomes of 0.95, 0.91, and 0.96, respectively) or the ECM could be employed with time-alignment of the clean reference signal (indicated with , , and outcomes of 0.92, 0.94, and 0.95, respectively). If no modifications to measure evaluation are feasible, then the fwSRR or CD provide sufficient intelligibility prediction performance over all reverberant T-F masked and CI-processed conditions (fwSRR: , , and outcomes of 0.92, 0.94, and 0.92, respectively; CD: , , and outcomes of 0.91, 0.95, and 0.93, respectively), warranting their use during the development of speech enhancement algorithms for reverberant CI-processed speech. While this study indicates objective intelligibility measures that reflect the restoration of phonemic information with gradual enhancement of the reverberant speech signal, further investigation of objective measures is needed using subjective scores obtained with CI listeners in a variety of mitigated reverberant listening scenarios.
V. CONCLUSIONS
To facilitate the development of reverberant speech-enhancement algorithms for CIs, the present study evaluated the ability of twelve objective intelligibility measures to predict trends in a subjective intelligibility dataset. The subjective dataset captured speech intelligibility scores acquired from 20 NH individuals listening to reverberant speech after CI processing and vocoding, limiting generalization of the findings to speech intelligibility outcomes for CI recipients. To simulate the under- or over-attenuation of the reverberant speech signal that may result from a speech enhancement algorithm during development, a gradual reduction in reverberant distortions was achieved across conditions by varying a parameter of the binary or ratio T-F masking algorithm. The fwSRR and CD provided good predictions of intelligibility trends over all masked conditions and when evaluated only on binary-masked speech. When evaluated only on ratio-masked speech, the SRMR and SRMR-CI measures predicted intelligibility trends with high correlation to subjective outcomes. Other measures, including the NCM, CSII, and ModA, were poor predictors of intelligibility trends, particularly for over- and under-mitigated reverberant speech signals.
To improve intrusive measure performance in reverberant conditions, the direct path of the reverberant signal was used as the reference signal in place of the typically-used clean signal, capturing the temporal delay and amplitude attenuation imposed by the reverberant enclosure. Using the direct path reference, significant improvements in measure performance were observed, particularly for the ECM and LLR measures. To adapt measure computation to CI-processed speech, the spectral filterbank used during measure computation was changed to match the spectral resolution of the CI stimulation pattern, resulting in significant improvements in measure performance for the STOI and NCM measures.
Several objective measures achieved high correlations to the subjective data and are recommended for use during the development of reverberant speech enhancement algorithms for CIs. The recommended measures, beginning with the best-performing measures, are the ECM using the direct path reference signal, the STOI using the CI spectral filterbank, the fwSRR using the direct-path reference signal, and the CD using the direct-path reference signal. If it is not feasible to modify the method used to evaluate the objective measure, the CD and fwSRR measures provided good prediction of intelligibility trends for binary-masked speech and the SRMR-CI provides good intelligibility prediction trends for ratio-masked speech. Additional evaluations of objective intelligibility measures are needed to validate the results presented in this analysis with subjective intelligibility data acquired with CI listeners and in a variety of reverberant listening scenarios.
ACKNOWLEDGMENTS
This work was funded by the Katherine Goodman Stern Fellowship, provided by Duke University, and by the National Institute of Health, via a grant administered by the National Institute on Deafness and Other Communication Disorders (#R01-DC014290-05).
AUTHOR DECLARATIONS
Conflict of Interest
The authors have no conflicts of interest to disclose.
DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
- 1.ANSI (1997). S3.5–1997, American National Standards Methods for the Calculation of the Speech Intelligibility Index ( American National Standards Institute, New York: ). [Google Scholar]
- 2. Başkent, D. (2012). “ Effect of speech degradation on top-down repair: Phonemic restoration with simulations of cochlear implants and combined electric-acoustic stimulation,” J. Assoc. Res. Otolaryngol. 13, 683–692. 10.1007/s10162-012-0334-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Bhargava, P. , Gaudrain, E. , and Başkent, D. (2014). “ Top-down restoration of speech in cochlear-implant users,” Hear. Res. 309, 113–123. 10.1016/j.heares.2013.12.003 [DOI] [PubMed] [Google Scholar]
- 4. Bolner, F. , Goehring, T. , Monaghan, J. , van Dijk, B. , Wouters, J. , and Bleeck, S. (2016). “ Speech enhancement based on neural networks applied to cochlear implant coding strategies,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, March 20–25, Shanghai, China, pp. 6520–6524. 10.1109/ICASSP.2016.7472933 [DOI] [Google Scholar]
- 5. Bosen, A. K. , and Chatterjee, M. (2016). “ Band importance functions of listeners with cochlear implants using clinical maps,” J. Acoust. Soc. Am. 140, 3718–3727. 10.1121/1.4967298 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Carlyon, R. P. , and Goehring, T. (2021). “ Cochlear Implant Research and Development in the Twenty-first Century: A Critical Update,” J. Assoc. Res. Otolaryngol. 22, 481–508. 10.1007/s10162-021-00811-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Chen, F. , Hazrati, O. , and Loizou, P. C. (2013). “ Predicting the intelligibility of reverberant speech for cochlear implant listeners with a non-intrusive intelligibility measure,” Biomed. Signal Process Control 8, 311–314. 10.1016/j.bspc.2012.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Chu, K. , Collins, L. , and Mainsah, B. (2022). “ Suppressing reverberation in cochlear implant stimulus patterns using time-frequency masks based on phoneme groups,” Proc. Mtgs. Acoust. 50(1), 050002. 10.1121/2.0001698 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Cosentino, S. , Marquardt, T. , Mcalpine, D. , and Falk, T. H. (2012). “ Towards objective measures of speech intelligibility for cochlear implant users in reverberant environments,” in Proceedings of the 11th ISSPA, July 2–5, Montreal, Canada, pp. 666–671. [Google Scholar]
- 10. Crowson, M. G. , Lin, V. , Chen, J. M. , and Chan, T. C. Y. (2020). “ Machine learning and cochlear implantation—A structured review of opportunities and challenges,” Otol. Neurotol. 41, E36–E45. 10.1097/MAO.0000000000002440 [DOI] [PubMed] [Google Scholar]
- 11. Cullington, H. E. , and Zeng, F.-G. (2008). “ Speech recognition with varying numbers and types of competing talkers by normal-hearing, cochlear-implant, and implant simulation subjects,” J. Acoust. Soc. Am. 123, 450–461. 10.1121/1.2805617 [DOI] [PubMed] [Google Scholar]
- 12. Dorman, M. F. , and Gifford, R. H. (2017). “ Speech understanding in complex listening environments by listeners fit with cochlear implants,” J. Speech. Lang. Hear. Res. 60, 3019–3026. 10.1044/2017_JSLHR-H-17-0035 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Falk, T. H. , Parsa, V. , Santos, J. F. , Arehart, K. , Hazrati, O. , Huber, R. , Kates, J. M. , and Scollie, S. (2015). “ Objective quality and intelligibility prediction for users of assistive listening devices: Advantages and limitations of existing tools,” IEEE Signal Process. Mag. 32, 114–124. 10.1109/MSP.2014.2358871 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Falk, T. H. , Zheng, C. , and Chan, W. Y. (2010). “ A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech,” IEEE Trans. Audio. Speech. Lang. Process. 18, 1766–1774. 10.1109/TASL.2010.2052247 [DOI] [Google Scholar]
- 15. Feng, Y. , and Chen, F. (2022). “ Nonintrusive objective measurement of speech intelligibility: A review of methodology,” Biomed. Signal Process. Control 71, 103204. 10.1016/j.bspc.2021.103204 [DOI] [Google Scholar]
- 16. Firszt, J. B. , Holden, L. K. , Skinner, M. W. , Tobey, E. A. , Peterson, A. , Gaggl, W. , Runge-Samuelson, C. L. , and Wackym, P. A. (2004). “ Recognition of speech presented at soft to loud levels by adult cochlear implant recipients of three cochlear implant systems,” Ear Hear. 25, 375–387. 10.1097/01.AUD.0000134552.22205.EE [DOI] [PubMed] [Google Scholar]
- 17. Gifford, R. H. , and Revit, L. J. (2010). “ Speech perception for adult cochlear implant recipients in a realistic background noise: Effectiveness of preprocessing strategies and external options for improving speech recognition in noise,” J. Am. Acad. Audiol. 21, 441–488. 10.3766/jaaa.21.7.3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Goehring, T. , Bolner, F. , Monaghan, J. J. M. , van Dijk, B. , Zarowski, A. , and Bleeck, S. (2017). “ Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users,” Hear. Res. 344, 183–194. 10.1016/j.heares.2016.11.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Goehring, T. , Keshavarzi, M. , Carlyon, R. P. , and Moore, B. C. J. (2019). “ Using recurrent neural networks to improve the perception of speech in non-stationary noise by people with cochlear implants,” J. Acoust. Soc. Am. 146, 705–718. 10.1121/1.5119226 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Hazrati, O. , Lee, J. , and Loizou, P. C. (2013a). “ Blind binary masking for reverberation suppression in cochlear implants,” J. Acoust. Soc. Am. 133, 1607–1614. 10.1121/1.4789891 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Hazrati, O. , and Loizou, P. C. (2012a). “ Tackling the combined effects of reverberation and masking noise using ideal channel selection,” J. Speech. Lang. Hear. Res. 55, 500–510. 10.1044/1092-4388(2011/11-0073) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Hazrati, O. , and Loizou, P. C. (2012b). “ The combined effects of reverberation and noise on speech intelligibility by cochlear implant listeners,” Int. J. Audiol. 51, 437–443. 10.3109/14992027.2012.658972 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Hazrati, O. , and Loizou, P. C. (2013). “ Reverberation suppression in cochlear implants using a blind channel-selection strategy,” J. Acoust. Soc. Am. 133, 4188–4196. 10.1121/1.4804313 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Hazrati, O. , Omid Sadjadi, S. , Loizou, P. C. , and Hansen, J. H. L. (2013b). “ Simultaneous suppression of noise and reverberation in cochlear implants using a ratio masking strategy,” J. Acoust. Soc. Am. 134, 3759–3765. 10.1121/1.4823839 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Henry, F. , Glavin, M. , and Jones, E. (2023). “ Noise reduction in cochlear implant signal processing: A review and recent developments,” IEEE Rev. Biomed. Eng. 16, 319–331. 10.1109/RBME.2021.3095428 [DOI] [PubMed] [Google Scholar]
- 26. Holden, L. K. , Skinner, M. W. , Holden, T. A. , and Demorest, M. E. (2002). “ Effects of stimulation rate with the Nucleus 24 ACE speech-coding strategy,” Ear Hear. 23, 463–476. 10.1097/00003446-200210000-00008 [DOI] [PubMed] [Google Scholar]
- 27. Hollube, I. , and Kollmeier, K. (1996). “ Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model,” J. Acoust. Soc. Am. 100, 1703–1715. 10.1121/1.417354 [DOI] [PubMed] [Google Scholar]
- 28. Hu, Y. , and Loizou, P. C. (2006). “ Evaluation of objective measures for speech enhancement,” in Proceedings of the Annual Conference on International Speech Communication Association, September 17–21, Pittsburgh, PA, pp. 1447–1450. [Google Scholar]
- 29.ITU (2001). ITU-T Rec. P.862, Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs ( International Telecommunications Union, Geneva, Switzerland: ). [Google Scholar]
- 30. Jeub, M. , Schafer, M. , and Vary, P. (2009). “ A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proceedings of the 16th International Conference on Digital Signal Process, July 5–7, Santorini, Greece. [Google Scholar]
- 31. Karoui, C. , James, C. , Barone, P. , Bakhos, D. , Marx, M. , and Macherey, O. (2019). “ Searching for the sound of a cochlear implant: Evaluation of different vocoder parameters by cochlear implant users with single-sided deafness,” Trends Hear. 23, 233121651986602. 10.1177/2331216519866029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Kates, I. M. (2005). “ Coherence and the speech intelligibility index,” J. Acoust. Soc. Am. 117, 2224–2237. 10.1121/1.1862575 [DOI] [PubMed] [Google Scholar]
- 33. Kerber, S. , and Seeber, B. U. (2013). “ Localization in reverberation with cochlear implants: Predicting performance from basic psychophysical measures,” J. Assoc. Res. Otolaryngol. 14, 379–392. 10.1007/s10162-013-0378-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Kim, G. , Lu, Y. , Hu, Y. , and Loizou, P. C. (2009). “ An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” J. Acoust. Soc. Am. 126, 1486–1494. 10.1121/1.3184603 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Kinoshita, K. , Delcroix, M. , Gannot, S. , Emanuël, E. A. , Haeb-Umbach, R. , Kellermann, W. , Leutnant, V. , Maas, R. , Nakatani, T. , Raj, B. , Sehr, A. , and Yoshioka, T. (2016). “ A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP J. Adv. Signal Process. 2016, 1–19. 10.1186/s13634-016-0306-6 [DOI] [Google Scholar]
- 36. Kitawaki, N. , Nagabuchi, H. , and Itoh, K. (1988). “ Objective quality evaluation for low-bit-rate speech coding systems,” IEEE J. Select. Areas Commun. 6, 242–248. 10.1109/49.601 [DOI] [Google Scholar]
- 37. Kokkinakis, K. , Hazrati, O. , and Loizou, P. C. (2011). “ A channel-selection criterion for suppressing reverberation in cochlear implants,” J. Acoust. Soc. Am. 129, 3221–3232. 10.1121/1.3559683 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Kokkinakis, K. , and Loizou, P. C. (2011). “ Evaluation of objective measures for quality assessment of reverberant speech,” in Proceedings of the ICASSP 2011, May 22–27, Prague, Czech Republic, pp. 2420–2423. [Google Scholar]
- 39. Kokkinakis, K. , and Stohl, J. S. (2021). “ Optimized gain functions in ideal time-frequency masks and their application to dereverberation for cochlear implants,” JASA Express Lett. 1, 084401. 10.1121/10.0005740 [DOI] [PubMed] [Google Scholar]
- 40. Kressner, A. A. , May, T. , and Rozell, C. J. (2016a). “ Outcome measures based on classification performance fail to predict the intelligibility of binary-masked speech,” J. Acoust. Soc. Am. 139, 3033–3036. 10.1121/1.4952439 [DOI] [PubMed] [Google Scholar]
- 41. Kressner, A. A. , and Rozell, C. J. (2015). “ Structure in time-frequency binary masking errors and its impact on speech intelligibility,” J. Acoust. Soc. Am. 137, 2025–2035. 10.1121/1.4916271 [DOI] [PubMed] [Google Scholar]
- 42. Kressner, A. A. , Westermann, A. , and Buchholz, J. M. (2018). “ The impact of reverberation on speech intelligibility in cochlear implant recipients,” J. Acoust. Soc. Am. 144, 1113–1122. 10.1121/1.5051640 [DOI] [PubMed] [Google Scholar]
- 43. Kressner, A. A. , Westermann, A. , Buchholz, J. M. , and Rozell, C. J. (2016b). “ Cochlear implant speech intelligibility outcomes with structured and unstructured binary mask errors,” J. Acoust. Soc. Am. 139, 800–810. 10.1121/1.4941567 [DOI] [PubMed] [Google Scholar]
- 44. Laneau, J. , Moonen, M. , and Wouters, J. (2006). “ Factors affecting the use of noise-band vocoders as acoustic models for pitch perception in cochlear implants,” J. Acoust. Soc. Am. 119, 491–506. 10.1121/1.2133391 [DOI] [PubMed] [Google Scholar]
- 45. Lim, J. S. , and Oppenheim, A. V. (1979). “ Enhancement and bandwidth compression of noisy speech,” Proc. IEEE 67(12), 1586–1604. 10.1109/PROC.1979.11540 [DOI] [Google Scholar]
- 46. Ma, J. , Hu, Y. , and Loizou, P. C. (2009). “ Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions,” J. Acoust. Soc. Am. 125, 3387–3405. 10.1121/1.3097493 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Möller, S. , Chan, W. , Côté, N. , Falk, T. H. , Raake, A. , and Wältermann, M. (2011). “ Speech quality estimation: Models and trends,” IEEE Signal Process. Mag. 28, 18–28. 10.1109/MSP.2011.942469 [DOI] [Google Scholar]
- 48. Naylor, P. A. , and Gaubitch, N. D. (2010). Speech Dereverberation ( Springer-Verlag, Berlin: ). [Google Scholar]
- 49. Nilsson, M. , Soli, S. D. , and Sullivan, J. A. (1994). “ Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am. 95, 1085–1099. 10.1121/1.408469 [DOI] [PubMed] [Google Scholar]
- 50. Pearson, K. (1894). “ Contributions to the mathematical theory of evolution,” Philos. Trans. R. Soc. London 185, 71–110. 10.1098/rspl.1893.0079 [DOI] [Google Scholar]
- 51. Plomp, R. (1986). “ A signal-to-noise ratio model for the speech-reception threshold of the hearing impaired,” J. Speech. Lang. Hear. Res. 29, 146–154. 10.1044/jshr.2902.146 [DOI] [PubMed] [Google Scholar]
- 52. Quackenbush, S. , Barnwell, T. , and Clements, M. (1988). Objective Measures of Speech Quality ( Prentice-Hall, Englewood Cliffs, NJ: ). [Google Scholar]
- 53. Reinhart, P. N. , Souza, P. E. , Srinivasan, N. K. , and Gallun, F. J. (2016). “ Effects of reverberation and compression on consonant identification in individuals with hearing impairment,” Ear Hear. 37, 144–152. 10.1097/AUD.0000000000000229 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Reinhart, P. N. , Zahorik, P. , and Souza, P. E. (2019). “ Effects of reverberation on the relationship between compression speed and working memory for speech-in-noise perception,” Ear Hear. 40, 1098–1105. 10.1097/AUD.0000000000000696 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Roman, N. , and Woodruff, J. (2011). “ Intelligibility of reverberant noisy speech with ideal binary masking,” J. Acoust. Soc. Am. 130, 2153–2161. 10.1121/1.3631668 [DOI] [PubMed] [Google Scholar]
- 56. Roman, N. , and Woodruff, J. (2013). “ Speech intelligibility in reverberation with ideal binary masking: Effects of early reflections and signal-to-noise ratio threshold,” J. Acoust. Soc. Am. 133, 1707–1717. 10.1121/1.4789895 [DOI] [PubMed] [Google Scholar]
- 57. Santos, J. F. , Cosentino, S. , Hazrati, O. , Loizou, P. C. , and Falk, T. H. (2013). “ Objective speech intelligibility measurement for cochlear implant users in complex listening environments,” Speech Commun. 55, 815–824. 10.1016/j.specom.2013.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Santos, F. , Cosentino, S. , Hazrati, O. , Loizou, P. C. , Falk, T. H. , National, I. , and Recherche, D. (2012). “ Performance comparison of intrusive objective speech intelligibility and quality metrics for cochlear implant users,” in Thirteenth. Annual Conference on International Speech Communication ( ISCA, Riverside, CA: ). [Google Scholar]
- 59. Santos, J. F. , and Falk, T. H. (2014). “ Updating the SRMR-CI metric for improved intelligibility prediction for cochlear implant users,” IEEE/ACM Trans. Audio. Speech. Lang. Process. 22, 2197–2206. 10.1109/TASLP.2014.2363788 [DOI] [Google Scholar]
- 60. Shahidi, L. K. , Collins, L. M. , and Mainsah, B. O. (2022). “ Parameter tuning of time-frequency masking algorithms for reverberant artifact removal within the cochlear implant stimulus,” Cochlear Implants Int. 23, 309–316. 10.1080/14670100.2022.2096182 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Skinner, M. W. , Holden, L. K. , Whitford, L. A. , Plant, K. L. , Psarros, C. , and Holden, T. A. (2002). “ Speech recognition with the Nucleus 24 SPEAK, ACE, and CIS speech coding strategies in newly implanted adults,” Ear Hear. 23, 207–223. 10.1097/00003446-200206000-00005 [DOI] [PubMed] [Google Scholar]
- 62. Svirsky, M. A. , Capach, N. H. , Neukam, J. D. , Azadpour, M. , Sagi, E. , Hight, A. E. , Glassman, E. K. , Lavender, A. , Seward, K. P. , Miller, M. , Ding, N. , Tan, C.-T. , and Fitzgerald, M. B. (2021). “ Valid acoustic models of cochlear implants: One size does not fit all,” Otol. Neurotol. 42, S2–S10. 10.1097/MAO.0000000000003373 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Swanson, B. , and Mauch, H. (2006). “ E10511DD: Nucleus MATLAB toolbox 4.20 software user manual,” Cochlear Ltd, Lane Cove, New South Wales, Australia. [Google Scholar]
- 64. Taal, C. H. , Hendriks, R. C. , Heusdens, R. , and Jensen, J. (2011). “ An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio. Speech. Lang. Process. 19, 2125–2136. 10.1109/TASL.2011.2114881 [DOI] [Google Scholar]
- 65. Vandali, A. E. , Whitford, L. A. , Plant, K. L. , and Clark, G. M. (2000). “ Speech perception as a function of electrical stimulation rate: Using the nucleus 24 cochlear implant system,” Ear Hear. 21, 608–624. 10.1097/00003446-200012000-00008 [DOI] [PubMed] [Google Scholar]
- 66. Whitmal, N. A. , Poissant, S. F. , Freyman, R. L. , and Helfer, K. S. (2007). “ Speech intelligibility in cochlear implant simulations: Effects of carrier type, interfering noise, and subject experience,” J. Acoust. Soc. Am. 122, 2376–2388. 10.1121/1.2773993 [DOI] [PubMed] [Google Scholar]
- 67. Yousefian, N. , and Loizou, P. C. (2012). “ Predicting the speech reception threshold of cochlear implant listeners using an envelope-correlation based measure,” J. Acoust. Soc. Am. 132, 3399–3405. 10.1121/1.4754539 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Zhao, Y. , Wang, D. , Merks, I. , and Zhang, T. (2016). “ DNN-based enhancement of noisy and reverberant speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, March 20–25, Shanghai, China. [Google Scholar]
- 69. Zhao, Y. , Wang, Z.-Q. , and Wang, D. (2017). “ A two-stage algorithm for noisy and reverberant speech enhancement,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, March 5–9, New Orleans, LA, pp. 5580–5584. [Google Scholar]
- 70. Zhao, Y. , Xu, B. , Giri, R. , and Zhang, T. (2018). “ Perceptually Guided Speech Enhancement Using Deep Neural Networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, April 15–20, Calgary, Canada. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.


