Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 May 1.
Published in final edited form as: Proc IEEE Int Conf Acoust Speech Signal Process. 2020 May 14;2020:6929–6933. doi: 10.1109/icassp40776.2020.9054450

USING AUTOMATIC SPEECH RECOGNITION AND SPEECH SYNTHESIS TO IMPROVE THE INTELLIGIBILITY OF COCHLEAR IMPLANT USERS IN REVERBERANT LISTENING ENVIRONMENTS

Kevin Chu 1, Leslie Collins 1, Boyla Mainsah 1
PMCID: PMC7568341  NIHMSID: NIHMS1595042  PMID: 33078056

Abstract

Cochlear implant (CI) users experience substantial difficulties in understanding reverberant speech. A previous study proposed a strategy that leverages automatic speech recognition (ASR) to recognize reverberant speech and speech synthesis to translate the recognized text into anechoic speech. However, the strategy was trained and tested on the same reverberant environment, so it is unknown whether the strategy is robust to unseen environments. Thus, the current study investigated the performance of the previously proposed algorithm in multiple unseen environments. First, an ASR system was trained on anechoic and reverberant speech using different room types. Next, a speech synthesizer was trained to generate speech from the text predicted by the ASR system. Experiments were conducted in normal hearing listeners using vocoded speech, and the results showed that the strategy improved speech intelligibility in previously unseen conditions. These results suggest that the ASR-synthesis strategy can potentially benefit CI users in everyday reverberant environments.

Keywords: Automatic speech recognition, Cochlear implants, Reverberation, Speech synthesis

1. INTRODUCTION

A cochlear implant (CI) is a prosthetic device that aims to restore speech perception to individuals with sensorineural hearing loss [1]. While CI users generally have excellent sentence recognition in quiet and anechoic environments [1], they experience much more difficulty in reverberation and noise [2]. Several approaches have been proposed to improve the reverberant speech intelligibility of CI users. Reverberation is modeled by convolving an anechoic signal with a room impulse response (RIR) [3]. Thus, the conventional method to enhance reverberant speech is via inverse filtering, which aims to recover the anechoic signal by convolving the reverberant signal with the inverse RIR [4]. However, inverse filtering is not feasible in CIs because it requires blind RIR estimation, which is a difficult task as RIRs constantly change within the same room due to changing source and receiver positions. Another mitigation strategy involves time-frequency (T-F) masking, where a gain is applied to each T-F unit of speech to simultaneously retain speech dominant units and suppress reverberant artifacts [5]. While T-F masks have improved the intelligibility for CI users, even ideal masks can introduce musical noise or retain reverberant artifacts, setting an upper bound on speech intelligibility [6].

Automatic speech recognition (ASR) enables spoken communication between humans and machines. ASR systems recognize word sequences by leveraging statistical models of both acoustics and language. ASR research is extensive, with applications ranging from voice-controlled personal assistants to in-vehicle infotainment systems [7]. In recent years, ASR systems have become increasingly robust to reverberation [8]. In IARPA’s 2015 ASpIRE challenge [9], one of the top submissions achieved a word error rate (WER) of 26.5% on the reverberant development set [8]. Despite the high performance and ubiquity of ASR systems, ASR research has not transitioned to CIs because ASR systems are non-causal. To determine the potential benefit of ASR for CI users, the first step is to test whether non-causal ASR can enhance speech within a CI framework.

Hazrati et al. [10] proposed a strategy that uses ASR to recognize reverberant speech and speech synthesis to generate anechoic speech based on text estimated by the ASR system. Hazrati et al. [10] tested the strategy in CI users, and observed an improvement in speech intelligibility using a recorded RIR. However, the ASR system was trained and tested on the same RIR, so it is unclear how the strategy would generalize to the diverse acoustic environments that CI users could encounter.

The present work expands on the work in Hazrati et al. [10] by investigating the performance of the ASR-synthesis strategy in multiple reverberant environments. Two major modifications were made to the experimental setup used in Hazrati et al. [10]. First, an ASR system was trained on anechoic and reverberant speech using three room types. The ASR system was then tested on a separate set of rooms. Second, the ASR system was trained and tested using binaural RIRs, which capture differences in acoustic waveforms that arrive at each ear. To evaluate the effectiveness of the ASR-synthesis strategy in the proposed experimental conditions, listening tests were performed in normal hearing listeners using vocoded speech.

2. METHODS

2.1. Reverberation Model

Reverberation was modeled by convolving clean, anechoic speech with RIRs [3] from the Aachen Impulse Response database [11]. The RIRs were recorded in a low-reverberant studio booth, a meeting room, an office, a lecture hall, a stairway, and Aula Carolina (a former church with strong reverberant effects). The use of recorded rather than simulated RIRs allowed the present study to evaluate the ASR-synthesis strategy in realistic listening environments. Recordings were made at various source-receiver distances and azimuthal angles, resulting in multiple RIRs per environment. To mimic the head shadowing effect, recordings were made from the left and right channels of a dummy head [11]. Table 1 summarizes details about the room dimensions, source-receiver distances (obtained from [11] and [12]), and direct-to-reverberant ratios (obtained directly from the RIR as described in [13]).

Table 1: Description of training, adaptation, and testing data used in this study.

The table specifies the sentence database, as well as the acoustic environment used for different subsets of the data. For each acoustic environment, information is provided about the room dimensions, source-receiver distance, and direct-to-reverberant ratio (DRR), where applicable.

Dataset Sentence Database Acoustic Environment Room Dimensions (L × W × H) Source-Receiver Distance DRR (dB)
Left Right
Training LibriSpeech train-clean-100 (100 hours) Anechoic - - - -
Meeting room 8.00m × 5.00m × 3.10m - - -
Lecture hall 10.80m × 10.90m × 3.15m - - -
Stairway 7.00m × 5.20m - - -
Adaptation HINT (10 lists) Anechoic - - - -
Testing HINT (14 lists) Anechoic - - - -
Office room 5.00m × 6.40m × 2.90m 3m −0.1 −0.2
Aula Carolina 19.00m × 30.00m 5m 2.5 −0.7

2.2. Automatic Speech Recognition Model

The Kaldi speech recognition toolkit [14] was used to decode speech using an acoustic model, language model, and pronunciation dictionary. The acoustic model learns the relationship between acoustic features and phones, the language model determines the probability of word sequences, and the pronunciation dictionary defines the sequence of phones that comprise each word [7]. Using Kaldi’s LibriSpeech s5 recipe [15], a GMM-HMM (Gaussian Mixture Model-Hidden Markov Model) acoustic model was trained using 13-dimensional mel-frequency cepstral coefficients (MFCCs) [16], as well as their deltas and delta-deltas computed over 25ms frames with a 10ms shift [14]. The features were normalized by applying cepstral mean and variance normalization (CMVN) [17] on a per-speaker basis to reduce channel distortion. The normalized features were spliced across ± 3 neighboring frames to incorporate contextual information [18]. The spliced features were reduced to 40 dimensions using linear discriminant analysis (LDA), where the classes are context-dependent HMM states generated using forced alignment [19]. Forced alignment automatically assigns HMM states to frames given the acoustic features and known transcription [20]. The dimensionally reduced features were then transformed using maximum likelihood linear transformation (MLLT), which makes the features more accurately modeled by diagonal covariance Gaussians [21]. Finally, the features were normalized by speaker by applying feature-space maximum likelihood linear regression (fMLLR) transforms on a per-speaker basis [22].

The acoustic model was trained on sentences from the LibriSpeech corpus [15], which contains labeled English speech designed for training and testing ASR systems. The corpus contains a training set for training acoustic models, a development set for tuning language model parameters, and a testing set for assessing ASR performance. The 100-hour training set was randomly divided into two subsets with 75% of the data in one subset and 25% in the other. Reverberation was applied to the sentences in the larger subset using RIRs from the lecture hall, meeting room, and stairway. Table 1 lists the sentence database and acoustic environment used for different subsets of the data.

The language model used in this study was an off-the-shelf trigram model based on Project Gutenberg texts [23]. The language model weight and word insertion penalty (WIP) were tuned by performing a grid search over language model weights of 7–17 and WIPs of 0, 0.5, and 1. Language model weights and WIPs were selected that minimized the WER of the trained ASR system tested on the LibriSpeech development set (dev-clean) [15]. The grid search resulted in a language model weight of 13 and WIP of 1. The pronunciations were defined using the CMU pronunciation dictionary [24].

The acoustic model was adapted to the speaker in the Hearing in Noise Test (HINT) database, which contains 24 phonetically balanced lists of everyday sentences spoken by a male speaker [25]. To reduce the potential for overfitting, the acoustic model was adapted and tested on separate subsets of the HINT database. The sentences were randomly divided into two subsets with 10 lists for the adaptation data and the remaining 14 lists for the testing data. The fMLLR transforms were estimated from anechoic speech in the adaptation set. During testing, the fMLLR transforms were applied on top of LDA+MLLT transformed features [26].

2.3. Speech Synthesis Model

The Merlin speech synthesizer [27] was used to generate speech waveforms using a duration model and an acoustic model [27]. The duration model predicts the durations of phonetic units, while the acoustic model learns the relationship between linguistic features and spectral parameters. The input features of the duration model consist of 416-dimensional linguistic features including quinphone identity, part of speech, and positional information within a syllable, word, and phrase. The output features are the state durations of a 5-state monophone HMM. The true state durations were obtained using an HMM-based forced aligner [27].

The input features of the acoustic model consist of 416-dimensional linguistic features and 9-dimensional subphone features. The linguistic features are the same as those used for the duration model. The subphone features are the forward and backward position of a frame in an HMM state, the forward and backward position of a frame in a phone, the forward and backward position of the HMM state in a phone, state duration, phone duration, and fraction of phone occupied by the current state. During synthesis, the subphone features were calculated from the durations predicted by the duration model. The output features are acoustic parameters consisting of 60-dimensional generalized mel-cepstral coefficients, the band aperiodicity, and the logarithm of the fundamental frequency, as well as their deltas and delta-deltas computed over 5ms frames. An additional binary feature was included to specify whether the phone is voiced [27]. The true acoustic features were calculated directly from the training data using the WORLD vocoder [28]. The input features were normalized to the range [0.01, 0.99], and the output features were normalized to achieve zero mean and unit variance.

Both the duration and acoustic models were modeled using deep feedforward neural networks containing 6 hidden layers with 1024 hyperbolic tangent units per layer. The learning rate was initially set to 0.002 with an exponentially decaying rate on subsequent epochs. The networks were trained to minimize the mean squared error between the predicted and the true acoustic features [27]. Both models were trained using the BDL voice (American male) from the CMU Arctic database [29]. Speech was synthesized by passing the acoustic features into the WORLD vocoder, which generates acoustic waveforms by convolving the excitation signal with the minimum phase response [28]. For the listening tests, speech was synthesized from the word sequences estimated by the ASR model.

2.4. Listening Test Procedure

Twenty native speakers of American English aged 18–53 years (mean = 23.8 years, standard deviation = 10.2) with self-reported normal hearing were recruited for the study. This study was approved by Duke University’s Institutional Review Board. Sentences were vocoded using the Advanced Combination Encoder (ACE) strategy [30]. Subjects were presented with a series of sentences monaurally using headphones (at 65dB sound pressure level) and were asked to type in all the words they could hear. To prevent random guessing, subjects were instructed to type “I don’t know” for unintelligible sentences.

Subjects were first trained to understand vocoded speech using sentences from the CUNY database [31]. During the testing phase, subjects were presented with sentences from the HINT database [25]. Subjects were tested on anechoic and reverberant speech generated using RIRs from the office room and Aula Carolina. To determine the effect of the strategy on speech intelligibility, subjects were tested on anechoic and reverberant speech, as well as speech synthesized from the word sequences estimated by the ASR system. An additional condition was included to test for the intelligibility of synthesized speech generated by assuming perfect ASR performance (ideal ASR). Because different acoustic signals arrive at a subject’s ears in a real-world environment, the strategy could potentially estimate and present different word sequences to each channel. To eliminate this possibility, subjects were tested separately in their left and right ears. This resulted in a total of 14 testing conditions. The order of the testing conditions and the assignment of HINT sentence list to each condition were randomized across participants.

2.5. Data Analysis

The ASR system was evaluated using accuracy, or 1-WER. The listening tests were evaluated by calculating the proportion of correct words. Words in a subject’s response were required to match those in the known transcript to be counted as correct. Variations in articles (a/the) and verb tense (is/was, are/were) were allowed as was done in [25]. For statistical analysis, the data were transformed using the rationalized arcsine transform [32] to achieve scores that approximately follow a normal distribution.

3. RESULTS

3.1. ASR Experiments

Fig. 1 shows the accuracy of the ASR model in each listening environment. The accuracy was 90±1% in the anechoic condition, 83±1% and 84±1% for the left and right channels in the office, and 29±1% and 28±1% for the left and right channels in Aula Carolina. A one-way, repeated-measures analysis of variance (ANOVA) showed that the effect of condition was statistically significant at a significance level of 0.05 (F(2.69, 51.11) = 13208, p < 0.001, ηg2 = 0.998, Greenhouse-Geisser correction). Post hoc Tukey tests showed that performance in both reverberant environments was significantly lower than in the anechoic condition (p < 0.001 for all comparisons).

Fig. 1:

Fig. 1:

Accuracy of the ASR system in an anechoic environment, an office, and Aula Carolina. Results are displayed for left (L) and right (R) channels. Bar heights and error bars denote mean and ± 1 standard deviation, respectively, across 20 adaptation/test splits.

3.2. Listening Experiment

Fig. 2 shows the performance of each subject with and without the ASR-synthesis strategy. Statistical inference was performed by conducting a three-way, repeated-measures ANOVA with factors room (anechoic, office, and Aula Carolina), channel (left and right), and strategy (natural speech and ASR-synthesis), as well as two and three-way interactions. The main effect of room was significant (F(2, 38) = 356.1, p < 0.001, ηg2 = 0.776)), as was the interaction between room and strategy (F(2, 38) = 62.0, p < 0.001, ηg2 = 0.315) and the three-way interaction (F(2, 38) = 3.3, p = 0.0498, ηg2 = 0.020). For the anechoic environment, subjects achieved high mean intelligibility scores with naturally-produced speech (L = 94±5%; R = 96±4%), while performance dropped significantly when the strategy was implemented (L = 77±8% (p < 0.001); R = 70±14% (p < 0.001)). In contrast to the anechoic condition, ASR-synthesis generally improved subject performance in both reverberant environments. In the office, mean intelligibility scores increased significantly from 48±19% to 69±13% for the left channel (p < 0.001) and 56±20% to 72±12% in the right channel (p = 0.001). In Aula Carolina, the mean scores increased non-significantly from 21±18% to 22±8% for the left channel (p = 0.191) and significantly from 16±17% to 26±10% for the right channel (p < 0.001).

Fig. 2:

Fig. 2:

Percent of correct words across 20 subjects. Results are displayed for an anechoic environment, an office, and Aula Carolina. Circles (o) denote the right channel, and triangles (Δ) denote the left channel. Data points above the parity line indicate that the ASR-synthesis strategy improves intelligibility.

To compare anechoic, natural speech with speech synthesized under the ideal ASR condition, a two-way repeated-measures ANOVA was conducted with factors channel and strategy as well as their two-way interaction. Compared to anechoic, natural speech, synthesized speech was significantly less intelligible (L = 81±9%; R = 83±9%) (F(1, 19) = 60.6, p < 0.001, ηg2 = 0.465).

4. CONCLUSION

This study extended previous work by investigating whether an ASR-synthesis based speech processing strategy could improve speech intelligibility for CI users in a variety of reverberant listening environments. First, the ASR system was tested on a set of previously unseen environments, and results showed comparable recognition performance between the anechoic environment and a moderately reverberant room, but substantially lower performance for the highly reverberant room. Next, the strategy was tested in normal hearing subjects using vocoded speech, and results showed that the strategy improved intelligibility in reverberant environments. These results extend the findings of Hazrati et al. [10] by suggesting that the ASR-synthesis strategy generalizes to a wide variety of reverberant environments. The main limitation of this strategy is that it is non-causal as the ASR system requires knowledge of the entire acoustic signal to estimate a word sequence. Future work will aim to develop a causal ASR-based strategy that is more compatible with testing in CI users.

5. ACKNOWLEDGEMENTS

This work was supported by the National Institute on Deafness and Other Communication Disorders under Award Number R01DC014290-04. The authors would like to thank the subjects that participated in this study.

6. REFERENCES

  • [1].Zeng FG, “Trends in cochlear implants,” Trends Amplif, vol. 8, no. 1, pp. 1–34, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Hazrati O. and Loizou PC, “The combined effects of reverberation and noise on speech intelligibility by cochlear implant listeners,” Int. J. Audiol, vol. 51, no. 6, pp. 437–43, June 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Kuttruff H, “Design considerations and design procedures,” in Room Acoustics. London: Spon Press, 2000, p. 305. [Google Scholar]
  • [4].Miyoshi M. and Kaneda Y, “Inverse Filtering of Room Acoustics,” IEEE Trans. Acoust., Speech, Signal Process, vol. 36, no. 2, pp. 145–152, February 1988. [Google Scholar]
  • [5].Wang D. and Brown G, “Fundamentals of Computational Auditory Scene Analysis,” in Computational Auditory Scene Analysis: Wiley-IEEE Press, 2006, pp. 22–23. [Google Scholar]
  • [6].Grais E, Roma G, Simpson A, and Plumbley M, “Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks,” in Proc. INTERSPEECH, 2016, pp. 3339–3343. [Google Scholar]
  • [7].Yu D. and Deng L, “Introduction,” in Automatic Speech Recognition. London: Springer, 2015, pp. 1–4. [Google Scholar]
  • [8].Peddinti V, Chen G, Manohar V, Ko T, Povey D, and Khudanpur S, “JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS,” in Proc. IEEE ASRU, 2015, pp. 539–546. [Google Scholar]
  • [9].Harper M, “The Automatic Speech Recognition in Reverberant Environments (ASpIRE) Challenge,” in Proc. IEEE ASRU, 2015, pp. 547–554. [Google Scholar]
  • [10].Hazrati O, Ghaffarzadegan S, and Hansen JHL, “Leveraging automatic speech recognition in cochlear implants for improved speech intelligibility under reverberation,” in Proc. IEEE ICASSP, 2015, pp. 5093–5097. [Google Scholar]
  • [11].Jeub M, Schafer M, and Vary P, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. IEEE Int. Conf. Dig. Signal Process., 2009, pp. 1–5. [Google Scholar]
  • [12].Jeub M, “Joint Dereverberation and Noise Reduction for Binaural Hearing Aids and Mobile Phones,” Ph.D. thesis, RWTH Aachen University, Aachen, Germany, 2012. [Google Scholar]
  • [13].Jeub M, Nelke C, Beaugeant C, and Vary P, “Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals,” in Proc. EUSIPCO, 2011, pp. 1347–1351. [Google Scholar]
  • [14].Povey D. et al. , “The Kaldi Speech Recognition Toolkit,” in Proc. IEEE ASRU, 2011. [Google Scholar]
  • [15].Panayotov V, Chen G, Povey D, and Khudanpur S, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. IEEE ICASSP, 2015, pp. 5206–5210. [Google Scholar]
  • [16].Davis SB and Mermelstein P, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoust., Speech, Signal Process, vol. 28, no. 4, pp. 357–366, 1980. [Google Scholar]
  • [17].Viikki O. and Laurila K, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Commun, vol. 25, no. 1–3, pp. 133–147, August 1998. [Google Scholar]
  • [18].Bahl LR, Souza P. V. d., Gopalakrishnan PS, Nahamoo D, and Picheny MA, “Robust methods for using context-dependent features and models in a continuous speech recognizer,” in Proc. IEEE ICASSP, 1994, pp. I/533–I/536. [Google Scholar]
  • [19].Haeb-Umbach R. and Ney H, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” in Proc. IEEE ICASSP, 1992, vol. 1, pp. 13–16. [Google Scholar]
  • [20].Yu D. and Deng L, “Deep Neural Network-Hidden Markov Model Hybrid Systems,” in Automatic Speech Recognition. London: Springer, 2015, p. 103. [Google Scholar]
  • [21].Gales MJF, “Semi-tied covariance matrices for hidden Markov models,” IEEE Trans. Speech Audio Process, vol. 7, no. 3, pp. 272–281, May 1999. [Google Scholar]
  • [22].Gales MJF, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech & Language, vol. 12, no. 2, pp. 75–98, 1998. [Google Scholar]
  • [23].(2019, September 16). Project Gutenberg [Online]. Available: www.gutenberg.org.
  • [24].(October 11). CMU pronouncing dictionary [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict?
  • [25].Nilsson M, Soli SD, and Sullivan JA, “Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am, vol. 95, no. 2, pp. 1085–99, February 1994. [DOI] [PubMed] [Google Scholar]
  • [26].Rath S, Povey D, Veselý K, and Černocký J, “Improved Feature Processing for Deep Neural Networks,” in Proc. INTERSPEECH, 2013, pp. 109–113. [Google Scholar]
  • [27].Wu Z, Watts O, and King S, “Merlin: An Open Source Neural Network Speech Synthesis System,” in Proc. SSW, 2016, pp. 202–207. [Google Scholar]
  • [28].Morise M, Yokomori F, and Ozawa K, “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” Ieice T Inf Syst, vol. E99d, no. 7, pp. 1877–1884, July 2016. [Google Scholar]
  • [29].Kominek J. and Black AW, “The CMU Arctic Speech Databases,” in Proc. SSW, 2004, pp. 223–224. [Google Scholar]
  • [30].Vandali AE, Whitford LA, Plant KL, and Clarke GM, “Speech perception as a function of electrical stimulation rate: Using the nucleus 24 cochlear implant system,” Ear Hear, vol. 21, no. 6, pp. 608–624, December 2000. [DOI] [PubMed] [Google Scholar]
  • [31].Boothroyd A, Hanin L, and Hnath T, “A sentence test of speech perception: Reliability, set equivalence, and short term learning,” City University of New York, New York, 1985. [Google Scholar]
  • [32].Studebaker GA, “A “rationalized” arcsine transform,” J Speech Hear Res, vol. 28, no. 3, pp. 455–62, September 1985. [DOI] [PubMed] [Google Scholar]

RESOURCES