Abstract
This work presents a single-channel speech enhancement (SE) framework based on the super-Gaussian extension of the joint maximum a posteriori (SGJMAP) estimation rule. The developed SE algorithm is an open-source research smartphone-based application for hearing improvement studies. In this algorithm, the SGJMAP-based estimation for noisy speech mixture is smoothed along the frequency axis by a Mel filter-bank, resulting in a Mel-warped frequency-domain SGJMAP estimation. The impulse response of this Mel-warped estimation is obtained by applying a Mel-warped inverse discrete cosine transform (Mel-IDCT). This helps in filtering out the background noise and enhancing the speech signal. The proposed application is implemented on an iPhone (Apple, Cupertino, CA) to operate in real time and tested with normal-hearing (NH) and hearing-impaired (HI) listeners with different types of hearing aids through wireless connectivity. The objective speech quality and intelligibility test results are used to compare the performance of the proposed algorithm to existing conventional single-channel SE methods. Additionally, test results from NH and HI listeners show substantial improvement in speech recognition with the developed method in simulated real-world noisy conditions at different signal-to-noise ratio levels.
I. INTRODUCTION
In the last three decades, significant developments in speech enhancement (SE) algorithms have been achieved. Researchers continue to show interest in developing novel, computationally efficient methods that can help suppress the background noise with no or minimum distortion to speech. SE algorithms have numerous applications for hearing aid devices (HADs), cochlear implants (CIs), and other assistive devices such as remote microphone technology (RMT). SE remains a challenging problem to resolve in the presence of nonstationary background noises and reverberant conditions. In the case of complex and dominant background noise, understanding speech and words is difficult for normal-hearing (NH) listeners and nearly intolerable for hearing-impaired (HI) listeners (Kochkin, 2010).
Researchers have provided many solutions for HI listeners, including the implementation of signal processing algorithms on suitable HADs and CIs. In the presence of numerous and often strong environmental noise, performance deteriorates with HADs and CIs. Many of these devices also lack the computational power to operate complex signal processing algorithms because of the physical design constraints. HAD manufacturers are using RMTs to increase the signal-to-noise ratio (SNR) with the help of a separate microphone at the speaker while the speech signal is wirelessly transmitted to the listener's hearing device (Thibodeau, 2020). The main limitation with RMTs is that they are an additional expense in addition to the cost of the HAD. Smartphones consisting of inbuilt microphones and Bluetooth (Bluetooth Special Interest Group, Kirkland, WA) can replace these traditional RMTs for small adult group conversations as a stand-alone device with no external component or additional hardware. Nearly 96% of Americans have Advanced RISC Machine-processor-based smartphones and 81% have a tablet that allows the user to download applications with access to the inbuilt microphone.1 Nowadays, hearing aids are released with Bluetooth Low Energy connectivity to smartphones with iOS (Apple, Cupertino, CA) or Android (Open Handset Alliance, Mountain View, CA) operating systems.2 To give the HI listeners more control over the HAD, manufacturers also have their own smartphone-based applications. The smartphone can stream phone calls and media sound directly to the HADs. Additionally, some of these applications enable real-time fine-tuning of the HADs.
Most hearing aid smartphone applications use a single microphone to capture audio (noisy speech) with minimal input/output (I/O) latency.3 A block diagram of the speech processing pipeline used with smartphones and HADs is shown in Fig. 1. Figure 1 is our proposed open-source research smartphone-based adaptive signal processing pipeline for hearing study. The microphone array (1, 2, or 3 microphones) of the smartphone captures the noisy speech signal. The voice activity detector (VAD) block helps to determine whether the input frame is a noisy speech or a noise-only frame. The output of the VAD helps separate the noisy speech signal from the noise part without the speech signal so they can be used for the SE and other stages of the HAD signal processing pipeline. The noisy input speech is then passed through an adaptive acoustic feedback cancellation block, a SE block to suppress the background noise and extract the speech with minimum or no distortion, and a multi-channel dynamic range audio-compression or automatic gain control block (Patel et al., 2019). The direction of arrival (DOA) estimation (Küçük and Panahi, 2020) allows the user to find the direction of the desired speaker. In the proposed smartphone-based HAD signal processing pipeline, the SE aims to suppress the noise and enhance the quality and intelligibility of speech for optimum speech perception, thus, improving the speech communication performance for the listener.
FIG. 1.
(Color online) Block diagram of the proposed smartphone-based HAD signal processing pipeline.
Boll (1979) introduced the noise attenuation principle into the spectral domain by subtracting the spectral magnitude of the noise from the magnitude spectrum of the noisy speech signal, which is referred to as the spectral subtraction method. That is, a numeric value is obtained by averaging the instantaneous noise spectral density terms and subtracting them from the noisy speech signal. This approach often distorts the speech signal by producing undesired audible noise (also known as musical noise). Ephraim and Malah (1984, 1985) proposed a novel SE technique to estimate the spectrum of clean speech by minimizing the criterion of the statistical error. This method, when operated at a low SNR level, has shown to be more robust to musical noise and, therefore, less speech distortion. Assuming a uniform phase distribution, the estimate of the speech phase is taken to be the phase of the noisy speech signal itself, which is used in most statistical model-based SE methods. As such, many SE methods regard the speech phase as perceptually unimportant (Vary and Eurasip, 1985). Most advanced single microphone (single-channel) SE techniques help minimize the background noise, whereas inaccuracy in the noise estimate during nonspeech activity contributes to speech distortion (Loizou, 2013). Such inaccuracy leads to inaccurate SE gain function estimates and can mask important speech components, which would lead to the degradation of speech intelligibility despite significant attenuation of the background noise.
The SE techniques are expected to reduce the background noise without compromising on perceptual speech quality and intelligibility. Statistical model-based single microphone SE techniques are mostly based on the decomposition of noisy speech to complex exponentials using short-time Fourier transforms (STFTs) to obtain the signal spectrum. The signal spectral magnitude is then multiplied with a nonlinear SE gain function. The gain function is derived by optimizing a cost function that would minimize the speech distortion and suppress the background noise. Other well-known SE techniques include those using the minimum mean square error (MMSE) or log-MMSE-based (Ephraim and Malah, 1984, 1985) methods for estimating the speech magnitude spectrum, maximum likelihood (McAulay and Malpass, 1980), and joint maximum a posteriori (JMAP) estimation method (Wolfe and Godsill, 2003). A super-Gaussian extension of the joint maximum a posteriori (SGJMAP) is proposed in Karadagur Ananda Reddy et al. (2017) and Lotter and Vary (2005) and has outperformed other proposed SE algorithms (Ephraim and Malah, 1984, 1985; Wolfe and Godsill, 2003). The super-Gaussian statistical model of the clean speech and noise spectral components attains a lower mean squared error compared to the Gaussian model. The real-valued time-frequency gain function in most of the statistical model-based SE methods is derived as a function of a priori SNR and a posteriori SNR. Hu and Loizou (2006) conducted a subjective speech quality assessment of the most widely used SE algorithms to monitor the speech distortion, reduction in noise, and overall speech quality. They showed that the statistical model-based SE methods perform better than other approaches.
The popularity of machine learning (ML) and neural network techniques for solving multivariate, complex, nonlinear problems is growing. By using such techniques, several issues related to the processing of speech signals can also be addressed. Recent SE innovations with deep neural networks (DNNs; Bhat et al., 2019; Shankar et al., 2020a,b) achieve superior noise reduction even in nonstationary noisy environments. Although these methods result in better objective test results after intensive training with a large dataset, the enhanced speech is dependent on the training procedure and is application specific (Nossier et al., 2020). Besides, the computational complexity and processing power needed for running the ML/DNN-based SE algorithms in real time can also pose some limitations. However, for real-time SE applications, researchers are developing computationally efficient DNN models which are less complex (Fedorov et al., 2020).
In this research note, a statistical model-based SE method is proposed, which runs on the smartphone (with no external component) in real time. The proposed SE algorithm has two stages based on the SGJMAP cost function (Karadagur Ananda Reddy et al., 2017; Lotter and Vary, 2005). In the first stage, the SGJMAP gain estimate of the noisy speech signal is smoothed along the frequency axis by a Mel filter-bank, leading to a Mel-warped frequency-domain SGJMAP estimate (Agarwal and Cheng, 1999; Processing, 2003). By applying a Mel-warped inverse discrete cosine transform (Mel-IDCT), we derived the impulse response of the Mel-warped estimate. This filters out the background noise from the input noisy speech signal. Traditional SGJMAP SE (Lotter and Vary, 2005) is used in the second stage as a post-filter to minimize the residual noise present in the first stage output. The proposed two-stage SE pipeline suppresses the background noise with minimal speech distortion in real time. The developed algorithm is implemented on a smartphone, which records the noisy speech and processes the signal using the proposed adaptive SE method. The output enhanced speech is then transmitted to the user's HAD/headset through a wired or wireless connection. As a result, this makes the proposed smartphone-based application an effective assistance platform for NH and HI users.
The authors performed objective speech quality and intelligibility tests using standard computer simulation-based measurements for noise types of machinery, traffic, and multi-talker babble mixed with the clean speech at SNR levels of –5, 0, and +5 dB. Real-life noise signals recorded on the smartphone are used in all of the testing methods. The perceptual evaluation of the speech quality (PESQ; Rix et al., 2001) was used for the objective test of the speech quality. The coherence speech intelligibility index (CSII; Loizou, 2013) and short-time objective intelligibility (STOI; Taal et al., 2011) were used for the objective test of the speech intelligibility. The proposed evaluation methodology was compared with several baseline statistical model-based single microphone SE techniques. The validation of the proposed SE smartphone application was completed using speech-in-noise tests. Along with the objective test measures, participants with normal hearing and hearing loss completed speech recognition tests. The obtained clinical test results reflect substantial improvements in both speech quality and understanding using the proposed method compared to just listening only through some HADs, which are commonly available in the market without the use of the proposed SE method.
II. PROPOSED SE
This section describes the proposed two-stage SE pipeline. In Fig. 2, the block diagram of the proposed algorithm reflects the usability and real-time implementation of the developed SE method with a smartphone and a HAD. We consider the time-domain noisy speech signal y(t) to be an additive mixture model of clean speech s(t) and noise z(t).
| (1) |
The input noisy speech signal is transformed from the time-domain into the frequency-domain by taking the STFT.
| (2) |
where , and represent the STFT of y(t), s(t), and z(t), respectively, for the frame λ and frequency bin k. In polar coordinates, Eq. (2) can be written as
| (3) |
where and are the magnitude spectra of noisy speech, clean speech, and noise, respectively. and represent the phases of noisy speech, clean speech, and noise, respectively.
FIG. 2.
(Color online) Block diagram representation of the proposed single microphone SE method.
A. SGJMAP gain estimation
A non-Gaussianity property in the spectral domain noise reduction framework is considered, thus, using a super Gaussian speech model (Karadagur Ananda Reddy et al., 2017; Shankar et al., 2020c). The purpose of the SE is to obtain an estimate of the clean speech magnitude spectrum . λ is eliminated for swiftness in further derivations. The JMAP estimator jointly maximizes the magnitude and phase spectra probability conditioned on the observed complex coefficient,
| (4) |
| (5) |
denotes the probability density function (PDF) of its argument. By approximating the PDF of the speech spectral magnitude with respect to the individualized parameters μ and v, the super-Gaussian PDF (Martin, 2002) of the magnitude spectral coefficient is given as
| (6) |
where denotes the Gamma function. The logarithm function of Eq. (4) is differentiated with respect to and equated to zero. By considering , we get
| (7) |
Further simplification of Eq. (7), yields a quadratic equation,
| (8) |
Solving the obtained quadratic equation in terms of and gives
| (9) |
where is the a priori SNR and is the a posteriori SNR. is the noise power estimated with the help of a VAD (Jongseo Sohn et al., 1999). is the estimated power spectral density of clean speech. and are shown to give good results (Lotter and Vary, 2005). The output speech magnitude spectrum estimate is , where the SE gain function is
| (10) |
B. Mel filter-bank and Mel-IDCT
Mel-frequency coefficients have been widely used in both speaker and speech recognition. The Mel-warped frequency scale is designed to replicate how the human ear perceives sound. As the spectral resolution decreases with an increase in the frequency, the Mel scale down-samples information in the higher frequency range (Zhou et al., 2011). In addition, several studies in the past have shown that error reduction is more beneficial for speech improvement in a perceptually significant domain (Cheng and O'Shaughnessy, 1991; Tsoukalas et al., 1997) and is the focus of this smartphone application. Hence, in this paper, the Mel-frequency is considered to be the perceptual domain. The SGJMAP SE gain coefficients computed in Eq. (10) are smoothed and transformed to the Mel-frequency scale. The Mel-warped frequency-domain SGJMAP coefficients Gmel are estimated by using triangular-shaped, half-overlapped frequency windows (Liu et al., 2010; Shah et al., 2004). The relation between the Mel-scale and frequency-domain is given by
| (11) |
| (12) |
Here, Eq. (11) computes the Mel-frequency coefficient from the frequency f in Hz. Equation (12) represents the central frequencies of the filter-bank bands (Rao and Manjunath, 2017; Yu and Deng, 2016), where and . FB = 23 is the number of filter-bank bands for our test results (Processing, 2003). The sampling frequency fs is set to be 16 kHz, and the upper frequency f in Eq. (11) is limited to 8 kHz. In addition to the 23 filter-bank bands, 2 marginal filter-bank bands with central frequencies and are considered for the purpose of the discrete cosine transform (DCT) to the time-domain. The frequency bin index corresponding to the central frequencies is obtained as
| (13) |
Due to the complex conjugate property in STFT, . The Mel-spaced filter-bank is given by (Bhat et al., 2019)
| (14) |
The Mel-warped SGJMAP coefficients are given by
| (15) |
where Gk are the computed SGJMAP SE gain values in Eq. (10) and . By applying the Mel-IDCT to Eq. (15), the time-domain impulse response for SGJMAP SE is
| (16) |
where Midct is the Mel-IDCT, which is defined as
| (17) |
where for are the central frequencies corresponding to the Mel filter-bank. and . The is computed as
| (18) |
Finally, the time-domain impulse response of the SGJMAP SE in Eq. (16) is mirrored and the causal impulse response is obtained (Processing, 2003). The impulse response is then weighted using a Hanning window, and the input noisy speech time-domain signal y(t) is filtered using the weighted impulse response. The filtered output is the enhanced speech signal , which goes to the post-filter SGJMAP SE block.
C. Post-filter SGJMAP
The single microphone SGJMAP SE explained in Sec. II A is the post-filter to eliminate the residual background noise present in . The SGJMAP gain values in Eq. (10) are calculated for the enhanced signal . The estimate of the magnitude spectrum of the final clean speech is given by
| (19) |
where represents the STFT of . In Vary and Eurasip (1985), the speech phase is considered to be perceptually insignificant. The phase of the noisy speech signal and from Eq. (19) are used to obtain the time-domain speech signal using the inverse fast Fourier transform (IFFT). The enhanced clean speech signal is then transmitted from the smartphone to the HADs (or earphones) through a wired or wireless connection as shown in Fig. 2.
III. OBJECTIVE TEST RESULTS
Standard computer simulation-based measures are used for the objective tests. Performance of the proposed two-stage SE algorithm is compared with the input noisy speech itself and two benchmark methods using a single microphone statistical model-based SE approach. The first method for comparison, called the direct benchmark technique, is based on the use of the SGJMAP estimator given by Lotter and Vary (2005). The second method for comparison is the log-MMSE SE method proposed by Ephraim and Malah (1985), which is known to perform well across different noise types (Wolfe and Godsill, 2003; Hu and Loizou, 2006). In Tables I–III, the comparisons use objective test results such that “noisy” denotes the unprocessed noisy speech signal, “logMMSE” denotes the use of the SE method of Ephraim and Malah (1985), “SGJMAP” denotes the use of the SE method of Lotter and Vary (2005), and “proposed” denotes the use of the SE method presented in this paper.
TABLE I.
Comparison of the SE methods in terms of the PESQ.
| SNR(dB) | Method | Machinery | Babble | Traffic |
|---|---|---|---|---|
| Noisy | 1.09 | 1.06 | 1.07 | |
| −5 | logMMSE | 1.11 | 1.06 | 1.09 |
| SGJMAP | 1.10 | 1.08 | 1.07 | |
| Proposed | 1.22 | 1.09 | 1.11 | |
| Noisy | 1.10 | 1.06 | 1.08 | |
| 0 | logMMSE | 1.14 | 1.12 | 1.12 |
| SGJMAP | 1.18 | 1.09 | 1.09 | |
| Proposed | 1.25 | 1.13 | 1.15 | |
| Noisy | 1.12 | 1.09 | 1.11 | |
| +5 | logMMSE | 1.48 | 1.21 | 1.24 |
| SGJMAP | 1.33 | 1.13 | 1.17 | |
| Proposed | 1.62 | 1.26 | 1.37 |
TABLE II.
Comparison of the SE methods in terms of the CSII.
| SNR(dB) | Method | Machinery | Babble | Traffic |
|---|---|---|---|---|
| Noisy | 0.54 | 0.34 | 0.38 | |
| −5 | logMMSE | 0.50 | 0.28 | 0.41 |
| SGJMAP | 0.46 | 0.19 | 0.32 | |
| Proposed | 0.59 | 0.34 | 0.45 | |
| Noisy | 0.70 | 0.50 | 0.54 | |
| 0 | logMMSE | 0.72 | 0.48 | 0.55 |
| SGJMAP | 0.66 | 0.43 | 0.48 | |
| Proposed | 0.75 | 0.52 | 0.60 | |
| Noisy | 0.84 | 0.66 | 0.69 | |
| +5 | logMMSE | 0.86 | 0.66 | 0.71 |
| SGJMAP | 0.90 | 0.64 | 0.68 | |
| Proposed | 0.88 | 0.69 | 0.76 |
TABLE III.
Comparison of the SE methods in terms of the STOI.
| SNR(dB) | Method | Machinery | Babble | Traffic |
|---|---|---|---|---|
| Noisy | 0.63 | 0.56 | 0.58 | |
| −5 | logMMSE | 0.63 | 0.46 | 0.58 |
| SGJMAP | 0.58 | 0.38 | 0.49 | |
| Proposed | 0.64 | 0.51 | 0.59 | |
| Noisy | 0.74 | 0.69 | 0.70 | |
| 0 | logMMSE | 0.74 | 0.61 | 0.68 |
| SGJMAP | 0.72 | 0.55 | 0.64 | |
| Proposed | 0.76 | 0.65 | 0.72 | |
| Noisy | 0.84 | 0.80 | 0.81 | |
| +5 | logMMSE | 0.83 | 0.75 | 0.79 |
| SGJMAP | 0.85 | 0.75 | 0.78 | |
| Proposed | 0.87 | 0.79 | 0.82 |
The experimental evaluations are performed using three different background noise types which are often encountered by the listener: machinery (e.g., factory), multi-talker babble (e.g., restaurant), and traffic (e.g., street) noise. The three types of recorded background noise represent a wide range of temporal and spectral characteristics. All of the noise types show nonstationary behavior. The machinery noise used in our experiments contains some quasiperiodic and periodic components. The spectral coefficients of recorded babble noise vary from being Gaussian to super-Gaussian depending on the number of speakers. The recorded traffic noise is mixed with the wind noise. It also includes the Doppler effect as a result of the approaching or receding vehicles. Smartphones have been used to capture actual noise signals via their microphones and record the data samples in real-life environments. The objective test results are the average of ten clean speech sentences from the Hearing in Noise Test (HINT; Nilsson et al., 1994) database mixed with the abovementioned noise types at a SNR of −5, 0, and +5 dB. All of the data files are obtained at a 16 kHz sampling rate, and 16 ms data frames with 50% overlap are used for the signal processing. Thus, the STFT of each frame contains 256 points.
The PESQ for speech quality measurement and the CSII and STOI for speech intelligibility measurements are used as objective criteria. The PESQ range is between −0.5 and 4.5, where 4.5 is high speech quality. The PESQ is calculated as follows; clean speech and degraded signals are equalized at the usual listening level and filtered via a telephone system-equivalent filter. The signals are then synchronized and processed through an auditory transform to obtain the spectrum of loudness. Additionally, the loudness differential between the signals is calculated and measured over the time and frequency to assess the subjective quality ranking (Rix et al., 2001). The CSII ranges between zero and one, where one indicates highly intelligible. As the basis for the CSII calculation, the speech intelligibility index is used. The scoring depends on the signal to distortion ratio term, which is computed using the coherence between the input and output signals (Loizou, 2013). The correlation between the clean reference signals and the distorted version of the reference signal is produced by the STOI (Taal et al., 2011). In some conditions, the intelligibility scores of the SE methods are close to the scores of the unprocessed input noisy speech signals. This indicates that the speech is not distorted after applying the SE method. Tables I, II, and III show the scores of the PESQ, CSII, and STOI at different SNRs of −5, 0, and +5 dB for the noisy speech signal alone and the three SE methods using the three background noise types. Although the existing baseline SE algorithms suppress the stationary noise types, it is always a challenge to suppress the nonstationary noise types without distorting the speech. From the objective test results, it should be noted that the proposed SE method performs better than other baseline techniques for babble noise conditions (nonstationary noise type), thus, making the proposed SE application suitable for suppressing the background noise in real-life noisy environments. However, the proposed SE is on par with the other SE algorithms for stationary noise cases (which should be easier to suppress by the majority of the SE algorithms). These objective test results are generated through matlab (The MathWorks, Natick, MA) codes on the computer.
Because the proposed application is a real-time application and to be used in noisy environments, there is always the case of reverberation. Hence, we test the SE algorithm with the abovementioned noisy speech data mixed reverberation in terms of the PESQ and STOI. Multi-talker babble noise (a more commonly seen background noise case) at the SNR level 0 dB is considered for testing. Simulated room impulse responses are generated using the image source model (ISM) method (Lehmann et al., 2007). The room size considered is 5 m × 5 m × 5 m with reverberation times (RT60) of 0.5 s. From Table IV, we notice that the SE algorithms degrade in the presence of reverberation. However, the effect of reverberation has been researched throughout the years on the perception of speech. Numerous ways to mitigate the effect of reverberation have been explored (Hazrati et al., 2013). Thus, to improve the quality and performance of real-time SE applications, we can always use dereverberation blocks in the signal processing pipeline. This helps in improving the performance of the SE algorithm.
TABLE IV.
The effect of reverberation.
| Method | PESQ | STOI |
|---|---|---|
| Noisy | 1.03 | 0.44 |
| logMMSE | 1.04 | 0.42 |
| SGJMAP | 1.02 | 0.42 |
| Proposed | 1.06 | 0.43 |
In addition to objective test results, Fig. 3 shows the signal spectrogram plots. The clean speech is degraded with machinery noise at a SNR of 0 dB. The plots include the spectrograms of clean speech, noisy speech, and the enhanced speech produced by the proposed SE method.
FIG. 3.
(Color online) Spectrogram plots of clean speech, noisy speech (machinery noise, SNR 0 dB), and enhanced speech using the proposed SE.
IV. REAL-TIME IMPLEMENTATION
The proposed SE algorithm can be implemented on any ARM-based processing platform to operate in real time. In this paper, we consider using an iPhone XR smartphone running on iOS 13.1.1 (Apple, Cupertino, CA) as the processing platform. No external or additional microphones are needed apart from the smartphone's inbuilt microphone. The proposed method would benefit from the use of the smartphone's computational power, features, and inbuilt microphones. The input noisy speech data are captured on the smartphone at a 48 kHz sampling rate. By lowpass filtering and a decimation factor of 3, the noisy speech data are down-sampled to 16 kHz. Therefore, for every frame processing, 256 samples are available (16 ms data frame in time with 50% overlap) using the STFT size set to be 256 points. A snapshot of the user interface (smartphone application) is shown in Fig. 4. The XCode is used for coding and debugging the SE algorithm.4 The Core Audio, Apple Inc.'s open-source library (Cupertino, CA), is used for the data I/O.5
FIG. 4.
(Color online) Snapshot of the developed SE application running on a smartphone.
When the switch button on the phone touch screen is set in the “off” mode, the proposed application simply plays out the input noisy speech from the smartphones' microphone without the proposed smartphone application processing. Switching the “on” button allows the proposed SE module to process the input noisy speech audio signal. The enhanced output signal is then transmitted to the HAD or earphone unit through a wired or wireless connection [via Bluetooth (Bluetooth Special Interest Group, Kirkland, WA) of the smartphone]. The SE algorithm uses the initial few seconds (1–2 s) to estimate the noise power at the beginning when the switch is in on mode. Hence, it is assumed that when the switch is triggered to on, there is no speech activity for that 1–2 s. A volume control slider is provided to the user to adjust the output volume depending on their comfort listening level. The smartphone device considered for implementation meets the requirements set by Federal Communications Commission (FCC) and has an M3, T4 hearing aid compatibility rating. An audio/video demonstration6 shows the working of the proposed iOS application running on the iPhone XR in real time.
A. Smartphone testing and computational complexity
The measured overall I/O audio latency of the proposed application is ≈15 ms. The I/O latency for the iPhone XR (Apple, Cupertino, CA) is ≈9 ms.7 The processing delay of the proposed SE alone is ≈6 ms. All of these are measured on the smartphone for an input frame of size 16 ms. Figure 5 illustrates the central processing unit (CPU) usage, memory usage, and energy impact of the proposed real-time smartphone application. The CPU consumption is low (7%) for the application, and the memory usage is approximately 47.2 MB. The smartphone iPhone XR has 4 GB random access memory (RAM). Thus, the proposed application uses nearly 0.1% of its memory. This section demonstrates that the smartphone CPU and memory space are not overwhelmed by the proposed real-time SE application. The developed application uses minimal smartphone resources and can, therefore, be used while other applications are running in the background. The proposed SE application also has a low energy impact. A fully charged iPhone XR with a battery capability of 2942 mAh runs for approximately 9 h when the user is running the SE application continuously.
FIG. 5.
(Color online) The CPU, memory consumption, and energy impact of the proposed smartphone real-time application.
V. SUBJECTIVE TEST RESULTS
Tittle et al. (2020) describe the results from the proposed real-time SE smartphone application with NH and HI listeners when used with three sets of HADs, a summary of which follows. A detailed explanation of the amount of hearing loss, test arrangement, procedure, and results can be found in Tittle et al. (2020). The participants are asked to type the speech sentences played through a speaker in the presence of background noise. Two test conditions are considered: (i) the use of HAD-alone and (ii) the use of HAD with the proposed SE application running on the smartphone in real time (HAD + SE). In condition (ii), the HAD is connected to the smartphone and the microphone of the smartphone is used. The HAD-alone condition uses the microphone of the HAD and the noise reduction features on the HAD are turned off. The HAD + SE condition is referred to as the Smartphone Hearing Aid Research Project Version 2 application (SHARP-2 app) by the authors in Tittle et al. (2020). Section V A summarizes their findings.
A. Participants
The test results are those of 13 listeners from 20 to 78 years of age with moderate-to-severe bilateral hearing loss, including 20 NH listeners (Tittle et al., 2020). The primary language of all participants is English, and everyone provided fully informed consent to the evaluation process as approved by the institutional review board (IRB) of the University of Texas at Dallas.
B. Equipment and test arrangement
The hardware tools used in the analysis include an iPhone XR (Apple, Cupertino, CA), two sets of made-for-iPhone (MFi) HADs, and one set of made-for-all (MFA) HADs. The iPhone volume is manually set according to the user's comfortable listening level. Starkey Halo II (Starkey hearing technologies, Eden Prairie, MN), Oticon Opn 1 (Oticon, Copenhagen, Denmark), and Phonak Audeo Marvel (Phonak, Zurich, Switzerland) HADs are considered. The Phonak HAD can be connected to both the iOS and Android platforms (MFA HADs). However, the Starkey and Oticon HADs work with only iOS-based iPhone devices (MFi HADs). For the HAD + SE test condition, all of the listeners wore noise-cancellation earmuffs during the trials to reduce the effect of their natural hearing. Thus, ensuring the participants listen to the output signal transmitted to the HAD from the developed SE smartphone application. The advanced signal processing features on the HADs are switched off to obtain the performance of the proposed SE algorithm alone, which is run on the smartphone.
As seen in Fig. 6, the test is conducted in a double-wall insulated audiometric test booth. A loudspeaker presents the background noise at 180 deg and the KEMAR (GRAS sound and vibration, Holte, Denmark) with a mouth simulator presents the HINT sentences at 0 deg. The smartphone is placed on a table in front of the KEMAR as shown in Fig. 6, and the SNR is measured at the listener's head. The listener is provided with a special graphical user interface (GUI) on a laptop to type in the response and have an interaction with the examiner. The especial easy-to-use GUI is developed by our team as described in Tokgoz et al. (2018).
FIG. 6.
(Color online) The clinical test setup for the HAD + SE test condition includes the sound booth and listener.
C. Stimuli
The HINT (Nilsson et al., 1994) is a standardized word recognition test that measures speech recognition in the presence of background noise. The HINT is composed of 250 sentences, which are further categorized into 25 lists. The background noise type considered is the restaurant noise (multi-talker babble) at different SNRs.
D. Procedure
The participants were asked to type the speech sentences played through a speaker in the presence of background noise. Two conditions were considered for testing: one condition consists of the HAD-alone and the other condition comprises the proposed SE application running on the smartphone (HAD + SE). In the HAD + SE condition, the HAD is connected to the smartphone running the proposed SE application, and the microphone of the smartphone is used. The HAD-alone condition used the microphone of the HAD and the noise reduction features on the HAD were turned off. A constant signal level of 65 dBA was presented to the listeners, and the SNR ranged from −10 to +10 dB. One HINT list made up of ten sentences was presented at each technology and SNR condition.
E. Test results with NH listeners
Figure 7 shows the overall average test results with eight NH listeners using the Oticon OPN1 (Oticon, Copenhagen, Denmark) HAD-alone and the proposed smartphone-based SE application (HAD + SE). The SNR values range from −5 dB to −10 dB. It was observed from the mean test results that a marginal benefit of about 7% in performance improvement was obtained in using the HAD + SE (smartphone plus proposed SE app) versus the Oticon OPN1 HAD-alone. It should be noted that at the fixed −5 dB SNR, the average test score was 85.66% and 93.20% for the HAD-alone and HAD + SE, respectively. It should be noted that at the fixed −10 dB SNR, the average test scores were 76.45% and 83.65% for the HAD-alone condition and the HAD + SE condition, respectively.
FIG. 7.
(Color online) The average test results with NH listeners for all SNRs.
Figure 7 also shows the overall average test results when the Phonak Audeo Marvel (Phonak, Zurich, Switzerland) HAD was used for performance comparison. In this experiment, 12 NH listeners were used. The overall average test results indicate a substantial performance improvement of about 30% when using the proposed SE application running on the smartphone (HAD + SE) versus using the Phonak HAD-alone. That is, the overall average test scores at −10 dB SNR were 49.81% for using the HAD-alone and 83.37% for using the HAD + SE. It should be noted that Fig. 7 shows the overall average test results and standard error bars with NH listeners for all of the considered SNRs. The detailed test results and statistical analyses of the proposed algorithm are published in Tittle et al. (2020).
F. Test results with HI listeners
Figure 8 shows the overall average test results with five HI listeners using the Starkey Halo II (Starkey hearing technologies, Eden Prairie, MN) HAD-alone and the proposed smartphone-based SE application (HAD + SE). The SNR values are −5 dB and −10 dB. When the use of the Starkey Halo II HAD was compared with the proposed SE application running on the smartphone (HAD + SE), a performance improvement of 9% was obtained using the HAD + SE versus the HAD-alone at −10 dB. However, at −5 dB SNR, the HI listeners seem to perform better under the HAD-alone condition than with the HAD + SE condition.
FIG. 8.
(Color online) The average test results with HI listeners for all SNRs.
When the Phonak Audeo Marvel (Phonak, Zurich, Switzerland) HAD was used for performance comparison with SNRs from 0 dB to −15 dB, then the average test results indicated a performance increase of 21% when using the proposed smartphone-based SE application (HAD + SE). That is a substantial performance improvement with an average test score of 78.20% for the HAD + SE versus 39.35% for the HAD-alone. Figure 8 shows the average test results and standard error bars with HI listeners for all of the considered SNRs. The performance improvement of the HAD + SE listening condition over the HAD-alone condition with HI listeners is discussed with a more detailed analysis and figures in Tittle et al. (2020). In Tittle et al. (2020), the relative benefits across all three of the HAD manufacturers were also compared, and a benefit score was determined, proving the average benefit of using the proposed smartphone-based SE application with each manufacturer's HAD, where the improvements observed for NH and HI listeners were considered clinically significant.
VI. CONCLUSION
The authors proposed a novel two-stage single-channel SE algorithm by smoothing the SGJMAP gain estimation along the frequency axis by a Mel filter-bank. The proposed algorithm is implemented as a real-time application on a smartphone. The proposed smartphone-based SE application is computationally efficient with minimal input-output delay and can function as an assistive tool for hearing devices. The objective test results show the effectiveness of the proposed algorithm in terms of the speech quality and intelligibility in comparison with the benchmark single-channel SE methods. The word recognition test results using human participants indicate that the proposed smartphone-based SE application has a substantial speech recognition benefit for people with normal hearing and hearing disorders when compared to HADs alone without any processing in background noise environments.
ACKNOWLEDGMENTS
This work was supported by The National Institute of the Deafness and Other Communication Disorders (NIDCD) of the National Institutes of Health (NIH) under Award No. 1R01DC015430-05. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Footnotes
Pew Research Center, Mobile Fact Sheet, available at https://www.pewresearch.org/internet/fact-sheet/mobile/ (Last viewed January 18, 2021).
Jacoti App, available at https://www.jacoti.com/ (Last viewed January 18, 2021).
Ear Machine App, available at http://www.earmachine.com/ (Last viewed January 22, 2021).
Xcode, available at https://developer.apple.com/xcode/ (Last viewed January 22, 2021).
Core Audio Overview, available at https://developer.apple.com/library/content/documentation/MusicAudio/Conceptual/CoreAudioOverview/WhatisCoreAudio/WhatisCoreAudio.html (Last viewed January 22, 2021).
SSPRL SHARP 2 Video Demo, available at https://ssprl.utdallas.edu/hearing-aid-project/video-demonstration/ (Last viewed March 25, 2021).
Superpowered, available at https://superpowered.com/latency (Last viewed January 29, 2021).
References
- 1.Agarwal, A., and Cheng, Y. M. (1999). “ Two-stage Mel-warped Wiener filter for robust speech recognition,” in Proc. ASRU ( Citeseer, 1999: ), Vol. 99, pp. 67–70. [Google Scholar]
- 2.Bhat, G. S., Shankar, N., Reddy, C. K. A., and Panahi, I. M. S. (2019). “ A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone,” IEEE Access 7, 78421–78433. 10.1109/ACCESS.2019.2922370 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Boll, S. (1979). “ Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process. 27(2), 113–120. 10.1109/TASSP.1979.1163209 [DOI] [Google Scholar]
- 4.Cheng, Y., and O'Shaughnessy, D. (1991). “ Speech enhancement based conceptually on auditory evidence,” in IEEE International Conference on Acoustics, Speech, and Signal Processing ( IEEE Computer Society, 1991), pp. 961–964. [Google Scholar]
- 5.Ephraim, Y., and Malah, D. (1984). “ Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process. 32(6), 1109–1121. 10.1109/TASSP.1984.1164453 [DOI] [Google Scholar]
- 6.Ephraim, Y., and Malah, D. (1985). “ Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech, Signal Process. 33(2), 443–445. 10.1109/TASSP.1985.1164550 [DOI] [Google Scholar]
- 7.Fedorov, I., Stamenovic, M., Jensen, C., Yang, L.-C., Mandell, A., Gan, Y., Mattina, M., and Whatmough, P. N. (2020). “ Tinylstms: Efficient neural speech enhancement for hearing aids,” arXiv:2005.11138.
- 8.Hazrati, O., Lee, J., and Loizou, P. C. (2013). “ Blind binary masking for reverberation suppression in cochlear implants,” J. Acoust. Soc. Am. 133(3), 1607–1614. 10.1121/1.4789891 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hu, Y., and Loizou, P. C. (2006). “ Subjective comparison of speech enhancement algorithms,” in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1, p. I-I. [Google Scholar]
- 10.Karadagur Ananda Reddy, C., Shankar, N., Shreedhar Bhat, G., Charan, R., and Panahi, I. (2017). “ An individualized super-Gaussian single microphone speech enhancement for hearing aid users with smartphone as an assistive device,” IEEE Signal Process. Lett. 24(11), 1601–1605. 10.1109/LSP.2017.2750979 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kochkin, S. (2010). “ Marketrak VIII: Consumer satisfaction with hearing aids is slowly increasing,” Hear. J. 63(1), 19–20. 10.1097/01.HJ.0000366912.40173 [DOI] [Google Scholar]
- 12.Küçük, A., and Panahi, I. M. S. (2020). “ Convolutional recurrent neural network based direction of arrival estimation method using two microphones for hearing studies,” in 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lehmann, E. A., Johansson, A. M., and Nordholm, S. (2007). “ Reverberation-time prediction method for room impulse responses simulated with the image-source model,” in 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ( IEEE, New York: ), pp. 159–162. [Google Scholar]
- 14.Liu, D., Wang, X., Zhang, J., and Huang, X. (2010). “ Feature extraction using Mel frequency cepstral coefficients for hyperspectral image classification,” Appl. Opt. 49(14), 2670–2675. 10.1364/AO.49.002670 [DOI] [Google Scholar]
- 15.Loizou, P. C. (2013). Speech Enhancement: Theory and Practice ( CRC Press, Boca Raton, FL: ). [Google Scholar]
- 16.Lotter, T., and Vary, P. (2005). “ Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model,” EURASIP J. Adv. Signal Process. 2005(7), 354850. 10.1155/ASP.2005.1110 [DOI] [Google Scholar]
- 17.Martin, R. (2002). “ Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors,” in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. I-253–I-256. [Google Scholar]
- 18.McAulay, R., and Malpass, M. (1980). “ Speech enhancement using a soft-decision noise suppression filter,” IEEE Trans. Acoust., Speech, Signal Process. 28(2), 137–145. 10.1109/TASSP.1980.1163394 [DOI] [Google Scholar]
- 19.Nilsson, M., Soli, S. D., and Sullivan, J. A. (1994). “ Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise,” J. Acoust. Soc. Am. 95(2), 1085–1099. 10.1121/1.408469 [DOI] [PubMed] [Google Scholar]
- 20.Nossier, S. A., Wall, J., Moniri, M., Glackin, C., and Cannings, N. (2020). “ An experimental analysis of deep learning architectures for supervised speech enhancement,” Electronics 10(1), 17. 10.3390/electronics10010017 [DOI] [Google Scholar]
- 21.Patel, K., Shankar, N., and Panahi, I. M. (2019). “ Frequency-based multiband adaptive compression for hearing aid application,” J. Acoust. Soc. Am. 146(4), 2959–2959. 10.1121/1.5137279 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Processing, S. (2003). “ Transmission and quality aspects (STQ); distributed speech recognition; extended advanced front-end feature extraction algorithm; compression algorithms; back-end speech reconstruction algorithm,” ETSI ES 202(212), version 1, available at https://www.etsi.org/deliver/etsi_es/202200_202299/202212/01.01.01_50/es_202212v010101m.pdf (Last viewed January 20, 2021).
- 23.Rao, K. S., and Manjunath, K. (2017). Speech Recognition Using Articulatory and Excitation Source Features ( Springer, 2017). [Google Scholar]
- 24.Rix, A. W., Beerends, J. G., Hollier, M. P., and Hekstra, A. P. (2001). “ Perceptual evaluation of speech quality (PESQ)—A new method for speech quality assessment of telephone networks and codecs,” in Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 749–752. [Google Scholar]
- 25.Shah, J. K., Iyer, A. N., Smolenski, B. Y., and Yantorno, R. E. (2004). “ Robust voiced/unvoiced classification using novel features and Gaussian mixture model,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 17–21. [Google Scholar]
- 26.Shankar, N., Bhat, G. S., and Panahi, I. M. (2020a). “ Real-time single-channel deep neural network-based speech enhancement on edge devices,” Proceedings of Interspeech 2020, pp. 3281–3285, available at https://www.isca-speech.org/archive/Interspeech_2020/pdfs/1901.pdf (Last viewed January 28, 2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Shankar, N., Bhat, G. S., and Panahi, I. M. S. (2020b). “ Efficient two-microphone speech enhancement using basic recurrent neural network cell for hearing and hearing aids,” J. Acoust. Soc. Am. 148(1), 389–400. 10.1121/10.0001600 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Shankar, N., Bhat, G. S., Reddy, C. K., and Panahi, I. (2020c). “ Noise dependent super Gaussian-coherence based dual microphone speech enhancement for hearing aid application using smartphone,” arXiv:2001.09571.
- 29.Sohn, J., Kim, N. S., and Sung, W. (1999). “ A statistical model-based voice activity detection,” IEEE Signal Process. Lett. 6(1), 1–3. 10.1109/97.736233 [DOI] [Google Scholar]
- 30.Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. (2011). “ An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speech, Lang. Process. 19(7), 2125–2136. 10.1109/TASL.2011.2114881 [DOI] [Google Scholar]
- 31.Thibodeau, L. M. (2020). “ Benefits in speech recognition in noise with remote wireless microphones in group settings,” J. Am. Acad. Audiol. 31, 404–411. 10.3766/jaaa.19060 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Tittle, S., Thibodeau, L. M., Panahi, I., Tokgoz, S., Shankar, N., Bhat, G. S., and Patel, K. (2020). “ Behavioral validation of the smartphone for remote microphone technology,” in Seminars in Hearing ( Thieme Medical Publishers, Inc., 2020), Vol. 41, pp. 291–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Tokgoz, S., Hao, Y., and Panahi, I. M. (2018). “ A hearing test simulator GUI for clinical testing,” J. Acoust. Soc. Am. 143(3), 1815–1815. 10.1121/1.5035952 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Tsoukalas, D. E., Mourjopoulos, J. N., and Kokkinakis, G. (1997). “ Speech enhancement based on audible noise suppression,” IEEE Trans. Speech Audio Process. 5(6), 497–514. 10.1109/89.641296 [DOI] [Google Scholar]
- 35.Vary, P., and Eurasip, M. (1985). “ Noise suppression by spectral magnitude estimation ‘mechanism and theoretical limits,’” Signal Process. 8(4), 387–400. 10.1016/0165-1684(85)90002-7 [DOI] [Google Scholar]
- 36.Wolfe, P. J., and Godsill, S. J. (2003). “ Efficient alternatives to the Ephraim and Malah suppression rule for audio signal enhancement,” EURASIP J. Adv. Signal Process. 2003(10), 910167. 10.1155/S1110865703304111 [DOI] [Google Scholar]
- 37.Yu, D., and Deng, L. (2016). Automatic Speech Recognition ( Springer, 2015). [Google Scholar]
- 38.Zhou, X., Garcia-Romero, D., Duraiswami, R., Espy-Wilson, C., and Shamma, S. (2011). “ Linear versus Mel frequency cepstral coefficients for speaker recognition,” in 2011 IEEE Workshop on Automatic Speech Recognition Understanding, pp. 559–564. [Google Scholar]








