Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 2;15:22892. doi: 10.1038/s41598-025-05057-2

End-to-end feature fusion for jointly optimized speech enhancement and automatic speech recognition

Mohamed Medani 1, Nasir Saleem 2, Fethi Fkih 3,, Manal Abdullah Alohali 4, Hela Elmannai 5, Sami Bourouis 6
PMCID: PMC12217813  PMID: 40594394

Abstract

Speech enhancement (SE) and automatic speech recognition (ASR) in real-time processing involve improving the quality and intelligibility of speech signals on the fly, ensuring accurate transcription as the speech unfolds. SE eliminates unwanted background noise from target speech in environments with high background noise levels, which is crucial in real-time ASR. This study first proposes a speech enhancement network based on an attentional-codec model. Its primary objective is to suppress noise in the target speech with minimal distortion. However, excessive noise suppression in the enhanced speech can potentially diminish the effectiveness of downstream ASR systems by excluding crucial latent information. While joint SE and ASR techniques have shown promise for achieving robust end-to-end ASR, they traditionally rely on using the enhanced features as inputs to the ASR systems. To address this limitation, our study uses a dynamic fusion approach. This approach integrates both the enhanced features and the raw noisy features, aiming to eliminate noise signals from the enhanced target speech while simultaneously learning fine details from the noisy signals. This fusion approach seeks to mitigate speech distortions, enhancing the overall performance of the ASR system. The proposed model consists of an attentional codec equipped with a causal attention mechanism for SE, a GRU-based fusion network, and an ASR system. The SE network uses a modified Gated Recurrent Unit (GRU), where the traditional hyperbolic tangent (tanh) is replaced by an attention-based rectified linear unit (AReLU). The SE experiments consistently obtain better speech quality, intelligibility, and noise suppression in matched and unmatched conditions than the baselines. With the LibriSpeech database, the proposed SE obtains better STOI (19.81%) and PESQ (28.97%) in matched conditions and unmatched conditions (STOI: 17.27% and PESQ: 27.51%). The joint training framework for robust end-to-end ASR evaluates the character error rate (CER). The ASR results find that the joint training framework reduces the error rate from 32.99% (average noisy signals) to 13.52% (with the proposed SE and joint training for ASR).

Keywords: Speech enhancement, Speech recognition, Deep learning, End-to-end processing, Attentional GRU, Feature fusion, Joint optimization

Subject terms: Electrical and electronic engineering, Biomedical engineering, Information technology

Introduction

Joint speech enhancement (SE) and speech recognition are vital for improving the accuracy and robustness of automatic speech recognition systems. By removing background noise and enhancing speech quality, the SE techniques enable ASR systems to better understand and transcribe spoken words, especially in noisy environments like crowded rooms or outdoor settings. This advancement is essential for the practical implementation of ASR in everyday applications, such as mobile devices and virtual assistants, ensuring reliable and effective communication with these technologies. Background noises frequently contaminate speech signals, which can significantly impact speech-related applications, particularly ASR13. Background noises and competing speakers are the primary sources of target signal distortion. To mitigate the impact of noise, a speech enhancement (SE) system can restore the quality and improve the intelligibility of degraded signals. A SE model performs well in various noisy backgrounds; however, developing a model that can handle different noisy backgrounds with minimal complexity and latency is a challenging task.

Traditional SE techniques, such as spectral subtraction4, Wiener filtering5, and statistical models6,7, perform better in stationary noisy backgrounds. However, they perform poorly in nonstationary, noisy backgrounds. Deep learning (DL) has become a mainstream approach for speech enhancement8. Deep learning techniques learn to transform a noisy speech into a clean speech by training on a dataset of paired clean and noisy samples. These techniques may use mapping-based training objectives911 or masking-based training objectives to estimate the spectrum or time-frequency masks1216. The commonly used deep learning techniques for speech enhancement include fully connected networks (FCN)17, RNNs18, and CNNs19. To enhance noisy speech, Long Short Term Memory (LSTM) is used to develop a noise- and speaker-independent model20. The model is trained using a four-layered LSTM network on speech samples from different speakers combined with various types of background noise. Another approach21 proposes a CNN architecture applying gated and dilated convolution. Another trend uses an attention mechanism to enhance noisy speech22. In23,24, the LSTM model is proposed for speech enhancement by applying the attention gate to replace the forget gate. The study in22 proposes a self-attention dense CNN for better feature extraction and uses feature reusing. The study in25 proposes a dual-path RNN with self-attention such that the processing of long sequences is improved. Several studies use attention mechanisms to enhance speech signals with promising results2628. These existing RNN models have good capability for noise suppression but suffer from their complex structure and long training times. GRU (Gated Recurrent Unit) is a recurrent neural network that can be used in speech enhancement for learning long-term temporal dependencies2932.

Nevertheless, speech enhancement focuses on refining the models to estimate the target speech, distinct from the speech recognition aspect. Consequently, speech enhancement approaches often do not align with the ultimate objective, resulting in suboptimal outcomes33. Moreover, the output speech from these enhancement techniques tends to be excessively over-smoothed, leading to post-enhancement speech distortion. This distortion can significantly impact the effectiveness of ASR systems34. Consequently, the success of this approach relies heavily on the success of the front-end enhancement35. To enhance the noise robustness of ASR, three primary approaches are commonly used. The first approach involves integrating a speech enhancement component at the front end of the ASR system. The second method employs multi-condition training to enhance the noise robustness of ASR. This involves training the speech recognition model using various types of data, including both clean and noisy speech. However, this approach leads to increased complexity and computational costs. Moreover, it often yields underwhelming results when faced with unmatched conditions36, and its performance can be impacted by speech distortion37. The third prevalent approach involves joint training techniques38,39, which utilize a unified framework to optimize both speech enhancement and recognition simultaneously. The rationale behind this approach is that speech enhancement and recognition are intertwined tasks that can mutually enhance each other performance. For instance, to improve the noise robustness of end-to-end ASR, a joint adversarial enhancement training method was proposed in40. This method leverages the joint training framework to refine both the mask-based enhancement network and the attention-based encoder-decoder speech recognition network. Furthermore, even on the noisy AISHELL-140 dataset, the CER remains above 50%, indicating a need for improvement. On the other hand, concerning E2E speech recognition, speech transformer models have demonstrated remarkable performance, achieving state-of-the-art results. The self-attention network41,42 stands out as a crucial element of the speech transformer, offering greater capability in capturing long-term dependencies compared to sequence-to-sequence models based on recurrent neural networks (RNNs). The more recent literature of the SE and ASR can be found in studies such as4349

To understand the problem and the need to jointly optimize SE and ASR, we analyze the spectrograms illustrated in Fig. 1, which depicts an example spectrogram of a test speech sample. The spectrogram of the enhanced speech, processed by the enhancement network, exhibits notable leaks, as shown in Fig. 1 (right), indicated by the highlighted boxes, resulting in speech distortion. These boxes indicate significant leaks, primarily due to the dominance of noise signals in these time-frequency bins, overshadowing the target speech. Consequently, the enhancement network interprets these time-frequency bins as noise signals and eliminates the relevant information, such as formants. Although the enhancement network manages to reduce noise signals to some extent, these leaks remain unrecognized by the ASR system, leading to substantial loss of essential speech details. These factors explain how speech distortion adversely affects the performance of automatic speech recognition.

Fig. 1.

Fig. 1

An example spectrogram of a test speech sample. Clean speech (left), noisy speech (middle), and enhanced speech without joint optimization (right). The boxes highlight the spectral leaks (over-smoothed distortions).

This paper presents a jointly optimized speech enhancement and automatic speech recognition model that aims to automatically acquire more robust representations that are well-suited for the recognition task. The contributions of this study are twofold.

  • This study proposes a speech enhancement based on an attentional-codec model to effectively reduce noise in the target speech while minimizing distortions, such as over-smoothed spectrograms. The proposed Speech Enhancement (SE) network improves noisy speech using an attention process that mirrors human focus on specific speech components amidst surrounding noise. By employing this attention process within the codec (coder-decoder), the model achieves enhanced sequential modelling, allowing learned weights from past input features to predict current features accurately. This attention mechanism actively manages the correlation between preceding and current frames, assigning attention weights to earlier speech frames. Experimental results demonstrate that the proposed SE model surpasses baseline methods in terms of speech quality, intelligibility, noise reduction, and speech distortion.

  • Traditionally, ASR systems have often depended on utilizing enhanced features as inputs. In our study, however, we use a dynamic fusion approach to overcome this limitation. This approach integrates both the enhanced features and the raw noisy features to filter out noise signals from the enhanced target speech while simultaneously capturing fine details from the noisy signals. By employing this fusion approach, we aim to reduce speech distortions and enhance the overall performance of the ASR system.

The paper is structured as follows: “Proposed speech enhancement” presents the proposed speech enhancement approach. “Speech enhancement experiments” details the experiments, results, and discussions about speech enhancement. “Joint optimization and ASR” discusses jointly optimized speech enhancement and automatic speech recognition with corresponding results. Finally, “Summary and conclusion” provides the conclusion of this study.

Proposed speech enhancement

Figure 2 shows the diagram of the proposed SE. A clean speech and background noise can be represented by s(t) and d(t). The resulting noisy speech y(t) is obtained by mixing s(t) and d(t), given as:

graphic file with name d33e504.gif 1

where Inline graphic and M shows speech samples. The speech enhancement network recovers the estimate Inline graphic of underlying clean speech s(t) from a noisy speech y(t). The SE network is fed with inputs Inline graphic and Inline graphic, where Y and X represent the magnitudes of the noisy mixture and underlying clean speech at frame t. The encoder extracts features h, given as:

graphic file with name d33e564.gif 2

whereas the parameters Q and K represent the query and key. Our study uses a Gated Recurrent Unit (GRU) encoder-decoder, showing an ability to model sequential information, resulting in lower computational costs and improved performance as compared to LSTM, as reported in  50. To generate fixed-length context vectors, the attention process is applied to the key and query inputs.

graphic file with name d33e582.gif 3

Fig. 2.

Fig. 2

The proposed speech enhancement pipeline.

With the context vectors Inline graphic and Inline graphic, the output of the decoder Inline graphic recovers the noisy enhanced speech Inline graphic.

graphic file with name d33e621.gif 4

Figure 3 shows the attentional-GRU codec. The encoder extracts features from the speech spectrum. To accomplish this, the extracted features are provided to the input layer. Where Inline graphic represents a neural network (GRU) function, and Inline graphic is the GRU output, respectively.

graphic file with name d33e644.gif 5

Fig. 3.

Fig. 3

The Architecture of the proposed staked attention encoder-decoder unidirectional GRU.

The Inline graphic is the input to the GRU cell as:

graphic file with name d33e665.gif 6

where Inline graphic represents the GRU function, and Inline graphic is the GRU output, respectively. The Inline graphic can be computed as:

graphic file with name d33e691.gif 7

Unidirectional attentional-GRU encoder

A gated recurrent unit (GRU) is an RNN type that includes a gating mechanism for controlling the flow of information. The unidirectional GRU processes the sequence in one direction and is commonly used for sequence-to-sequence learning. Since the GRU has fewer parameters to optimize, it can mitigate the gradient vanishing problem and train faster than LSTM. This study employs a modified GRU in which the classical hyperbolic tangent (tanh) is replaced with an attention-based ReLU (AReLU)51, a learnable activation function that leverages an element-wise attention approach. The hyperbolic tangent shows high complexity because of dense activation computations. AReLU employs learned data-adaptive parameters to amplify positive elements and diminish negative elements. The training process remains robust towards vanishing the gradient since the attention mechanism in AReLU activation learns element-wise residues of the active region. The attention activation learning through AReLU leads to well-focused activations in significant areas of the feature map. Having additional learnable parameters (Inline graphic and Inline graphic) per layer enables fast network training at low learning rates. According to study51, AReLU is denoted as Inline graphic using a combination of an element-by-element sign-based attention approach Inline graphic and the classical ReLU Inline graphic, as follows:

graphic file with name d33e743.gif 8
graphic file with name d33e749.gif 9

where Inline graphic is input to the activation layer, Inline graphic, Inline graphic grasps the input variables to [0.01, 0.99] such that preventing Inline graphic to be zero, and Inline graphic shows the sigmoid activation. The inclusion of AReLU in GRUs can assist in capturing long-term contextual dependencies between features, which is critical in SE. As a result, in addition to preventing gradient vanishing, the use of AReLU in the GRU can aid in capturing these long-term dependencies and improving the performance of SE. The attention-ReLU-based GRU cell structure is depicted in Fig. 4.

Fig. 4.

Fig. 4

Attention-ReLU in GRU cell structure.

Attentional process

The attention process plays a crucial role in creating fixed-length context vectors by processing information about the key and query inputs. An attention mechanism can process preceding and future speech frames. However, speech enhancement in this study tends to avoid processing latency and therefore only uses previous speech frames. To achieve this, the model uses both causal dynamic and causal local attention approaches. In causal dynamic attention, the model uses the entire previous speech sequence Inline graphic, and the input sequence Inline graphic for computing attention weights. This indicates that all preceding frames are utilized to enhance the present frame. However, for long speech sequences, the attention weights of many preceding frames may become almost zero. To address this, the model uses the causal local attention process, where Inline graphic, and Inline graphic are utilized for computing attention weights. The model learns the attention weights Inline graphic as:

graphic file with name d33e832.gif 10

where Inline graphic is causal dynamic attention and Inline graphic denotes causal local attention with z is a constant. According to the correlation computation:

graphic file with name d33e854.gif 11

The attention-weighted context vectors are given as:

graphic file with name d33e862.gif 12

The proposed SE model determines the attention process for a speech frame with attention-weighted context vectors.

Unidirectional attentional-GRU decoder

The GRU decoder reconstructs the speech samples with input features, encoder outputs, and attention-weighted context vectors, respectively. The enhanced vector Inline graphic, is computed using the attention-weighted context vectors and features.

graphic file with name d33e879.gif 13

The context vectors and feature vectors are concatenated as Inline graphic. A time-frequency mask is finally estimated from the final vectors, given as:

graphic file with name d33e893.gif 14

where Inline graphic shows an ideal ratio mask. The magnitude spectrums of clean speech s(t) and noise signals d(t) are denoted as Inline graphic and Inline graphic respectively. To reconstruct the noisy enhanced speech, multiply the noisy features by the enhanced vectors, and obtain the resulting signal by performing the inverse Short-time Fourier Transform (STFT).

graphic file with name d33e931.gif 15

The enhanced features along with raw noisy features are further used for joint SE and ASR optimization.

Speech enhancement experiments

Dataset and data generation

The experiments utilized the LibriSpeech dataset52, and noise sources were selected from the AURORA database53. LibriSpeech is a database of approximately 1000 h of English speech. The database includes 16,000 audio files, each 10 s in length, derived from public-domain audiobooks from the LibriVox project. The database is divided into several subsets, including test-clean, test-other, train-clean-100, train-clean-360, and train-other-500, intended for development, testing, and training purposes. The speech dataset D includes training and testing sentences, denoted as Inline graphic and Inline graphic, respectively. The training and testing sentences are labelled as Inline graphic and Inline graphic. The noisy sentences are created by mixing background noises with Inline graphic and Inline graphic.

graphic file with name d33e992.gif 16
graphic file with name d33e998.gif 17

where Inline graphic and Inline graphic denote training and testing noisy data.

Feature extraction

The noisy-clean pairs y(t), s(t) are transformed to the frequency domain by applying the Short-Time Fourier Transform (STFT), given as:

graphic file with name d33e1034.gif 18

where Inline graphic, F is the number of frequency bins, and T shows the number of frames. This study has used the STFT magnitude |Y| as the input feature.

SE network architecture

The architecture of the proposed GRU-based codec consists of the input layer, three GRU layers each containing 256 units, and an output layer containing 257 sigmoidal-activated units. The hyperparameters include epochs (160), learning rate (0.0001), and weights (randomly initialized), respectively. The training process utilized mini-batches of 32 sequences, employing back-propagation through time with an Adam optimizer. The GRU layer configuration is given as (161/256/256/256/257) units. Table 1 provides the details of the hyperparameters. Noisy sentences are generated using Inline graphicdB, 0dB, and Inline graphicdB SNRs. The sentences of both genders are repeated for all SNRs and mixed with all noises, resulting in 21,600 sentences (approximately 18 h). These sentences are used to train the proposed SE model. During testing, half of the speech sentences are used in matched and half are used in mismatched noisy conditions. All noises are tested with distinct sentences. Sentences are sampled at a 16 kHz rate, and a Hanning window (512 points) with 75% overlapping is used in experiments. Usually, a noisy phase is used during speech reconstruction; however, this study uses an estimated phase54 to reconstruct the final speech. A loss function quantifies the differences between a predefined mask and an estimated mask in masking-based SE. The goal of the loss function is to minimize errors. Typically, MSE is used as a loss function in TF-masking-based SE, defined as

graphic file with name d33e1080.gif 19

where f(S), S, and Y represent the model output, input, and ground truth label. The MSE in Eq. (19) can be expressed as:

graphic file with name d33e1103.gif 20

Table 1.

Hyperparameters and initial conditions.

Component Parameter Value/range Description
STFT preprocessing Window Size 512 points Hanning window
Hop Length 128 points (75% overlap) Frame overlap
FFT Bins 257 Frequency bins
SE Network GRU Layers 3 Stacked layers
Units per Layer 256 Hidden units
Attention Window (Z) 5 frames Local attention span
AReLU Init (Inline graphic, Inline graphic) Inline graphic, Inline graphic Learned parameters
Training Batch Size 32 Mini-batch size
Learning Rate 0.0001 Adam optimizer
Epochs 160 Training iterations
Loss Weight (Inline graphic) Adaptive (init = 1.0) SE-ASR balance
WMSE Threshold (B) 10 Dynamic weighting
Fusion Network GRU Hidden Units 256 Fusion layer size
Fusion Steps (p) 3 Iteration count

The estimated mask and predefined mask are represented as Inline graphic and M(s), respectively. A dynamic-weighted loss function is employed to enhance network learning. This loss function multiplies weighted values by the learning errors. With such a process, the loss function emphasizes larger errors, enhancing overall performance. The weighted Mean Squared Error (WMSE) is calculated by multiplying the MSE function by a weight variable Inline graphic.

graphic file with name d33e1301.gif 21

To give more importance to the situations with significant errors, the weighting variable Inline graphic in Eq. (21) is modified based on the following formula:

graphic file with name d33e1318.gif 22

The following conditions are applied to select the weights, given as:

graphic file with name d33e1326.gif 23

When the absolute divergence falls below a constant value of Inline graphic (experimentally set to 10; since the model performs better at B=10), the weighting is reduced by half. When the error is smaller than B, the weighting factor is reduced by half, indicating the model does not focus as much on small errors. Similarly, When the error is greater than or equal to B, the weight Inline graphic is set equal to the error magnitude, meaning the model will focus strongly on these larger errors.

SE evaluation metrics and related models

To examine the proposed SE, this study uses well-adopted metrics including Short-time Objective Intelligibility (STOI)55, Perceptual Evaluation of Speech Quality (PESQ)56, and Source-to-Distortion Ratio (SDR). In this study, we chose LSTM20 and DNN17 as baseline models for estimating Ideal Ratio Mask (IRM) The baseline models are represented as LSTM+IRM denotes that LSTM is used to estimate IRM; DNN+IRM indicates that a fully-connected DNN estimates IRM, and GRN+IRM indicates that the proposed SE estimates IRM as a training objective.

SE results and discussions

The study first compares the performance of the proposed SE against the baselines. Tables 2 and 3 displays the average test results of three metrics (STOI, PESQ, and SDR) across four testing noises and three SNRs for both matched and unmatched noisy conditions. It is important to highlight that, unlike the baselines, the proposed GRN+IRM consistently outperforms them across all noisy testing scenarios.

Table 2.

SE performance in matched testing conditions.

Noise Metric STOI in % PESQ SDR in dB
Types SNR Inline graphicdB 0dB 5dB Avg Inline graphicdB 0dB 5dB Avg Inline graphicdB 0dB 5dB Avg
Airport Noise Noisy Speech 63.05 69.76 83.95 72.25 1.64 1.86 2.14 1.88 Inline graphic 0.11 5.07 0.13
DNN+IRM 80.85 84.54 90.74 85.38 1.83 2.21 2.59 2.21 3.98 6.86 8.38 6.41
LSTM+IRM 83.55 87.65 92.36 87.85 2.01 2.34 2.67 2.34 4.09 7.1 9.54 6.91
GRN+IRM 86.25 89.58 94.55 90.13 2.19 2.47 2.78 2.48 4.21 7.33 10.7 7.41
Babble noise Noisy speech 57.75 68.05 79.67 68.52 1.52 1.75 2.07 1.78 Inline graphic 0.13 5.08 0.16
DNN+IRM 74.22 78.35 85.93 79.51 1.91 2.24 2.48 2.22 3.82 6.31 8.74 6.29
LSTM+IRM 76.85 80.64 87.45 81.65 2.04 2.35 2.66 2.35 3.95 6.58 9.05 6.52
GRN+IRM 80.22 82.25 89.27 83.91 2.15 2.46 2.75 2.45 4.08 7.4 9.36 6.94
Car Noise Noisy Speech 58.84 68.92 79.60 69.12 1.37 1.62 1.92 1.63 Inline graphic 0.08 5.05 0.09
DNN+IRM 78.65 81.74 86.77 82.38 1.74 2.18 2.48 2.13 3.81 6.42 8.83 6.35
LSTM+IRM 80.23 84.47 89.18 84.62 1.99 2.39 2.55 2.31 4.2 7.21 9.92 7.11
GRN+IRM 85.48 86.56 92.68 88.24 2.09 2.47 2.71 2.42 4.56 7.59 10.8 7.65
Factory Noise Noisy Speech 58.44 67.44 78.80 68.22 1.31 1.61 1.92 1.61 Inline graphic 0.12 5.07 0.17
DNN+IRM 78.25 80.71 85.25 81.41 1.66 2.15 2.47 2.09 3.66 6.34 8.72 6.24
LSTM+IRM 79.45 82.45 88.17 83.35 1.89 2.33 2.55 2.26 3.85 5.53 9.52 6.3
GRN+IRM 81.62 82.15 90.44 84.73 2.11 2.45 2.77 2.44 4.01 6.69 10.3 6.99

Table 3.

SE performance in unmatched testing conditions.

Noise Metric STOI in % PESQ SDR in dB
Types SNR Inline graphicdB 0dB 5dB Avg Inline graphicdB 0dB 5dB Avg Inline graphicdB 0dB 5dB Avg
Airport Noise Noisy Speech 60.95 71.84 82.24 71.67 1.55 1.85 2.14 1.84 Inline graphic 0.11 5.07 0.14
DNN+IRM 77.36 83.24 87.48 82.69 1.81 2.25 2.58 2.21 3.9 6.78 8.22 6.3
LSTM+IRM 80.14 85.46 89.87 85.15 1.94 2.34 2.67 2.31 4.01 7.06 9.46 6.84
GRN+IRM 82.77 87.91 92.37 87.68 2.08 2.42 2.75 2.42 4.13 7.34 10.7 7.39
Babble Noise Noisy Speech 54.64 65.94 77.48 66.02 1.39 1.74 2.04 1.72 Inline graphic 0.16 5.1 0.19
DNN+IRM 71.15 78.47 80.77 76.79 1.77 2.01 2.56 2.11 3.88 5.85 8.66 6.13
LSTM+IRM 73.55 79.87 82.33 78.58 1.88 2.17 2.64 2.23 3.98 6.01 8.94 6.31
GRN+IRM 75.96 81.26 85.84 81.02 2.01 2.33 2.71 2.35 4.09 6.18 9.23 6.5
Car Noise Noisy Speech 57.27 67.48 78.44 67.73 1.39 1.63 1.93 1.65 Inline graphic 0.1 5.07 0.13
DNN+IRM 75.38 80.29 83.21 79.62 1.72 2.22 2.44 2.12 3.99 6.83 8.31 6.38
LSTM+IRM 78.19 83.37 87.55 83.03 1.95 2.31 2.67 2.31 4.14 7.11 9.51 6.92
GRN+IRM 80.91 86.42 90.81 86.04 2.14 2.45 2.73 2.44 4.3 7.45 10.6 7.45
Factory Noise Noisy Speech 55.24 65.93 77.24 66.13 1.31 1.6 1.92 1.61 Inline graphic 0.12 5.08 0.18
DNN+IRM 70.74 75.39 81.25 75.79 1.71 2.09 2.44 2.08 3.43 5.98 8.34 5.92
LSTM+IRM 73.92 77.41 84.47 78.61 1.84 2.25 2.59 2.22 3.68 6.22 9.27 6.39
GRN+IRM 77.35 79.49 87.64 81.49 1.97 2.41 2.71 2.36 3.94 6.47 10.2 6.87

Table 2 displays the results of speech enhancement under matched conditions, where the proposed GRN+IRM demonstrates better values for all objective measures in all background noises. Specifically, at low SNR (Inline graphicdB), the proposed SE network achieves the highest STOI (Inline graphic%) and PESQ (Inline graphic) values for airport noise, whereas the best SDR (Inline graphicdB) is achieved at Inline graphicdB for babble noise. Taking the babble noisy case (matched condition) at Inline graphicdB SNR, STOI improves from 63.05% with noisy speech to 86.25% with GRN+IRM, resulting in a 23.2% improvement in STOI. Furthermore, STOI improves from 80.85% with DNN+IRM to 86.25% with GRN+IRM, resulting in a 5.4% improvement in STOI. Similarly, in the case of factory noise (matched condition) at 0dB SNR, PESQ increases from 1.61 with UnP to 2.45 with the proposed GRN+IRM, achieving a 0.84 (34.28%) improvement. Additionally, PESQ increases from 2.18 with DNN+IRM to 2.47 with the proposed GRN+IRM in-car noise, resulting in a 0.29 (11.74%) improvement over the DNN+IRM. In a matched condition, consider street noise at 5dB as another case, where the SDR value increases from 0.11dB with UnP to 6.82dB with GRN+IRM, achieving an improvement of 6.71 dB. On average, at low SNR (-5dB) in matched conditions, the proposed GRN+IRM increases STOI (by 16.34%), PESQ (by 0.71), and SDR (by 7.17dB) over noisy unprocessed speech, demonstrating the effectiveness of the proposed SE model.

Table 3 presents the results of speech enhancement conducted under unmatched conditions, where the proposed GRN+IRM with an IRM training objective achieves better average values for all objective measures in all background noises. Specifically, at low SNR (Inline graphicdB), the proposed model achieves the highest STOI (Inline graphic%), and PESQ (Inline graphic) in street noise, whereas the best SDR (Inline graphicdB) is achieved in-car noise. In the case of babble noise at 0dB SNR under unmatched conditions, the STOI improves from 71.84% with noisy speech to 87.91% with GRN+IRM, resulting in a 16.07% improvement in STOI. Also, for the factory noisy case (unmatched condition) at 0dB SNR, the PESQ improves from 1.61 with UnP to 2.41 with the proposed GRN+IRM, representing a 0.80 (33.19%) improvement. Furthermore, in car noise at 5dB, the PESQ improves from 2.44 with DNN+IRM to 2.73 with the proposed GRN+IRM, representing a 0.29 (11.74%) improvement over the DNN+IRM baseline. In the unmatched condition of street noise at 5dB, the SDR value increases from 5.12dB with UnP to 9.71dB with GRN+IRM, resulting in an improvement of 4.59 dB. On average, at low SNR (-5dB) under unmatched conditions, the proposed GRN+IRM significantly improves the STOI, PESQ, and SDR over noisy unprocessed speech. Table 4 provides average scores encompassing all background noises for matched conditions (GRN+IRM-Matched), unmatched conditions (GRN+IRM-Unmatched), and the average of both conditions.

Table 4.

The overall results (Matched and Unmatched), averaged over all testing SNRs and noises.

Condition GRN+IRM-Matched GRN+IRM-Unmatched Average
Metric STOI PESQ SDR STOI PESQ SDR STOI PESQ SDR
Results 86.75 2.45 7.20 82.27 2.39 6.97 84.51 2.42 7.09

Table 5 shows the performance of the causal local attention (CLA) for which the values of W are varied between (5–15). The results (PESQ and STOI) indicate that the value of W greater than 15 shows no competitive results, and the best SE results are obtained with Inline graphic. Therefore, Inline graphic is fixed for the proposed SE. It was observed that the causal local attention outperformed the causal dynamic attention. These findings support the assumption that substantial preceding information is not necessary for effective speech enhancement, as noisy conditions can change rapidly over time. These observations apply to the attention networks, as the attention GRU performed better than the baseline GRU.

Table 5.

Causal local attention with different weight values.

Z STOI (in%) PESQ
SNR Inline graphicdB 0dB 5dB Inline graphicdB 0dB 5dB
5 86.21 89.56 94.54 2.18 2.47 2.77
15 82.45 84.74 89.83 2.15 2.42 2.69
25 79.56 82.74 88.12 2.08 2.34 2.66

Table 6 shows a comparison of errors and predicted results (STOI and PESQ) between GRN+IRM with weighted and without weighted loss functions. The results indicate that the weighted loss function improved the PESQ and STOI after incorporating the proposed GRN+IRM. The use of weighted mean square error (WMSE) reduces the errors to Inline graphic as compared to Inline graphic with non-weighted MSE. The learning error is reduced by 10.52% with the weighted loss function. Due to limitations in computational resources in practical applications, it is crucial to establish an optimal balance between the model’s performance improvement and parameter efficiency. Table 7 illustrates the efficiency of parameters in the proposed speech enhancement model. The parameter efficiency of these SE models reveals that the integration of the attention process into GRU does not significantly affect the parameter count (2.138M) and parameter size (2.71 MB) compared to LSTM (4.672M, 5.43 MB) and residual LSTM (RLSTM) (10M)57. To employ the proposed GRN+IRM on embedded systems, it is essential to minimize hardware memory usage. Consequently, we present a summary of multiply-accumulate operations (MACs). The proposed GRN+IRM model achieves 0.245 G/s MACs with an attention process, ensuring efficiency without compromising SE performance. The integration of GRU has significantly reduced parameter numbers, parameter size, and MACs. We further analyzed the convergence of the proposed model after incorporating weighted MSE, as shown in Fig. 5. It can be observed that the weighted MSE converges faster than the traditional MSE.

Table 6.

Dynamically-Weighted vs. non-dynamically-weighted loss.

Model Metric Error
GRN+MSE PESQ: 2.32, STOI: 83.87% Inline graphic
GRN+WMSE PESQ: 2.45, STOI: 86.21% Inline graphic
Improvement PESQi: 4.91%, STOIi: 2.34% & 10.52% 10.52%

Table 7.

Computational efficiency.

Model Para# MACs Param size
GRN+IRM 2.138 M 0.245 G/s 2.71 MB
LSTM20 4.672 M 0.412 G/s 5.43 MB
RLSTM57 10.0 M 1.347 G/s 13.55 MB

Fig. 5.

Fig. 5

Learning error: DW-MSE and without DW-MSE loss.

Comparison with related studies

This study compares the performance of the GRN+IRM with several selected studies from the literature to showcase its superiority. The comparison is performed for three different SNR levels (Inline graphicdB, 0dB, and 5dB) and the results are presented in Table 8. The study finds that the GRN+IRM model, with the IRM training objective, performs highly competitively as compared to recent models, except PL-CRN61, which performs slightly better at less adverse SNR (5 dB). CRN-BLSTM60 gained 0.48 (21.62%) PESQ over noisy mixture, which indicates 7.36% lower performance than the proposed GRN+IRM. Similarly, CNN-GRU62 gained 0.59 (25.32%) PESQ over a noisy mixture, which shows 3.64% lower performance than the proposed GRN+IRM. Furthermore, the gain in STOI for MCBNet59 over noisy mixture is 8.31%, indicating 8.03% less STOI gain as compared to the GRN+IRM. Additionally, the STOI improves from 84.25% with DCCRN64 to 86.75% with GRN+IRM. The proposed GRN+IRM outperforms related models by significant margins, such as a PESQ improvement of 0.31 (14.28%) and an STOI improvement of 11.55% over the state-of-the-art GRN67 and AECNN68 models at the -5dB SNR level.

Table 8.

Comparison with related SE models, where the symbol “Inline graphic” indicates improvement over noisy speech.

Metric PESQ STOI
SNR (in dB) Inline graphicdB 0dB 5dB Average PESQInline graphic Inline graphicdB 0dB 5dB Average STOI Inline graphic
Noisy unprocessed 1.46 1.74 2.01 1.74 60.25 69.76 81.21 70.41
DeepResGRU30 2.09 2.29 2.49 2.29 0.55 74.13 81.81 85.51 80.48 10.07
CFN-GCFU58 1.98 2.24 2.62 2.28 0.54 71.61 78.19 86.21 78.67 8.26
MCBNet59 2.01 2.32 2.52 2.28 0.54 72.81 79.15 84.15 78.71 8.30
CRN-BLSTM60 1.93 2.23 2.51 2.22 0.48 70.31 77.08 81.96 76.45 6.04
PL-CRNN61 2.06 2.51 2.85 2.47 0.73 73.16 84.42 90.15 82.57 12.16
CNN-GRU62 2.01 2.34 2.65 2.33 0.59 74.61 83.11 90.11 82.61 12.20
DTLN63 1.91 2.34 2.67 2.31 0.57 72.72 85.19 90.68 82.86 12.45
DCCRN64 1.85 2.34 2.78 2.32 0.58 74.51 85.87 92.38 84.25 13.84
DNN-TGSA65 2.01 2.31 2.58 2.30 0.56 74.41 81.21 84.12 79.91 9.50
DeepXi66 1.99 2.21 2.41 2.20 0.46 72.01 81.21 91.99 81.73 11.32
GRN67 1.86 2.16 2.42 2.15 0.41 69.76 76.89 81.42 76.02 5.61
AECNN68 1.92 2.19 2.45 2.19 0.45 72.01 77.78 82.51 77.43 7.02
CRN69 1.92 2.22 2.49 2.21 0.41 70.11 76.95 81.88 76.31 5.90
GAN70 1.72 2.15 2.44 2.11 0.37 65.01 75.71 82.61 74.44 4.03
LSTM20 1.82 2.15 2.44 2.14 0.40 68.78 75.81 81.54 75.37 4.96
GRN+IRM (Proposed) 2.17 2.48 2.75 2.45 0.71 83.29 85.86 92.08 86.75 16.34

Significant values are in bold.

Subjective evaluation

Furthermore, we conducted subjective listening tests to evaluate the quality of the enhanced speech. We randomly selected 300 sentences from different background noises at Inline graphicdB, 0dB, and 5dB to assess the performance of the DNN, LSTM, and proposed GRN+IRM. The participants were asked to rate the speech quality on a scale from 0 to 5. The subjective tests are performed in a soundproof room using high-quality headphones. Before the tests, training sessions were conducted to familiarize the listeners with the procedures. Figure 8 displays the results of the Mean Opinion Score (MOS), a numerical measure of the human-judged overall quality, also known as the subjective listening test, where the proposed GRN+IRM model demonstrated superior MOS performance. The average MOS score was higher than 2.80 (with MOS Inline graphic at Inline graphicdB) for negative SNRs, indicating considerable SE performance. For SNR Inline graphic 0dB, the GRN+IRM model yielded a MOS score greater than 3.0 (MOS Inline graphic 3.0 at SNR Inline graphic 0dB). ANOVA statistical analysis for average MOS scores at -5dB, 0dB, and 5dB yielded [F(3,10) = 44.5, p < 0.0001], [F(3,10) = 35.8, p < 0.0001], and [F(3,10) = 27.2, p < 0.0001], indicating the statistical significance of the MOS scores achieved by the GRN+IRM model. FDNN and LSTM also demonstrated improved performance, as deep learning can produce better speech quality. Figure 6 shows the average MOS score of all listeners, where the y-axis shows the MOS score and the x-axis indicates the input SNRs.

Fig. 8.

Fig. 8

Noisy and enhanced features fusion with GRU.

Fig. 6.

Fig. 6

Average MOS of all participants at SNRs.

Joint optimization and ASR

In conventional joint speech enhancement and ASR, a noisy magnitude spectrum |Y| is used as the input feature. The conventional joint training method comprises two main components: speech enhancement and speech recognition. Initially, the model is trained using both noisy and clean parallel data to enhance speech quality. Subsequently, the improved speech output serves as the sole input feature for the speech recognition model7173. To optimize the entire system, a combined loss function for both enhancement and speech recognition is employed. This enables the simultaneous training of enhancement and ASR models. However, this approach completely depends on the enhanced features of the speech recognition model, which may still be affected to some extent by speech distortions. Therefore, this study follows the joint optimization approach shown in Fig. 7.

Fig. 7.

Fig. 7

Schematic of joint SE and ASR.

The spectrograms produced by the speech enhancement network can often display noticeable distortions in the resulting speech. This problem arises when noise dominates in specific time-frequency bands, overshadowing the intended speech signals. Consequently, the speech enhancement identifies these noisy time-frequency bands and removes a significant amount of information, resulting in distortions that lead to the loss of important speech elements like formats. Even though the speech enhancement effectively reduces background noise, these distortions persist undetected by the ASR system. Such distortions ultimately contribute to the decline in ASR performance. To tackle this challenge, we implement the fused GRU (FInline graphic) approach to combine noisy and enhanced features, as illustrated in Fig. 8. This method aims to mitigate the impact of these distortions and enhance the overall performance of the ASR system.

Regarding the feature fusion network, our approach involves employing two GRUs simultaneously, denoted by G(Inline graphic), demonstrated in Fig. 9. The goal is to derive deep representations for enhanced (Inline graphic Inline graphic) and noisy (Inline graphic Inline graphic) features. In the initial stage of fusing noisy features Inline graphic Inline graphic with enhanced features Inline graphic Inline graphic at Inline graphic, the hidden state Inline graphic is initialized randomly. At the reset gate of GRU for step p, the hidden state Inline graphic and noisy input features Inline graphic Inline graphic decide the status of the reset gate. The status of the update gate is also decided by Inline graphic and Inline graphic Inline graphic, given as:

graphic file with name d33e3726.gif 24
graphic file with name d33e3732.gif 25

where Inline graphic and Inline graphic are weights of reset and update gates. The reset gate r determines the memorization of past information by using element-wise product Inline graphic, given as:

graphic file with name d33e3761.gif 26
graphic file with name d33e3767.gif 27

Fig. 9.

Fig. 9

Features fusion block with GRUs.

The Inline graphic helps in remembering long-term information. The selective fusion of features combines Inline graphic and Inline graphic at step p, given as:

graphic file with name d33e3796.gif 28

The above Eq. (28) connects the input gate Inline graphic and the forget gate z. Finally, after three stages of FInline graphic, the features are concatenated to obtain the final features Inline graphic, given as:

graphic file with name d33e3829.gif 29

The fused features Inline graphic are used as input to the Transformer-based ASR system. To jointly train the ASR and the proposed SE, the loss function is given as:

graphic file with name d33e3843.gif 30

The parameter Inline graphic controls the enhancement loss Inline graphic. In the SE module, the parameter Inline graphic is not fixed but is adaptively optimized during training to dynamically balance multiple objectives within the composite loss function. Instead of manually setting a static value, Inline graphic is learned alongside model parameters, allowing the model to adjust its focus based on the training dynamics and data characteristics. By learning Inline graphic adaptively, the model can prioritize noise suppression or speech fidelity at different stages of training, ultimately converging to an optimal balance that enhances overall performance. This approach leads to a more flexible and efficient enhancement process, tailoring the loss weights to suit the complexities of the input data and task requirements.

ASR results

For a noisy training dataset, speech sentences are selected from the LibriSpeech training set. These sentences are mixed with different noises, with randomly selected SNRs ranging from 0dB to 20dB. The inference set contains noises mixed with speech sentences from LibriSpeech at SNRs of 0dB, 5dB, 10dB, 15dB, and 20dB. Tables 9 and 10 show the results of the joint speech enhancement and transformer ASR. The joint training approach has the potential to improve the efficiency of end-to-end ASR, illustrating the efficacy of the joint training technique. We present the character error rate (CER) for ASR-Enhanced (indicate the concatenation of the noisy features Inline graphic and features enhanced by the proposed SE Inline graphic) and ASR-Enhanced-Fused (indicate the concatenation of the Inline graphic, noisy features Inline graphic, and features enhanced by the proposed SE Inline graphic). In addition, we provide results for ASR-Enhanced-LSTM (indicate the concatenation of the noisy features Inline graphic and features enhanced by LSTM-based SE Inline graphic) and ASR-Enhanced-GRU (indicate the concatenation of the noisy features Inline graphic and features enhanced by GRU-based SE Inline graphic). Table 9 shows the CERs for the testing dataset, whereas Table 10 provides results for the development set, respectively. With the proposed speech enhancement and joint ASR, the CERs are improved significantly. Since the proposed SE shows less speech distortion (obtained better SDR (7.09 dB) as compared to LSTM and GRU), the average CERs are improved from 14.30% (with ASR-Enhanced-LSTM) to 13.01% with the proposed ASR-Enhanced-Fused.

Table 9.

CER results for Joint Speech Enhancement and ASR on Testing Set.

Model 0dB 5dB 10dB 15dB 20dB Average Clean
Noisy 51.44 39.78 28.15 23.11 22.47 32.99
ASR-enhanced 24.35 16.69 13.29 12.01 11.11 15.49 9.32
ASR-enhanced-GRU 22.02 15.75 12.37 11.65 10.84 14.52 9.32
ASR-enhanced-LSTM 21.92 15.65 11.97 11.22 10.74 14.30 9.32
ASR-enhanced-fused 20.02 14.75 10.37 10.05 9.84 13.01 9.32

Table 10.

CER results for joint speech enhancement and ASR on development set.

Model 0dB 5dB 10dB 15dB 20dB Average Clean
Noisy 51.44 39.78 28.15 23.11 22.47 32.99
ASR-Enhanced 22.18 14.54 11.35 10.09 9.72 11.42 8.21
ASR-Enhanced-GRU 20.13 13.98 11.45 10.02 9.65 11.27 8.21
ASR-Enhanced-LSTM 20.01 13.77 11.02 10.78 9.94 11.37 8.21
ASR-Enhanced-Fused 18.14 12.24 10.01 9.74 8.94 10.23 8.21

Summary and conclusion

This paper proposes a model that optimizes both speech enhancement and automatic speech recognition simultaneously. The objective is to seamlessly enhance speech quality while also refining representations to better suit the recognition task. While the integration of joint speech enhancement and automatic speech recognition techniques has displayed potential in achieving robust end-to-end ASR systems, conventional approaches typically rely on utilizing enhanced features as inputs for ASR systems. To overcome this limitation, our study adopted a dynamic fusion methodology. This approach combines both the enhanced features and the raw noisy features, to eliminate noise signals from the enhanced target speech while simultaneously capturing fine details from the noisy signals. By employing this fusion strategy, we alleviate speech distortions, thereby enhancing the overall performance of the ASR system. Our proposed model comprises an attentional codec equipped with a causal attention mechanism for SE, a fusion network based on Gated Recurrent Units (GRUs), and an ASR system. In the SE network, we utilize a modified GRU architecture where the traditional hyperbolic tangent (tanh) activation function is replaced with an attention-based rectified linear unit (AReLU).

The proposed speech enhancement (GRN+IRM) consistently outperforms baselines across noisy testing scenarios. Specifically, under low SNR (Inline graphicdB) conditions, our SE network achieves superior STOI (Inline graphic%), PESQ (Inline graphic), and SDR (Inline graphicdB) scores in matched conditions. Similarly, our model achieves the highest STOI (Inline graphic%), PESQ (Inline graphic), and SDR (Inline graphicdB) values at low SNRs. Notably, causal local attention outperforms causal dynamic attention, concluding that extensive preceding information might not be necessary for effective speech enhancement, given the rapid changes in noisy conditions. Minimizing hardware memory usage is crucial to ensure the feasibility of deploying the proposed GRN+IRM on embedded systems. Therefore, we examined multiply-accumulate operations (MACs). The proposed model concludes 0.245 G/s MACs with an attention process, ensuring efficiency without compromising SE performance. Our study concludes that the GRN+IRM model, trained with the IRM objective, stands competitively against recent models. With our proposed speech enhancement and joint ASR, significant improvements are observed in character error rates (CERs). Due to reduced speech distortion (achieved a better SDR of 7.09dB compared to LSTM and GRU), the average CERs are enhanced from 14.30% (with ASR-Enhanced-LSTM) to 13.01% with our proposed ASR-Enhanced-Fused model.

The limitations of this include the performance may degrade for highly non-stationary noises (e.g., sudden bursts, overlapping speakers) due to the fixed attention window (Z = 5). Future work will explore adaptive window sizing or hybrid attention mechanisms. Further, the model is trained on LibriSpeech (English), and its generalizability to low-resource languages with different phonetic structures is untested. To address this in the future, transfer learning with limited labelled data could be investigated.

Acknowledgements

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/607/46. The Researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025). Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R747), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia

Funding

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/607/46. The researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025). The Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R747), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data availability

The datasets generated used and analysed during the current study are available in the LibriSpeech and AURORA repository available at https://www.openslr.org/12 and http://aurora.hsnr.de/aurora-2.html. The raw codes for attention-GRU are available at https://github.com/NasirSaleem/Speech-Enhancement-ASR.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Reza, S., Ferreira, M. C., Machado, J. J. M. & Tavares, J. M. R. S. A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model. Expert Syst. Appl.215, 119293 (2023). [Google Scholar]
  • 2.El-Shafai, W. et al. Optical ciphering scheme for cancellable speaker identification system. Comput. Syst. Sci. Eng.45(1), 563–578 (2023). [Google Scholar]
  • 3.Passos, L. A., Papa, J. P., Hussain, A. & Adeel, A. Canonical cortical graph neural networks and its application for speech enhancement in audio-visual hearing aids. Neurocomputing527, 196–203 (2023). [Google Scholar]
  • 4.Windowing, F. F. T. Research article speech enhancement with geometric advent of spectral subtraction using connected time-frequency regions noise estimation. Res. J. Appl. Sci. Eng. Technol.6(6), 1081–1087 (2013). [Google Scholar]
  • 5.Jannu, C., & Vanambathina, S.D. Weibull and nakagami speech priors based regularized nmf with adaptive wiener filter for speech enhancement. Int. J. Speech Technol. 1–13 (2023).
  • 6.Ephraim, Y. & Malah, David. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process.32(6), 1109–1121 (1984). [Google Scholar]
  • 7.Chen, Bin & Loizou, Philipos C. A laplacian-based mmse estimator for speech enhancement. Speech Commun.49(2), 134–143 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Michelsanti, D. et al. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process.29, 1368–1396 (2021). [Google Scholar]
  • 9.Yong, X., Jun, D., Dai, L.-R. & Lee, C.-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process.23(1), 7–19 (2014). [Google Scholar]
  • 10.Wang, Z.-Q., Wang, P. & Wang, D. Complex spectral mapping for single-and multi-channel speech enhancement and robust asr. IEEE/ACM Trans. Audio Speech Lang. Process.28, 1778–1787 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Li, A., Liu, W., Zheng, C., Fan, C. & Li, X. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.29, 1829–1843 (2021). [Google Scholar]
  • 12.Abdullah, Salinna, Zamani, Majid & Demosthenous, Andreas. Towards more efficient dnn-based speech enhancement using quantized correlation mask. IEEE Access9, 24350–24362 (2021). [Google Scholar]
  • 13.Saleem, N., Mustafa, E., Nawaz, A. & Khan, A. Ideal binary masking for reducing convolutive noise. Int. J. Speech Technol.18, 547–554 (2015). [Google Scholar]
  • 14.Bao, Feng & Abdulla, Waleed H. A new ratio mask representation for casa-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.27(1), 7–19 (2018). [Google Scholar]
  • 15.Saleem, N., Khattak, M. I., Al-Hasan, M. & Qazi, A. B. On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks. IEEE Access8, 160581–160595 (2020). [Google Scholar]
  • 16.Fan, C. et al. Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.29, 198–209 (2020). [Google Scholar]
  • 17.Saleem, N. & Khattak, M. I. Deep neural networks for speech enhancement in complex-noisy environments. Int. J. Interactive Multimed. Artif. Intell.6(1), 84–91 (2020). [Google Scholar]
  • 18.Sun, L., Du, J., Dai, L.-R., & Lee, C.-H. Multiple-target deep learning for lstm-rnn based speech enhancement. In 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140. IEEE, (2017).
  • 19.Yechuri, S. & Vanambathina, S. A nested u-net with efficient channel attention and d3net for speech enhancement. Circ. Syst. Signal Process. 1–21 (2023).
  • 20.Chen, Jitong, Wang, Yuxuan, Yoho, Sarah E., Wang, DeLiang & Healy, Eric W. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. J. Acoust. Soc. Am.139(5), 2604–2612 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Huang, X., Chen, H. & Wei, L. A two-stage frequency-time dilated dense network for speech enhancement. Appl. Acoust.201, 109107 (2022). [Google Scholar]
  • 22.Pandey, A. & Wang, D. L. Dense cnn with self-attention for time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.29, 1270–1279 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Saleem, N. et al. U-shaped low-complexity type-2 fuzzy lstm neural network for speech enhancement. IEEE Access11, 20814–20826 (2023). [Google Scholar]
  • 24.Liang, Ruiyu, Kong, F., Xie, Y., Tang, G. & Cheng, J. Real-time speech enhancement algorithm based on attention lstm. IEEE Access8, 48464–48476 (2020). [Google Scholar]
  • 25.Pandey, A., & Wang, D.L. Dual-path self-attention rnn for real-time speech enhancement. arXiv preprint arXiv:2010.12713, (2020).
  • 26.Yechuri, S., & Vanambathina, S. A nested u-net with efficient channel attention and d3net for speech enhancement. Circ. Syst. Signal Process. 1–21 (2023).
  • 27.Xu, X., & Hao, J. U-former: Improving monaural speech enhancement with multi-head self and cross attention. in 2022 26th International Conference on Pattern Recognition (ICPR), pp 663–369. IEEE (2022).
  • 28.Chen, J., Wang, Z., Tuo, D., Wu, Z., Kang, S., & Meng, H.. Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement. in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7857–7861. IEEE (2022).
  • 29.Guochen, Y. et al. Dbt-net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.30, 2629–2644 (2022). [Google Scholar]
  • 30.Saleem, N. et al. Deepresgru: Residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowl.-Based Syst.238, 107914 (2022). [Google Scholar]
  • 31.Wang, Y., Han, J., Zhang, T. & Qing, D. Speech enhancement from fused features based on deep neural network and gated recurrent unit network. EURASIP J. Adv. Signal Process.1–19, 2021 (2021). [Google Scholar]
  • 32.Yuan, W. Incorporating group update for speech enhancement based on convolutional gated recurrent network. Speech Commun.132, 32–39 (2021). [Google Scholar]
  • 33.Seltzer, M. L. Bridging the gap: Towards a unified framework for hands-free speech recognition using microphone arrays. In 2008 Hands-Free Speech Communication and Microphone Arrays, pp. 104–107. IEEE, (2008).
  • 34.Wang, Z.-Q. & Wang, D. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.24(4), 796–806 (2016). [Google Scholar]
  • 35.Han, K. et al. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process.23(6), 982–992 (2015). [Google Scholar]
  • 36.Li, F., Nidadavolu, P. S., & Hermansky, H. A long, deep and wide artificial neural net for robust speech recognition in unknown noise. in Interspeech, pp. 358–362 (2014).
  • 37.M.L. Seltzer, D. Yu, and Y. Wang. An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing, pp 7398–7402. IEEE (2013).
  • 38.Liu, B., Nie, S., Zhang, Y., Ke, D., Liang, S., & Liu, W. Boosting noise robustness of acoustic model via deep adversarial training. in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5034–5038. IEEE (2018).
  • 39.Chang, X., Zhang, W., Qian, Y., Le Roux, J., & Watanabe, S. End-to-end multi-speaker speech recognition with transformer. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6134–6138. IEEE (2020).
  • 40.Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp. 1–5. IEEE (2017).
  • 41.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. Attention is all you need. Adv. Neural Inform. Process. Syst. 30, (2017).
  • 42.Saleem, N., Gunawan, T. S., Dhahbi, S., & Bourouis, S. Time domain speech enhancement with cnn and time-attention transformer. Digital Signal Process. 104408 (2024).
  • 43.Vanambathina, S. D., Nandyala, S., Jannu, C., Devi, J. S., Yechuri, S., & Parisae, V. Speech enhancement using u-net-based progressive learning with squeeze-tcn. In International Conference on Advances in Distributed Computing and Machine Learning, pp. 419–432. Springer (2024).
  • 44.Parisae, V. & Bhavanam, S. N. Multi scale encoder-decoder network with time frequency attention and s-tcn for single channel speech enhancement. J. Intell. Fuzzy Syst.46(4), 10907–10907 (2024). [Google Scholar]
  • 45.Nakadai, K., Hidai, K., Okuno, H. G., & Kitano, H. Real-time speaker localization and speech separation by audio-visual integration. in Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), vol. 1, pp. 1043–1049. IEEE (2002).
  • 46.Jannu, Chaitanya & Vanambathina, S. D. An overview of speech enhancement based on deep learning techniques. Int. J. Image Graph.25(01), 2550001 (2025). [Google Scholar]
  • 47.Jannu, C. & Vanambathina, S. D. Multi-stage progressive learning-based speech enhancement using time-frequency attentive squeezed temporal convolutional networks. Circ. Syst. Signal Process.42(12), 7467–7493 (2023). [Google Scholar]
  • 48.Jannu, C. & Vanambathina, S. D. Dct based densely connected convolutional gru for real-time speech enhancement. J. Intell. Fuzzy Syst.45(1), 1195–1208 (2023). [Google Scholar]
  • 49.Jannu, C., & Vanambathina, S.D. Convolutional transformer based local and global feature learning for speech enhancement. Int. J. Adv. Comput. Sci. Appl. 14(1), (2023).
  • 50.Ullah, R. et al. End-to-end deep convolutional recurrent models for noise robust waveform speech enhancement. Sensors22(20), 7782 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Rajamani, S. T., Rajamani, K. T., Mallol-Ragolta, A., Liu, S., & Schuller, B. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. 6294–6298 (2021).
  • 52.Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. 5206–5210 (2015).
  • 53.Macho, D., Mauuary, L., Noé, B., Cheng, Y. M., Ealey, D., Jouvet, D., Kelleher, H., Pearce, D., & S. Fabien. Evaluation of a noise-robust dsr front-end on aurora databases (2002).
  • 54.Zheng, N. & Zhang, X.-L. Phase-aware speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process.27(1), 63–76 (2018). [Google Scholar]
  • 55.Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing, pp 4214–4217. IEEE (2010).
  • 56.Beerends, J. G., Hekstra, A. P., Rix, A. W. & Hollier, M. P. Perceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part II: Psychoacoustic model. J. Audio Eng. Soc.50(10), 765–778 (2002). [Google Scholar]
  • 57.Kim, J., El-Khamy, M. & Lee, J. Residual lstm: Design of a deep recurrent architecture for distant speech recognition. Proc. Interspeech2017, 1591–1595 (2017). [Google Scholar]
  • 58.Xian, Y., Sun, Y., Wang, W. & Naqvi, S. M. Convolutional fusion network for monaural speech enhancement. Neural Netw.143, 97–107 (2021). [DOI] [PubMed] [Google Scholar]
  • 59.Lan, T. et al. Multi-scale informative perceptual network for monaural speech enhancement. Appl. Acoustics195, 108787 (2022). [Google Scholar]
  • 60.Wang, Z., Zhang, T., Shao, Y. & Ding, B. Lstm-convolutional-blstm encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl. Acoustics172, 107647 (2021). [Google Scholar]
  • 61.Li, A., Yuan, M., Zheng, C. & Li, X. Speech enhancement using progressive learning-based convolutional recurrent neural network. Appl. Acoustics166, 107347 (2020). [Google Scholar]
  • 62.Hasannezhad, M., Ouyang, Z., Zhu, W.-P., & Champagne, B. An integrated cnn-gru framework for complex ratio mask estimation in speech enhancement. pp. 764–768 (2020).
  • 63.Westhausen, N. L., & Meyer, Bernd T. Dual-signal transformation lstm network for real-time noise suppression (2020).
  • 64.Yanxin, Hu. et al. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement. Proc. Interspeech2020, 2472–2476 (2020). [Google Scholar]
  • 65.Kim, J., El-Khamy, M., & Lee, J. T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement. 6649–6653 (2020).
  • 66.Zhang, Q., Nicolson, A., Wang, M., Paliwal, K. K. & Wang, C. Deepmmse: A deep learning approach to mmse-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process.28, 1404–1415 (2020). [Google Scholar]
  • 67.Tan, K, & Wang, D. Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement. 6865–6869 (2019). [DOI] [PMC free article] [PubMed]
  • 68.Pandey, A., & Wang, D.L. Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain. 6875–6879 (2019).
  • 69.Tan, K. & Wang, D. A convolutional recurrent neural network for real-time speech enhancement.2018, 3229–3233 (2018).
  • 70.Shah, N., Patil, H.A., & Soni, M. H. Time-frequency mask-based speech enhancement using convolutional generative adversarial network. 1246–1251 (2018).
  • 71.Bhardwaj, V. et al. Automatic speech recognition (asr) systems for children: A systematic literature review. Appl. Sci.12(9), 4419 (2022). [Google Scholar]
  • 72.Rahman, A. et al. Advancement and Challenges (IEEE Access, Arabic speech recognition, 2024).
  • 73.Hadwan, M., Alsayadi, H. A., & Al-Hagree, S. An end-to-end transformer-based automatic speech recognition for qur’an reciters. Comput. Mater. Continua. 74(2), (2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated used and analysed during the current study are available in the LibriSpeech and AURORA repository available at https://www.openslr.org/12 and http://aurora.hsnr.de/aurora-2.html. The raw codes for attention-GRU are available at https://github.com/NasirSaleem/Speech-Enhancement-ASR.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES