Abstract
Speech enhancement (SE) and automatic speech recognition (ASR) in real-time processing involve improving the quality and intelligibility of speech signals on the fly, ensuring accurate transcription as the speech unfolds. SE eliminates unwanted background noise from target speech in environments with high background noise levels, which is crucial in real-time ASR. This study first proposes a speech enhancement network based on an attentional-codec model. Its primary objective is to suppress noise in the target speech with minimal distortion. However, excessive noise suppression in the enhanced speech can potentially diminish the effectiveness of downstream ASR systems by excluding crucial latent information. While joint SE and ASR techniques have shown promise for achieving robust end-to-end ASR, they traditionally rely on using the enhanced features as inputs to the ASR systems. To address this limitation, our study uses a dynamic fusion approach. This approach integrates both the enhanced features and the raw noisy features, aiming to eliminate noise signals from the enhanced target speech while simultaneously learning fine details from the noisy signals. This fusion approach seeks to mitigate speech distortions, enhancing the overall performance of the ASR system. The proposed model consists of an attentional codec equipped with a causal attention mechanism for SE, a GRU-based fusion network, and an ASR system. The SE network uses a modified Gated Recurrent Unit (GRU), where the traditional hyperbolic tangent (tanh) is replaced by an attention-based rectified linear unit (AReLU). The SE experiments consistently obtain better speech quality, intelligibility, and noise suppression in matched and unmatched conditions than the baselines. With the LibriSpeech database, the proposed SE obtains better STOI (19.81%) and PESQ (28.97%) in matched conditions and unmatched conditions (STOI: 17.27% and PESQ: 27.51%). The joint training framework for robust end-to-end ASR evaluates the character error rate (CER). The ASR results find that the joint training framework reduces the error rate from 32.99% (average noisy signals) to 13.52% (with the proposed SE and joint training for ASR).
Keywords: Speech enhancement, Speech recognition, Deep learning, End-to-end processing, Attentional GRU, Feature fusion, Joint optimization
Subject terms: Electrical and electronic engineering, Biomedical engineering, Information technology
Introduction
Joint speech enhancement (SE) and speech recognition are vital for improving the accuracy and robustness of automatic speech recognition systems. By removing background noise and enhancing speech quality, the SE techniques enable ASR systems to better understand and transcribe spoken words, especially in noisy environments like crowded rooms or outdoor settings. This advancement is essential for the practical implementation of ASR in everyday applications, such as mobile devices and virtual assistants, ensuring reliable and effective communication with these technologies. Background noises frequently contaminate speech signals, which can significantly impact speech-related applications, particularly ASR1–3. Background noises and competing speakers are the primary sources of target signal distortion. To mitigate the impact of noise, a speech enhancement (SE) system can restore the quality and improve the intelligibility of degraded signals. A SE model performs well in various noisy backgrounds; however, developing a model that can handle different noisy backgrounds with minimal complexity and latency is a challenging task.
Traditional SE techniques, such as spectral subtraction4, Wiener filtering5, and statistical models6,7, perform better in stationary noisy backgrounds. However, they perform poorly in nonstationary, noisy backgrounds. Deep learning (DL) has become a mainstream approach for speech enhancement8. Deep learning techniques learn to transform a noisy speech into a clean speech by training on a dataset of paired clean and noisy samples. These techniques may use mapping-based training objectives9–11 or masking-based training objectives to estimate the spectrum or time-frequency masks12–16. The commonly used deep learning techniques for speech enhancement include fully connected networks (FCN)17, RNNs18, and CNNs19. To enhance noisy speech, Long Short Term Memory (LSTM) is used to develop a noise- and speaker-independent model20. The model is trained using a four-layered LSTM network on speech samples from different speakers combined with various types of background noise. Another approach21 proposes a CNN architecture applying gated and dilated convolution. Another trend uses an attention mechanism to enhance noisy speech22. In23,24, the LSTM model is proposed for speech enhancement by applying the attention gate to replace the forget gate. The study in22 proposes a self-attention dense CNN for better feature extraction and uses feature reusing. The study in25 proposes a dual-path RNN with self-attention such that the processing of long sequences is improved. Several studies use attention mechanisms to enhance speech signals with promising results26–28. These existing RNN models have good capability for noise suppression but suffer from their complex structure and long training times. GRU (Gated Recurrent Unit) is a recurrent neural network that can be used in speech enhancement for learning long-term temporal dependencies29–32.
Nevertheless, speech enhancement focuses on refining the models to estimate the target speech, distinct from the speech recognition aspect. Consequently, speech enhancement approaches often do not align with the ultimate objective, resulting in suboptimal outcomes33. Moreover, the output speech from these enhancement techniques tends to be excessively over-smoothed, leading to post-enhancement speech distortion. This distortion can significantly impact the effectiveness of ASR systems34. Consequently, the success of this approach relies heavily on the success of the front-end enhancement35. To enhance the noise robustness of ASR, three primary approaches are commonly used. The first approach involves integrating a speech enhancement component at the front end of the ASR system. The second method employs multi-condition training to enhance the noise robustness of ASR. This involves training the speech recognition model using various types of data, including both clean and noisy speech. However, this approach leads to increased complexity and computational costs. Moreover, it often yields underwhelming results when faced with unmatched conditions36, and its performance can be impacted by speech distortion37. The third prevalent approach involves joint training techniques38,39, which utilize a unified framework to optimize both speech enhancement and recognition simultaneously. The rationale behind this approach is that speech enhancement and recognition are intertwined tasks that can mutually enhance each other performance. For instance, to improve the noise robustness of end-to-end ASR, a joint adversarial enhancement training method was proposed in40. This method leverages the joint training framework to refine both the mask-based enhancement network and the attention-based encoder-decoder speech recognition network. Furthermore, even on the noisy AISHELL-140 dataset, the CER remains above 50%, indicating a need for improvement. On the other hand, concerning E2E speech recognition, speech transformer models have demonstrated remarkable performance, achieving state-of-the-art results. The self-attention network41,42 stands out as a crucial element of the speech transformer, offering greater capability in capturing long-term dependencies compared to sequence-to-sequence models based on recurrent neural networks (RNNs). The more recent literature of the SE and ASR can be found in studies such as43–49
To understand the problem and the need to jointly optimize SE and ASR, we analyze the spectrograms illustrated in Fig. 1, which depicts an example spectrogram of a test speech sample. The spectrogram of the enhanced speech, processed by the enhancement network, exhibits notable leaks, as shown in Fig. 1 (right), indicated by the highlighted boxes, resulting in speech distortion. These boxes indicate significant leaks, primarily due to the dominance of noise signals in these time-frequency bins, overshadowing the target speech. Consequently, the enhancement network interprets these time-frequency bins as noise signals and eliminates the relevant information, such as formants. Although the enhancement network manages to reduce noise signals to some extent, these leaks remain unrecognized by the ASR system, leading to substantial loss of essential speech details. These factors explain how speech distortion adversely affects the performance of automatic speech recognition.
Fig. 1.
An example spectrogram of a test speech sample. Clean speech (left), noisy speech (middle), and enhanced speech without joint optimization (right). The boxes highlight the spectral leaks (over-smoothed distortions).
This paper presents a jointly optimized speech enhancement and automatic speech recognition model that aims to automatically acquire more robust representations that are well-suited for the recognition task. The contributions of this study are twofold.
This study proposes a speech enhancement based on an attentional-codec model to effectively reduce noise in the target speech while minimizing distortions, such as over-smoothed spectrograms. The proposed Speech Enhancement (SE) network improves noisy speech using an attention process that mirrors human focus on specific speech components amidst surrounding noise. By employing this attention process within the codec (coder-decoder), the model achieves enhanced sequential modelling, allowing learned weights from past input features to predict current features accurately. This attention mechanism actively manages the correlation between preceding and current frames, assigning attention weights to earlier speech frames. Experimental results demonstrate that the proposed SE model surpasses baseline methods in terms of speech quality, intelligibility, noise reduction, and speech distortion.
Traditionally, ASR systems have often depended on utilizing enhanced features as inputs. In our study, however, we use a dynamic fusion approach to overcome this limitation. This approach integrates both the enhanced features and the raw noisy features to filter out noise signals from the enhanced target speech while simultaneously capturing fine details from the noisy signals. By employing this fusion approach, we aim to reduce speech distortions and enhance the overall performance of the ASR system.
The paper is structured as follows: “Proposed speech enhancement” presents the proposed speech enhancement approach. “Speech enhancement experiments” details the experiments, results, and discussions about speech enhancement. “Joint optimization and ASR” discusses jointly optimized speech enhancement and automatic speech recognition with corresponding results. Finally, “Summary and conclusion” provides the conclusion of this study.
Proposed speech enhancement
Figure 2 shows the diagram of the proposed SE. A clean speech and background noise can be represented by s(t) and d(t). The resulting noisy speech y(t) is obtained by mixing s(t) and d(t), given as:
![]() |
1 |
where
and M shows speech samples. The speech enhancement network recovers the estimate
of underlying clean speech s(t) from a noisy speech y(t). The SE network is fed with inputs
and
, where Y and X represent the magnitudes of the noisy mixture and underlying clean speech at frame t. The encoder extracts features h, given as:
![]() |
2 |
whereas the parameters Q and K represent the query and key. Our study uses a Gated Recurrent Unit (GRU) encoder-decoder, showing an ability to model sequential information, resulting in lower computational costs and improved performance as compared to LSTM, as reported in 50. To generate fixed-length context vectors, the attention process is applied to the key and query inputs.
![]() |
3 |
Fig. 2.

The proposed speech enhancement pipeline.
With the context vectors
and
, the output of the decoder
recovers the noisy enhanced speech
.
![]() |
4 |
Figure 3 shows the attentional-GRU codec. The encoder extracts features from the speech spectrum. To accomplish this, the extracted features are provided to the input layer. Where
represents a neural network (GRU) function, and
is the GRU output, respectively.
![]() |
5 |
Fig. 3.
The Architecture of the proposed staked attention encoder-decoder unidirectional GRU.
The
is the input to the GRU cell as:
![]() |
6 |
where
represents the GRU function, and
is the GRU output, respectively. The
can be computed as:
![]() |
7 |
Unidirectional attentional-GRU encoder
A gated recurrent unit (GRU) is an RNN type that includes a gating mechanism for controlling the flow of information. The unidirectional GRU processes the sequence in one direction and is commonly used for sequence-to-sequence learning. Since the GRU has fewer parameters to optimize, it can mitigate the gradient vanishing problem and train faster than LSTM. This study employs a modified GRU in which the classical hyperbolic tangent (tanh) is replaced with an attention-based ReLU (AReLU)51, a learnable activation function that leverages an element-wise attention approach. The hyperbolic tangent shows high complexity because of dense activation computations. AReLU employs learned data-adaptive parameters to amplify positive elements and diminish negative elements. The training process remains robust towards vanishing the gradient since the attention mechanism in AReLU activation learns element-wise residues of the active region. The attention activation learning through AReLU leads to well-focused activations in significant areas of the feature map. Having additional learnable parameters (
and
) per layer enables fast network training at low learning rates. According to study51, AReLU is denoted as
using a combination of an element-by-element sign-based attention approach
and the classical ReLU
, as follows:
![]() |
8 |
![]() |
9 |
where
is input to the activation layer,
,
grasps the input variables to [0.01, 0.99] such that preventing
to be zero, and
shows the sigmoid activation. The inclusion of AReLU in GRUs can assist in capturing long-term contextual dependencies between features, which is critical in SE. As a result, in addition to preventing gradient vanishing, the use of AReLU in the GRU can aid in capturing these long-term dependencies and improving the performance of SE. The attention-ReLU-based GRU cell structure is depicted in Fig. 4.
Fig. 4.

Attention-ReLU in GRU cell structure.
Attentional process
The attention process plays a crucial role in creating fixed-length context vectors by processing information about the key and query inputs. An attention mechanism can process preceding and future speech frames. However, speech enhancement in this study tends to avoid processing latency and therefore only uses previous speech frames. To achieve this, the model uses both causal dynamic and causal local attention approaches. In causal dynamic attention, the model uses the entire previous speech sequence
, and the input sequence
for computing attention weights. This indicates that all preceding frames are utilized to enhance the present frame. However, for long speech sequences, the attention weights of many preceding frames may become almost zero. To address this, the model uses the causal local attention process, where
, and
are utilized for computing attention weights. The model learns the attention weights
as:
![]() |
10 |
where
is causal dynamic attention and
denotes causal local attention with z is a constant. According to the correlation computation:
![]() |
11 |
The attention-weighted context vectors are given as:
![]() |
12 |
The proposed SE model determines the attention process for a speech frame with attention-weighted context vectors.
Unidirectional attentional-GRU decoder
The GRU decoder reconstructs the speech samples with input features, encoder outputs, and attention-weighted context vectors, respectively. The enhanced vector
, is computed using the attention-weighted context vectors and features.
![]() |
13 |
The context vectors and feature vectors are concatenated as
. A time-frequency mask is finally estimated from the final vectors, given as:
![]() |
14 |
where
shows an ideal ratio mask. The magnitude spectrums of clean speech s(t) and noise signals d(t) are denoted as
and
respectively. To reconstruct the noisy enhanced speech, multiply the noisy features by the enhanced vectors, and obtain the resulting signal by performing the inverse Short-time Fourier Transform (STFT).
![]() |
15 |
The enhanced features along with raw noisy features are further used for joint SE and ASR optimization.
Speech enhancement experiments
Dataset and data generation
The experiments utilized the LibriSpeech dataset52, and noise sources were selected from the AURORA database53. LibriSpeech is a database of approximately 1000 h of English speech. The database includes 16,000 audio files, each 10 s in length, derived from public-domain audiobooks from the LibriVox project. The database is divided into several subsets, including test-clean, test-other, train-clean-100, train-clean-360, and train-other-500, intended for development, testing, and training purposes. The speech dataset D includes training and testing sentences, denoted as
and
, respectively. The training and testing sentences are labelled as
and
. The noisy sentences are created by mixing background noises with
and
.
![]() |
16 |
![]() |
17 |
where
and
denote training and testing noisy data.
Feature extraction
The noisy-clean pairs y(t), s(t) are transformed to the frequency domain by applying the Short-Time Fourier Transform (STFT), given as:
![]() |
18 |
where
, F is the number of frequency bins, and T shows the number of frames. This study has used the STFT magnitude |Y| as the input feature.
SE network architecture
The architecture of the proposed GRU-based codec consists of the input layer, three GRU layers each containing 256 units, and an output layer containing 257 sigmoidal-activated units. The hyperparameters include epochs (160), learning rate (0.0001), and weights (randomly initialized), respectively. The training process utilized mini-batches of 32 sequences, employing back-propagation through time with an Adam optimizer. The GRU layer configuration is given as (161/256/256/256/257) units. Table 1 provides the details of the hyperparameters. Noisy sentences are generated using
dB, 0dB, and
dB SNRs. The sentences of both genders are repeated for all SNRs and mixed with all noises, resulting in 21,600 sentences (approximately 18 h). These sentences are used to train the proposed SE model. During testing, half of the speech sentences are used in matched and half are used in mismatched noisy conditions. All noises are tested with distinct sentences. Sentences are sampled at a 16 kHz rate, and a Hanning window (512 points) with 75% overlapping is used in experiments. Usually, a noisy phase is used during speech reconstruction; however, this study uses an estimated phase54 to reconstruct the final speech. A loss function quantifies the differences between a predefined mask and an estimated mask in masking-based SE. The goal of the loss function is to minimize errors. Typically, MSE is used as a loss function in TF-masking-based SE, defined as
![]() |
19 |
where f(S), S, and Y represent the model output, input, and ground truth label. The MSE in Eq. (19) can be expressed as:
![]() |
20 |
Table 1.
Hyperparameters and initial conditions.
| Component | Parameter | Value/range | Description |
|---|---|---|---|
| STFT preprocessing | Window Size | 512 points | Hanning window |
| Hop Length | 128 points (75% overlap) | Frame overlap | |
| FFT Bins | 257 | Frequency bins | |
| SE Network | GRU Layers | 3 | Stacked layers |
| Units per Layer | 256 | Hidden units | |
| Attention Window (Z) | 5 frames | Local attention span | |
AReLU Init ( , ) |
,
|
Learned parameters | |
| Training | Batch Size | 32 | Mini-batch size |
| Learning Rate | 0.0001 | Adam optimizer | |
| Epochs | 160 | Training iterations | |
Loss Weight ( ) |
Adaptive (init = 1.0) | SE-ASR balance | |
| WMSE Threshold (B) | 10 | Dynamic weighting | |
| Fusion Network | GRU Hidden Units | 256 | Fusion layer size |
| Fusion Steps (p) | 3 | Iteration count |
The estimated mask and predefined mask are represented as
and M(s), respectively. A dynamic-weighted loss function is employed to enhance network learning. This loss function multiplies weighted values by the learning errors. With such a process, the loss function emphasizes larger errors, enhancing overall performance. The weighted Mean Squared Error (WMSE) is calculated by multiplying the MSE function by a weight variable
.
![]() |
21 |
To give more importance to the situations with significant errors, the weighting variable
in Eq. (21) is modified based on the following formula:
![]() |
22 |
The following conditions are applied to select the weights, given as:
![]() |
23 |
When the absolute divergence falls below a constant value of
(experimentally set to 10; since the model performs better at B=10), the weighting is reduced by half. When the error is smaller than B, the weighting factor is reduced by half, indicating the model does not focus as much on small errors. Similarly, When the error is greater than or equal to B, the weight
is set equal to the error magnitude, meaning the model will focus strongly on these larger errors.
SE evaluation metrics and related models
To examine the proposed SE, this study uses well-adopted metrics including Short-time Objective Intelligibility (STOI)55, Perceptual Evaluation of Speech Quality (PESQ)56, and Source-to-Distortion Ratio (SDR). In this study, we chose LSTM20 and DNN17 as baseline models for estimating Ideal Ratio Mask (IRM) The baseline models are represented as LSTM+IRM denotes that LSTM is used to estimate IRM; DNN+IRM indicates that a fully-connected DNN estimates IRM, and GRN+IRM indicates that the proposed SE estimates IRM as a training objective.
SE results and discussions
The study first compares the performance of the proposed SE against the baselines. Tables 2 and 3 displays the average test results of three metrics (STOI, PESQ, and SDR) across four testing noises and three SNRs for both matched and unmatched noisy conditions. It is important to highlight that, unlike the baselines, the proposed GRN+IRM consistently outperforms them across all noisy testing scenarios.
Table 2.
SE performance in matched testing conditions.
| Noise | Metric | STOI in % | PESQ | SDR in dB | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Types | SNR |
dB |
0dB | 5dB | Avg |
dB |
0dB | 5dB | Avg |
dB |
0dB | 5dB | Avg |
| Airport Noise | Noisy Speech | 63.05 | 69.76 | 83.95 | 72.25 | 1.64 | 1.86 | 2.14 | 1.88 | ![]() |
0.11 | 5.07 | 0.13 |
| DNN+IRM | 80.85 | 84.54 | 90.74 | 85.38 | 1.83 | 2.21 | 2.59 | 2.21 | 3.98 | 6.86 | 8.38 | 6.41 | |
| LSTM+IRM | 83.55 | 87.65 | 92.36 | 87.85 | 2.01 | 2.34 | 2.67 | 2.34 | 4.09 | 7.1 | 9.54 | 6.91 | |
| GRN+IRM | 86.25 | 89.58 | 94.55 | 90.13 | 2.19 | 2.47 | 2.78 | 2.48 | 4.21 | 7.33 | 10.7 | 7.41 | |
| Babble noise | Noisy speech | 57.75 | 68.05 | 79.67 | 68.52 | 1.52 | 1.75 | 2.07 | 1.78 | ![]() |
0.13 | 5.08 | 0.16 |
| DNN+IRM | 74.22 | 78.35 | 85.93 | 79.51 | 1.91 | 2.24 | 2.48 | 2.22 | 3.82 | 6.31 | 8.74 | 6.29 | |
| LSTM+IRM | 76.85 | 80.64 | 87.45 | 81.65 | 2.04 | 2.35 | 2.66 | 2.35 | 3.95 | 6.58 | 9.05 | 6.52 | |
| GRN+IRM | 80.22 | 82.25 | 89.27 | 83.91 | 2.15 | 2.46 | 2.75 | 2.45 | 4.08 | 7.4 | 9.36 | 6.94 | |
| Car Noise | Noisy Speech | 58.84 | 68.92 | 79.60 | 69.12 | 1.37 | 1.62 | 1.92 | 1.63 | ![]() |
0.08 | 5.05 | 0.09 |
| DNN+IRM | 78.65 | 81.74 | 86.77 | 82.38 | 1.74 | 2.18 | 2.48 | 2.13 | 3.81 | 6.42 | 8.83 | 6.35 | |
| LSTM+IRM | 80.23 | 84.47 | 89.18 | 84.62 | 1.99 | 2.39 | 2.55 | 2.31 | 4.2 | 7.21 | 9.92 | 7.11 | |
| GRN+IRM | 85.48 | 86.56 | 92.68 | 88.24 | 2.09 | 2.47 | 2.71 | 2.42 | 4.56 | 7.59 | 10.8 | 7.65 | |
| Factory Noise | Noisy Speech | 58.44 | 67.44 | 78.80 | 68.22 | 1.31 | 1.61 | 1.92 | 1.61 | ![]() |
0.12 | 5.07 | 0.17 |
| DNN+IRM | 78.25 | 80.71 | 85.25 | 81.41 | 1.66 | 2.15 | 2.47 | 2.09 | 3.66 | 6.34 | 8.72 | 6.24 | |
| LSTM+IRM | 79.45 | 82.45 | 88.17 | 83.35 | 1.89 | 2.33 | 2.55 | 2.26 | 3.85 | 5.53 | 9.52 | 6.3 | |
| GRN+IRM | 81.62 | 82.15 | 90.44 | 84.73 | 2.11 | 2.45 | 2.77 | 2.44 | 4.01 | 6.69 | 10.3 | 6.99 | |
Table 3.
SE performance in unmatched testing conditions.
| Noise | Metric | STOI in % | PESQ | SDR in dB | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Types | SNR |
dB |
0dB | 5dB | Avg |
dB |
0dB | 5dB | Avg |
dB |
0dB | 5dB | Avg |
| Airport Noise | Noisy Speech | 60.95 | 71.84 | 82.24 | 71.67 | 1.55 | 1.85 | 2.14 | 1.84 | ![]() |
0.11 | 5.07 | 0.14 |
| DNN+IRM | 77.36 | 83.24 | 87.48 | 82.69 | 1.81 | 2.25 | 2.58 | 2.21 | 3.9 | 6.78 | 8.22 | 6.3 | |
| LSTM+IRM | 80.14 | 85.46 | 89.87 | 85.15 | 1.94 | 2.34 | 2.67 | 2.31 | 4.01 | 7.06 | 9.46 | 6.84 | |
| GRN+IRM | 82.77 | 87.91 | 92.37 | 87.68 | 2.08 | 2.42 | 2.75 | 2.42 | 4.13 | 7.34 | 10.7 | 7.39 | |
| Babble Noise | Noisy Speech | 54.64 | 65.94 | 77.48 | 66.02 | 1.39 | 1.74 | 2.04 | 1.72 | ![]() |
0.16 | 5.1 | 0.19 |
| DNN+IRM | 71.15 | 78.47 | 80.77 | 76.79 | 1.77 | 2.01 | 2.56 | 2.11 | 3.88 | 5.85 | 8.66 | 6.13 | |
| LSTM+IRM | 73.55 | 79.87 | 82.33 | 78.58 | 1.88 | 2.17 | 2.64 | 2.23 | 3.98 | 6.01 | 8.94 | 6.31 | |
| GRN+IRM | 75.96 | 81.26 | 85.84 | 81.02 | 2.01 | 2.33 | 2.71 | 2.35 | 4.09 | 6.18 | 9.23 | 6.5 | |
| Car Noise | Noisy Speech | 57.27 | 67.48 | 78.44 | 67.73 | 1.39 | 1.63 | 1.93 | 1.65 | ![]() |
0.1 | 5.07 | 0.13 |
| DNN+IRM | 75.38 | 80.29 | 83.21 | 79.62 | 1.72 | 2.22 | 2.44 | 2.12 | 3.99 | 6.83 | 8.31 | 6.38 | |
| LSTM+IRM | 78.19 | 83.37 | 87.55 | 83.03 | 1.95 | 2.31 | 2.67 | 2.31 | 4.14 | 7.11 | 9.51 | 6.92 | |
| GRN+IRM | 80.91 | 86.42 | 90.81 | 86.04 | 2.14 | 2.45 | 2.73 | 2.44 | 4.3 | 7.45 | 10.6 | 7.45 | |
| Factory Noise | Noisy Speech | 55.24 | 65.93 | 77.24 | 66.13 | 1.31 | 1.6 | 1.92 | 1.61 | ![]() |
0.12 | 5.08 | 0.18 |
| DNN+IRM | 70.74 | 75.39 | 81.25 | 75.79 | 1.71 | 2.09 | 2.44 | 2.08 | 3.43 | 5.98 | 8.34 | 5.92 | |
| LSTM+IRM | 73.92 | 77.41 | 84.47 | 78.61 | 1.84 | 2.25 | 2.59 | 2.22 | 3.68 | 6.22 | 9.27 | 6.39 | |
| GRN+IRM | 77.35 | 79.49 | 87.64 | 81.49 | 1.97 | 2.41 | 2.71 | 2.36 | 3.94 | 6.47 | 10.2 | 6.87 | |
Table 2 displays the results of speech enhancement under matched conditions, where the proposed GRN+IRM demonstrates better values for all objective measures in all background noises. Specifically, at low SNR (
dB), the proposed SE network achieves the highest STOI (
%) and PESQ (
) values for airport noise, whereas the best SDR (
dB) is achieved at
dB for babble noise. Taking the babble noisy case (matched condition) at
dB SNR, STOI improves from 63.05% with noisy speech to 86.25% with GRN+IRM, resulting in a 23.2% improvement in STOI. Furthermore, STOI improves from 80.85% with DNN+IRM to 86.25% with GRN+IRM, resulting in a 5.4% improvement in STOI. Similarly, in the case of factory noise (matched condition) at 0dB SNR, PESQ increases from 1.61 with UnP to 2.45 with the proposed GRN+IRM, achieving a 0.84 (34.28%) improvement. Additionally, PESQ increases from 2.18 with DNN+IRM to 2.47 with the proposed GRN+IRM in-car noise, resulting in a 0.29 (11.74%) improvement over the DNN+IRM. In a matched condition, consider street noise at 5dB as another case, where the SDR value increases from 0.11dB with UnP to 6.82dB with GRN+IRM, achieving an improvement of 6.71 dB. On average, at low SNR (-5dB) in matched conditions, the proposed GRN+IRM increases STOI (by 16.34%), PESQ (by 0.71), and SDR (by 7.17dB) over noisy unprocessed speech, demonstrating the effectiveness of the proposed SE model.
Table 3 presents the results of speech enhancement conducted under unmatched conditions, where the proposed GRN+IRM with an IRM training objective achieves better average values for all objective measures in all background noises. Specifically, at low SNR (
dB), the proposed model achieves the highest STOI (
%), and PESQ (
) in street noise, whereas the best SDR (
dB) is achieved in-car noise. In the case of babble noise at 0dB SNR under unmatched conditions, the STOI improves from 71.84% with noisy speech to 87.91% with GRN+IRM, resulting in a 16.07% improvement in STOI. Also, for the factory noisy case (unmatched condition) at 0dB SNR, the PESQ improves from 1.61 with UnP to 2.41 with the proposed GRN+IRM, representing a 0.80 (33.19%) improvement. Furthermore, in car noise at 5dB, the PESQ improves from 2.44 with DNN+IRM to 2.73 with the proposed GRN+IRM, representing a 0.29 (11.74%) improvement over the DNN+IRM baseline. In the unmatched condition of street noise at 5dB, the SDR value increases from 5.12dB with UnP to 9.71dB with GRN+IRM, resulting in an improvement of 4.59 dB. On average, at low SNR (-5dB) under unmatched conditions, the proposed GRN+IRM significantly improves the STOI, PESQ, and SDR over noisy unprocessed speech. Table 4 provides average scores encompassing all background noises for matched conditions (GRN+IRM-Matched), unmatched conditions (GRN+IRM-Unmatched), and the average of both conditions.
Table 4.
The overall results (Matched and Unmatched), averaged over all testing SNRs and noises.
| Condition | GRN+IRM-Matched | GRN+IRM-Unmatched | Average | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Metric | STOI | PESQ | SDR | STOI | PESQ | SDR | STOI | PESQ | SDR |
| Results | 86.75 | 2.45 | 7.20 | 82.27 | 2.39 | 6.97 | 84.51 | 2.42 | 7.09 |
Table 5 shows the performance of the causal local attention (CLA) for which the values of W are varied between (5–15). The results (PESQ and STOI) indicate that the value of W greater than 15 shows no competitive results, and the best SE results are obtained with
. Therefore,
is fixed for the proposed SE. It was observed that the causal local attention outperformed the causal dynamic attention. These findings support the assumption that substantial preceding information is not necessary for effective speech enhancement, as noisy conditions can change rapidly over time. These observations apply to the attention networks, as the attention GRU performed better than the baseline GRU.
Table 5.
Causal local attention with different weight values.
| Z | STOI (in%) | PESQ | ||||
|---|---|---|---|---|---|---|
| SNR |
dB |
0dB | 5dB |
dB |
0dB | 5dB |
| 5 | 86.21 | 89.56 | 94.54 | 2.18 | 2.47 | 2.77 |
| 15 | 82.45 | 84.74 | 89.83 | 2.15 | 2.42 | 2.69 |
| 25 | 79.56 | 82.74 | 88.12 | 2.08 | 2.34 | 2.66 |
Table 6 shows a comparison of errors and predicted results (STOI and PESQ) between GRN+IRM with weighted and without weighted loss functions. The results indicate that the weighted loss function improved the PESQ and STOI after incorporating the proposed GRN+IRM. The use of weighted mean square error (WMSE) reduces the errors to
as compared to
with non-weighted MSE. The learning error is reduced by 10.52% with the weighted loss function. Due to limitations in computational resources in practical applications, it is crucial to establish an optimal balance between the model’s performance improvement and parameter efficiency. Table 7 illustrates the efficiency of parameters in the proposed speech enhancement model. The parameter efficiency of these SE models reveals that the integration of the attention process into GRU does not significantly affect the parameter count (2.138M) and parameter size (2.71 MB) compared to LSTM (4.672M, 5.43 MB) and residual LSTM (RLSTM) (10M)57. To employ the proposed GRN+IRM on embedded systems, it is essential to minimize hardware memory usage. Consequently, we present a summary of multiply-accumulate operations (MACs). The proposed GRN+IRM model achieves 0.245 G/s MACs with an attention process, ensuring efficiency without compromising SE performance. The integration of GRU has significantly reduced parameter numbers, parameter size, and MACs. We further analyzed the convergence of the proposed model after incorporating weighted MSE, as shown in Fig. 5. It can be observed that the weighted MSE converges faster than the traditional MSE.
Table 6.
Dynamically-Weighted vs. non-dynamically-weighted loss.
| Model | Metric | Error |
|---|---|---|
| GRN+MSE | PESQ: 2.32, STOI: 83.87% | ![]() |
| GRN+WMSE | PESQ: 2.45, STOI: 86.21% | ![]() |
| Improvement | PESQi: 4.91%, STOIi: 2.34% & 10.52% | 10.52% |
Table 7.
Computational efficiency.
Fig. 5.

Learning error: DW-MSE and without DW-MSE loss.
Comparison with related studies
This study compares the performance of the GRN+IRM with several selected studies from the literature to showcase its superiority. The comparison is performed for three different SNR levels (
dB, 0dB, and 5dB) and the results are presented in Table 8. The study finds that the GRN+IRM model, with the IRM training objective, performs highly competitively as compared to recent models, except PL-CRN61, which performs slightly better at less adverse SNR (5 dB). CRN-BLSTM60 gained 0.48 (21.62%) PESQ over noisy mixture, which indicates 7.36% lower performance than the proposed GRN+IRM. Similarly, CNN-GRU62 gained 0.59 (25.32%) PESQ over a noisy mixture, which shows 3.64% lower performance than the proposed GRN+IRM. Furthermore, the gain in STOI for MCBNet59 over noisy mixture is 8.31%, indicating 8.03% less STOI gain as compared to the GRN+IRM. Additionally, the STOI improves from 84.25% with DCCRN64 to 86.75% with GRN+IRM. The proposed GRN+IRM outperforms related models by significant margins, such as a PESQ improvement of 0.31 (14.28%) and an STOI improvement of 11.55% over the state-of-the-art GRN67 and AECNN68 models at the -5dB SNR level.
Table 8.
Comparison with related SE models, where the symbol “
” indicates improvement over noisy speech.
| Metric | PESQ | STOI | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SNR (in dB) |
dB |
0dB | 5dB | Average | PESQ
|
dB |
0dB | 5dB | Average | STOI
|
| Noisy unprocessed | 1.46 | 1.74 | 2.01 | 1.74 | – | 60.25 | 69.76 | 81.21 | 70.41 | – |
| DeepResGRU30 | 2.09 | 2.29 | 2.49 | 2.29 | 0.55 | 74.13 | 81.81 | 85.51 | 80.48 | 10.07 |
| CFN-GCFU58 | 1.98 | 2.24 | 2.62 | 2.28 | 0.54 | 71.61 | 78.19 | 86.21 | 78.67 | 8.26 |
| MCBNet59 | 2.01 | 2.32 | 2.52 | 2.28 | 0.54 | 72.81 | 79.15 | 84.15 | 78.71 | 8.30 |
| CRN-BLSTM60 | 1.93 | 2.23 | 2.51 | 2.22 | 0.48 | 70.31 | 77.08 | 81.96 | 76.45 | 6.04 |
| PL-CRNN61 | 2.06 | 2.51 | 2.85 | 2.47 | 0.73 | 73.16 | 84.42 | 90.15 | 82.57 | 12.16 |
| CNN-GRU62 | 2.01 | 2.34 | 2.65 | 2.33 | 0.59 | 74.61 | 83.11 | 90.11 | 82.61 | 12.20 |
| DTLN63 | 1.91 | 2.34 | 2.67 | 2.31 | 0.57 | 72.72 | 85.19 | 90.68 | 82.86 | 12.45 |
| DCCRN64 | 1.85 | 2.34 | 2.78 | 2.32 | 0.58 | 74.51 | 85.87 | 92.38 | 84.25 | 13.84 |
| DNN-TGSA65 | 2.01 | 2.31 | 2.58 | 2.30 | 0.56 | 74.41 | 81.21 | 84.12 | 79.91 | 9.50 |
| DeepXi66 | 1.99 | 2.21 | 2.41 | 2.20 | 0.46 | 72.01 | 81.21 | 91.99 | 81.73 | 11.32 |
| GRN67 | 1.86 | 2.16 | 2.42 | 2.15 | 0.41 | 69.76 | 76.89 | 81.42 | 76.02 | 5.61 |
| AECNN68 | 1.92 | 2.19 | 2.45 | 2.19 | 0.45 | 72.01 | 77.78 | 82.51 | 77.43 | 7.02 |
| CRN69 | 1.92 | 2.22 | 2.49 | 2.21 | 0.41 | 70.11 | 76.95 | 81.88 | 76.31 | 5.90 |
| GAN70 | 1.72 | 2.15 | 2.44 | 2.11 | 0.37 | 65.01 | 75.71 | 82.61 | 74.44 | 4.03 |
| LSTM20 | 1.82 | 2.15 | 2.44 | 2.14 | 0.40 | 68.78 | 75.81 | 81.54 | 75.37 | 4.96 |
| GRN+IRM (Proposed) | 2.17 | 2.48 | 2.75 | 2.45 | 0.71 | 83.29 | 85.86 | 92.08 | 86.75 | 16.34 |
Significant values are in bold.
Subjective evaluation
Furthermore, we conducted subjective listening tests to evaluate the quality of the enhanced speech. We randomly selected 300 sentences from different background noises at
dB, 0dB, and 5dB to assess the performance of the DNN, LSTM, and proposed GRN+IRM. The participants were asked to rate the speech quality on a scale from 0 to 5. The subjective tests are performed in a soundproof room using high-quality headphones. Before the tests, training sessions were conducted to familiarize the listeners with the procedures. Figure 8 displays the results of the Mean Opinion Score (MOS), a numerical measure of the human-judged overall quality, also known as the subjective listening test, where the proposed GRN+IRM model demonstrated superior MOS performance. The average MOS score was higher than 2.80 (with MOS
at
dB) for negative SNRs, indicating considerable SE performance. For SNR
0dB, the GRN+IRM model yielded a MOS score greater than 3.0 (MOS
3.0 at SNR
0dB). ANOVA statistical analysis for average MOS scores at -5dB, 0dB, and 5dB yielded [F(3,10) = 44.5, p < 0.0001], [F(3,10) = 35.8, p < 0.0001], and [F(3,10) = 27.2, p < 0.0001], indicating the statistical significance of the MOS scores achieved by the GRN+IRM model. FDNN and LSTM also demonstrated improved performance, as deep learning can produce better speech quality. Figure 6 shows the average MOS score of all listeners, where the y-axis shows the MOS score and the x-axis indicates the input SNRs.
Fig. 8.

Noisy and enhanced features fusion with GRU.
Fig. 6.

Average MOS of all participants at SNRs.
Joint optimization and ASR
In conventional joint speech enhancement and ASR, a noisy magnitude spectrum |Y| is used as the input feature. The conventional joint training method comprises two main components: speech enhancement and speech recognition. Initially, the model is trained using both noisy and clean parallel data to enhance speech quality. Subsequently, the improved speech output serves as the sole input feature for the speech recognition model71–73. To optimize the entire system, a combined loss function for both enhancement and speech recognition is employed. This enables the simultaneous training of enhancement and ASR models. However, this approach completely depends on the enhanced features of the speech recognition model, which may still be affected to some extent by speech distortions. Therefore, this study follows the joint optimization approach shown in Fig. 7.
Fig. 7.

Schematic of joint SE and ASR.
The spectrograms produced by the speech enhancement network can often display noticeable distortions in the resulting speech. This problem arises when noise dominates in specific time-frequency bands, overshadowing the intended speech signals. Consequently, the speech enhancement identifies these noisy time-frequency bands and removes a significant amount of information, resulting in distortions that lead to the loss of important speech elements like formats. Even though the speech enhancement effectively reduces background noise, these distortions persist undetected by the ASR system. Such distortions ultimately contribute to the decline in ASR performance. To tackle this challenge, we implement the fused GRU (F
) approach to combine noisy and enhanced features, as illustrated in Fig. 8. This method aims to mitigate the impact of these distortions and enhance the overall performance of the ASR system.
Regarding the feature fusion network, our approach involves employing two GRUs simultaneously, denoted by G(
), demonstrated in Fig. 9. The goal is to derive deep representations for enhanced (
) and noisy (
) features. In the initial stage of fusing noisy features
with enhanced features
at
, the hidden state
is initialized randomly. At the reset gate of GRU for step p, the hidden state
and noisy input features
decide the status of the reset gate. The status of the update gate is also decided by
and
, given as:
![]() |
24 |
![]() |
25 |
where
and
are weights of reset and update gates. The reset gate r determines the memorization of past information by using element-wise product
, given as:
![]() |
26 |
![]() |
27 |
Fig. 9.
Features fusion block with GRUs.
The
helps in remembering long-term information. The selective fusion of features combines
and
at step p, given as:
![]() |
28 |
The above Eq. (28) connects the input gate
and the forget gate z. Finally, after three stages of F
, the features are concatenated to obtain the final features
, given as:
![]() |
29 |
The fused features
are used as input to the Transformer-based ASR system. To jointly train the ASR and the proposed SE, the loss function is given as:
![]() |
30 |
The parameter
controls the enhancement loss
. In the SE module, the parameter
is not fixed but is adaptively optimized during training to dynamically balance multiple objectives within the composite loss function. Instead of manually setting a static value,
is learned alongside model parameters, allowing the model to adjust its focus based on the training dynamics and data characteristics. By learning
adaptively, the model can prioritize noise suppression or speech fidelity at different stages of training, ultimately converging to an optimal balance that enhances overall performance. This approach leads to a more flexible and efficient enhancement process, tailoring the loss weights to suit the complexities of the input data and task requirements.
ASR results
For a noisy training dataset, speech sentences are selected from the LibriSpeech training set. These sentences are mixed with different noises, with randomly selected SNRs ranging from 0dB to 20dB. The inference set contains noises mixed with speech sentences from LibriSpeech at SNRs of 0dB, 5dB, 10dB, 15dB, and 20dB. Tables 9 and 10 show the results of the joint speech enhancement and transformer ASR. The joint training approach has the potential to improve the efficiency of end-to-end ASR, illustrating the efficacy of the joint training technique. We present the character error rate (CER) for ASR-Enhanced (indicate the concatenation of the noisy features
and features enhanced by the proposed SE
) and ASR-Enhanced-Fused (indicate the concatenation of the
, noisy features
, and features enhanced by the proposed SE
). In addition, we provide results for ASR-Enhanced-LSTM (indicate the concatenation of the noisy features
and features enhanced by LSTM-based SE
) and ASR-Enhanced-GRU (indicate the concatenation of the noisy features
and features enhanced by GRU-based SE
). Table 9 shows the CERs for the testing dataset, whereas Table 10 provides results for the development set, respectively. With the proposed speech enhancement and joint ASR, the CERs are improved significantly. Since the proposed SE shows less speech distortion (obtained better SDR (7.09 dB) as compared to LSTM and GRU), the average CERs are improved from 14.30% (with ASR-Enhanced-LSTM) to 13.01% with the proposed ASR-Enhanced-Fused.
Table 9.
CER results for Joint Speech Enhancement and ASR on Testing Set.
| Model | 0dB | 5dB | 10dB | 15dB | 20dB | Average | Clean |
|---|---|---|---|---|---|---|---|
| Noisy | 51.44 | 39.78 | 28.15 | 23.11 | 22.47 | 32.99 | – |
| ASR-enhanced | 24.35 | 16.69 | 13.29 | 12.01 | 11.11 | 15.49 | 9.32 |
| ASR-enhanced-GRU | 22.02 | 15.75 | 12.37 | 11.65 | 10.84 | 14.52 | 9.32 |
| ASR-enhanced-LSTM | 21.92 | 15.65 | 11.97 | 11.22 | 10.74 | 14.30 | 9.32 |
| ASR-enhanced-fused | 20.02 | 14.75 | 10.37 | 10.05 | 9.84 | 13.01 | 9.32 |
Table 10.
CER results for joint speech enhancement and ASR on development set.
| Model | 0dB | 5dB | 10dB | 15dB | 20dB | Average | Clean |
|---|---|---|---|---|---|---|---|
| Noisy | 51.44 | 39.78 | 28.15 | 23.11 | 22.47 | 32.99 | – |
| ASR-Enhanced | 22.18 | 14.54 | 11.35 | 10.09 | 9.72 | 11.42 | 8.21 |
| ASR-Enhanced-GRU | 20.13 | 13.98 | 11.45 | 10.02 | 9.65 | 11.27 | 8.21 |
| ASR-Enhanced-LSTM | 20.01 | 13.77 | 11.02 | 10.78 | 9.94 | 11.37 | 8.21 |
| ASR-Enhanced-Fused | 18.14 | 12.24 | 10.01 | 9.74 | 8.94 | 10.23 | 8.21 |
Summary and conclusion
This paper proposes a model that optimizes both speech enhancement and automatic speech recognition simultaneously. The objective is to seamlessly enhance speech quality while also refining representations to better suit the recognition task. While the integration of joint speech enhancement and automatic speech recognition techniques has displayed potential in achieving robust end-to-end ASR systems, conventional approaches typically rely on utilizing enhanced features as inputs for ASR systems. To overcome this limitation, our study adopted a dynamic fusion methodology. This approach combines both the enhanced features and the raw noisy features, to eliminate noise signals from the enhanced target speech while simultaneously capturing fine details from the noisy signals. By employing this fusion strategy, we alleviate speech distortions, thereby enhancing the overall performance of the ASR system. Our proposed model comprises an attentional codec equipped with a causal attention mechanism for SE, a fusion network based on Gated Recurrent Units (GRUs), and an ASR system. In the SE network, we utilize a modified GRU architecture where the traditional hyperbolic tangent (tanh) activation function is replaced with an attention-based rectified linear unit (AReLU).
The proposed speech enhancement (GRN+IRM) consistently outperforms baselines across noisy testing scenarios. Specifically, under low SNR (
dB) conditions, our SE network achieves superior STOI (
%), PESQ (
), and SDR (
dB) scores in matched conditions. Similarly, our model achieves the highest STOI (
%), PESQ (
), and SDR (
dB) values at low SNRs. Notably, causal local attention outperforms causal dynamic attention, concluding that extensive preceding information might not be necessary for effective speech enhancement, given the rapid changes in noisy conditions. Minimizing hardware memory usage is crucial to ensure the feasibility of deploying the proposed GRN+IRM on embedded systems. Therefore, we examined multiply-accumulate operations (MACs). The proposed model concludes 0.245 G/s MACs with an attention process, ensuring efficiency without compromising SE performance. Our study concludes that the GRN+IRM model, trained with the IRM objective, stands competitively against recent models. With our proposed speech enhancement and joint ASR, significant improvements are observed in character error rates (CERs). Due to reduced speech distortion (achieved a better SDR of 7.09dB compared to LSTM and GRU), the average CERs are enhanced from 14.30% (with ASR-Enhanced-LSTM) to 13.01% with our proposed ASR-Enhanced-Fused model.
The limitations of this include the performance may degrade for highly non-stationary noises (e.g., sudden bursts, overlapping speakers) due to the fixed attention window (Z = 5). Future work will explore adaptive window sizing or hybrid attention mechanisms. Further, the model is trained on LibriSpeech (English), and its generalizability to low-resource languages with different phonetic structures is untested. To address this in the future, transfer learning with limited labelled data could be investigated.
Acknowledgements
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/607/46. The Researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025). Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R747), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
Funding
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/607/46. The researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025). The Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R747), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
Data availability
The datasets generated used and analysed during the current study are available in the LibriSpeech and AURORA repository available at https://www.openslr.org/12 and http://aurora.hsnr.de/aurora-2.html. The raw codes for attention-GRU are available at https://github.com/NasirSaleem/Speech-Enhancement-ASR.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Reza, S., Ferreira, M. C., Machado, J. J. M. & Tavares, J. M. R. S. A customized residual neural network and bi-directional gated recurrent unit-based automatic speech recognition model. Expert Syst. Appl.215, 119293 (2023). [Google Scholar]
- 2.El-Shafai, W. et al. Optical ciphering scheme for cancellable speaker identification system. Comput. Syst. Sci. Eng.45(1), 563–578 (2023). [Google Scholar]
- 3.Passos, L. A., Papa, J. P., Hussain, A. & Adeel, A. Canonical cortical graph neural networks and its application for speech enhancement in audio-visual hearing aids. Neurocomputing527, 196–203 (2023). [Google Scholar]
- 4.Windowing, F. F. T. Research article speech enhancement with geometric advent of spectral subtraction using connected time-frequency regions noise estimation. Res. J. Appl. Sci. Eng. Technol.6(6), 1081–1087 (2013). [Google Scholar]
- 5.Jannu, C., & Vanambathina, S.D. Weibull and nakagami speech priors based regularized nmf with adaptive wiener filter for speech enhancement. Int. J. Speech Technol. 1–13 (2023).
- 6.Ephraim, Y. & Malah, David. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process.32(6), 1109–1121 (1984). [Google Scholar]
- 7.Chen, Bin & Loizou, Philipos C. A laplacian-based mmse estimator for speech enhancement. Speech Commun.49(2), 134–143 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Michelsanti, D. et al. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Trans. Audio Speech Lang. Process.29, 1368–1396 (2021). [Google Scholar]
- 9.Yong, X., Jun, D., Dai, L.-R. & Lee, C.-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process.23(1), 7–19 (2014). [Google Scholar]
- 10.Wang, Z.-Q., Wang, P. & Wang, D. Complex spectral mapping for single-and multi-channel speech enhancement and robust asr. IEEE/ACM Trans. Audio Speech Lang. Process.28, 1778–1787 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Li, A., Liu, W., Zheng, C., Fan, C. & Li, X. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.29, 1829–1843 (2021). [Google Scholar]
- 12.Abdullah, Salinna, Zamani, Majid & Demosthenous, Andreas. Towards more efficient dnn-based speech enhancement using quantized correlation mask. IEEE Access9, 24350–24362 (2021). [Google Scholar]
- 13.Saleem, N., Mustafa, E., Nawaz, A. & Khan, A. Ideal binary masking for reducing convolutive noise. Int. J. Speech Technol.18, 547–554 (2015). [Google Scholar]
- 14.Bao, Feng & Abdulla, Waleed H. A new ratio mask representation for casa-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.27(1), 7–19 (2018). [Google Scholar]
- 15.Saleem, N., Khattak, M. I., Al-Hasan, M. & Qazi, A. B. On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks. IEEE Access8, 160581–160595 (2020). [Google Scholar]
- 16.Fan, C. et al. Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.29, 198–209 (2020). [Google Scholar]
- 17.Saleem, N. & Khattak, M. I. Deep neural networks for speech enhancement in complex-noisy environments. Int. J. Interactive Multimed. Artif. Intell.6(1), 84–91 (2020). [Google Scholar]
- 18.Sun, L., Du, J., Dai, L.-R., & Lee, C.-H. Multiple-target deep learning for lstm-rnn based speech enhancement. In 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140. IEEE, (2017).
- 19.Yechuri, S. & Vanambathina, S. A nested u-net with efficient channel attention and d3net for speech enhancement. Circ. Syst. Signal Process. 1–21 (2023).
- 20.Chen, Jitong, Wang, Yuxuan, Yoho, Sarah E., Wang, DeLiang & Healy, Eric W. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. J. Acoust. Soc. Am.139(5), 2604–2612 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Huang, X., Chen, H. & Wei, L. A two-stage frequency-time dilated dense network for speech enhancement. Appl. Acoust.201, 109107 (2022). [Google Scholar]
- 22.Pandey, A. & Wang, D. L. Dense cnn with self-attention for time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.29, 1270–1279 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Saleem, N. et al. U-shaped low-complexity type-2 fuzzy lstm neural network for speech enhancement. IEEE Access11, 20814–20826 (2023). [Google Scholar]
- 24.Liang, Ruiyu, Kong, F., Xie, Y., Tang, G. & Cheng, J. Real-time speech enhancement algorithm based on attention lstm. IEEE Access8, 48464–48476 (2020). [Google Scholar]
- 25.Pandey, A., & Wang, D.L. Dual-path self-attention rnn for real-time speech enhancement. arXiv preprint arXiv:2010.12713, (2020).
- 26.Yechuri, S., & Vanambathina, S. A nested u-net with efficient channel attention and d3net for speech enhancement. Circ. Syst. Signal Process. 1–21 (2023).
- 27.Xu, X., & Hao, J. U-former: Improving monaural speech enhancement with multi-head self and cross attention. in 2022 26th International Conference on Pattern Recognition (ICPR), pp 663–369. IEEE (2022).
- 28.Chen, J., Wang, Z., Tuo, D., Wu, Z., Kang, S., & Meng, H.. Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement. in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 7857–7861. IEEE (2022).
- 29.Guochen, Y. et al. Dbt-net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.30, 2629–2644 (2022). [Google Scholar]
- 30.Saleem, N. et al. Deepresgru: Residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition. Knowl.-Based Syst.238, 107914 (2022). [Google Scholar]
- 31.Wang, Y., Han, J., Zhang, T. & Qing, D. Speech enhancement from fused features based on deep neural network and gated recurrent unit network. EURASIP J. Adv. Signal Process.1–19, 2021 (2021). [Google Scholar]
- 32.Yuan, W. Incorporating group update for speech enhancement based on convolutional gated recurrent network. Speech Commun.132, 32–39 (2021). [Google Scholar]
- 33.Seltzer, M. L. Bridging the gap: Towards a unified framework for hands-free speech recognition using microphone arrays. In 2008 Hands-Free Speech Communication and Microphone Arrays, pp. 104–107. IEEE, (2008).
- 34.Wang, Z.-Q. & Wang, D. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.24(4), 796–806 (2016). [Google Scholar]
- 35.Han, K. et al. Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio Speech Lang. Process.23(6), 982–992 (2015). [Google Scholar]
- 36.Li, F., Nidadavolu, P. S., & Hermansky, H. A long, deep and wide artificial neural net for robust speech recognition in unknown noise. in Interspeech, pp. 358–362 (2014).
- 37.M.L. Seltzer, D. Yu, and Y. Wang. An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing, pp 7398–7402. IEEE (2013).
- 38.Liu, B., Nie, S., Zhang, Y., Ke, D., Liang, S., & Liu, W. Boosting noise robustness of acoustic model via deep adversarial training. in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5034–5038. IEEE (2018).
- 39.Chang, X., Zhang, W., Qian, Y., Le Roux, J., & Watanabe, S. End-to-end multi-speaker speech recognition with transformer. in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6134–6138. IEEE (2020).
- 40.Bu, H., Du, J., Na, X., Wu, B., & Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp. 1–5. IEEE (2017).
- 41.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. Attention is all you need. Adv. Neural Inform. Process. Syst. 30, (2017).
- 42.Saleem, N., Gunawan, T. S., Dhahbi, S., & Bourouis, S. Time domain speech enhancement with cnn and time-attention transformer. Digital Signal Process. 104408 (2024).
- 43.Vanambathina, S. D., Nandyala, S., Jannu, C., Devi, J. S., Yechuri, S., & Parisae, V. Speech enhancement using u-net-based progressive learning with squeeze-tcn. In International Conference on Advances in Distributed Computing and Machine Learning, pp. 419–432. Springer (2024).
- 44.Parisae, V. & Bhavanam, S. N. Multi scale encoder-decoder network with time frequency attention and s-tcn for single channel speech enhancement. J. Intell. Fuzzy Syst.46(4), 10907–10907 (2024). [Google Scholar]
- 45.Nakadai, K., Hidai, K., Okuno, H. G., & Kitano, H. Real-time speaker localization and speech separation by audio-visual integration. in Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292), vol. 1, pp. 1043–1049. IEEE (2002).
- 46.Jannu, Chaitanya & Vanambathina, S. D. An overview of speech enhancement based on deep learning techniques. Int. J. Image Graph.25(01), 2550001 (2025). [Google Scholar]
- 47.Jannu, C. & Vanambathina, S. D. Multi-stage progressive learning-based speech enhancement using time-frequency attentive squeezed temporal convolutional networks. Circ. Syst. Signal Process.42(12), 7467–7493 (2023). [Google Scholar]
- 48.Jannu, C. & Vanambathina, S. D. Dct based densely connected convolutional gru for real-time speech enhancement. J. Intell. Fuzzy Syst.45(1), 1195–1208 (2023). [Google Scholar]
- 49.Jannu, C., & Vanambathina, S.D. Convolutional transformer based local and global feature learning for speech enhancement. Int. J. Adv. Comput. Sci. Appl. 14(1), (2023).
- 50.Ullah, R. et al. End-to-end deep convolutional recurrent models for noise robust waveform speech enhancement. Sensors22(20), 7782 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Rajamani, S. T., Rajamani, K. T., Mallol-Ragolta, A., Liu, S., & Schuller, B. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. 6294–6298 (2021).
- 52.Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. 5206–5210 (2015).
- 53.Macho, D., Mauuary, L., Noé, B., Cheng, Y. M., Ealey, D., Jouvet, D., Kelleher, H., Pearce, D., & S. Fabien. Evaluation of a noise-robust dsr front-end on aurora databases (2002).
- 54.Zheng, N. & Zhang, X.-L. Phase-aware speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process.27(1), 63–76 (2018). [Google Scholar]
- 55.Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing, pp 4214–4217. IEEE (2010).
- 56.Beerends, J. G., Hekstra, A. P., Rix, A. W. & Hollier, M. P. Perceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment part II: Psychoacoustic model. J. Audio Eng. Soc.50(10), 765–778 (2002). [Google Scholar]
- 57.Kim, J., El-Khamy, M. & Lee, J. Residual lstm: Design of a deep recurrent architecture for distant speech recognition. Proc. Interspeech2017, 1591–1595 (2017). [Google Scholar]
- 58.Xian, Y., Sun, Y., Wang, W. & Naqvi, S. M. Convolutional fusion network for monaural speech enhancement. Neural Netw.143, 97–107 (2021). [DOI] [PubMed] [Google Scholar]
- 59.Lan, T. et al. Multi-scale informative perceptual network for monaural speech enhancement. Appl. Acoustics195, 108787 (2022). [Google Scholar]
- 60.Wang, Z., Zhang, T., Shao, Y. & Ding, B. Lstm-convolutional-blstm encoder-decoder network for minimum mean-square error approach to speech enhancement. Appl. Acoustics172, 107647 (2021). [Google Scholar]
- 61.Li, A., Yuan, M., Zheng, C. & Li, X. Speech enhancement using progressive learning-based convolutional recurrent neural network. Appl. Acoustics166, 107347 (2020). [Google Scholar]
- 62.Hasannezhad, M., Ouyang, Z., Zhu, W.-P., & Champagne, B. An integrated cnn-gru framework for complex ratio mask estimation in speech enhancement. pp. 764–768 (2020).
- 63.Westhausen, N. L., & Meyer, Bernd T. Dual-signal transformation lstm network for real-time noise suppression (2020).
- 64.Yanxin, Hu. et al. Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement. Proc. Interspeech2020, 2472–2476 (2020). [Google Scholar]
- 65.Kim, J., El-Khamy, M., & Lee, J. T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement. 6649–6653 (2020).
- 66.Zhang, Q., Nicolson, A., Wang, M., Paliwal, K. K. & Wang, C. Deepmmse: A deep learning approach to mmse-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process.28, 1404–1415 (2020). [Google Scholar]
- 67.Tan, K, & Wang, D. Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement. 6865–6869 (2019). [DOI] [PMC free article] [PubMed]
- 68.Pandey, A., & Wang, D.L. Tcnn: Temporal convolutional neural network for real-time speech enhancement in the time domain. 6875–6879 (2019).
- 69.Tan, K. & Wang, D. A convolutional recurrent neural network for real-time speech enhancement.2018, 3229–3233 (2018).
- 70.Shah, N., Patil, H.A., & Soni, M. H. Time-frequency mask-based speech enhancement using convolutional generative adversarial network. 1246–1251 (2018).
- 71.Bhardwaj, V. et al. Automatic speech recognition (asr) systems for children: A systematic literature review. Appl. Sci.12(9), 4419 (2022). [Google Scholar]
- 72.Rahman, A. et al. Advancement and Challenges (IEEE Access, Arabic speech recognition, 2024).
- 73.Hadwan, M., Alsayadi, H. A., & Al-Hagree, S. An end-to-end transformer-based automatic speech recognition for qur’an reciters. Comput. Mater. Continua. 74(2), (2023).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated used and analysed during the current study are available in the LibriSpeech and AURORA repository available at https://www.openslr.org/12 and http://aurora.hsnr.de/aurora-2.html. The raw codes for attention-GRU are available at https://github.com/NasirSaleem/Speech-Enhancement-ASR.




























































