Skip to main content
. 2024 Feb 28;10:e1901. doi: 10.7717/peerj-cs.1901

Table 1. Summary of deep learning-based algorithms used for single-channel and multi-channel speech enhancement systems in filtering different types of noises.

Deep learning method References Dataset Evaluation metrics Results Advantages/Disadvantages
DNN (Deep Neural Network) Zhao et al. (2018) NOISEX and IEEE corpus SDR, PESQ, and STOI Averaged results with mismatched SNR (−3 to 3 dB) PESQ is 1.99, SDR is 11.35, and STOI is 90.61%. Advantages
Being familiar with the model’s architecture since Networks are typically simple.
Disadvantages
DNN has relatively big parameters since every node in each layer is connected to every node in the layer before it.
Karjol, Kumar & Ghosh (2018) TIMIT + noises from
AURORA dataset
STOI, SegSNR,and PESQ For seen noise, the average best PESQ is 2.65, whereas for unseen noise, it is 2.19.
Saleem & Khattak (2020) Environmental noises SegSNR, PESQ, LLR and STOI PESQ is 2.27, SNRseg is 4.24 , LLR is 0.53 and STOI is 84%
Deep autoencoder based on MFCC (DAE-MFCC) Feng, Zhang & Glass (2014) CHiME-2 WER Error rate of 34%. Advantages
Dimensional reduction is done using DAE, and the bottleneck layer’s features might be helpful.
Disadvantages
Learning temporal information is a drawback of DNN-based DAE information.
Lu et al., (2013) Japanese corpus +
environmental noises
PESQ Average PESQ for factory noise is 3.13, whereas it is 4.08 for car noise.
Recurrent neural network-Long short-term memory (RNN-LSTM) Gao et al. (2018) In factories, the average PESQ is 3.13, and in cars, it is 4.08. SDR, STOI STOI: 0.86 and SDR: 9.46 on average. Advantages
-Best for handling data that is sequence-based, like speech signals.
-Contextual data can be handled by RNN-LSTM.
Disadvantages
It is well known that learning the RNN parameters is challenging and time-consuming.
Weninger et al. (2013) CHiME-2 WA, WER Average accuracy is 85%.
Wollmer et al. (2013) Buckeye (spontaneous
speech) + CHiME noises
WA Average WA using BN-
BLSTM: 43.55%.
Maas et al. (2012) AURORA-2 MSE and WER The average error rate (SNR 0-20 dB) is 10.28% for seen noise and 12.90% for unseen noise.
Wang & Wang (2019) CHiME-2 + environmental
Noises
WER Magnitude features provide the best average error rate of 7.8% (accuracy of 92.2%).
Park & Lee (2017) TIMIT + environmental
noises
PESQ, STOI, SDR CNN outperformed DNN and RNN in terms of accuracy, with PESQ 2.34, STOI 0.83, and SDR 8.62.
Plantinga, Bagchi & Fosler-Lussier (2019) CHiME-2 Word Error Rate (WER) Using ResNet and mimic loss, a word error rate of 9.3% is achieved.
Rownicka, Bell & Renals (2020) AMI and Aurora-4 Word Error Rate (WER) 8.31% WER on Aurora-4
Pandey & Wang (2019) NOISEX + TIMIT + SSN STOI, PESQ, and SI-SDR Results indicate that Autoencoder CNN performed better than SEGAN.
Germain, Chen & Koltun (2019) Voice Bank + DEMAND SNR, SIG, BAK, OVL SNR:19.00, SIG: 3.86, BAK: 3.33, OVL: 3.22.
Fu et al. (2018) TIMIT + environmental
noises
PESQ, STOI Fully utilising ConvNet yields the best STOI, while DNN achieves the best PESQ.
Donahue, Li & Prabhavalkar (2018) WSJ + environmental and Music noise Word Error Rate (WER) 17.6% word error rate.
Baby & Verhulst (2019) Voice Bank + DEMAND STOI, PESQ,
SegSNR
PESQ: 2.62, SegSNR: 17.68, STOI: 0.942
CNN (Convolution neural network) Ochiai, Delcroix & Nakatani (2020) CHiME-4, Aurora-4 WER, SDR Chime-4: SDR: 14.24, Aurora-4: 6.3%, WER: 8.3% (real data), 10.8% (simulated). Advantages
-CNN has the capacity to detect patterns in neighbouring speech structures.
-Compared to RNN and standard DNN, CNN is more effective.
Disadvantages
Inability to maintain invariance when the input data changes
Xu, Elshamy & Fingscheidt (2020) Grid corpus + CHiME-3
noises
PESQ, STOI For seen noises, PESQ is 2.60 and STOI is 0.70, while for unseen noises only, 2.63 and 0.74.
Choi et al. (2019) Voice Bank + DEMAND PESQ, CSIG, CBAK, COVL, SSNR PESQ 3.24, CSIG 4.34, CBAK 4.10, COVL 3.81, and SSNR 16.85 are the values.
Soleymanpour et al. (2023) Babble Noise PESQ, STOI PESQ is 1.35 to 1.78 at -8db to 0db and STOI is 0.56
Saleem et al. (2023) VoiceBank-DEMAND Corpus + Librispeech PESQ, STOI. PESQ is 2.28, STOI is 84.5%.
GAN (generative adversarial network) Soni, Shah & Patil (2018) Voice Bank + DEMAND PESQ, CSIG, CBAK,
MOS, STOI
PESQ 2.53, SIG 3.80, BAK 3.12, MOS 3.14, and STOI 0.93T are the values. Advantages:
If GAN is correctly trained, its combined networks can be very strong.
Disadvantages:
The adversarial training is typically challenging and
unstable.