Table 1. Summary of deep learning-based algorithms used for single-channel and multi-channel speech enhancement systems in filtering different types of noises.
| Deep learning method | References | Dataset | Evaluation metrics | Results | Advantages/Disadvantages |
|---|---|---|---|---|---|
| DNN (Deep Neural Network) | Zhao et al. (2018) | NOISEX and IEEE corpus | SDR, PESQ, and STOI | Averaged results with mismatched SNR (−3 to 3 dB) PESQ is 1.99, SDR is 11.35, and STOI is 90.61%. |
Advantages Being familiar with the model’s architecture since Networks are typically simple. Disadvantages DNN has relatively big parameters since every node in each layer is connected to every node in the layer before it. |
| Karjol, Kumar & Ghosh (2018) | TIMIT + noises from AURORA dataset |
STOI, SegSNR,and PESQ | For seen noise, the average best PESQ is 2.65, whereas for unseen noise, it is 2.19. | ||
| Saleem & Khattak (2020) | Environmental noises | SegSNR, PESQ, LLR and STOI | PESQ is 2.27, SNRseg is 4.24 , LLR is 0.53 and STOI is 84% | ||
| Deep autoencoder based on MFCC (DAE-MFCC) | Feng, Zhang & Glass (2014) | CHiME-2 | WER | Error rate of 34%. |
Advantages Dimensional reduction is done using DAE, and the bottleneck layer’s features might be helpful. Disadvantages Learning temporal information is a drawback of DNN-based DAE information. |
| Lu et al., (2013) | Japanese corpus + environmental noises |
PESQ | Average PESQ for factory noise is 3.13, whereas it is 4.08 for car noise. | ||
| Recurrent neural network-Long short-term memory (RNN-LSTM) | Gao et al. (2018) | In factories, the average PESQ is 3.13, and in cars, it is 4.08. | SDR, STOI | STOI: 0.86 and SDR: 9.46 on average. |
Advantages -Best for handling data that is sequence-based, like speech signals. -Contextual data can be handled by RNN-LSTM. Disadvantages It is well known that learning the RNN parameters is challenging and time-consuming. |
| Weninger et al. (2013) | CHiME-2 | WA, WER | Average accuracy is 85%. | ||
| Wollmer et al. (2013) | Buckeye (spontaneous speech) + CHiME noises |
WA | Average WA using BN- BLSTM: 43.55%. |
||
| Maas et al. (2012) | AURORA-2 | MSE and WER | The average error rate (SNR 0-20 dB) is 10.28% for seen noise and 12.90% for unseen noise. | ||
| Wang & Wang (2019) | CHiME-2 + environmental Noises |
WER | Magnitude features provide the best average error rate of 7.8% (accuracy of 92.2%). | ||
| Park & Lee (2017) | TIMIT + environmental noises |
PESQ, STOI, SDR | CNN outperformed DNN and RNN in terms of accuracy, with PESQ 2.34, STOI 0.83, and SDR 8.62. | ||
| Plantinga, Bagchi & Fosler-Lussier (2019) | CHiME-2 | Word Error Rate (WER) | Using ResNet and mimic loss, a word error rate of 9.3% is achieved. | ||
| Rownicka, Bell & Renals (2020) | AMI and Aurora-4 | Word Error Rate (WER) | 8.31% WER on Aurora-4 | ||
| Pandey & Wang (2019) | NOISEX + TIMIT + SSN | STOI, PESQ, and SI-SDR | Results indicate that Autoencoder CNN performed better than SEGAN. | ||
| Germain, Chen & Koltun (2019) | Voice Bank + DEMAND | SNR, SIG, BAK, OVL | SNR:19.00, SIG: 3.86, BAK: 3.33, OVL: 3.22. | ||
| Fu et al. (2018) | TIMIT + environmental noises |
PESQ, STOI | Fully utilising ConvNet yields the best STOI, while DNN achieves the best PESQ. | ||
| Donahue, Li & Prabhavalkar (2018) | WSJ + environmental and Music noise | Word Error Rate (WER) | 17.6% word error rate. | ||
| Baby & Verhulst (2019) | Voice Bank + DEMAND | STOI, PESQ, SegSNR |
PESQ: 2.62, SegSNR: 17.68, STOI: 0.942 | ||
| CNN (Convolution neural network) | Ochiai, Delcroix & Nakatani (2020) | CHiME-4, Aurora-4 | WER, SDR | Chime-4: SDR: 14.24, Aurora-4: 6.3%, WER: 8.3% (real data), 10.8% (simulated). |
Advantages -CNN has the capacity to detect patterns in neighbouring speech structures. -Compared to RNN and standard DNN, CNN is more effective. Disadvantages Inability to maintain invariance when the input data changes |
| Xu, Elshamy & Fingscheidt (2020) | Grid corpus + CHiME-3 noises |
PESQ, STOI | For seen noises, PESQ is 2.60 and STOI is 0.70, while for unseen noises only, 2.63 and 0.74. | ||
| Choi et al. (2019) | Voice Bank + DEMAND | PESQ, CSIG, CBAK, COVL, SSNR | PESQ 3.24, CSIG 4.34, CBAK 4.10, COVL 3.81, and SSNR 16.85 are the values. | ||
| Soleymanpour et al. (2023) | Babble Noise | PESQ, STOI | PESQ is 1.35 to 1.78 at -8db to 0db and STOI is 0.56 | ||
| Saleem et al. (2023) | VoiceBank-DEMAND Corpus + Librispeech | PESQ, STOI. | PESQ is 2.28, STOI is 84.5%. | ||
| GAN (generative adversarial network) | Soni, Shah & Patil (2018) | Voice Bank + DEMAND | PESQ, CSIG, CBAK, MOS, STOI |
PESQ 2.53, SIG 3.80, BAK 3.12, MOS 3.14, and STOI 0.93T are the values. |
Advantages: If GAN is correctly trained, its combined networks can be very strong. Disadvantages: The adversarial training is typically challenging and unstable. |