Skip to main content
. 2022 Aug 12;17(8):e0266467. doi: 10.1371/journal.pone.0266467

Table 2. A literature review of data augmentation techniques for audio classification.

Author and Year Purpose Data Augmentation Technique(s) Input to augmentation technique Results / Impact of augmentation on the performance of classification models
[28] Environmental Sound classification Time Stretching, Pitch Shifting, Dynamic Range Compression and Background Noise Addition Log-Mel- Spectrogram The accuracy for the proposed CNN (SB-CNN) increased from 73% (before augmentation) to 79% (after augmentation)
[32] Speech Recognition Mixup Augmentation Normalized Spectrogram The authors compared the classification performance of a VGG-11 model trained with empirical risk minimization and mixup augmentation and observed a lower classification error with mixup augmentation.
[33] Speech Recognition Variational Autoencoder Discrete Fourier Transform The authors proposed four classification models and evaluated these using Word Error Rate (WER). However, all four classification models suffered an increase in the WER after augmentation.
[29] Speech Recognition SpecAugment Log Mel Spectrogram Listen Attend Spell obtained WER of 2.8 with Augmentation and without presence of Language Model whereas LAS obtained WER of 4.1 without Augmentation
[34] Acoustic Scene Classification Spectrogram Rolling and mixup Mel Frequency Cepstral Coefficient The mean accuracy obtained by ResNet model before augmentation is 80.97% and after augmentation is 82.85%
[35] Monaural Singing Voice Separation Variational Autoencoder- Generative Adversarial Network (VAE-GAN) Short Time Fourier Transform The authors used metrics such as Source to Interference ratio (SIR), Source to Artifacts ratio (SAR) and Source to Distortion ratio (SDR) to evaluate the separation quality of a deep recurrent neural network and VAE-GAN. A higher value suggested better separation quality. The results revealed that VAE-GAN had a higher SDR and SAR whereas RNN had a higher SIR.
[31] Environmental Sound Classification WaveGAN Raw audio For baseline method the accuracy generated was 94.84% whereas after application of GAN the accuracy achieved was 97.03
[36] Animal Audio Classification Signal Speed scaling, Pitch Shift, Volume increase/decrease, Addition of random noise and Time shift Raw audio The mean accuracy obtained by VGG19 on the cat dataset is 83.05 without augmentation and 85.59 with augmentation
pitch shift, time shift, summing two spectrograms from same class, applying random cropping followed by cutting the spectrogram in 10 different temporal slices and applying a function on it, application of time shift by randomly picking the shift T. Mel Spectrogram The mean accuracy obtained by VGG19 on cat dataset is 83.05 without augmentation and 90.68 with augmentation
[30] Abnormal Respiratory Sounds Detection Convolutional Variational Autoencoder Mel Spectrogram The specificity, sensitivity and F-score of the respiratory sounds classification model increased from 0.286 to 0.986, 0.888 to 0.988 and 0.349 to 0.900 upon augmentation respectively.
[37] Acoustic Scene Classification Zero-value Masking
Mini-batch based mixture masking
Mini-batch based cutting masking
Log Mel Spectrogram The accuracy on DCASE 18 dataset is 76.2%
The accuracy on DCASE 18 dataset is 77.0%
The accuracy on DCASE 18 dataset is 76.9%