Table 2. A literature review of data augmentation techniques for audio classification.
Author and Year | Purpose | Data Augmentation Technique(s) | Input to augmentation technique | Results / Impact of augmentation on the performance of classification models |
---|---|---|---|---|
[28] | Environmental Sound classification | Time Stretching, Pitch Shifting, Dynamic Range Compression and Background Noise Addition | Log-Mel- Spectrogram | The accuracy for the proposed CNN (SB-CNN) increased from 73% (before augmentation) to 79% (after augmentation) |
[32] | Speech Recognition | Mixup Augmentation | Normalized Spectrogram | The authors compared the classification performance of a VGG-11 model trained with empirical risk minimization and mixup augmentation and observed a lower classification error with mixup augmentation. |
[33] | Speech Recognition | Variational Autoencoder | Discrete Fourier Transform | The authors proposed four classification models and evaluated these using Word Error Rate (WER). However, all four classification models suffered an increase in the WER after augmentation. |
[29] | Speech Recognition | SpecAugment | Log Mel Spectrogram | Listen Attend Spell obtained WER of 2.8 with Augmentation and without presence of Language Model whereas LAS obtained WER of 4.1 without Augmentation |
[34] | Acoustic Scene Classification | Spectrogram Rolling and mixup | Mel Frequency Cepstral Coefficient | The mean accuracy obtained by ResNet model before augmentation is 80.97% and after augmentation is 82.85% |
[35] | Monaural Singing Voice Separation | Variational Autoencoder- Generative Adversarial Network (VAE-GAN) | Short Time Fourier Transform | The authors used metrics such as Source to Interference ratio (SIR), Source to Artifacts ratio (SAR) and Source to Distortion ratio (SDR) to evaluate the separation quality of a deep recurrent neural network and VAE-GAN. A higher value suggested better separation quality. The results revealed that VAE-GAN had a higher SDR and SAR whereas RNN had a higher SIR. |
[31] | Environmental Sound Classification | WaveGAN | Raw audio | For baseline method the accuracy generated was 94.84% whereas after application of GAN the accuracy achieved was 97.03 |
[36] | Animal Audio Classification | Signal Speed scaling, Pitch Shift, Volume increase/decrease, Addition of random noise and Time shift | Raw audio | The mean accuracy obtained by VGG19 on the cat dataset is 83.05 without augmentation and 85.59 with augmentation |
pitch shift, time shift, summing two spectrograms from same class, applying random cropping followed by cutting the spectrogram in 10 different temporal slices and applying a function on it, application of time shift by randomly picking the shift T. | Mel Spectrogram | The mean accuracy obtained by VGG19 on cat dataset is 83.05 without augmentation and 90.68 with augmentation | ||
[30] | Abnormal Respiratory Sounds Detection | Convolutional Variational Autoencoder | Mel Spectrogram | The specificity, sensitivity and F-score of the respiratory sounds classification model increased from 0.286 to 0.986, 0.888 to 0.988 and 0.349 to 0.900 upon augmentation respectively. |
[37] | Acoustic Scene Classification | Zero-value Masking Mini-batch based mixture masking Mini-batch based cutting masking |
Log Mel Spectrogram | The accuracy on DCASE 18 dataset is 76.2% The accuracy on DCASE 18 dataset is 77.0% The accuracy on DCASE 18 dataset is 76.9% |