. 2022 Aug 12;17(8):e0266467. doi: 10.1371/journal.pone.0266467

Table 2. A literature review of data augmentation techniques for audio classification.

Author and Year	Purpose	Data Augmentation Technique(s)	Input to augmentation technique	Results / Impact of augmentation on the performance of classification models
[28]	Environmental Sound classification	Time Stretching, Pitch Shifting, Dynamic Range Compression and Background Noise Addition	Log-Mel- Spectrogram	The accuracy for the proposed CNN (SB-CNN) increased from 73% (before augmentation) to 79% (after augmentation)
[32]	Speech Recognition	Mixup Augmentation	Normalized Spectrogram	The authors compared the classification performance of a VGG-11 model trained with empirical risk minimization and mixup augmentation and observed a lower classification error with mixup augmentation.
[33]	Speech Recognition	Variational Autoencoder	Discrete Fourier Transform	The authors proposed four classification models and evaluated these using Word Error Rate (WER). However, all four classification models suffered an increase in the WER after augmentation.
[29]	Speech Recognition	SpecAugment	Log Mel Spectrogram	Listen Attend Spell obtained WER of 2.8 with Augmentation and without presence of Language Model whereas LAS obtained WER of 4.1 without Augmentation
[34]	Acoustic Scene Classification	Spectrogram Rolling and mixup	Mel Frequency Cepstral Coefficient	The mean accuracy obtained by ResNet model before augmentation is 80.97% and after augmentation is 82.85%
[35]	Monaural Singing Voice Separation	Variational Autoencoder- Generative Adversarial Network (VAE-GAN)	Short Time Fourier Transform	The authors used metrics such as Source to Interference ratio (SIR), Source to Artifacts ratio (SAR) and Source to Distortion ratio (SDR) to evaluate the separation quality of a deep recurrent neural network and VAE-GAN. A higher value suggested better separation quality. The results revealed that VAE-GAN had a higher SDR and SAR whereas RNN had a higher SIR.
[31]	Environmental Sound Classification	WaveGAN	Raw audio	For baseline method the accuracy generated was 94.84% whereas after application of GAN the accuracy achieved was 97.03
[36]	Animal Audio Classification	Signal Speed scaling, Pitch Shift, Volume increase/decrease, Addition of random noise and Time shift	Raw audio	The mean accuracy obtained by VGG19 on the cat dataset is 83.05 without augmentation and 85.59 with augmentation
[36]	Animal Audio Classification	pitch shift, time shift, summing two spectrograms from same class, applying random cropping followed by cutting the spectrogram in 10 different temporal slices and applying a function on it, application of time shift by randomly picking the shift T.	Mel Spectrogram	The mean accuracy obtained by VGG19 on cat dataset is 83.05 without augmentation and 90.68 with augmentation
[30]	Abnormal Respiratory Sounds Detection	Convolutional Variational Autoencoder	Mel Spectrogram	The specificity, sensitivity and F-score of the respiratory sounds classification model increased from 0.286 to 0.986, 0.888 to 0.988 and 0.349 to 0.900 upon augmentation respectively.
[37]	Acoustic Scene Classification	Zero-value Masking Mini-batch based mixture masking Mini-batch based cutting masking	Log Mel Spectrogram	The accuracy on DCASE 18 dataset is 76.2% The accuracy on DCASE 18 dataset is 77.0% The accuracy on DCASE 18 dataset is 76.9%