Table 4.
A comprehensive comparison of various deep learning-based architectures was conducted to assess their performance on the MAD dataset for audio classification.
| architecture | parameters | pretrain | accuracy (%) | |
|---|---|---|---|---|
| pretrained weights | scratch | |||
| ResNet18 | 11.7M | 88.04±0.59 | 86.83±0.38 | |
| ResNet34 | 21.8M | ImageNet | 88.27±0.29 | 87.07±0.74 |
| ResNet50 | 25.6M | 88.24±0.41 | 86.51±0.38 | |
| ResNet101 | 44.7M | 88.62±0.31 | 86.86±0.60 | |
| EfficientNet-B0 | 5.3M | 88.92±0.17 | 87.32±1.00 | |
| EfficientNet-B1 | 7.8M | ImageNet | 88.85±0.28 | 87.71±0.33 |
| EfficientNet-B2 | 9.2M | 89.11±0.64 | 88.02±0.27 | |
| CNN6 | 4.8M | 89.53±0.26 | 87.57±0.81 | |
| CNN10 | 5.2M | AudioSet | 90.76±0.47 | |
| CNN14 | 80.7M | 89.47±0.22 | ||
| AST | 87.7M | ImageNet + AudioSet | 90.03±0.23 | 81.17±0.77 |
| AST-Patch-Mix | ImageNet + AudioSet | 91.07±0.19 | 82.20±1.13 | |