Skip to main content
. 2024 Jun 22;11:668. doi: 10.1038/s41597-024-03511-w

Table 4.

A comprehensive comparison of various deep learning-based architectures was conducted to assess their performance on the MAD dataset for audio classification.

architecture parameters pretrain accuracy (%)
pretrained weights scratch
ResNet18 11.7M 88.04±0.59 86.83±0.38
ResNet34 21.8M ImageNet 88.27±0.29 87.07±0.74
ResNet50 25.6M 88.24±0.41 86.51±0.38
ResNet101 44.7M 88.62±0.31 86.86±0.60
EfficientNet-B0 5.3M 88.92±0.17 87.32±1.00
EfficientNet-B1 7.8M ImageNet 88.85±0.28 87.71±0.33
EfficientNet-B2 9.2M 89.11±0.64 88.02±0.27
CNN6 4.8M 89.53±0.26 87.57±0.81
CNN10 5.2M AudioSet 90.76±0.47 89.03_±0.37
CNN14 80.7M 90.97_±0.34 89.47±0.22
AST 87.7M ImageNet + AudioSet 90.03±0.23 81.17±0.77
AST-Patch-Mix ImageNet + AudioSet 91.07±0.19 82.20±1.13

Employing pretrained model weights from the ImageNet42 and AudioSet29 datasets is denoted as pretrained weights, while not using pretrained weights is denoted as scratch, respectively. Best and second best results.