. 2024 Jun 22;11:668. doi: 10.1038/s41597-024-03511-w

Table 4.

A comprehensive comparison of various deep learning-based architectures was conducted to assess their performance on the MAD dataset for audio classification.

architecture	parameters	pretrain	accuracy (%)
architecture	parameters	pretrain	pretrained weights	scratch
ResNet18	11.7M		88.04_±0.59	86.83_±0.38
ResNet34	21.8M	ImageNet	88.27_±0.29	87.07_±0.74
ResNet50	25.6M		88.24_±0.41	86.51_±0.38
ResNet101	44.7M		88.62_±0.31	86.86_±0.60
EfficientNet-B0	5.3M		88.92_±0.17	87.32_±1.00
EfficientNet-B1	7.8M	ImageNet	88.85_±0.28	87.71_±0.33
EfficientNet-B2	9.2M		89.11_±0.64	88.02_±0.27
CNN6	4.8M		89.53_±0.26	87.57_±0.81
CNN10	5.2M	AudioSet	90.76_±0.47	${\underline{89.03}}_{\pm 0.37}$
CNN14	80.7M		${\underline{90.97}}_{\pm 0.34}$	89.47_±0.22
AST	87.7M	ImageNet + AudioSet	90.03_±0.23	81.17_±0.77
AST-Patch-Mix		ImageNet + AudioSet	91.07_±0.19	82.20_±1.13

Employing pretrained model weights from the ImageNet⁴² and AudioSet²⁹ datasets is denoted as pretrained weights, while not using pretrained weights is denoted as scratch, respectively. Best and second best results.