Skip to main content
. 2023 Dec 13;21(12):e3002366. doi: 10.1371/journal.pbio.3002366

Table 15. Training details for CochCNN9 and CochResNet50 models.

Model Name Batch Size Initial Learning Rate Num Classes (includes “null”) Accuracy on Training Task
CochCNN9 Word 128 0.01 794 (Top 1) 66.640%
(Top 5) 83.102%
CochCNN9 Speaker 128 0.01 433 (Top 1) 96.216%
(Top 5) 99.058%
CochCNN9 AudioSet 128 0.00001* 517 (ESC-50 SVM) 83.60%
CochCNN9 MultiTask 128 0.00001* Three tasks: (Word) 794, (Speaker) 433, (AudioSet) 517 (Top 1 Word) 64.954%
(Top 5 Word) 81.998%
(Top 1 Speaker) 86.686%
(Top 5 Speaker) 96.039%
(ESC-50 SVM) 82.60%
CochResNet50 Word 256 0.1 794 (Top 1) 86.792%
(Top 5) 95.149%
CochResNet50 Speaker 256 0.1 433 (Top 1) 99.114%
(Top 5) 99.835%
CochResNet50 AudioSet 256 0.001* 517 (ESC-50 SVM) 91.6%
CochResNet50 MultiTask 256 0.001* Three tasks: (Word) 794, (Speaker) 433, (AudioSet) 517 (Top 1 Word) 83.459%
(Top 5 Word) 93.422%
(Top 1 Speaker) 94.354%
(Top 5 Speaker) 98.785%
(ESC-50 SVM) 87.450%

*Models trained with the AudioSet loss had additional gradient clipping (max l2 norm = 1.0) and learning rate warm-up for the first 500 batches of training (learning rate = <initial learning rate> / (500-i), where i is the batch number).