Table 15. Training details for CochCNN9 and CochResNet50 models.
Model Name | Batch Size | Initial Learning Rate | Num Classes (includes “null”) | Accuracy on Training Task |
---|---|---|---|---|
CochCNN9 Word | 128 | 0.01 | 794 | (Top 1) 66.640% (Top 5) 83.102% |
CochCNN9 Speaker | 128 | 0.01 | 433 | (Top 1) 96.216% (Top 5) 99.058% |
CochCNN9 AudioSet | 128 | 0.00001* | 517 | (ESC-50 SVM) 83.60% |
CochCNN9 MultiTask | 128 | 0.00001* | Three tasks: (Word) 794, (Speaker) 433, (AudioSet) 517 | (Top 1 Word) 64.954% (Top 5 Word) 81.998% (Top 1 Speaker) 86.686% (Top 5 Speaker) 96.039% (ESC-50 SVM) 82.60% |
CochResNet50 Word | 256 | 0.1 | 794 | (Top 1) 86.792% (Top 5) 95.149% |
CochResNet50 Speaker | 256 | 0.1 | 433 | (Top 1) 99.114% (Top 5) 99.835% |
CochResNet50 AudioSet | 256 | 0.001* | 517 | (ESC-50 SVM) 91.6% |
CochResNet50 MultiTask | 256 | 0.001* | Three tasks: (Word) 794, (Speaker) 433, (AudioSet) 517 | (Top 1 Word) 83.459% (Top 5 Word) 93.422% (Top 1 Speaker) 94.354% (Top 5 Speaker) 98.785% (ESC-50 SVM) 87.450% |
*Models trained with the AudioSet loss had additional gradient clipping (max l2 norm = 1.0) and learning rate warm-up for the first 500 batches of training (learning rate = <initial learning rate> / (500-i), where i is the batch number).