Table 16. Training details for CochCNN9 and CochResNet50 models trained on clean speech.
Model Name | Batch Size | Initial Learning Rate | Num Classes (this number is inclusive of the “null” label, although no “null” examples were included when training clean models) | Accuracy on Clean Speech for Training Task |
---|---|---|---|---|
CochCNN9 Word | 128 | 0.01 | 794 | (Top 1) 82.311% (Top 5) 94.150% |
CochCNN9 WordClean | 128 | 0.01 | 794 | (Top 1) 84.365% (Top 5) 95.078% |
CochCNN9 Speaker | 128 | 0.01 | 433 | (Top 1) 99.799% (Top 5) 99.990% |
CochCNN9 SpeakerClean* | 128 | 0.01 | 433 | (Top 1) 99.905% (Top 5) 99.998% |
CochResNet50 Word | 256 | 0.1 | 794 | (Top 1) 94.212% (Top 5) 98.993% |
CochResNet50 WordClean | 256 | 0.1 | 794 | (Top 1) 93.998% (Top 5) 98.662% |
CochResNet50 Speaker | 256 | 0.1 | 433 | (Top 1) 99.973% (Top 5) 100.000% |
CochResNet50 Speaker Clean | 256 | 0.1 | 433 | (Top 1) 99.988% (Top 5) 100.000% |
*Model had additional gradient clipping (max l2 norm = 1.0) and learning rate warm-up for the first 500 batches of training (learning rate = <initial learning rate> / (500-i), where i is the batch number).