Table 1.
CNN architecture used for the sound event and speech command datasets.
Sound Event | Speech Command | |
---|---|---|
Image input layer | 32 × 15 | 64 × 64 |
Middle layers | Conv. 1: 16@3 × 3, Stride 1 × 1, Pad 1 × 1 Batch Normalization, ReLU Max Pool: 2 × 2, Stride 1 × 1, Pad 1 × 1 Conv. 2: 16@3 × 3, Stride 1 × 1, Pad 1 × 1 Batch Normalization, ReLU Max Pool: 2 × 2, Stride 1 × 1, Pad 1 × 1 |
Conv. 1: 48@3 × 3, Stride 1 × 1, Pad ‘same’ Batch Normalization, ReLU Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’ Conv. 2: 96@3 × 3, Stride 1 × 1, Pad ‘same’ Batch Normalization, ReLU Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’ Conv. 3: 192@3 × 3, Stride 1 × 1, Pad ‘same’ Batch Normalization, ReLU Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’ Conv. 4: 192@3 × 3, Stride 1 × 1, Pad ‘same’ Batch Normalization, ReLU Conv. 5: 192@3 × 3, Stride 1 × 1, Pad ‘same’ Batch Normalization, ReLU Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’ Dropout: 0.2 |
Final layers | Fully connected layer: 50 Softmax layer Classification layer |
Fully connected layer: 36 Softmax layer Classification layer |