Skip to main content
. 2021 May 14;21(10):3434. doi: 10.3390/s21103434

Table 1.

CNN architecture used for the sound event and speech command datasets.

Sound Event Speech Command
Image input layer 32 × 15 64 × 64
Middle layers Conv. 1: 16@3 × 3, Stride 1 × 1, Pad 1 × 1
Batch Normalization, ReLU
Max Pool: 2 × 2, Stride 1 × 1, Pad 1 × 1
Conv. 2: 16@3 × 3, Stride 1 × 1, Pad 1 × 1
Batch Normalization, ReLU
Max Pool: 2 × 2, Stride 1 × 1, Pad 1 × 1
Conv. 1: 48@3 × 3, Stride 1 × 1, Pad ‘same’
Batch Normalization, ReLU
Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’
Conv. 2: 96@3 × 3, Stride 1 × 1, Pad ‘same’
Batch Normalization, ReLU
Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’
Conv. 3: 192@3 × 3, Stride 1 × 1, Pad ‘same’
Batch Normalization, ReLU
Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’
Conv. 4: 192@3 × 3, Stride 1 × 1, Pad ‘same’
Batch Normalization, ReLU
Conv. 5: 192@3 × 3, Stride 1 × 1, Pad ‘same’
Batch Normalization, ReLU
Max Pool: 3 × 3, Stride 2 × 2, Pad ‘same’
Dropout: 0.2
Final layers Fully connected layer: 50
Softmax layer
Classification layer
Fully connected layer: 36
Softmax layer
Classification layer