Table 2.
Dataset | Classes | Input | Description |
CIFAR10 (26) | 10 | 32 row 32 column 3 RGB | Natural and manufactured objects in their environment |
CIFAR100 (26) | 100 | 32 row 32 column 3 RGB | Natural and manufactured objects in their environment |
SVHN (27) | 10 | 32 row 32 column 3 RGB | Single digits of house addresses from Google’s Street View |
GTSRB (28) | 43 | 32 row 32 column 3 RGB | German traffic signs in multiple environments |
Flickr-Logos32 (29) | 32 | 32 row 32 column 3 RGB | Localized corporate logos in their environment |
VAD (30, 31) | 2 | 16 sample 26 MFCC | Voice activity present or absent, with noise (TIMIT + NOISEX) |
TIMIT class (30). | 39 | 32 sample 16 MFCC 3 delta | Phonemes from English speakers, with phoneme boundaries |
TIMIT frame (30) | 39 | 16 sample 39 MFCC | Phonemes from English speakers, without phoneme boundaries |
GTSRB and Flickr-Logos32 are cropped and/or downsampled from larger images. VAD and TIMIT datasets have Mel-frequency cepstral coefficients (MFCC) computed from 16-kHz audio data.