Skip to main content
. 2016 Sep 20;113(41):11441–11446. doi: 10.1073/pnas.1604850113

Table 2.

Summary of datasets

Dataset Classes Input Description
CIFAR10 (26) 10 32 row × 32 column × 3 RGB Natural and manufactured objects in their environment
CIFAR100 (26) 100 32 row × 32 column × 3 RGB Natural and manufactured objects in their environment
SVHN (27) 10 32 row × 32 column × 3 RGB Single digits of house addresses from Google’s Street View
GTSRB (28) 43 32 row × 32 column × 3 RGB German traffic signs in multiple environments
Flickr-Logos32 (29) 32 32 row × 32 column × 3 RGB Localized corporate logos in their environment
VAD (30, 31) 2 16 sample × 26 MFCC Voice activity present or absent, with noise (TIMIT + NOISEX)
TIMIT class (30). 39 32 sample × 16 MFCC × 3 delta Phonemes from English speakers, with phoneme boundaries
TIMIT frame (30) 39 16 sample × 39 MFCC Phonemes from English speakers, without phoneme boundaries

GTSRB and Flickr-Logos32 are cropped and/or downsampled from larger images. VAD and TIMIT datasets have Mel-frequency cepstral coefficients (MFCC) computed from 16-kHz audio data.