. 2016 Sep 20;113(41):11441–11446. doi: 10.1073/pnas.1604850113

Table 2.

Summary of datasets

Dataset	Classes	Input	Description
CIFAR10 (26)	10	32 row $\times$ 32 column $\times$ 3 RGB	Natural and manufactured objects in their environment
CIFAR100 (26)	100	32 row $\times$ 32 column $\times$ 3 RGB	Natural and manufactured objects in their environment
SVHN (27)	10	32 row $\times$ 32 column $\times$ 3 RGB	Single digits of house addresses from Google’s Street View
GTSRB (28)	43	32 row $\times$ 32 column $\times$ 3 RGB	German traffic signs in multiple environments
Flickr-Logos32 (29)	32	32 row $\times$ 32 column $\times$ 3 RGB	Localized corporate logos in their environment
VAD (30, 31)	2	16 sample $\times$ 26 MFCC	Voice activity present or absent, with noise (TIMIT + NOISEX)
TIMIT class (30).	39	32 sample $\times$ 16 MFCC $\times$ 3 delta	Phonemes from English speakers, with phoneme boundaries
TIMIT frame (30)	39	16 sample $\times$ 39 MFCC	Phonemes from English speakers, without phoneme boundaries

GTSRB and Flickr-Logos32 are cropped and/or downsampled from larger images. VAD and TIMIT datasets have Mel-frequency cepstral coefficients (MFCC) computed from 16-kHz audio data.