Skip to main content
. 2022 Jul 8;3(7):100520. doi: 10.1016/j.patter.2022.100520

Table 5.

Image classification results of representative CNN, ViT, and MLP-like models on ImageNet-1K benchmark

Model Pre-trained dataset Top 1 (%) Params (M) FLOPs (G)
CNN based

VGG-16100 71.5 134 15.5
Xception16 79.0 22.9
Inception-ResNet-V2101 80.1
ResNet-5097,102 80.4 25.6 4.1
ResNet-15297,102 82.0 60.2 11.5
RegNetY-8GF102,103 82.2 39 8.0
RegNetY-16GF103 82.9 84 15.9
ConvNeXt-B104 83.8 89.0 15.4
VAN-Huge105 84.2 60.3 12.2
EfficientNetV2-M106 85.1 54 24.0
EfficientNetV2-L106 85.7 120 53.0
PolyLoss(EfficientNetV2-L)107 87.21
EfficientNetV2-XL106 ImageNet-21k 87.3 208 94.0
RepLKNet-XL108 ImageNet-21k 87.82 335 128.7
Meta pseudo labels (EfficientNet-L2)109 JFT-300M 90.23 480

Transformer based

ViT-B/1647 77.9 86 55.5
DeiT-B/16110 81.8 86 17.6
T2T-ViT-24111 82.3 64.1 13.8
PVT-large112 82.3 61 9.8
Swin-B49 83.5 88 15.4
Nest-B113 83.8 68 17.9
PyramidTNT-B114 84.1 157 16.0
CSWin-B115 84.2 78 15.0
CaiT-M-48-448116 86.5 356 330
PeCo(ViT-H)99 88.31 635
ViT-L/1647 ImageNet-21k 85.3 307
SwinV1-L49 ImageNet-21k 87.3 197 103.9
SwinV2-G117 ImageNet-21k 90.22 3000
V-MoE90 JFT-300M 90.4 14,700
ViT-G/1447 JFT-300M 90.53 1843

CNN + Transformer

Twins-SVT-B118 83.2 56 8.6
Shuffle-B119 84.0 88 15.6
CMT-B120 84.5 45.7 9.3
CoAtNet-3121 84.5 168 34.7
VOLO-D3122 85.4 86 20.6
VOLO-D5122 87.11 296 69.0
CoAtNet-4121 ImageNet-21k 88.12 275 360.9
CoAtNet-7121 JFT-300M 90.93 2440

MLP based

DynaMixer-L75 84.31 97 27.4
ResMLP-B24/860 ImageNet-21k 84.42 129.1 100.2
Mixer-H/1415 JFT-300M 86.33 431

Pre-trained dataset column provides extra data information. PloyLoss, PeCo, and meta pseudo labels are different training strategies, where the used model is in the bracket

1

The best performance on ImageNet-1k without pre-trained dataset.

2

The best performance on ImageNet-1k with ImageNet-21k pre-training.

3

The best performance on ImageNet-1k with JFT-300M pre-training.