. 2022 Jul 8;3(7):100520. doi: 10.1016/j.patter.2022.100520

Table 5.

Image classification results of representative CNN, ViT, and MLP-like models on ImageNet-1K benchmark

Model	Pre-trained dataset	Top 1 (%)	Params (M)	FLOPs (G)
CNN based

VGG-16¹⁰⁰	–	71.5	134	15.5
Xception¹⁶	–	79.0	22.9	–
Inception-ResNet-V2¹⁰¹	–	80.1	–	–
ResNet-50⁹⁷^,¹⁰²	–	80.4	25.6	4.1
ResNet-152⁹⁷^,¹⁰²	–	82.0	60.2	11.5
RegNetY-8GF¹⁰²^,¹⁰³	–	82.2	39	8.0
RegNetY-16GF¹⁰³	–	82.9	84	15.9
ConvNeXt-B¹⁰⁴	–	83.8	89.0	15.4
VAN-Huge¹⁰⁵	–	84.2	60.3	12.2
EfficientNetV2-M¹⁰⁶	–	85.1	54	24.0
EfficientNetV2-L¹⁰⁶	–	85.7	120	53.0
PolyLoss(EfficientNetV2-L)¹⁰⁷	–	87.2¹	–	–
EfficientNetV2-XL¹⁰⁶	ImageNet-21k	87.3	208	94.0
RepLKNet-XL¹⁰⁸	ImageNet-21k	87.8²	335	128.7
Meta pseudo labels (EfficientNet-L2)¹⁰⁹	JFT-300M	90.2³	480	–

Transformer based

ViT-B/16⁴⁷	–	77.9	86	55.5
DeiT-B/16¹¹⁰	–	81.8	86	17.6
T2T-ViT-24¹¹¹	–	82.3	64.1	13.8
PVT-large¹¹²	–	82.3	61	9.8
Swin-B⁴⁹	–	83.5	88	15.4
Nest-B¹¹³	–	83.8	68	17.9
PyramidTNT-B¹¹⁴	–	84.1	157	16.0
CSWin-B¹¹⁵	–	84.2	78	15.0
CaiT-M-48-448¹¹⁶	–	86.5	356	330
PeCo(ViT-H)⁹⁹	–	88.3¹	635	–
ViT-L/16⁴⁷	ImageNet-21k	85.3	307	–
SwinV1-L⁴⁹	ImageNet-21k	87.3	197	103.9
SwinV2-G¹¹⁷	ImageNet-21k	90.2²	3000	–
V-MoE⁹⁰	JFT-300M	90.4	14,700	–
ViT-G/14⁴⁷	JFT-300M	90.5³	1843	–

CNN + Transformer

Twins-SVT-B¹¹⁸	–	83.2	56	8.6
Shuffle-B¹¹⁹	–	84.0	88	15.6
CMT-B¹²⁰	–	84.5	45.7	9.3
CoAtNet-3¹²¹	–	84.5	168	34.7
VOLO-D3¹²²	–	85.4	86	20.6
VOLO-D5¹²²	–	87.1¹	296	69.0
CoAtNet-4¹²¹	ImageNet-21k	88.1²	275	360.9
CoAtNet-7¹²¹	JFT-300M	90.9³	2440	–

MLP based

DynaMixer-L⁷⁵	–	84.3¹	97	27.4
ResMLP-B24/8⁶⁰	ImageNet-21k	84.4²	129.1	100.2
Mixer-H/14¹⁵	JFT-300M	86.3³	431	–

Pre-trained dataset column provides extra data information. PloyLoss, PeCo, and meta pseudo labels are different training strategies, where the used model is in the bracket

The best performance on ImageNet-1k without pre-trained dataset.

The best performance on ImageNet-1k with ImageNet-21k pre-training.

The best performance on ImageNet-1k with JFT-300M pre-training.