Table 5.
Model | Pre-trained dataset | Top 1 (%) | Params (M) | FLOPs (G) |
---|---|---|---|---|
CNN based | ||||
VGG-16100 | – | 71.5 | 134 | 15.5 |
Xception16 | – | 79.0 | 22.9 | – |
Inception-ResNet-V2101 | – | 80.1 | – | – |
ResNet-5097,102 | – | 80.4 | 25.6 | 4.1 |
ResNet-15297,102 | – | 82.0 | 60.2 | 11.5 |
RegNetY-8GF102,103 | – | 82.2 | 39 | 8.0 |
RegNetY-16GF103 | – | 82.9 | 84 | 15.9 |
ConvNeXt-B104 | – | 83.8 | 89.0 | 15.4 |
VAN-Huge105 | – | 84.2 | 60.3 | 12.2 |
EfficientNetV2-M106 | – | 85.1 | 54 | 24.0 |
EfficientNetV2-L106 | – | 85.7 | 120 | 53.0 |
PolyLoss(EfficientNetV2-L)107 | – | 87.21 | – | – |
EfficientNetV2-XL106 | ImageNet-21k | 87.3 | 208 | 94.0 |
RepLKNet-XL108 | ImageNet-21k | 87.82 | 335 | 128.7 |
Meta pseudo labels (EfficientNet-L2)109 | JFT-300M | 90.23 | 480 | – |
Transformer based | ||||
ViT-B/1647 | – | 77.9 | 86 | 55.5 |
DeiT-B/16110 | – | 81.8 | 86 | 17.6 |
T2T-ViT-24111 | – | 82.3 | 64.1 | 13.8 |
PVT-large112 | – | 82.3 | 61 | 9.8 |
Swin-B49 | – | 83.5 | 88 | 15.4 |
Nest-B113 | – | 83.8 | 68 | 17.9 |
PyramidTNT-B114 | – | 84.1 | 157 | 16.0 |
CSWin-B115 | – | 84.2 | 78 | 15.0 |
CaiT-M-48-448116 | – | 86.5 | 356 | 330 |
PeCo(ViT-H)99 | – | 88.31 | 635 | – |
ViT-L/1647 | ImageNet-21k | 85.3 | 307 | – |
SwinV1-L49 | ImageNet-21k | 87.3 | 197 | 103.9 |
SwinV2-G117 | ImageNet-21k | 90.22 | 3000 | – |
V-MoE90 | JFT-300M | 90.4 | 14,700 | – |
ViT-G/1447 | JFT-300M | 90.53 | 1843 | – |
CNN + Transformer | ||||
Twins-SVT-B118 | – | 83.2 | 56 | 8.6 |
Shuffle-B119 | – | 84.0 | 88 | 15.6 |
CMT-B120 | – | 84.5 | 45.7 | 9.3 |
CoAtNet-3121 | – | 84.5 | 168 | 34.7 |
VOLO-D3122 | – | 85.4 | 86 | 20.6 |
VOLO-D5122 | – | 87.11 | 296 | 69.0 |
CoAtNet-4121 | ImageNet-21k | 88.12 | 275 | 360.9 |
CoAtNet-7121 | JFT-300M | 90.93 | 2440 | – |
MLP based | ||||
DynaMixer-L75 | – | 84.31 | 97 | 27.4 |
ResMLP-B24/860 | ImageNet-21k | 84.42 | 129.1 | 100.2 |
Mixer-H/1415 | JFT-300M | 86.33 | 431 | – |
Pre-trained dataset column provides extra data information. PloyLoss, PeCo, and meta pseudo labels are different training strategies, where the used model is in the bracket
The best performance on ImageNet-1k without pre-trained dataset.
The best performance on ImageNet-1k with ImageNet-21k pre-training.
The best performance on ImageNet-1k with JFT-300M pre-training.