Table 1.
MobileViT architecture, where d represents the input size of the conversion layer in the MobileViT block.
| Layer | Output size | Output stride | Repeat | Output Channels |
|---|---|---|---|---|
| Image | 256 ×256 | 1 | ||
| Conv-3 ×3, ↓2 MV2 | 128 ×128 | 2 | 1 1 | 16 32 |
| MV2, ↓2 MV2 | 64 ×64 | 4 | 1 2 | 64 64 |
| MV2, ↓2 MobileViT block (L = 2) | 32 ×32 | 8 | 1 1 | 96 96 (d = 144) |
| MV2, ↓2 MobileViT block (L = 4) | 16 ×16 | 16 | 1 1 | 128 128 (d = 192) |
| MV2, ↓2 MobileViT block (L = 3) Conv-1 ×1 | 8 ×8 | 32 | 1 1 1 | 160 160 (d = 240) 640 |
| Global pool Linear | 1 ×1 | 256 | 1 | 1,000 |
| Network Parameters | 5.6 M |
By default, the kernel size n is set to 3 in the Mobile ViT block and the space size of the block (height h and width w) is set to 2.