Skip to main content
. 2022 Jun 30;16:922761. doi: 10.3389/fnbot.2022.922761

Table 1.

MobileViT architecture, where d represents the input size of the conversion layer in the MobileViT block.

Layer Output size Output stride Repeat Output Channels
Image 256 ×256 1
Conv-3 ×3, ↓2 MV2 128 ×128 2 1 1 16 32
MV2, ↓2 MV2 64 ×64 4 1 2 64 64
MV2, ↓2 MobileViT block (L = 2) 32 ×32 8 1 1 96 96 (d = 144)
MV2, ↓2 MobileViT block (L = 4) 16 ×16 16 1 1 128 128 (d = 192)
MV2, ↓2 MobileViT block (L = 3) Conv-1 ×1 8 ×8 32 1 1 1 160 160 (d = 240) 640
Global pool Linear 1 ×1 256 1 1,000
Network Parameters 5.6 M

By default, the kernel size n is set to 3 in the Mobile ViT block and the space size of the block (height h and width w) is set to 2.