. 2022 Jun 30;16:922761. doi: 10.3389/fnbot.2022.922761

Table 1.

MobileViT architecture, where d represents the input size of the conversion layer in the MobileViT block.

Layer	Output size	Output stride	Repeat	Output Channels
Image	256 ×256	1
Conv-3 ×3, ↓2 MV2	128 ×128	2	1 1	16 32
MV2, ↓2 MV2	64 ×64	4	1 2	64 64
MV2, ↓2 MobileViT block (L = 2)	32 ×32	8	1 1	96 96 (d = 144)
MV2, ↓2 MobileViT block (L = 4)	16 ×16	16	1 1	128 128 (d = 192)
MV2, ↓2 MobileViT block (L = 3) Conv-1 ×1	8 ×8	32	1 1 1	160 160 (d = 240) 640
Global pool Linear	1 ×1	256	1	1,000
Network Parameters			5.6 M

By default, the kernel size n is set to 3 in the Mobile ViT block and the space size of the block (height h and width w) is set to 2.