. 2022 Jul 8;3(7):100520. doi: 10.1016/j.patter.2022.100520

Table 1.

Comparison between convolution, self-attention, and token-mixing MLP

Operation	Information aggregation	Receptive field	Resolution sensitive	Spatial	Channel	Params	FLOPs
Convolution	static	local	false	agnostic	specific	$O (k^{2} C^{2})$	$O (H W C^{2})$
Depthwise convolution	static	local	false	agnostic	specific	$O (k^{2} C)$	$O (H W C)$
Self-attention⁴⁷	dynamic	global	false	agnostic	specific	$O (3 C^{2})$	$O (H^{2} W^{2} C)$
Token-mixing MLP¹⁵	static	global	True	specific	agnostic	$O (H^{2} W^{2})$	$O (H^{2} W^{2} C)$

H, W, and C are the height, width, and channel numbers of the feature map, respectively. k is the convolutional kernel size. “Information aggregation” refers to whether the weights are fixed or dynamically generated based on the input during inference. “Resolution sensitive” refers to whether the operation is sensitive to input resolution. “Spatial” refers whether feature extraction is sensitive to the spatial location of objects, “specific” means true, while “agnostic” means false. “Channel specific” means no weights are shared between channels, “Channel agnostic” means weights are shared between channels.