Skip to main content
. 2022 Jul 8;3(7):100520. doi: 10.1016/j.patter.2022.100520

Table 1.

Comparison between convolution, self-attention, and token-mixing MLP

Operation Information aggregation Receptive field Resolution sensitive Spatial Channel Params FLOPs
Convolution static local false agnostic specific O(k2C2) O(HWC2)
Depthwise convolution static local false agnostic specific O(k2C) O(HWC)
Self-attention47 dynamic global false agnostic specific O(3C2) O(H2W2C)
Token-mixing MLP15 static global True specific agnostic O(H2W2) O(H2W2C)

H, W, and C are the height, width, and channel numbers of the feature map, respectively. k is the convolutional kernel size. “Information aggregation” refers to whether the weights are fixed or dynamically generated based on the input during inference. “Resolution sensitive” refers to whether the operation is sensitive to input resolution. “Spatial” refers whether feature extraction is sensitive to the spatial location of objects, “specific” means true, while “agnostic” means false. “Channel specific” means no weights are shared between channels, “Channel agnostic” means weights are shared between channels.