Table 1.
Operation | Information aggregation | Receptive field | Resolution sensitive | Spatial | Channel | Params | FLOPs |
---|---|---|---|---|---|---|---|
Convolution | static | local | false | agnostic | specific | ||
Depthwise convolution | static | local | false | agnostic | specific | ||
Self-attention47 | dynamic | global | false | agnostic | specific | ||
Token-mixing MLP15 | static | global | True | specific | agnostic |
H, W, and C are the height, width, and channel numbers of the feature map, respectively. k is the convolutional kernel size. “Information aggregation” refers to whether the weights are fixed or dynamically generated based on the input during inference. “Resolution sensitive” refers to whether the operation is sensitive to input resolution. “Spatial” refers whether feature extraction is sensitive to the spatial location of objects, “specific” means true, while “agnostic” means false. “Channel specific” means no weights are shared between channels, “Channel agnostic” means weights are shared between channels.