Schematic diagram of different receptive fields, including global, cruciform, and local
The bright blue dot indicates the position of the encoded token, and the light blue dots are other locations involved in the calculation. The range of the blue dots constitutes the receptive field. The orange rectangle is the local window, which covers a larger region than the commonly used convolutional kernels. The transition between different receptive fields is marked next to the arrow. The corresponding networks are listed on the left and right sides. Although MS-MLP87 uses shifting and channel projection, its receptive field is most similar to the local window and spatial projection due to the depthwise convolution before shifting feature maps. The local receptive fields formed by MLP-like variants can be realized by convolution, causing them to lack an essential difference from CNNs.