Skip to main content
. 2022 Jul 8;3(7):100520. doi: 10.1016/j.patter.2022.100520

Figure 1.

Figure 1

Illustrative shift between different weighted-sum paradigms

Illustrative shift between different weighted-sum paradigms in CNN (A and B), Transformer (C), and MLP (D and E). The input feature map is H×W×C, where H, W, and C are the feature map’s height, width, and channel numbers, respectively. The light blue part highlights the input features, and the yellow part is the output features. The dark blue dot represents the position of interest, the dark orange denotes other features used in the calculation process, and the green dot represents the corresponding output feature. The token-mixing MLP is reduced to one fully connected layer to facilitate understanding. Linear projection performs a 1×1 convolution along the channel dimension, and weighted sum means the elements are multiplied by the weights and then summed.