Summary
Recently, the proposed deep multilayer perceptron (MLP) models have stirred up a lot of interest in the vision community. Historically, the availability of larger datasets combined with increased computing capacity led to paradigm shifts. This review provides detailed discussions on whether MLPs can be a new paradigm for computer vision. We compare the intrinsic connections and differences between convolution, self-attention mechanism, and token-mixing MLP in detail. Advantages and limitations of token-mixing MLP are provided, followed by careful analysis of recent MLP-like variants, from module design to network architecture, and their applications. In the graphics processing unit era, the locally and globally weighted summations are the current mainstreams, represented by the convolution and self-attention mechanism, as well as MLPs. We suggest the further development of the paradigm to be considered alongside the next-generation computing devices.
Keywords: computer vision, neural network, deep MLP, paradigm shift
The bigger picture
The last decade has been called the third spring of deep learning, when extraordinary progress in both theory and applications has been made. This has attracted the enthusiasm of academia and an investment of resources from the whole society. During this period, the scale of networks has grown tremendously, and paradigms have emerged constantly. People wonder how long this trend will continue and what the next network will be.
We point out that existing paradigms blossom in a riot of color on the tree rooted in weighted summation with GPU computing. In terms of computation and power, its applications are still promising, but simply scaling the network with more GPUs is not sustainable. We recognize that the brilliance of this generation of networks is based on the switch of computing hardware from CPU to GPU and, similarly, we expect that the next paradigm will be brought about by the emergence of a new physical system-based computing hardware with a non-weighted summation network, such as the Boltzmann machine.
In the past decade, deep learning has boomed in three mainstream paradigms: convolution, self-attention, and fully connected. We compare these paradigms and comprehensively review MLPs with their vision applications and conclude that models within different paradigms achieve competitive performance. We point out that all deep architectures are weighted summation networks with the GPU-based computation, no matter what the paradigm is. The applications are still promising, but we expect the next paradigm will be beyond the weighted summation and come with the next-generation computing device.
Introduction
In computer vision, the ambition to create a system that imitates how the brain perceives and understands visual information fueled the initial development of neural networks.1,2 Subsequently, convolutional neural networks (CNNs),3, 4, 5 multilayer perceptrons (MLPs),6 and Boltzmann machine7,8 were proposed, and achieved fruitful results in theoretical research9, 10, 11, 12, 13, 14 in the last century. CNNs stood out due to their computational efficiency over MLPs and deep Boltzmann machines in the contest to replace hand-crafted features, and topped the list for vast visual tasks in the 2010s. From 2020, the Transformer-based models introduced from the natural language processing field to the visual field have once again reached a new peak. With the introduction of MLP-Mixer15 in 2021, the hot topic in the vision community comes: Will MLP become a new paradigm and push computer vision to a new height? This survey aims to provide opinions on this issue.
From a historical perspective, the availability of larger datasets combined with the transition from CPU-based training to graphics processing unit (GPU)-based training leads to paradigm shifts and a gradual reduction in human intervention. The locally weighted summation represented by convolution and globally weighted summation represented by self-attention are the current mainstreams. The token-mixing MLP15 in MLP-Mixer further abandons the artificially designed self-attention mechanism and allows the model learn the global weights matrix autonomously from the raw data, seemingly in line with the laws of historical development.
We review MLP-Mixer in detail, and compare the intrinsic connections and differences between convolution, self-attention mechanism, and token-mixing MLP. We observe that the token-mixing MLP is an enlarged and weights-shared-between-channel version of depthwise convolution,16 which faces challenges, such as high computational complexity and resolution sensitivity. Exhaustive analysis reveals that, not only the recent MLP-like variant designs are gradually approaching the direction of CNN, but the performance of these variants in visual tasks still lags behind CNN- and Transformer-based designs. At this moment, MLP is not a new paradigm that can push computer vision to new heights. In fact, computing paradigm and computing hardware are cooperative. The current weighted-sum paradigms have driven the booming of GPU-based computing and deep learning itself, while we believe the next paradigm or Boltzmann-like will also grow up with a new generation of computing hardware.
The rest of the paper is organized as follows. Preliminary reviews MLP, CNN, and Transformer, as well as their corresponding paradigms from a historical perspective. Pioneering model and new paradigm reviews the design of the latest MLP pioneering models, describes the differences and connections between token-mixing MLP, convolution, and self-attention mechanism, and presents the bottlenecks and challenges faced by the seemingly new paradigm. Block of MLP variants and architecture of MLP variants discuss the block evolution and network architecture of MLP-like variants. Applications of MLP variants sheds light on applications of MLP-like variants. Summary and outlook gives our summary and discusses potential future research directions.
Preliminary
For completeness and to provide helpful insight into the visual deep MLP presented in the subsequent sections, we briefly introduce MLP, CNN, and Transformer, including their brief histories and corresponding paradigms.
Multilayer perceptron and Boltzmann machine
The original “Perceptron” model was developed by Frank Rosenblatt in 1958,2 and can be viewed as a fully connected layer with only one output element. In 1969, a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert17 critically analyzed perceptron and pointed out several critical weaknesses of perceptron, including that perceptron was unable to learn an XOR function. For a while, interest in perceptron waned.
Interest in perceptron revived in 1985 when Hinton and co-workers6 recognized that a feedforward neural network with two or more layers had a greater fitting ability than a single-layered perceptron. Indeed Hinton and co-workers proposed the MLP, a network composed of multiple layers of perceptrons and activation function, to solve the XOR problem. And they provided a reasonably effective training algorithm called backpropagation for neural networks. As shown in 1989 by Cybenko,9 Hornik et al.,10 and Funahashi,11 MLPs are universal function approximators and can be used to construct mathematical models for classification and regression analysis.
In MLP, the fully connected layer can be viewed as a paradigm for extracting features. As the name implies, the main feature lies in the full connectivity, i.e., all the neurons from one layer are connected to every neuron in the following layer (Figure 1D). One problem with fully connected layers is the input resolution sensitivity, where the number of neurons is related to the input size. Another significant problem with full connectivity is the enormous parameter cardinality and computational cost, growing quadratically with the image resolution.
The Boltzmann machine7 proposed in 1985 is more theoretically intriguing because of the analogy of its dynamics to simple physical processes; and a restricted Boltzmann machine (RBM)8 comprises a layer of visible neurons and a layer of hidden neurons with only visible-hidden connections between the two layers. RBM is stacked18 and driven by recovering to a minimum energy state computed by a predetermined energy function. Contrarily speaking, stacked RBM is more computationally expensive than MLP, especially during inference. Computing power has been the main factor limiting the development of both MLP and RBM.
Convolution neural network
CNN was first proposed by Fukushima3,19,20 in an architecture called Neocognitron, which involved multiple pooling and convolutional layers, and inspired later CNNs. In 1989, LeCun et al.4 proposed a multilayered CNN for handwritten zip code recognition, and the prototype of the architecture later called LeNet. After years of research, LeCun5 proposed LeNet-5, which outperformed all other models on handwritten character recognition. In the CPU era, it was widely accepted that the backpropagation algorithm was ineffective, considering the limited computational power of CPUs, in converging to the global minima of the error surface, and hand-crafted features were generally better than that of CNN-based extractors.21
In 2007, NVIDIA developed the CUDA programming platform;22,23 and, in 2009, ImageNet,24 a large image dataset, was proposed to provide the raw material for networks to learn image features autonomously. Three years later, AlexNet25 won the ImageNet competition, a symbolically significant event of the first paradigm shift. CNN-based architectures have gradually been utilized to extract image features automatically instead of hand-crafted features. In traditional computer vision algorithms, features such as gradient, texture, and color are extracted locally. Hence, the inductive biases inherent to CNNs, such as local connectivity and translation invariance, significantly help image feature extraction. The development of self-supervision26, 27, 28, 29 and training strategies30, 31, 32, 33, 34, 35 further assisted the continuous improvement of CNNs. In addition to classification, CNNs outperform traditional algorithms for almost all computer vision tasks, such as object detection,36,37 segmentation,38,39 demosaicing,40 super-resolution,41, 42, 43 and deblurring.44 CNN is the de facto standard for computer vision.
A CNN architecture typically comprises alternate convolutional and pooling layers with several fully connected layers behind, where the standard local-connected convolutional layer is the paradigm (Figure 1A); and depthwise convolution is a variant of convolution, where it applies different convolutional filters to different single channel (Figure 1B). In the convolutional operation, sliding kernels with the same set of weights can extract full sets of features within an image, making the convolutional layer more parameter efficient than the fully connected layer.
Vision Transformer
Keeping pace with Moore’s law,45 the computing capability has increased steadily with each new generation of chips. In 2020, visual researchers noted the application and success of the Transformer46 in natural language processing and suggested moving beyond the limits of local connectivity and toward global connectivity. Vision Transformer (ViT47) is the first work promoting research on Transformer in the field of vision. It uses stacked transformer blocks of the same architecture to extract visual features, where each transformer block comprises two parts: a multi-head self-attention layer and a two-layer MLP, in which layer normalization and residual path are also added. Since then, the Transformer-based architecture has been widely accepted in the vision community, outperforming CNN-based architectures in tasks such as objection detection48,49 and segmentation,49, 50, 51 and achieving state-of-the-art performance on denoising,52 deraining,52 and super-resolution.52, 53, 54, 55 Furthermore, several works56, 57, 58, 59 demonstrate that the Transformer-based architecture is more robust than CNN-based methods. All the developments over the last two years indicate that Transformer has become another de facto standard for computer vision.
The paradigm in Transformer can be boiled down to the self-attention layer, where each input vector is linearly mapped into three different vectors: called query , key and value . Then the query, key, and value vectors come from different input vectors are aggregated into three corresponding matrices, denoted, , , and (Figure 1C). The self-attention mainly computes the cosine similarity between each query vector and all key vectors, and the is applied to normalize the similarity and obtain the attention matrix. Output features then become the weighted sum of all value vectors in the entire space, where the attention matrix gives the weights. Compared with the convolutional layer, which focuses only on local characteristics, the self-attention layer can capture long-distance characteristics and easily derive global information.
Pioneering model and new paradigm
The success of ViT marks the paradigm shift to the era of the global receptive field in computer vision, placing the consecutive question: Can we further abandon the artificially designed self-attention mechanism and allow the model learn the global weights matrix autonomously from the raw data? This motivation reminds researchers of the long-dusted simplest structure, MLP. After a long period of slumber, MLP finally reappears in May 2021, when the first deep MLP, called MLP-Mixer,15 is launched.
This section reviews in detail the structure of the latest so-called pioneering MLP model, MLP-Mixer,15 followed by a brief review of the contemporaneous ResMLP60 as well as FeedForward.61 After that, we strip the new paradigm, token-mixing MLP, from the network and elaborate its differences and connections with convolution and self-attention mechanisms. Finally, we explore the bottlenecks of token-mixing MLP and lay the foundation for introducing subsequent variants.
Structure of pioneering model
MLP-Mixer15 is the first proposed visual deep MLP network identified by the vision community as the pioneering MLP model. Compared with conventional MLP, it gets deeper and involves several differences. In detail, the MLP-Mixer comprises three modules, a per-patch fully connected layer for patch embedding, a stack of L mixer layers for feature extraction, and a classification header for classification (Figure 2).
Patch embedding
The patch embedding is inherited from ViT,47 comprising three steps: (1) cut an image into non-overlap patches, (2) flatten the patches, and (3) linearly project these flattened patches. Specifically, an input image of the size is split into S non-overlapping patches, where H and W are the image’s height and width, respectively, S donates patch number, and p represents the patch size (typically 14 or 16). The patch is also called a token. Each patch is then unfolded into a vector . In total, we obtain a set of flattened patches , which are input of the MLP-Mixer. For each , the per-patch fully connected layer maps it into a C-dimensional embedding vector:
(Equation 1) |
where is the embedding vector of and represents weights of the per-patch fully connected layer. In practice, it is possible to combine three steps presented above into a single step using a 2D convolution operation, where the convolutional kernel size and stride are equal to patch size.
Mixer layers
The MLP-Mixer stacks L mixer layers of the same architecture, where a single mixer layer essentially consists of a token-mixing MLP and a channel-mixing MLP. Let the patch features at the input of each mixer layer be , where S is the number of patches, and C is the dimension of each patch feature, i.e., the number of channels. The token-mixing MLP works on each column of , maps , and the weights are shared among all columns. The channel-mixing MLP works on each row of , maps , and the weights are shared among all rows. Both token-mixing and channel-mixing MLPs comprise two fully connected layers, and there is a non-linear activation function between the two layers. Thus, a mixer layer can be written as follows (omitting layer indices):
(Equation 2) |
where σ is the GELU62 activation function, and denotes the layer normalization63 widely used in Transformer-based models. represents the weights of a fully connected layer, where , , , and , and is the expansion ratio (commonly ). It is worth mentioning that each mixer layer takes an input of the same size, which is most similar to Transformers or deep recurrent neural networks in other domains. However, it opposes most CNNs, which have a pyramidal structure: deeper layers have a lower resolution input but more channels.
Classification header
After processing with L stacking Mixer layers, S patch features are generated. Then, a global vector is calculated from the features through the average pooling scheme, which is forwarded into a fully connected layer for classification.
Compared with MLP-Mixer, FeedForward61 and ResMLP60 were proposed a few days later. FeedForward61 adopts essentially the same structure as the MLP-Mixer, but swaps the channel-mixing MLP and token-mixing MLP order. As another contemporary work, ResMLP60 simplifies the token-mixing MLP in the MLP-Mixer from two fully connected layers to one. Meanwhile, ResMLP proposes an affine element-wise transformation to replace the layer normalization in the MLP-Mixer and stabilize training.
Experimentally, these MLP pioneering models achieve comparable performance to CNN and ViT for image classification (Image classification). These empirical results significantly break past perceptions, challenge the necessity for the convolutional layer and the self-attention layer, and prompt the community to rethink whether the convolutional layer or the self-attention layer is necessary. The latter induces us to explore whether a pure MLP stack will become the new paradigm for computer vision.
Token-mixing MLP, a new paradigm?
To find out whether pure MLPs stacked into a mixer layer will become a new paradigm, it is necessary first to reveal the difference between it, convolution, and self-attention mechanisms. There is an indisputable fact that both the channel-mixing MLP in MLP-Mixer and the MLP in ViT are just a convolution commonly used in CNNs, allowing communication between different channels. The core points to compare come naturally to token-mixing MLP, self-attention, and convolution, which allow communication between different spatial locations. A detailed comparison between the three is reported in Table 1.
Table 1.
Operation | Information aggregation | Receptive field | Resolution sensitive | Spatial | Channel | Params | FLOPs |
---|---|---|---|---|---|---|---|
Convolution | static | local | false | agnostic | specific | ||
Depthwise convolution | static | local | false | agnostic | specific | ||
Self-attention47 | dynamic | global | false | agnostic | specific | ||
Token-mixing MLP15 | static | global | True | specific | agnostic |
H, W, and C are the height, width, and channel numbers of the feature map, respectively. k is the convolutional kernel size. “Information aggregation” refers to whether the weights are fixed or dynamically generated based on the input during inference. “Resolution sensitive” refers to whether the operation is sensitive to input resolution. “Spatial” refers whether feature extraction is sensitive to the spatial location of objects, “specific” means true, while “agnostic” means false. “Channel specific” means no weights are shared between channels, “Channel agnostic” means weights are shared between channels.
Convolution usually performs the aggregation computation of spatial information in a local region, but poorly models long-term dependencies. The token-mixing MLP (Figure 1E) can be viewed as an unusual convolution type whose convolutional kernel covers the entire space. To enhance efficiency, it aggregates spatial information from a single channel and shares weights for all channels. This is very close to the depthwise convolution16 (Figure 1B) used in CNN, which independently applies convolutions to each channel. However, the convolutional kernels in depthwise convolution are usually small and not shared between each channel. The self-attention mechanism considers all patches as well, where the weights are dynamically generated based on the input, while weights in token-mixing MLP and the convolutional layer are fixed and input independent.
We now shift to another important metric, namely complexity. Without loss of generality, we assume that the input feature map size is (or in CNNs), where H and W are the spatial resolutions, and C is the number of channels. Intuitively, local computation has minimal computational complexity, i.e., in depthwise convolution and in dense convolution. However, both self-attention and token-mixing MLP involving a global receptive field have greater complexity, . Fortunately, token-mixing MLP is less computationally intensive than the self-attention layer due to the lack of calculations, such as the attention matrix. As for the parameter cardinality, the parameter complexity is in token-mixing MLP, strongly correlated with the image resolution. So, once the network is fixed, the corresponding image resolution is also fixed. In comparison, other paradigms have more advantages in parameters. The newly proposed MLP-like variants are optimized in complexity, see Complexity analysis.
The above comparative analysis reveals that token-mixing MLP is seemly a new paradigm. But what does the new paradigm bring to the learning weights? Figure 3 visualizes the weights of fully connected layers (FC kernels) of MLP-Mixer and ResMLP trained on ImageNet24 or JFT-300M.64 ImageNet-1k contains 1.2 million labeled images, ImageNet-21k involves 14 million images, and JFT-300M has 300 million images. As the amount of training data increases, the number of FC kernels for locality computation in MLP-Mixer increases. ResMLP’s shallow FC kernels also present some local connectivity properties. These fully connected layers actually still perform local computation, which convolutional layers can replace. Thus, we conclude that: the shallow fully connected layers of deep MLP implement a convolution-like scheme. As the number of layers increases, i.e., the network becomes deeper, the effective range of receptive field increases, and the weights become disorganized. However, it is unclear if this is due to a lack of training data or if it should be so. Notably, while MLP-Mixer and ResMLP share a highly similar structure (except for the normalization layer), their learning weights are vastly different. This questions whether MLP is learning generic visual features. Moreover, MLP’s interpretability stands far behind.
Bottlenecks
Based on the above analysis and comparison, it is evident that the seemingly new paradigm still faces several bottlenecks:
-
1
Without the inductive biases of the local connectivity and the self-attention, the token-mixing MLP is more flexible and has a stronger fitting capability. However, it has a greater risk of over-fitting than the convolutional and self-attention layers. Thus, the large-scale dataset is needed to shorten the classification accuracy gap between MLP-Mixer, ViT, and CNN.
-
2
The complexity of the token-mixing MLP is quadratic to image size, which, for the current computing capability, makes it intractable for existing MLP-like models on high-resolution images.
-
3
The token-mixing MLP is resolution sensitive, and the network cannot deal with a flexible input resolution once the number of neurons in the fully connected layer is set. However, some tasks adopt a multi-scale training strategy48 and have different input resolutions between training and evaluation stages.65,66 In these cases, MLP models are non-transferable and impractical.
After several explorations and practices to address these challenges, the vision community has developed many MLP-like variants. Their main contributions are modifications to the token-mixing MLP, including reducing computational effort, removing resolution sensitivity, and reintroducing local receptive fields. These variants will be described in detail in the subsequent sections.
Block of MLP variants
To overcome the challenges faced by the MLP-Mixer, the vision community has made several attempts and proposed many MLP-like variants. The improvements focus on redesigning the network’s interior parts, i.e., the block module. The lower part of Figure 4 illustrates the block designs of the latest MLP-like variants, highlighting that they are primarily modified for token-mixing MLP. Except for the gMLP,67 the remaining blocks retain the tandem spatial MLP and the channel MLP. Moreover, most of the improvements reduce the spatial MLP’s sensitivity to image resolution (green rectangle).
In this work, we reproduce most variants of MLP-like models in Jittor68,69 and Pytorch.70 Moreover, this section first details the redesigned blocks of the latest MLP-like variants, then compares their properties and receptive field, and finally discusses the findings.
MLP block variants
We divide the MLP block variants into three categories: (1) mappings employing both the axial direction and channel dimensions, (2) mappings considering only channel dimensions, and (3) mappings utilizing the entire spatial and channel dimensions. The upper part of Figure 4 categorizes the network variants. Since the channel-mixing MLP is basically the same (except for the gMLP), we review and describe the changes to the token-mixing MLP.
Axial and channel projection blocks
The global receptive field of the initial token-mixing MLP is heavily parameterized and computationally very complex. Some researchers have proposed orthogonally decomposing the spatial projections and maintaining long-range dependence while no longer encoding spatial information along a flat spatial dimension.
Hou et al.71 present the Vision Permutator (ViP), which separately encodes the feature representations along the height and width dimensions with linear projections. This design allows ViP to capture long-range dependencies along one spatial direction while preserving precise positional information along the other direction. As illustrated in Figure 5A, Permute-MLP comprises three branches responsible for encoding information along the height, width, or channel dimension. Specifically, it first splits the feature map into g segments along the channel dimension, where and the segments are then concatenated along the height dimension. Thus, it maps and is shared across the width and part channels. After mapping, the feature map is recovered to the original shape. The branch treatment in the width direction is the same as in the height direction. The third path is a simple mapping process along the channel dimension, which can also be regarded as a convolution. Finally, the outputs from all three branches are fused by exploiting the split attention.72
Tang et al.73 adopt a strategy consistent with ViP for spatial information extraction and build an attention-free network called Sparse MLP. As illustrated in Figure 5B, the block contains three parallel branches. The difference from ViP is no longer splitting the feature map along the channel dimension, but directly mapping and along the height and width dimension. Without the split attention, Sparse MLP’s fusion strategy involves concatenating the three tributary outputs by channel and then passing them through a fully connected layer for dimensionality reduction. There is also a minor modification where Sparse MLP places a depthwise convolution in front of each block.
RaftMLP74 employs serial mappings in high and wide dimensions to form a raft-token-mixing block (Figure 5C), which is different from the parallel branches of ViP and Sparse MLP. In the specific implementation, the raft-token-mixing block also splits the feature map along the channel dimension, consistent with ViP.
DynaMixer75 generates mixing matrices dynamically for each set of tokens to be mixed by considering their contents. DynaMixer adopts the parallel strategy for a computational speedup and mixes tokens in a row-wise and column-wise way (Figure 5D). The proposed DynaMixer operation performs dimensionality reduction and flattening first and then utilizes a linear function to estimate a or mixing matrix. is performed on each row of the mixing matrix to obtain the mixing weights. The output equals the product of the mix weights and the input.
WaveMLP76 considers the dynamic aggregation tokens and image resolution issues. It considers each token a wave with both amplitude and phase information (Figure 5E). The tokens are aggregated according to their varying contents from different input images with the dynamically produced phase. There are two parallel paths to aggregate spatial information along the horizontal and vertical directions, respectively. To address the issue of sensitivity to image resolution, WaveMLP adopts a simple strategy that restricts the fully connected layers only to connect tokens within a local window.49 However, the local window limits long-range dependencies.
MorphMLP77 considers long-range and short-range dependencies while continuing the static aggregation strategy (Figure 5F). It focuses on local details in the early stages of the network and gradually changes to long-term modeling in the later stages. The local window is used to solve the image resolution sensitivity problem, and the window size increases as the number of layers increases. The authors find that such a feature extraction model is beneficial for images and videos.
Discussion
ViP, Sparse MLP, RaftMLP, and DynaMixer encode spatial information along the axial direction instead of the entire plane, preserving the long-range dependence to a certain extent and reducing the complexity of parameters and the computational cost. However, they are still image resolution sensitive. WaveMLP and MorphMLP adopt a local window strategy but discard long-range dependencies. Furthermore, all those variants cannot mix tokens both globally and locally.
Channel-only projection blocks
The mainstream method adopts Swin’s proposal49 and uses a local window to achieve resolution insensitivity. Another approach replaces all the spatial fully connected layers with channel projection, i.e., convolution. However, it causes the tokens to no longer interact with each other, and the concept of the receptive field disappears. To reintroduce the receptive field, many works align features at different spatial locations to the same channel by shifting (or moving) the feature maps and then interacting with spatial information through channel projection. Such an operation enables an MLP-like architecture to achieve the same local receptive field as a CNN-like architecture.
Yu et al.78 propose a spatial-shift MLP-like architecture for vision, called S2MLP. The actual practice is quite simple. As shown in Figure 6A, the proposed spatial-shift module groups C channels into several groups, shifting different channel groups in different directions. The feature map after the shift aligns different token features to the same channel, and then the interaction of the spatial information can be realized after channel projection. Given the simplicity of this approach, Yu et al.79 exploit the idea of ViP to extend the spatial-shift module into three parallel branches and then fuse the branch features by a split attention module to further improve the network’s performance (Figure 6B). This newly proposed network is called S2MLPv2.
Unlike grouping and then performing the same shift operation for each group, the Axial Shifted MLP (AS-MLP)80 performs different operations within each group that contains a few channels, e.g., every three feature maps in the channel direction are left-shifted, no-shifted, right-shifted, and so on (Figure 6C). In addition, AS-MLP uses two parallel branches for horizontal and vertical shifting, where the outputs are added element-wise and projected along the channel dimension for information integration. It is worth mentioning that AS-MLP is also extended with different shifting strategies, allowing the receptive field to be similar to the dilated convolution (atrous convolution).81,82
CycleMLP83 was published three days after AS-MLP. Although CycleMLP does not directly shift feature maps, it integrates features at different spatial locations along the channel direction by employing deformable convolution,84 an equivalent approach to shifting the feature map. As illustrated in Figure 6D, CycleMLP and AS-MLP slightly differ as CycleMLP has three branches and AS-MLP has only two branches. Furthermore, CycleMLP relies on split attention to fuse the branched features.
ActiveMLP85 dynamically estimates the offset, rather than manually setting it like AS-MLP and CycleMLP do (Figure 6G). It first predicts the spatial locations of helpful contextual features along each direction at the channel level, and then finds and fuses them. This is equivalent to a deformable convolution.84
HireMLP86 adopts inner-region and cross-region rearrangements before channel projection to communicate spatial information. The inner-region rearrangement expands the feature map along the height or width direction, and the cross-region rearrangement moves the feature map cyclically along the width or height direction. The HireMLP block still comprises three parallel branches, and the output feature is obtained by adding the branched features (Figure 6E).
The six models mentioned above can communicate only localized information through feature map movement. MS-MLP87 effectively expands the range of receptive fields by mixing neighboring and distant tokens from fine- to coarse-grained levels and then gathering them via a shifting operation. From the implementation aspect, MS-MLP performs depthwise convolution of different sizes before channel alignment (Figure 6F). Compared with the global receptive field or local window, there is still some vacancy in the receptive field of MS-MLP.
Discussion
After shifting the feature maps, channel projection is equivalent to sampling features at different locations in different channels for aggregation. In other words, this strategy is an artificially designed deformable convolution. Thus, it may be far better to call these models CNN-like, as only local feature extraction can be performed.
Spatial and channel projection blocks
Some variants still retain full space and channel projection. Their module designs are not short of sparkle and enhance performance. Nevertheless, these methods are resolution sensitive, prohibiting them from being a general vision backbone.
gMLP67 is the first proposed MLP-Mixer variant. gMLP was developed by Liu et al., who experimented with several design choices for token-mixing architecture and found that spatial projections work well when they are linear and paired with multiplicative gating. In detail, the spatial gating unit first linearly projects the input X, . Then the output of the spatial gating unit is , where denotes the element-wise product. The authors found it effective to split X into two independent parts along the channel dimension for the gating function and the multiplicative bypass: . Note that the gMLP block has a channel projection before and after the spatial gating unit. However, there is no longer a channel-mixing MLP. Pleasingly, gMLP achieves good performance in both computer vision and natural language processing tasks.
Lou et al.88 consider how to scale the MLP-Mixer to more parameters with comparable computational cost, making the model more computationally efficient and better performing. Specifically, they introduce the mixture-of-experts (MoE)89 scheme into the MLP-Mixer and propose Sparse-MLP(MoE). Carlos et al.90 had already applied MoE on the MLP of the Transformer block, i.e., channel-mixing MLP in MLP-Mixer, a few months earlier. As a continuation of Carlos’ work, Lou et al. expand MoE from a channel-mixing MLP to a token-mixing MLP and achieve some performance gains compared with the primitive MLP-Mixer. More details about MoE can be found in Shazeer and co-workers.89,90
Yu et al.91 impose a circulant-structure constraint on the token-mixing MLP weights, reducing the spatial projection’s sensitivity to spatial translation while preserving the global receptive field. It should be noted that the model is still resolution sensitive. The authors reduce the number of parameters from to , but do not reduce the computation cost. Therefore, the authors employ a fast Fourier transform to reduce the FLOPs and enhance computational efficiency.
Finally, and the most ingeniously, Ding et al.92 propose a novel structural re-parameterization technique to merge the convolutional layers into the fully connected layers. Therefore, during training, the proposed RepMLPNet can learn parallel fully connected layers (global receptive field) and convolutional layers (local receptive field) and combine the two via transforming the parameters. Compared with the weight values of the fully connected layers before merging, the re-parameterization resultant weight has larger values around a specific position, suggesting that the model focuses more on the neighborhood. Although large images can be sliced and input to RepMLPNet for feature extraction, the resolution sensitivity makes it not a general vision backbone.
Others
ConvMLP93 is another special variant, which is lightweight, stage-wise, and comprises a co-design of convolutional layers. ConvMLP replaces the token-mixing MLP with a depthwise convolution, and therefore we consider ConvMLP as a pure CNN model rather than an MLP-like model.
In addition, LIT94 replaces the self-attention layers with MLP in the first two stages of a pyramid ViT. UniNet95 jointly searches the optimal combination of convolution, self-attention, and MLP to build a series of all-operator network architectures with high performances on visual tasks. However, these methods are beyond the scope of Vision MLP.
Receptive field and complexity analysis
The main novelty of MLP is allowing the model to autonomously learn the global receptive field from raw data. However, do these so-called MLP-like variants still hold to the original intent? Immediately following the module design, we compare and analyze the receptive field of these blocks to provide a more in-depth presentation for those MLP-like variants. Following that, we compare and analyze the complexity of different modules, with the corresponding results reported in Table 2. A comparison of the block’s spatial sensitivity, channel sensitivity, and resolution sensitivity is also provided.
Table 2.
Spatial operation | Block | Spatial | Channel | Resolution sensitive | Params | FLOPs |
---|---|---|---|---|---|---|
Spatial projection | MLP-Mixer15 | specific | agnostic | true | ||
ResMLP60 | specific | agnostic | true | |||
FeedForward61 | specific | agnostic | true | |||
gMLP67 | specific | agnostic | true | |||
Sparse-MLP(MoE)88 | specific | agnostic | true | |||
CCS91 | agnostic | group specific | true | |||
RepMLPNet92 | specific | group specific | true | |||
Axial projection | RaftMLP74 | specific | agnostic | true | ||
ViP71 | specific | specific | true | |||
Sparse MLP73 | specific | specific | true | |||
DynaMixer75 | specific | specific | true | |||
WaveMLP76 | specific | specific | false | |||
MorphMLP77 | specific | specific | false | |||
Shifting & channel projection | S2MLP78 | agnostic | specific | false | ||
S2MLPv279 | agnostic | specific | false | |||
AS-MLP80 | agnostic | specific | false | |||
CycleMLP83 | agnostic | specific | false | |||
HireMLP86 | agnostic | specific | false | |||
MS-MLP87 | agnostic | specific | false | |||
ActiveMLP85 | agnostic | specific | false |
H, W, and C are the feature map’s height, width, and channel numbers, respectively. L is the local window size. “Spatial” refers to whether feature extraction is sensitive to the spatial location of objects, “specific” means true, while “agnostic” means false. “Channel” refers whether weights are shared between channels, “agnostic” shares weights between all channels, “group specific” shares weights between groups, and “specific” does not share. “Resolution sensitive” refers to whether the module is resolution sensitive.
Receptive field
Receptive field was first used by Sherrington96 in 1906 to describe the skin area that could trigger a scratch reflex in a dog. Nowadays, the term receptive field is also used in describing artificial neural networks, which is deemed as the size of the region in the input that produces the output value. It measures the relevance of an output feature to the input region. Different information aggregation methods will generate different receptive fields, which we divide into three categories, global, cruciform, and local. CNNs have local receptive fields, while vision Transformers own global receptive fields. Swin Transformer49 introduces the concept of local window, reducing the global receptive field to a fixed region independent of the image resolution, which is still more extensive than the standard convolutional layer. Figure 7 displays the schematic diagrams of different receptive fields and the corresponding MLP-like variants in detail.
Similar to MLP-Mixer,15 the full spatial projection in gMLP,67 Sparse-MLP(MoE),88 CCS,91 and RepMLPNet92 still retain the global receptive field, i.e., the encoded features of each token are the weighted sum of all input token features. It must be acknowledged that this global receptive field is at the patch level, not at the pixel level in the traditional sense. In other words, the globalness is approximated by patch partition, similar to the Transformer-based methods. The size of the patch affects the final result and the network’s computational complexity. Notably, the patch partition is a strong artificial assumption that is always ignored.
To balance long-distance dependence and computational cost, axial projection decomposes the full spatial projection orthogonally, i.e., along with horizontal and vertical directions. The projection on both axes is made serially (RaftMLP74) or parallel (ViP,71 Sparse MLP,73 and DynaMixer75). Thus, the token encoded only interacts with horizontal or vertical tokens in a single projection and forms a cruciform receptive field, which retains horizontal and vertical long-range dependence. However, if the token of interest and the current token are not at the same horizontal or vertical area, the two tokens cannot interact.
The global and cruciform receptive fields are required to cover the entire height and width of the space, resulting in a one-to-one correspondence between the neurons’ number in the fully connected layer and the image resolution, further restricting the model to utilize images of a specific size. To eliminate resolution sensitivity, many MLP-like variants (blue box in Figure 7) choose to use local receptive fields. Mainly, two approaches are adopted: local window and channel mapping after shifting feature maps. However, these operations can be achieved by expanding the convolutional kernel size and using deformable convolution, making these variants not fundamentally different from CNNs. A concern is that these MLP-like variants abandon the global receptive field, a significant MLP feature.
Virtually, the cruciform and local receptive field is a particular case of the global receptive field, e.g., the weight value is approximately 0 except for a specific area. Therefore, these two receptive field types are equivalent to learning only a small part of the weights of the global receptive field and setting the other weights to be constantly 0. This is also an artificial inductive bias, similar to the locality introduced by the convolutional layer.
Complexity analysis
Since all network blocks mentioned above contain the channel-mixing MLP module, where the number of parameters is , and the FLOPs are , we ignore this item in the later analysis and focus mainly on the module used for spatial information fusion, i.e., token-mixing MLP and its variants. Table 2 lists the comparison results of the attribute and complexity analysis of different spatial information fusion modules, where the network names are used to name each module for ease of understanding. The complexity is referenced from the analysis in the original papers.
The full spatial projection in the MLP-Mixer15 and its variants contains parameters and has FLOPs, both quadratic with the image resolution. Theoretically, it is difficult to apply the network to large-resolution images with the current computational power. Hence, a compromise to enhance computational efficiency significantly increases the patch size, e.g., or , and the information extracted is too coarse to discriminate small objects. The only difference is that CCS91 constructs an weight matrix using weight vectors of length and uses a fast Fourier transform to reduce the computational cost.
In contrast and similar to RaftMLP,74 the axial projection, the orthogonal decomposition of the full spatial projection, reduces the parameter cardinality from to , and the FLOPs from to . If three parallel branches are used, the number of parameters and the FLOPs are and , respectively. Notice that fusing information from multiple branches does not increase the computational complexity, such as splitting attention and dimensionality reduction after channel concatenation. If there is a local window, the number of parameters and FLOPs are further reduced to and , where L is the window size, supposing a channel branch exists. DynaMixer75 requires dynamic estimation of the weight matrix, which leads to higher complexity.
The approach based on shifting the feature map and channel projection further reduce computational complexity, i.e., the number of parameters is with FLOPs. MS-MLP87 adds some depthwise convolutions, ActiveMLP85 adds some channel projections, but both do not affect the overall complexity. Moreover, the number of weights is decoupled from the image resolution to no longer constrain these variants.
It is worth noting that computational complexity is only one of the determinants of inference time, as reshaping and shifting feature maps is also time-consuming. In addition, reducing complexity does not mean that the proposed network has fewer parameters. Conversely, the various networks retain a comparable number of parameters (Table 4). This allows networks of lower complexity to have more layers and more channels.
Table 4.
Model | Date | Structure | Top 1 (%) | Params (M) | FLOPs (G) | Open source code |
---|---|---|---|---|---|---|
Small models | ||||||
Sparse-MLP(MoE)-S88 | 2021.09 | single stage | 71.3 | 21 | – | false |
RepMLPNet-T22492 | 2021.12 | pyramid | 76.4 | 15.2 | 2.8 | true |
ResMLP-1260 | 2021.05 | single stage | 76.6 | 15 | 3.0 | true |
Hire-MLP-Ti86 | 2021.08 | pyramid | 78.9 | 17 | 2.1 | falseb |
gMLP-S67 | 2021.05 | single stage | 79.4 | 20 | 4.5 | true |
AS-MLP-T80 | 2021.07 | pyramid | 81.3 | 28 | 4.4 | true |
ViP-small/771 | 2021.06 | two stage | 81.5 | 25 | 6.9 | true |
CycleMLP-B283 | 2021.07 | pyramid | 81.6 | 27 | 3.9 | true |
MorphMLP-T77 | 2021.11 | pyramid | 81.6 | 23 | 3.9 | false |
Sparse MLP-T73 | 2021.09 | pyramid | 81.9 | 24.1 | 5.0 | false |
ActiveMLP-T85 | 2022.03 | pyramid | 82.0 | 27 | 4.0 | false |
S2-MLPv2-small/779 | 2021.08 | two stage | 82.0 | 25 | 6.9 | false |
MS-MLP-T87 | 2022.02 | pyramid | 82.1 | 28 | 4.9 | true |
WaveMLP-S76 | 2021.11 | pyramid | 82.6 | 30.0 | 4.5 | falseb |
DynaMixer-S75 | 2022.01 | two stage | 82.7∗ | 26 | 7.3 | false |
Medium models | ||||||
FeedForward61 | 2021.05 | single stage | 74.9 | 62 | 11.4 | true |
Mixer-B/1615 | 2021.05 | single stage | 76.4 | 59 | 11.7 | true |
Sparse-MLP(MoE)-B88 | 2021.09 | single stage | 77.9 | 69 | – | false |
RaftMLP-1274 | 2021.08 | single stage | 78.0 | 58 | 12.0 | false |
ResMLP-3660 | 2021.05 | single stage | 79.7 | 45 | 8.9 | true |
Mixer-B/16 + CCS91 | 2021.06 | single stage | 79.8 | 57 | 11 | false |
RepMLPNet-B224 92 | 2021.12 | pyramid | 80.1 | 68.2 | 6.7 | true |
S2-MLP-deep 78 | 2021.06 | single stage | 80.7 | 51 | 9.7 | false |
ViP-medium/7 71 | 2021.06 | two stage | 82.7 | 55 | 16.3 | true |
CycleMLP-B4 83 | 2021.07 | pyramid | 83.0 | 52 | 10.1 | true |
AS-MLP-S 80 | 2021.07 | pyramid | 83.1 | 50 | 8.5 | true |
Hire-MLP-B 86 | 2021.08 | pyramid | 83.1 | 58 | 8.1 | falseb |
MorphMLP-B 77 | 2021.11 | pyramid | 83.2 | 58 | 10.2 | false |
Sparse MLP-B 73 | 2021.09 | pyramid | 83.4 | 65.9 | 14.0 | false |
MS-MLP-S 87 | 2022.02 | pyramid | 83.4 | 50 | 9.0 | true |
ActiveMLP-B 85 | 2022.03 | pyramid | 83.5 | 52 | 10.1 | false |
S2-MLPv2-medium/7 79 | 2021.08 | two stage | 83.6 | 55 | 16.3 | false |
WaveMLP-B 76 | 2021.11 | pyramid | 83.6 | 63.0 | 10.2 | falseb |
DynaMixer-M 75 | 2022.01 | two stage | 83.7∗ | 57 | 17.0 | false |
Large Models | ||||||
Sparse-MLP(MoE)-L 88 | 2021.09 | single stage | 79.2 | 130 | – | false |
S2-MLP-wide 78 | 2021.06 | single stage | 80.0 | 71 | 14.0 | false |
gMLP-B 67 | 2021.05 | single stage | 81.6 | 73 | 15.8 | true |
RepMLPNet-L256a92 | 2021.12 | pyramid | 81.8 | 117.7 | 11.5 | true |
ViP-large/771 | 2021.06 | two stage | 83.2 | 88 | 24.3 | true |
CycleMLP-B583 | 2021.07 | pyramid | 83.2 | 76 | 12.3 | true |
AS-MLP-B80 | 2021.07 | pyramid | 83.3 | 88 | 15.2 | true |
Hire-MLP-L86 | 2021.08 | pyramid | 83.4 | 96 | 13.5 | falseb |
MorphMLP-L77 | 2021.11 | pyramid | 83.4 | 76 | 12.5 | false |
ActiveMLP-L85 | 2022.03 | pyramid | 83.6 | 76 | 12.3 | false |
MS-MLP-B87 | 2022.02 | pyramid | 83.8 | 88 | 16.1 | true |
DynaMixer-L75 | 2022.01 | two stage | 84.3∗ | 97 | 27.4 | false |
The training and testing size is . “Date” means the initial release date on arXiv, where 2021.05 denotes May, 2021. “Open source code” refers to whether there is officially open source code.
The training and testing size is .
Unofficial code and weights are open sourced at https://github.com/sithu31296/image-classification.
The best performance.
Discussion on block
The bottleneck of token-mixing MLP (Bottlenecks) induces researchers to redesign the block. Recently released MLP-like variants reduce the model’s computational complexity, dynamic aggregation information, and resolve image resolution sensitivity. Specifically, researchers decompose the full spatial projection orthogonally, restrict interaction within a local window, perform channel projection after shifting feature maps, and make other artificial designs. These careful and clever designs demonstrate that researchers have noticed that the current amount of data and computational power is insufficient for pure MLPs. Comparing the computational complexity has a theoretical significance, but it is not the only determinant of inference time and final model efficiency. Analysis of the receptive field shows that the new paradigm is instead moving toward the old paradigm. To put it more bluntly, the development of MLP heads back to the way of CNNs. Hence, we still need to make efforts to balance long-distance dependence and image resolution sensitivity.
Architecture of MLP variants
Multiple blocks are stacked to form an architecture via selecting a network structure. According to our investigation, the traditional structures for classification models are also applicable to MLP-like architectures that can be divided into three categories (Figure 8): (1) a single-stage structure inherited from the ViT,47 (2) a two-stage structure with smaller patches in the early stage and larger patches in the later stage, and (3) a CNN-like pyramid structure; and, in each stage, there are multiple identical blocks. Table 3 illustrates the stacking structures of MLP-Mixer and its variants.
Table 3.
Spatial operation | Single stage | Two stage | Pyramid |
---|---|---|---|
Spatial projection | MLP-Mixer,15 ResMLP,60 gMLP,67 FeedForward,61 CCS,91 Sparse-MLP(MoE)88 | – | RepMLPNet92 |
Axial projection | RaftMLP74 | ViP,71 DynaMixer75 | Sparse MLP,73 WaveMLP,76 MorphMLP77 |
Shifting and channel projection | S2MLP78 | S2MLPv279 | CycleMLP,83 AS-MLP,80 HireMLP,86 MS-MLP,87 ActiveMLP85 |
From single stage to pyramid
MLP-Mixer15 inherits the “isotropic” design of ViT,47 i.e., after the patch embedding each block does not change the size of the feature map. This is called a single-stage structure. Models of this design include FeedForward,61 ResMLP,60 gMLP,67 S2MLP,78 CCS,91 RaftMLP,74 and Sparse-MLP(MoE).88 Due to the limited computing resources, the patch partition during patch embedding of the single-stage model is usually large, e.g., or , with the coarse-grained patch partition limiting the subsequent feature fineness. Although the impact is not significant for single-object classification, it impacts many downstream tasks, such as object detection and instance segmentation, especially for small targets.
Intuitively, smaller patches are beneficial in modeling fine-grained details in the images and tend to achieve higher recognition accuracy. ViP71 further proposes a two-staged configuration. Specifically, the network considers patch slices in the initial patch embedding and performs a patch merge after a few layers. During patch merging, the height and width of the feature map halve while the channels double. Compared with the patch embedding, encoding fine-level patch representations brings a slight top 1 accuracy improvement on ImageNet-1k (from 80.6% to 81.5%). S2MLPv279 follows ViP and achieves a similar top 1 accuracy improvement on ImageNet-1k (from 80.9% to 82.0%), while DynaMixer75 also adopts two-staged configurations.
If the initial patch size is further reduced, e.g., , more patch merging is required subsequently to reduce the number of patches (or tokens), promoting the network to adopt a pyramid structure. Specifically, the entire structure contains four stages, where the feature resolution reduces from to , and the output dimension increases accordingly. Almost all the recently proposed MLP-like variants adopt the pyramid structure (right side of Table 3). Worth mentioning, a convolutional layer can equivalently achieve patch embedding if its kernel size and stride are equal to the patch size. In the pyramid structure, the researchers find that using an overlapping patch embedding provides better results, that is, convolution of with instead of , which is similar to ResNet (the initial embedding layers of ResNet is a convolutional layer with followed by a max-pooling layer).83,97
Discussion on architecture
The architecture of MLP-like variants gradually evolves from single stage to pyramid, with smaller and smaller patch size and higher feature fineness. We believe this development trend is not only to cater to downstream task frameworks, such as FPN,98 but also to balance the original intention and computing power. When the patch size decreases, the number of tokens increases accordingly, so token interactions are limited to a small range to control the amount of computation. With a small range of token interactions, the global receptive field can only be preserved by reducing the size of the feature map, and the pyramid structure appears. This is consistent with the CNN concept, where the alternating use of convolutional and pooling layers has been around since 1979!3
We would like to point out that it is unfair to compare single-stage and pyramid models directly based on current configurations (the bottom half of Figure 8). What if a patch partition is used in a single-stage model? Will it be worse than the pyramid model? Currently, these are unknown. What is known is that the cost of calculations will increase significantly and is constrained by the current computing devices.
Applications of MLP variants
This section reviews the applications of MLP-like variants in computer vision, including image classification, object detection and semantic segmentation, low-level vision, video analysis, and point cloud. Due to the short development time of MLP, we focus on the first two aspects and give an intuitive comparison of MLP, CNN, and Transformer-based models. We are limited to some brief introduction for the latter three aspects, as only a few works are currently available.
Image classification
ImageNet24 is a large vision dataset designed for visual object classification. Since its release, it has been used as a benchmark for evaluating models in computer vision. Classification performance on ImageNet is often regarded as a reflection of the network’s ability to extract visual features. After training on ImageNet, the model can be well transferred to other datasets and downstream tasks, e.g., object detection and segmentation, where the transferred part is usually called vision backbone.
Table 4 compares the performance of current Vision MLP models on ImageNet-1k, including top 1 accuracy, parameters, and FLOPs, where all results are derived from the cited papers. We further divide the MLP models into three configuration types based on the number of parameters, and the rows are sorted by top 1 accuracy. The results highlight that, under the same training configuration, the recently proposed variants bring good performance gains. Table 5 provides more detailed and comprehensive information. Compared with the latest CNN and Transformer models, MLP-like variants still pose a performance gap. Without the support of extra training data, both CNN and Transformer exceed 87% top 1 accuracy, while the MLP-like variant currently achieves only 84.3%. High performance may benefit from better architecture-specific training strategies, e.g., PeCo,99 but we do not yet have a training mode specific to MLP. The gap between MLP and other networks is further widened with additional data support.
Table 5.
Model | Pre-trained dataset | Top 1 (%) | Params (M) | FLOPs (G) |
---|---|---|---|---|
CNN based | ||||
VGG-16100 | – | 71.5 | 134 | 15.5 |
Xception16 | – | 79.0 | 22.9 | – |
Inception-ResNet-V2101 | – | 80.1 | – | – |
ResNet-5097,102 | – | 80.4 | 25.6 | 4.1 |
ResNet-15297,102 | – | 82.0 | 60.2 | 11.5 |
RegNetY-8GF102,103 | – | 82.2 | 39 | 8.0 |
RegNetY-16GF103 | – | 82.9 | 84 | 15.9 |
ConvNeXt-B104 | – | 83.8 | 89.0 | 15.4 |
VAN-Huge105 | – | 84.2 | 60.3 | 12.2 |
EfficientNetV2-M106 | – | 85.1 | 54 | 24.0 |
EfficientNetV2-L106 | – | 85.7 | 120 | 53.0 |
PolyLoss(EfficientNetV2-L)107 | – | 87.21 | – | – |
EfficientNetV2-XL106 | ImageNet-21k | 87.3 | 208 | 94.0 |
RepLKNet-XL108 | ImageNet-21k | 87.82 | 335 | 128.7 |
Meta pseudo labels (EfficientNet-L2)109 | JFT-300M | 90.23 | 480 | – |
Transformer based | ||||
ViT-B/1647 | – | 77.9 | 86 | 55.5 |
DeiT-B/16110 | – | 81.8 | 86 | 17.6 |
T2T-ViT-24111 | – | 82.3 | 64.1 | 13.8 |
PVT-large112 | – | 82.3 | 61 | 9.8 |
Swin-B49 | – | 83.5 | 88 | 15.4 |
Nest-B113 | – | 83.8 | 68 | 17.9 |
PyramidTNT-B114 | – | 84.1 | 157 | 16.0 |
CSWin-B115 | – | 84.2 | 78 | 15.0 |
CaiT-M-48-448116 | – | 86.5 | 356 | 330 |
PeCo(ViT-H)99 | – | 88.31 | 635 | – |
ViT-L/1647 | ImageNet-21k | 85.3 | 307 | – |
SwinV1-L49 | ImageNet-21k | 87.3 | 197 | 103.9 |
SwinV2-G117 | ImageNet-21k | 90.22 | 3000 | – |
V-MoE90 | JFT-300M | 90.4 | 14,700 | – |
ViT-G/1447 | JFT-300M | 90.53 | 1843 | – |
CNN + Transformer | ||||
Twins-SVT-B118 | – | 83.2 | 56 | 8.6 |
Shuffle-B119 | – | 84.0 | 88 | 15.6 |
CMT-B120 | – | 84.5 | 45.7 | 9.3 |
CoAtNet-3121 | – | 84.5 | 168 | 34.7 |
VOLO-D3122 | – | 85.4 | 86 | 20.6 |
VOLO-D5122 | – | 87.11 | 296 | 69.0 |
CoAtNet-4121 | ImageNet-21k | 88.12 | 275 | 360.9 |
CoAtNet-7121 | JFT-300M | 90.93 | 2440 | – |
MLP based | ||||
DynaMixer-L75 | – | 84.31 | 97 | 27.4 |
ResMLP-B24/860 | ImageNet-21k | 84.42 | 129.1 | 100.2 |
Mixer-H/1415 | JFT-300M | 86.33 | 431 | – |
Pre-trained dataset column provides extra data information. PloyLoss, PeCo, and meta pseudo labels are different training strategies, where the used model is in the bracket
The best performance on ImageNet-1k without pre-trained dataset.
The best performance on ImageNet-1k with ImageNet-21k pre-training.
The best performance on ImageNet-1k with JFT-300M pre-training.
From Tables 4 and 5, it is concluded that: (1) MLP-like models can achieve competitive performance compared with CNN-based and Transformer-based architectures with the same training strategy and data volume, (2) the performance gains brought by increasing data volume and architecture-specific training strategies may be greater than the module redesign, (3) the visual community is encouraged to build self-supervised methods and appropriate training strategies for pure MLPs.
Object detection and semantic segmentation
Some MLP-like variants76,77,80,83,85, 86, 87 pre-trained on ImageNet are transferred to downstream tasks, such as object detection and semantic segmentation. Such tasks are more challenging than classification due to involving multiple objects of interest in one input image. However, we currently do not have a pure MLP framework for object detection and segmentation. These MLP variants are used as backbone networks to traditional CNN-based frameworks, such as Mask R-CNN38 and UperNet,123 requiring the variant to have a pyramidal structure and resolution insensitivity.
Table 6 reports object detection and semantic segmentation results of different backbones on the COCO val2017 dataset.65 As we limit the training strategy to Mask R-CNN 1x,38 the results are not state-of-the-art on the COCO dataset. Table 7 reports semantic segmentation results of different backbones on the ADE20K124 validation set, employing the Semantic FPN125 and UperNet123 frameworks. Empirical results show that the performance of MLP-like variants on object detection and semantic segmentation is still weaker than the most advanced CNN and Transformer-based backbones.
Table 6.
Backbone | Mask R-CNN 1 38 |
|||||||
---|---|---|---|---|---|---|---|---|
Params | FLOPs | |||||||
CNN based | ||||||||
ResNet10197 | 40.4 | 61.1 | 44.2 | 36.4 | 57.7 | 38.8 | 63.2M | 336G |
ResNeXt101126 | 42.8 | 63.8 | 47.3 | 38.4 | 60.6 | 41.3 | 101.9M | 493G |
VAN-large105 | 47.1 | 67.9 | 51.9 | 42.2 | 65.4 | 45.5 | 64.4M | – |
Transformer based | ||||||||
PVT-large112 | 42.9 | 65.0 | 46.6 | 39.5 | 61.9 | 42.5 | 81M | 364G |
Swin-B49 | 46.9 | – | – | 42.3 | – | – | 107M | 496G |
CSWin-B115 | 48.7∗ | 70.4∗ | 53.9∗ | 43.9∗ | 67.8∗ | 47.3 | 97M | 526G |
MLP based | ||||||||
CycleMLP-B583 | 44.1 | 65.5 | 48.4 | 40.1 | 62.8 | 43.0 | 95.3M | 421G |
WaveMLP-B76 | 45.7 | 67.5 | 50.1 | 27.8 | 49.2 | 59.7∗ | 75.1M | 353G |
HireMLP-L86 | 45.9 | 67.2 | 50.4 | 41.7 | 64.7 | 45.3 | 115.2M | 443G |
MS-MLP-B87 | 46.4 | 67.2 | 50.7 | 42.4 | 63.6 | 46.4 | 107.5M | 557G |
ActiveMLP-L85 | 47.4 | 69.9 | 52.0 | 43.2 | 67.3 | 46.5 | 96.0M | – |
Employing the Mask R-CNN,38 where “1x” means that a single-scale training schedule is used.
The best performance.
Table 7.
Backbone | Semantic FPN125 |
UperNet123 |
||||
---|---|---|---|---|---|---|
Params | FLOPs | mIoU (%) | Params | FLOPs | mIoU (%) | |
CNN based | ||||||
ResNet10197 | 47.5M | 260G | 38.8 | 86M | 1029G | 44.9 |
ResNeXt101126 | 86.4M | – | 40.2 | – | – | – |
VAN-large105 | 49.0M | – | 48.1 | 75M | – | 50.1 |
ConvNeXt-XL104 | – | – | – | 391M | 3335G | 54.0 |
RepLKNet-XL108 | – | – | – | 374M | 3431G | 56.0 |
Transformer based | ||||||
PVT-medium112 | 48.0M | 219G | 41.6 | – | – | – |
Swin-B49 | 53.2M | 274G | 45.2 | 121M | 1188G | 49.7 |
CSWin-B115 | 81.2M | 464G | 49.9∗ | 109.2M | 1222G | 52.2 |
BEiT-L127 | – | – | – | – | – | 57.0 |
SwinV2-G117 | – | – | – | – | – | 59.9∗ |
MLP based | ||||||
MorphMLP-B77 | 59.3M | – | 45.9 | – | – | – |
CycleMLP-B583 | 79.4M | 343G | 45.6 | – | – | – |
Wave-MLP-M76 | 43.3M | 231G | 46.8 | – | – | – |
AS-MLP-B80 | – | – | – | 121M | 1166G | 49.5 |
HireMLP-L86 | – | – | – | 127M | 1125G | 49.9 |
MS-MLP-B87 | – | – | – | 122M | 1172G | 49.9 |
ActiveMLP-L85 | 79.8M | – | 48.1 | 108M | 1106G | 51.1 |
Currently, an optimal backbone choice is Transformer based, followed by CNN. Due to the resolution sensitivity, pure MLPs have not been used for downstream tasks. Recently, Transformer-based frameworks, e.g., DETR48 have been proposed. Thus, we expect the proposal of a pure MLP framework. To this end, MLPs still need to be further explored in these fields.
Low-level vision
Research on applying MLPs to the low-level vision domains, such as image generation and processing, is just beginning. These tasks output images instead of labels or boxes, making them more challenging than high-level vision tasks, such as image classification, object detection, and semantic segmentation.
Cazenavette and Guevara128 propose MixerGAN for unpaired image-to-image translation. Specially, MixerGAN adopts the framework of CycleGAN,129 but replaces the convolution-based residual block with the mixer layer of MLP-Mixer. Their experiments show that the MLP-Mixer succeeds at generative objectives and, although being an initial exploration, it is promising in extending the MLP-based architecture to image composition tasks further.
Tu et al.130 propose MAXIM, a UNet-shaped hierarchical structure that supports long-range interactions enabled by spatially gated MLPs. MAXIM contains two MLP-based building blocks: a multi-axis-gated MLP and a cross-gating block, both are variants of the gMLP block.67 By applying gMLP to low-level vision tasks to gain global information, the MAXIM family has achieved state-of-the-art performance in multiple image processing tasks with moderate complexity, including image dehazing, deblurring, denoising, deraining, and enhancement.
Video analysis
Several works extend MLPs to temporal modeling and video analysis. MorphMLP77 achieves competitive performance with recent state-of-the-art methods on Kinetics-400131 dataset, demonstrating that the MLP-like backbone is also suitable for video recognition. Skating-Mixer132 extends the MLP-Mixer-based framework to learn from video. It was used to score figure skating at the Beijing 2022 Winter Olympic Games and demonstrated satisfactory performance. However, compared with other methods, the number of frames in a single input has not been increased. Therefore, their advantage may be the larger spatial receptive field, instead of capturing long-term temporal information.
Point cloud
Point cloud analysis can be considered a special vision task, which is increasingly used in real-time by robots and self-driving vehicles to understand their environment and navigate through it. Unlike images, point clouds are inherently sparse, unordered, and irregular. The unordered nature is one of the biggest challenges for CNNs based on local receptive fields, because input adjacent does not imply spatial adjacent. In contrast, MLPs are naturally invariant to permutation, which perfectly fits the characteristic of point cloud,133 making classical frameworks, such as MLP-based PointNet134 and PointNet++.135
Choe et al.136 design PointMixer, which embeds geometric relations between point features into the MLP-Mixer’s framework. The relative position vector is utilized for processing unstructured 3D points, and token-mixing MLP is replaced with a softmax function. Ma et al.133 construct a pure residual MLP network, called PointMLP. It introduces a lightweight geometric affine module to transform the local points to a normal distribution. It then employs simple residual MLPs to represent local points, as they are permutation invariant and straightforward. PointMLP achieves the new state-of-the-art performance on multiple datasets. In addition, recently proposed Transformer-based networks137,138 show competitive performance, where self-attention is permutation invariant for processing a sequence of points, making it well suited for point cloud learning.
From the above analysis, it is evident that Transformer and MLP are appealing solutions for unordered data, where disorder makes it challenging to design artificial inductive biases.
Discussion on application
MLP-like variants have been applied for diverse vision tasks, such as image classification, image generation, object detection, semantic segmentation, and video analysis, achieving outstanding performance due to the artificial redesigning of the MLP block. Nevertheless, constructing MLP frameworks and employing MLP-specific training strategies may improve performance further. In addition, pure MLPs have already demonstrated their advantages in point cloud analysis, encouraging the application of MLPs to visual tasks with unordered data.
Summary and outlook
As the history of computer vision attests, the availability of larger datasets along with the increase in computational power often triggers a paradigm shift; and, within these paradigm shifts, there is a gradual reduction in human intervention, i.e., removing hand-crafted inductive biases and allowing the model to further freely learn from the raw data.15 The MLP and Boltzmann machines proposed in the last century exceeded the computational conditions at the time and were not widely used. In contrast, computationally efficient CNNs are more popular and replace manual feature extraction. From CNNs to Transformer, we have seen the models’ receptive field expand step by step, and the spatial range considered when encoding features is getting larger and larger. From Transformer to deep MLP, we no longer use similarity as the weight matrix, but allow the model to learn the weights from the raw data. The latest MLP works all seem to suggest that deep MLPs are making a strong comeback as the new paradigm. In the latest MLP development, we see compromises, such as:
-
1
The latest proposed deep MLP-based models use patch partition instead of flattening the entire input to compromise computational cost. This allows the full connectivity and global receptive field to be approximated at the patch level. The patch partition forms a two-dimensional matrix , instead of a one-dimensional vector as in the entire input flattening case, where are the input resolution, p is the patch size, and C is the input channel. Subsequently, the fully connected projections are performed alternately on space and channels. This is an orthogonal decomposition of traditional fully connected projections, just as the full space projection is further orthogonalized into horizontal and vertical directions.
-
2
At the module design level, there are two main improvement routes. One MLP-like variant type focuses on reducing computational complexity, which compromises computational power, and this reduction comes at the expense of decoupling the full spatial connectivity. Another type of variant addresses the resolution sensitivity problem, making it possible to transfer pre-trained models to downstream tasks. These works adopt CNN-like improvements, but the full connectivity and global receptive field as in MLPs are eroded. The receptive field evolves in the opposite direction in these models, becoming smaller and smaller and backing to the CNN ways.
-
3
At the architecture level, the traditional block-stacking patterns are also applicable to MLPs, and it seems that the pyramid structure is still the best choice, with the initial smaller patch size helping to obtain finer features. Note that this comparison is unfair because the initial patch of the current single-stage model is larger (), and the initial patch of the pyramid model is smaller (). The pyramid structure compromises small patches and low computational costs to some extent. What if a patch partition is used in a single-stage model? Will it be worse than the pyramid model? These are unknown. What is known is that the cost of calculations will increase significantly, and it strains the current computing devices.
The results of our research suggest that the current amount of data and computational capability are still not enough to support pure Vision MLP models to learn effectively. Moreover, human intervention still occupies an important place. Based on this conclusion, we elaborate on potential future research directions.
Vision-tailored designs
With the current amount of data and computation, human guidance remains important, and it seems natural to combine the advantages of other architectures.95,139 Currently, most MLP-like variants remain an either/or choice for short- and long-range dependencies and need further intuitions to enhance their efficiency on visual inputs. RepMLPNet92 has made a viable attempt. We believe that, in the future, the community should focus on how to combine short-range dependencies and long-range dependencies rather than keeping only one or the other. This is consistent with human intuition as local details are beneficial for understanding individual objects and the interactions across the entire visual field remain significant. Note that image resolution insensitivity is important, ensuring the network to be a universal vision backbone. To sum up, we encourage the community to rethink tailored visual designs further, i.e., to integrate the global receptive field (long-range dependencies) and local receptive field (short-range dependencies) while maintaining resolution insensitivity.
Scaling-up/down techniques
It has long been recognized that larger vision models generally perform better on vision tasks,97,100 but the size of most vision networks is only a few million to over a hundred million. Furthermore, various configurations of an MLP-like variant offer limited gains despite the increased number of parameters. Recently, the visual community has conducted some scaling-up research on the vision Transformers with self-supervised pre-training, including V-MoE,90 Swinv2,117 and ViTAEv2,140 which afford a considerable performance boost. Nevertheless, scaling-up techniques specific to MLPs need further exploration. Moving from the lab to life, MLPs can have intensive power and computation requirements, hindering their deployment on edge devices and resource-constrained environments, such as mobile phone platforms. Such hardware efficient designs are currently lacking for the vision MLPs to enable their seamless deployment in resource-constrained devices. How do MLPs perform with low precision training and inference? How do MLPs perform knowledge distillation? How about using neural architecture search141 to design more efficient and lightweight MLP models? It will be interesting to see how these questions are answered.
Dedicated pre-training and optimizing method
MLPs enjoy greater freedom and larger solution space with less inductive biases than CNNs and Transformer. We currently have difficulties finding optimal solutions for MLPs, which is commonly attributed to computation power and data volume constraints. Pre-training helps generalization, and the very limited information in the labels is used only to slightly adjust the weights found by pre-training. Many self-supervised frameworks, such as MoCo,28,29,142 SimCLR,27,143 SimMIM,144 MAE,145 and MaskFeat,146 have already provided a big boost to CNNs and Transformer. Are these self-supervised learning methods still effective for MLPs? Can a better self-supervised training method be designed for MLPs? What’s more, is this related to the optimizer used? We know that SGD147 is a good optimizer for CNNs and that AdamW148 performs well for Transformer. What is the best choice for MLPs? A recent work149 has conducted a preliminary exploration and investigates the MLP-Mixer from the lens of loss landscape geometry. GGA-MLP150 proposes a greedy genetic algorithm to optimize weights and biases in MLP. We believe that dedicated pre-training and optimizing methods will be an excellent boost to accelerate the development of deep MLP models.
Interpretability
Another optional direction is toward a more in-depth analysis and comparison of the filters learned by the network and the resulting feature maps. MLPs continue the long-term trend of removing hand-crafted inductive biases and allowing models to further freely learn from the raw data. What follows is that the interpretability of the model is getting lower and lower. Both mathematical explanations and visual analysis are possibly helpful to understand what neural networks can freely learn from massive amounts of raw data with fewer priors. This can assist in proving whether some of the past artificial priors are correct or incorrect and potentially guide the design choices of future networks. In addition, the theoretical understanding of why networks might be vulnerable is also a key topic.
Beyond MLP
With the further improvement of the data volume and computing power, we should be beyond the horizon of present knowledge, the weighted-sum paradigms, and reconsider more theoretical paradigms from mathematic and physic systems, such as Boltzmann machines. The weighted-sum paradigms have driven the booming of GPU-based computing and deep learning itself, while we believe the Boltzmann-like paradigm will also grow up with a new generation of computing hardware.
Data and code availability
We reproduce most variants of MLP-like models in Jittor and Pytorch. Code is available at https://github.com/liuruiyang98/Jittor-MLP. This review did not generate new data.
Acknowledgments
This research is supported by the National Key Research and Development Program of China (grant no. 2021ZD0113801 and 2021YFC3330202), the Beijing Academy of Artificial Intelligence (BAAI), and the Basic Research Fund of Shenzhen City (grant no. JCYJ20210324120012033). We would like to thank Prof. Shi-min Hu and Dr. Zhaowei Xi for their helpful discussions and insightful suggestions. We acknowledge and deeply appreciate all the feedback and comments provided by the editors and the panel of anonymous reviewers. Their work has greatly enhanced the quality and contribution of this article.
Author contributions
L.T. designed the roadmap of the article. L.T. and R.L. initiated the project and supervised the team in collaboration with Y.L., D.L., H.-T.Z., and L.T. R.L. and Y.L. drafted the manuscript. R.L. and D.L. reproduced and open sourced the code of MLP-like variants in Jittor. All co-authors participated in discussions and revised the manuscript iteratively.
Declaration of interests
The authors declare no competing interests.
References
- 1.McCulloch W.S., Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943;5:115–133. doi: 10.1007/bf02478259. [DOI] [PubMed] [Google Scholar]
- 2.Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958;65:386–408. doi: 10.1037/h0042519. [DOI] [PubMed] [Google Scholar]
- 3.Fukushima K. Neural network model for a mechanism of pattern recognition unaffected by shift in position-neocognitron. IEICE Technical Report. A. 1979;62:658–665. doi: 10.1007/BF00344251. [DOI] [PubMed] [Google Scholar]
- 4.LeCun Y., Boser B.E., Denker J.S., Henderson D., Howard R.E., Hubbard W.E., Jackel L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989;1:541–551. doi: 10.1162/neco.1989.1.4.541. [DOI] [Google Scholar]
- 5.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition. Proc. IEEE. 1998;86:2278–2324. doi: 10.1109/5.726791. [DOI] [Google Scholar]
- 6.Rumelhart D.E., Hinton G.E., Williams R.J. Tech. Rep. California Univ San Diego La Jolla Inst for Cognitive Science; 1985. Learning internal representations by error propagation. [Google Scholar]
- 7.Ackley D.H., Hinton G.E., Sejnowski T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985;9:147–169. doi: 10.1207/s15516709cog0901_7. [DOI] [Google Scholar]
- 8.Smolensky P. Tech. Rep. Colorado Univ at Boulder Dept of Computer Science; 1986. Information processing in dynamical systems: foundations of harmony theory. [Google Scholar]
- 9.Cybenko G. Approximation by superpositions of a sigmoidal function. Math. Control, Signals, Syst. MCSS. 1989;2:303–314. doi: 10.1007/BF02551274. [DOI] [Google Scholar]
- 10.Hornik K., Stinchcombe M.B., White H. Multilayer feedforward networks are universal approximators. Neural Network. 1989;2:359–366. doi: 10.1016/0893-6080(89)90020-8. [DOI] [Google Scholar]
- 11.Funahashi K.I. On the approximate realization of continuous mappings by neural networks. Neural Network. 1989;2:183–192. doi: 10.1016/0893-6080(89)90003-8. [DOI] [Google Scholar]
- 12.Hinton G.E., Sejnowski T.J. Learning and relearning in Boltzmann machines. Parallel distributed processing: Explorations in the microstructure of cognition. 1986;1:2. [Google Scholar]
- 13.Hinton G.E. Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Comput. 1989;1:143–150. doi: 10.1162/neco.1989.1.1.143. [DOI] [Google Scholar]
- 14.Tanaka T. Mean-field theory of Boltzmann machine learning. Phys. Rev. 1998;58:2302–2310. doi: 10.1103/physreve.58.2302. [DOI] [Google Scholar]
- 15.Tolstikhin I.O., Houlsby N., Kolesnikov A., Beyer L., Zhai X., Unterthiner T., Yung J., Steiner A., Keysers D., Uszkoreit J., et al. Mlp-mixer: an all-mlp architecture for vision. arXiv. 2021 doi: 10.48550/arXiv.2105.01601. Preprint at. [DOI] [Google Scholar]
- 16.Chollet F. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society; 2017. Xception: deep learning with depthwise separable convolutions; pp. 1800–1807. [DOI] [Google Scholar]
- 17.Minsky M., Papert S. 1969. Perceptrons. [Google Scholar]
- 18.Hinton G.E., Salakhutdinov R.R. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–507. doi: 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]
- 19.Fukushima K., Miyake S. Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognit. 1982;15:455–469. doi: 10.1016/0031-3203(82)90024-3. [DOI] [Google Scholar]
- 20.Fukushima K., Miyake S., Ito T. Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans. Syst. Man Cybern. 1983;13:826–834. doi: 10.1109/TSMC.1983.6313076. [DOI] [Google Scholar]
- 21.Schmidhuber J. In: Duch W., Mandziuk J., editors. Vol. 63. Springer; 2007. New millennium AI and the convergence of history; pp. 15–35. (Challenges for Computational Intelligence). [DOI] [Google Scholar]
- 22.Nickolls J., Buck I., Garland M., Skadron K. Scalable parallel programming with cuda: is cuda the parallel programming model that application developers have been waiting for? Queue. 2008;6:40–53. doi: 10.1145/1365490.1365500. [DOI] [Google Scholar]
- 23.Lindholm E., Nickolls J., Oberman S.F., Montrym J. NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro. 2008;28:39–55. doi: 10.1109/MM.2008.31. [DOI] [Google Scholar]
- 24.Deng J., Dong W., Socher R., Li L., Li K., Fei-Fei L. 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society; 2009. Imagenet: a large-scale hierarchical image database; pp. 248–255. [DOI] [Google Scholar]
- 25.Krizhevsky A., Sutskever I., Hinton G.E. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. Bartlett P.L., Pereira F.C.N., Burges C.J.C., Bottou L., Weinberger K.Q., editors. 2012. Imagenet classification with deep convolutional neural networks; pp. 1106–1114. [Google Scholar]
- 26.Zhang J., Yu J., Tao D. Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans. Image Process. 2018;27:2420–2432. doi: 10.1109/TIP.2018.2804218. [DOI] [PubMed] [Google Scholar]
- 27.Chen T., Kornblith S., Norouzi M., Hinton G.E. Vol. 119. PMLR; 2020. A simple framework for contrastive learning of visual representations; pp. 1597–1607. (Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event). [Google Scholar]
- 28.He K., Fan H., Wu Y., Xie S., Girshick R.B. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation/IEEE; 2020. Momentum contrast for unsupervised visual representation learning; pp. 9726–9735. [DOI] [Google Scholar]
- 29.Chen X., Fan H., Girshick R.B., He K. Improved baselines with momentum contrastive learning. arXiv. 2020 doi: 10.48550/arXiv.2003.04297. Preprint at. [DOI] [Google Scholar]
- 30.He T., Zhang Z., Zhang H., Zhang Z., Xie J., Li M. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation/IEEE; 2019. Bag of tricks for image classification with convolutional neural networks; pp. 558–567. [DOI] [Google Scholar]
- 31.Kolesnikov A., Beyer L., Zhai X., Puigcerver J., Yung J., Gelly S., Houlsby N. In: Vedaldi A., Bischof H., Brox T., Frahm J., editors. Vol. 12350. Springer; 2020. Big transfer (bit): general visual representation learning; pp. 491–507. (Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part V). [DOI] [Google Scholar]
- 32.Wightman R. 2019. Pytorch Image Models.https://github.com/rwightman/pytorch-image-models [DOI] [Google Scholar]
- 33.Contributors M. 2020. MMSegmentation: Openmmlab Semantic Segmentation Toolbox and Benchmark.https://github.com/open-mmlab/mmsegmentation [Google Scholar]
- 34.Chen K., Wang J., Pang J., Cao Y., Xiong Y., Li X., Sun S., Feng W., Liu Z., Xu J., et al. Mmdetection: open mmlab detection toolbox and benchmark. arXiv. 2019 doi: 10.48550/arXiv.1906.07155. Preprint at. [DOI] [Google Scholar]
- 35.Yu J., Tan M., Zhang H., Rui Y., Tao D. Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022;44:563–578. doi: 10.1109/TPAMI.2019.2932058. [DOI] [PubMed] [Google Scholar]
- 36.Ren S., He K., Girshick R.B., Sun J., Faster R.-C.N.N. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. Cortes C., Lawrence N.D., Lee D.D., Sugiyama M., Garnett R., editors. 2015. Towards real-time object detection with region proposal networks; pp. 91–99. [Google Scholar]
- 37.Redmon J., Farhadi A. Yolov3: an incremental improvement. arXiv. 2018 doi: 10.48550/arXiv.1804.02767. Preprint at. [DOI] [Google Scholar]
- 38.He K., Gkioxari G., Dollár P., Girshick R.B., Mask R.-C.N.N. IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society; 2017. pp. 2980–2988. [DOI] [Google Scholar]
- 39.Ronneberger O., Fischer P., Brox T. In: Navab N., Hornegger J., III W.M.W., Frangi A.F., editors. Vol. 9351. Springer; 2015. U-net: convolutional networks for biomedical image segmentation; pp. 234–241. (Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III). [DOI] [Google Scholar]
- 40.Syu N., Chen Y., Chuang Y. Learning deep convolutional networks for demosaicing. arXiv. 2018 doi: 10.48550/arXiv.1802.03769. Preprint at. [DOI] [Google Scholar]
- 41.Dong C., Loy C.C., He K., Tang X. In: Fleet D.J., Pajdla T., Schiele B., Tuytelaars T., editors. Vol. 8692. Springer; 2014. Learning a deep convolutional network for image super-resolution; pp. 184–199. (Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV). [DOI] [Google Scholar]
- 42.Dong C., Loy C.C., Tang X. In: Leibe B., Matas J., Sebe N., Welling M., editors. Vol. 9906. Springer; 2016. Accelerating the super-resolution convolutional neural network; pp. 391–407. (Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II). [DOI] [Google Scholar]
- 43.Yu J., Fan Y., Yang J., Xu N., Wang Z., Wang X., Huang T. Wide activation for efficient and accurate image super-resolution. arXiv. 2018 doi: 10.48550/arXiv.1808.08718. Preprint at. [DOI] [Google Scholar]
- 44.Nah S., Kim T.H., Lee K.M. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society; 2017. Deep multi-scale convolutional neural network for dynamic scene deblurring; pp. 257–265. [DOI] [Google Scholar]
- 45.Moore G.E. Cramming more components onto integrated circuits. Proc. IEEE. 1998;86:82–85. doi: 10.1109/jproc.1998.658762. [DOI] [Google Scholar]
- 46.Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Guyon I., von Luxburg U., Bengio S., Wallach H.M., Fergus R., Vishwanathan S.V.N., et al., editors. 2017. Attention is all you need; pp. 5998–6008. [Google Scholar]
- 47.Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J. An image is worth 16x16 words: transformers for image recognition at scale. arXiv. 2020 doi: 10.48550/arXiv.2010.11929. Preprint at. [DOI] [Google Scholar]
- 48.Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S. In: Vedaldi A., Bischof H., Brox T., Frahm J., editors. Vol. 12346. Springer; 2020. End-to-end object detection with transformers; pp. 213–229. (Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I). [DOI] [Google Scholar]
- 49.Liu Z., Lin Y., Cao Y., Hu H., Wei Y., Zhang Z., Lin S., Guo B. 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE; 2021. Swin transformer: hierarchical vision transformer using shifted windows; pp. 9992–10002. [DOI] [Google Scholar]
- 50.Hatamizadeh A., Tang Y., Nath V., Yang D., Myronenko A., Landman B.A., Roth H.R., Xu D. IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022. IEEE; 2022. UNETR: transformers for 3d medical image segmentation; pp. 1748–1758. [DOI] [Google Scholar]
- 51.Valanarasu J.M.J., Oza P., Hacihaliloglu I., Patel V.M. In: de Bruijne M., Cattin P.C., Cotin S., Padoy N., Speidel S., Zheng Y., et al., editors. Vol. 12901. Springer; 2021. Medical transformer: gated axial-attention for medical image segmentation; pp. 36–46. (Medical Image Computing and Computer Assisted Intervention - MICCAI 2021 - 24th International Conference, Strasbourg, France, September 27 - October 1, 2021, Proceedings, Part I). [DOI] [Google Scholar]
- 52.Chen H., Wang Y., Guo T., Xu C., Deng Y., Liu Z., Ma S., Xu C., Xu C., Gao W. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation/IEEE; 2021. Pre-trained image processing transformer; pp. 12299–12310. [Google Scholar]
- 53.Cao J., Li Y., Zhang K., Gool L.V. Video super-resolution transformer. arXiv. 2021 doi: 10.48550/arXiv.2106.06847. Preprint at. [DOI] [Google Scholar]
- 54.Liang J., Cao J., Sun G., Zhang K., Gool L.V., Timofte R. IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, BC, Canada, October 11-17, 2021. IEEE; 2021. Swinir: image restoration using swin transformer; pp. 1833–1844. [DOI] [Google Scholar]
- 55.Wang Z., Cun X., Bao J., Liu J. Uformer: A general u-shaped transformer for image restoration. arXiv. 2021 doi: 10.48550/arXiv.2106.03106. Preprint at. [DOI] [Google Scholar]
- 56.Aldahdooh A., Hamidouche W., Déforges O. Reveal of vision transformers robustness against adversarial attacks. arXiv. 2021 doi: 10.48550/arXiv.2106.03734. Preprint at. [DOI] [Google Scholar]
- 57.Bhojanapalli S., Chakrabarti A., Glasner D., Li D., Unterthiner T., Veit A. 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE; 2021. Understanding robustness of transformers for image classification; pp. 10211–10221. [DOI] [Google Scholar]
- 58.Mahmood K., Mahmood R., van Dijk M. 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE; 2021. On the robustness of vision transformers to adversarial examples; pp. 7818–7827. [DOI] [Google Scholar]
- 59.Naseer M., Ranasinghe K., Khan S.H., Hayat M., Khan F.S., Yang M. Intriguing Properties of Vision Transformers. arXiv. 2021 doi: 10.48550/arXiv.2105.10497. Preprint at. [DOI] [Google Scholar]
- 60.Touvron H., Bojanowski P., Caron M., Cord M., El-Nouby A., Grave E., Izacard G., Joulin A., Synnaeve G., Verbeek J., Jégou H. Resmlp: feedforward networks for image classification with data-efficient training. arXiv. 2021 doi: 10.48550/arXiv.2105.03404. Preprint at. [DOI] [PubMed] [Google Scholar]
- 61.Melas-Kyriazi L. Do you even need attention? A stack of feed-forward layers does surprisingly well on imagenet. arXiv. 2021 doi: 10.48550/arXiv.2105.02723. Preprint at. [DOI] [Google Scholar]
- 62.Hendrycks D., Gimpel K. Bridging nonlinearities and stochastic regularizers with Gaussian error linear units. arXiv. 2016 doi: 10.48550/arXiv.1606.08415. Preprint at. [DOI] [Google Scholar]
- 63.Ba L.J., Kiros J.R., Hinton G.E. Layer normalization. arXiv. 2016 doi: 10.48550/arXiv.1607.06450. Preprint at. [DOI] [Google Scholar]
- 64.Sun C., Shrivastava A., Singh S., Gupta A. IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society; 2017. Revisiting unreasonable effectiveness of data in deep learning era; pp. 843–852. [DOI] [Google Scholar]
- 65.Lin T., Maire M., Belongie S.J., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L. In: Fleet D.J., Pajdla T., Schiele B., Tuytelaars T., editors. Vol. 8693. Springer; 2014. Microsoft COCO: common objects in context; pp. 740–755. (Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V). [DOI] [Google Scholar]
- 66.Cordts M., Omran M., Ramos S., Rehfeld T., Enzweiler M., Benenson R., Franke U., Roth S., Schiele B. 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society; 2016. The cityscapes dataset for semantic urban scene understanding; pp. 3213–3223. [DOI] [Google Scholar]
- 67.Liu H., Dai Z., So D.R., Le Q.V. Pay attention to mlps. arXiv. 2021 doi: 10.48550/arXiv.2105.08050. Preprint at. [DOI] [Google Scholar]
- 68.Hu S.M., Liang D., Yang G.Y., Yang G.W., Zhou W.Y. Jittor: a novel deep learning framework with meta-operators and unified graph execution. Sci. China Inf. Sci. 2020;63:222103–222121. doi: 10.1007/s11432-020-3097-4. [DOI] [Google Scholar]
- 69.Liu R. 2021. Jittor-mlp.https://github.com/liuruiyang98/Jittor-MLP [Google Scholar]
- 70.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., Desmaison A. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Wallach H.M., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E.B., Garnett R., editors. 2019. Pytorch: an imperative style, high-performance deep learning library; pp. 8024–8035. [Google Scholar]
- 71.Hou Q., Jiang Z., Yuan L., Cheng M., Yan S., Feng J. Vision permutator: a permutable mlp-like architecture for visual recognition. arXiv. 2021 doi: 10.48550/arXiv.2106.12368. Preprint at. [DOI] [PubMed] [Google Scholar]
- 72.Zhang H., Wu C., Zhang Z., Zhu Y., Zhang Z., Lin H., Zhang Z., Sun Y., He T., Mueller J., et al. Resnest: split-attention networks. arXiv. 2020 doi: 10.48550/arXiv.2004.08955. Preprint at. [DOI] [Google Scholar]
- 73.Tang C., Zhao Y., Wang G., Luo C., Xie W., Zeng W. Sparse MLP for image recognition: is self-attention really necessary? arXiv. 2021 doi: 10.48550/arXiv.2109.05422. Preprint at. [DOI] [Google Scholar]
- 74.Tatsunami Y., Taki M. Raftmlp: do mlp-based models dream of winning over computer vision? arXiv. 2021 doi: 10.48550/arXiv.2108.04384. Preprint at. [DOI] [Google Scholar]
- 75.Wang Z., Jiang W., Zhu Y., Yuan L., Song Y., Liu W. Dynamixer: a vision MLP architecture with dynamic mixing. arXiv. 2022 doi: 10.48550/arXiv.2201.12083. Preprint at. [DOI] [Google Scholar]
- 76.Tang Y., Han K., Guo J., Xu C., Li Y., Xu C., Wang Y. An image patch is a wave: phase-aware vision MLP. arXiv. 2021 doi: 10.48550/arXiv.2111.12294. Preprint at. [DOI] [Google Scholar]
- 77.Zhang D.J., Li K., Chen Y., Wang Y., Chandra S., Qiao Y., Liu L., Shou M.Z. Morphmlp: a self-attention free, mlp-like backbone for image and video. arXiv. 2021 doi: 10.48550/arXiv.2111.12527. Preprint at. [DOI] [Google Scholar]
- 78.Yu T., Li X., Cai Y., Sun M., Li P. IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022. IEEE; 2022. S2-mlp: spatial-shift MLP architecture for vision; pp. 3615–3624. [DOI] [Google Scholar]
- 79.Yu T., Li X., Cai Y., Sun M., Li P. S2-mlpv2: improved spatial-shift MLP architecture for vision. arXiv. 2021 doi: 10.48550/arXiv.2108.01072. Preprint at. [DOI] [Google Scholar]
- 80.Lian D., Yu Z., Sun X., Gao S. AS-MLP: An axial shifted MLP architecture for vision. arXiv. 2021 doi: 10.48550/arXiv.2107.08391. Preprint at. [DOI] [Google Scholar]
- 81.Yu F., Koltun V. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016. Bengio Y., LeCun Y., editors. Conference Track Proceedings; 2016. Multi-scale context aggregation by dilated convolutions. [Google Scholar]
- 82.Wang P., Chen P., Yuan Y., Liu D., Huang Z., Hou X., Cottrell G. 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018. IEEE Computer Society; 2018. Understanding convolution for semantic segmentation; pp. 1451–1460. [DOI] [Google Scholar]
- 83.Chen S., Xie E., GE C., Chen R., Liang D., Luo P. International Conference on Learning Representations. 2022. CycleMLP: a MLP-like architecture for dense prediction.https://openreview.net/forum?id=NMEceG4v69Y URL: [Google Scholar]
- 84.Dai J., Qi H., Xiong Y., Li Y., Zhang G., Hu H., Wei Y. IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society; 2017. Deformable convolutional networks; pp. 764–773. [DOI] [Google Scholar]
- 85.Wei G., Zhang Z., Lan C., Lu Y., Chen Z. Activemlp: an mlp-like architecture with active token mixer. arXiv. 2022 doi: 10.48550/arXiv.2203.06108. Preprint at. [DOI] [Google Scholar]
- 86.Guo J., Tang Y., Han K., Chen X., Wu H., Xu C., Xu C., Wang Y. Hire-mlp: vision MLP via hierarchical rearrangement. arXiv. 2021 doi: 10.48550/arXiv.2108.13341. Preprint at. [DOI] [Google Scholar]
- 87.Zheng H., He P., Chen W., Zhou M. Mixing and shifting: exploiting global and local dependencies in vision mlps. arXiv. 2022 doi: 10.48550/arXiv.2202.06510. Preprint at. [DOI] [Google Scholar]
- 88.Lou Y., Xue F., Zheng Z., You Y. Sparse-mlp: a fully-mlp architecture with conditional computation. arXiv. 2021 doi: 10.48550/arXiv.2109.0200. Preprint at. [DOI] [Google Scholar]
- 89.Shazeer N., Mirhoseini A., Maziarz K., Davis A., Le Q.V., Hinton G.E., Dean J. 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net; 2017. Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. [Google Scholar]
- 90.Riquelme C., Puigcerver J., Mustafa B., Neumann M., Jenatton R., Pinto A.S., Keysers D., Houlsby N. Scaling vision with sparse mixture of experts. arXiv. 2021 doi: 10.48550/arXiv.2106.05974. Preprint at. [DOI] [Google Scholar]
- 91.Yu T., Li X., Cai Y., Sun M., Li P. Rethinking token-mixing MLP for mlp-based vision backbone. arXiv. 2021 doi: 10.48550/arXiv.2106.14882. Preprint at. [DOI] [Google Scholar]
- 92.Ding X., Chen H., Zhang X., Han J., Ding G. Repmlpnet: hierarchical vision MLP with re-parameterized locality. arXiv. 2021 doi: 10.48550/arXiv.2112.11081. Preprint at. [DOI] [Google Scholar]
- 93.Li J., Hassani A., Walton S., Shi H. Convmlp: hierarchical convolutional mlps for vision. arXiv. 2021 doi: 10.48550/arXiv.2109.04454. Preprint at. [DOI] [Google Scholar]
- 94.Pan Z., Zhuang B., He H., Liu J., Cai J. Less is more: Pay less attention in vision transformers. arXiv. 2021 doi: 10.48550/arXiv.2105.14217. Preprint at. [DOI] [Google Scholar]
- 95.Liu J., Li H., Song G., Huang X., Liu Y. Uninet: unified architecture search with convolution, transformer, and MLP. arXiv. 2021 doi: 10.48550/arXiv.2110.04035. Preprint at. [DOI] [Google Scholar]
- 96.Sherrington C.S. Observations on the scratch-reflex in the spinal dog. J. Physiol. 1906;34:1–50. doi: 10.1113/jphysiol.1906.sp001139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.He K., Zhang X., Ren S., Sun J. 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society; 2016. Deep residual learning for image recognition; pp. 770–778. [DOI] [Google Scholar]
- 98.Lin T., Dollár P., Girshick R.B., He K., Hariharan B., Belongie S.J. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society; 2017. Feature pyramid networks for object detection; pp. 936–944. [DOI] [Google Scholar]
- 99.Dong X., Bao J., Zhang T., Chen D., Zhang W., Yuan L., Chen D., Wen F., Yu N. Peco: perceptual codebook for BERT pre-training of vision transformers. arXiv. 2021 doi: 10.48550/arXiv.2111.12710. Preprint at. [DOI] [Google Scholar]
- 100.Simonyan K., Zisserman A. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015. Bengio Y., LeCun Y., editors. Conference Track Proceedings; 2015. Very deep convolutional networks for large-scale image recognition. [DOI] [Google Scholar]
- 101.Szegedy C., Ioffe S., Vanhoucke V., Alemi A.A. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. Singh S.P., Markovitch S., editors. AAAI Press; 2017. Inception-v4, inception-resnet and the impact of residual connections on learning; pp. 4278–4284.http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806 URL: [Google Scholar]
- 102.Wightman R., Touvron H., Jégou H. Resnet strikes back: an improved training procedure in timm. arXiv. 2021 doi: 10.48550/arXiv.2110.00476. Preprint at. [DOI] [Google Scholar]
- 103.Radosavovic I., Kosaraju R.P., Girshick R.B., He K., Dollár P. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation/IEEE; 2020. Designing network design spaces; pp. 10425–10433. [DOI] [Google Scholar]
- 104.Liu, Z., Mao, H., Wu, C., Feichtenhofer, C., Darrell, T., Xie, S.. A Convnet for the 2020s. CoRR 2022;Preprint at arXiv. 10.48550/arXiv.2201.03545. [DOI]
- 105.Guo M., Lu C., Liu Z., Cheng M., Hu S. Visual attention network. arXiv. 2022 doi: 10.48550/arXiv.2202.09741. Preprint at. [DOI] [Google Scholar]
- 106.Tan M., Le Q.V. In: Meila M., Zhang T., editors. Vol. 139. PMLR; 2021. Efficientnetv2: smaller models and faster training; pp. 10096–10106. (Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event). [Google Scholar]
- 107.Leng Z., Tan M., Liu C., Cubuk E.D., Shi J., Cheng S., Anguelov D. International Conference on Learning Representations. 2021. Polyloss: a polynomial expansion perspective of classification loss functions. [Google Scholar]
- 108.Ding, X., Zhang, X., Zhou, Y., Han, J., Ding, G., Sun, J.. Scaling up your kernels to 31x31: revisiting large kernel design in cnns. Preprint at arXiv. 10.48550/arXiv.2203.06717. [DOI]
- 109.Pham H., Dai Z., Xie Q., Le Q.V. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation/IEEE; 2021. Meta pseudo labels; pp. 11557–11568. [Google Scholar]
- 110.Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., Jégou H. In: Meila M., Zhang T., editors. Vol. 139. PMLR; 2021. Training data-efficient image transformers & distillation through attention; pp. 10347–10357. (Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event). [Google Scholar]
- 111.Yuan L., Chen Y., Wang T., Yu W., Shi Y., Jiang Z., Tay F.E., Feng J., Yan S. 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE; 2021. Tokens-to-token vit: training vision transformers from scratch on imagenet; pp. 538–547. [DOI] [Google Scholar]
- 112.Wang W., Xie E., Li X., Fan D., Song K., Liang D., Lu T., Luo P., Shao L. 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE; 2021. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions; pp. 548–558. [DOI] [Google Scholar]
- 113.Zhang Z., Zhang H., Zhao L., Chen T., Pfister T. Aggregating nested transformers. arXiv. 2021 doi: 10.48550/arXiv.2105.12723. Preprint at. [DOI] [Google Scholar]
- 114.Han K., Guo J., Tang Y., Wang Y. Pyramidtnt: improved transformer-in-transformer baselines with pyramid architecture. arXiv. 2022 doi: 10.48550/arXiv.2201.00978. Preprint at. [DOI] [Google Scholar]
- 115.Dong X., Bao J., Chen D., Zhang W., Yu N., Yuan L., Chen D., Guo B. Cswin transformer: a general vision transformer backbone with cross-shaped windows. arXiv. 2021 doi: 10.48550/arXiv.2107.00652. Preprint at. [DOI] [Google Scholar]
- 116.Touvron H., Cord M., Sablayrolles A., Synnaeve G., Jégou H. 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE; 2021. Going deeper with image transformers; pp. 32–42. [DOI] [Google Scholar]
- 117.Liu Z., Hu H., Lin Y., Yao Z., Xie Z., Wei Y., Ning J., Cao Y., Zhang Z., Dong L., Wei F. Swin Transformer V2: Scaling up Capacity and Resolution. arXiv. 2021 doi: 10.48550/arXiv.2111.09883. Preprint at. [DOI] [Google Scholar]
- 118.Chu X., Tian Z., Wang Y., Zhang B., Ren H., Wei X., Xia H., Shen C. Twins: revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021;34 doi: 10.48550/arXiv.2104.13840. [DOI] [Google Scholar]
- 119.Huang Z., Ben Y., Luo G., Cheng P., Yu G., Fu B. Shuffle transformer: rethinking spatial shuffle for vision transformer. arXiv. 2021 doi: 10.48550/arXiv.2106.03650. Preprint at. [DOI] [Google Scholar]
- 120.Guo J., Han K., Wu H., Xu C., Tang Y., Xu C., Wang Y. CMT: convolutional neural networks meet vision transformers. arXiv. 2021 doi: 10.48550/arXiv.2107.06263. Preprint at. [DOI] [Google Scholar]
- 121.Dai Z., Liu H., Le Q.V., Tan M. Coatnet: marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 2021;34:3965–3977. [Google Scholar]
- 122.Yuan L., Hou Q., Jiang Z., Feng J., Yan S. VOLO: Vision Outlooker for Visual Recognition. arXiv. 2021 doi: 10.48550/arXiv.2106.13112. Preprint at. [DOI] [PubMed] [Google Scholar]
- 123.Xiao T., Liu Y., Zhou B., Jiang Y., Sun J. In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y., editors. Vol. 11209. Springer; 2018. Unified perceptual parsing for scene understanding; pp. 432–448. (Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V). [DOI] [Google Scholar]
- 124.Zhou B., Zhao H., Puig X., Fidler S., Barriuso A., Torralba A. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society; 2017. Scene parsing through ADE20K dataset; pp. 5122–5130. [DOI] [Google Scholar]
- 125.Kirillov A., Girshick R.B., He K., Dollár P. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation/IEEE; 2019. Panoptic feature pyramid networks; pp. 6399–6408. [DOI] [Google Scholar]
- 126.Xie S., Girshick R.B., Dollár P., Tu Z., He K. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society; 2017. Aggregated residual transformations for deep neural networks; pp. 5987–5995. [DOI] [Google Scholar]
- 127.Bao H., Dong L., Wei F. Beit: BERT pre-training of image transformers. arXiv. 2021 doi: 10.48550/arXiv.2106.08254. Preprint at. [DOI] [Google Scholar]
- 128.Cazenavette G., Guevara M.L.D. Mixergan: an mlp-based architecture for unpaired image-to-image translation. arXiv. 2021 doi: 10.48550/arXiv.2105.14110. Preprint at. [DOI] [Google Scholar]
- 129.Zhu J., Park T., Isola P., Efros A.A. IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society; 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks; pp. 2242–2251. [DOI] [Google Scholar]
- 130.Tu Z., Talebi H., Zhang H., Yang F., Milanfar P., Bovik A., Li Y. MAXIM: multi-axis MLP for image processing. arXiv. 2022 doi: 10.48550/arXiv.2201.02973. Preprint at. [DOI] [Google Scholar]
- 131.Carreira J., Zisserman A. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society; 2017. Quo vadis, action recognition? A new model and the kinetics dataset; pp. 4724–4733. [DOI] [Google Scholar]
- 132.Xia J., Zhuge M., Geng T., Fan S., Wei Y., He Z., Zheng F. Skating-mixer: multimodal mlp for scoring figure skating. arXiv. 2022 doi: 10.48550/arXiv.220303990. Preprint at. [DOI] [Google Scholar]
- 133.Ma X., Qin C., You H., Ran H., Fu Y. Rethinking network design and local geometry in point cloud: a simple residual MLP framework. arXiv. 2022 doi: 10.48550/arXiv.2202.07123. Preprint at. [DOI] [Google Scholar]
- 134.Qi C.R., Su H., Mo K., Guibas L.J. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society; 2017. Pointnet: deep learning on point sets for 3d classification and segmentation; pp. 77–85. [DOI] [Google Scholar]
- 135.Qi C.R., Yi L., Su H., Guibas L.J. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Guyon I., von Luxburg U., Bengio S., Wallach H.M., Fergus R., Vishwanathan S.V.N., et al., editors. 2017. Pointnet++: deep hierarchical feature learning on point sets in a metric space; pp. 5099–5108. [Google Scholar]
- 136.Choe J., Park C., Rameau F., Park J., Kweon I.S. Pointmixer: mlp-mixer for point cloud understanding. arXiv. 2021 doi: 10.48550/arXiv.2111.11187. Preprint at. [DOI] [Google Scholar]
- 137.Guo M.H., Cai J.X., Liu Z.N., Mu T.J., Martin R.R., Hu S.M. PCT: point cloud transformer. Comput Vis Media. 2021;7:187–199. doi: 10.1007/s41095-021-0229-5. [DOI] [Google Scholar]
- 138.Zhao H., Jiang L., Jia J., Torr P.H.S., Koltun V. 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE; 2021. Point transformer; pp. 16239–16248. [DOI] [Google Scholar]
- 139.Valanarasu J.M.J., Patel V.M. Unext: mlp-based rapid medical image segmentation network. arXiv. 2022 doi: 10.48550/arXiv.2203.04967. Preprint at. arXiv:2203.04967. [DOI] [Google Scholar]
- 140.Zhang Q., Xu Y., Zhang J., Tao D. Vitaev2: vision transformer advanced by exploring inductive bias for image recognition and beyond. arXiv. 2022 doi: 10.48550/arXiv.2202.10108. Preprint at. [DOI] [Google Scholar]
- 141.Liu C., Zoph B., Neumann M., Shlens J., Hua W., Li L., Fei-Fei L., Yuille A., Huang J., Murphy K. In: Ferrari V., Hebert M., Sminchisescu C., Weiss Y., editors. Vol. 11205. Springer; 2018. Progressive neural architecture search; pp. 19–35. (Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I). [DOI] [Google Scholar]
- 142.Chen X., Xie S., He K. An empirical study of training self-supervised vision transformers. arXiv. 2021 doi: 10.48550/arXiv.2104.02057. Preprint at. [DOI] [Google Scholar]
- 143.Chen T., Kornblith S., Swersky K., Norouzi M., Hinton G.E. In: Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H., editors. Vol. 33. 2020. Big self-supervised models are strong semi-supervised learners; pp. 22243–22255.https://proceedings.neurips.cc/paper/2020/hash/fcbc95ccdd551da181207c0c1400c655-Abstract.html (Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual). URL: [Google Scholar]
- 144.Xie Z., Zhang Z., Cao Y., Lin Y., Bao J., Yao Z., Dai Q., Hu H. Simmim: a simple framework for masked image modeling. arXiv. 2021 doi: 10.48550/arXiv.2111.09886. Preprint at. [DOI] [Google Scholar]
- 145.He K., Chen X., Xie S., Li Y., Dollár P., Girshick R.B. Masked autoencoders are scalable vision learners. arXiv. 2021 doi: 10.48550/arXiv.2111.06377. Preprint at. [DOI] [Google Scholar]
- 146.Wei C., Fan H., Xie S., Wu C., Yuille A.L., Feichtenhofer C. Masked feature prediction for self-supervised visual pre-training. arXiv. 2021 doi: 10.48550/arXiv.2112.09133. Preprint at. [DOI] [Google Scholar]
- 147.Bottou L. In: Second Edition. Montavon G., Orr G.B., Müller K., editors. Vol. 7700. Springer; 2012. Stochastic gradient descent tricks; pp. 421–436. (Neural Network.: Tricks of the Trade). [DOI] [Google Scholar]
- 148.Loshchilov I., Hutter F. Fixing weight decay regularization in Adam. arXiv. 2017 doi: 10.48550/arXiv.1711.05101. Preprint at. [DOI] [Google Scholar]
- 149.Chen X., Hsieh C., Gong B. When vision transformers outperform resnets without pretraining or strong data augmentations. arXiv. 2021 doi: 10.48550/arXiv.2106.01548. Preprint at. [DOI] [Google Scholar]
- 150.Bansal P., Lamba R., Jain V., Jain T., Shokeen S., Kumar S., Singh P.K., Khan B. Contrast Media & Molecular Imaging; 2022. Gga-mlp: A Greedy Genetic Algorithm to Optimize Weights and Biases in Multilayer Perceptron; p. 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
We reproduce most variants of MLP-like models in Jittor and Pytorch. Code is available at https://github.com/liuruiyang98/Jittor-MLP. This review did not generate new data.