Abstract
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. Despite their effectiveness, SSMs struggle with global context modeling due to data-independent matrices. The Mamba model addresses this with data-dependent variants enabled by the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures face significant parameter scalability challenges, limiting their utility in vision applications. This paper tackles the scalability issue of large SSMs for image classification and action recognition without relying on additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures and increases robustness to common corruption artifacts. Our thorough evaluation on the ImageNet-1K, Kinetics-400, and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to %.
Keywords: Action Recognition, Mamba, Robustness
Introduction
Various networks have been proposed for both image and video recognition in recent years. These include convolutional neural networks (Krizhevsky et al., 2012; He et al., 2016; Carreira & Zisserman, 2017; Feichtenhofer et al., 2019), vision Transformers (Dosovitskiy et al., 2021; Arnab et al., 2021), and networks using focal modulation (Yang et al., 2022; Wasim et al., 2023). The Attention-based Transformer models have dominated both image and video recognition, either as pure Attention-based models (Liu et al., 2021, 2022; Arnab et al., 2021; Bertasius et al., 2021; Yan et al., 2022) or as hybrid models (Li et al., 2022b; Fan et al., 2021; Li et al., 2022a).
Recently, State-Space Models (SSMs) such as S4 (Gu et al., 2022) have gained popularity as a new context modeling method. They recurrently model context and bring well-established techniques from state-space modeling to deep large models. However, S4 encountered a problem in terms of modeling global context due to the data-independent nature of the input, state-transition, and output matrices. To mitigate this issue, the Mamba (Gu & Dao, 2024) model introduced the S6 selective-scan algorithm, which uses data-dependent variants of the input and output matrices. This improves the context modeling capabilities, particularly on long sequences, and the approach has been adapted to image tasks (Zhu et al., 2024; Liu et al., 2024) and in the recent work VideoMamba (Li et al., 2024) to the video domain.
In this work, we investigate the property of vision SSMs, where we focus on VideoMamba (Li et al., 2024) since it is the largest vision SSM architecture and the only one that can be applied to videos, and make two key observations. First, VideoMamba does not scale well with the number of parameters as plotted in Fig 1. While the accuracy substantially increases as the number of parameters is increased from 7M (tiny) to 25M (small) parameters, the accuracy only slightly increases if the parameters are increased further to 75M (middle) parameters. To mitigate this issue, Li et al. (2024) proposed to train first a small model and then use the small model as the teacher for training a larger model using distillation. While distillation improves the accuracy of the middle-sized model, it does not solve the underlying problem. Increasing the parameters further to 98M (base) parameters again does not improve the results.
Fig. 1.

Performance comparison with VideoMamba: We compare the performance of our model with VideoMamba (Li et al., 2024), both with and without distillation, on IN1K (Deng et al., 2009)
The second observation is the higher sensitivity of the Mamba-based network to common corruptions and perturbations like image blur or JPEG compression in comparison to vision Transformers as shown in Fig 2. Both observations are major limitations for practical applications. We therefore propose a simple yet efficient Mamba-Attention interleaved architecture, termed StableMamba, that resolves both issues. It improves the robustness to common corruptions and perturbations during inference (Hendrycks & Dietterich, 2019) as shown in Fig 2 and mitigates the scalability issue without the need for cumbersome workarounds like distillation as shown in Fig 1. In summary, the main contributions of this paper are:
We analyze the largest Mamba architecture for images and video and present a simple yet efficient Mamba-Attention interleaved architecture.
We show that our approach resolves the scalability issue and increases the robustness to various common corruptions (Hendrycks & Dietterich, 2019).
We report improved performance for comparable methods for image classification on ImageNet-1K (Deng et al., 2009) and for action recognition on Kinetics-400 (Kay et al., 2017) and Something-Something-v2 (Goyal et al., 2017).
Fig. 2.
Performance comparison of different networks on Gaussian blur corruption. Performance comparison of different networks on JPEG compression corruption
Related Work
Image and Video Recognition: In the last decade, Convolutional Neural Networks (CNNs) have been the primary choice for computer vision tasks. Starting with the introduction of AlexNet (Krizhevsky et al., 2012), the field has seen rapid advancements with notable architectures such as VGG (Simonyan & Zisserman, 2015), Inception (Szegedy et al., 2015), ResNet (He et al., 2016), MobileNet (Howard et al., 2017), and EfficientNet (Tan & Le, 2019) achieving improved performance on ImageNet (Deng et al., 2009). Recently, ConvNeXt variants (Liu et al., 2022a; Woo et al., 2023) and FocalNets (Yang et al., 2022) have updated traditional 2D ConvNets with modern design elements and training techniques, achieving performance comparable to state-of-the-art models. At the same time, the Vision Transformer (ViT) (Dosovitskiy et al., 2021), inspired by the Transformer (Vaswani et al., 2017) for natural language processing, and its variants such as DeiT (Touvron et al., 2021), Swin Transformer (Liu et al., 2021), and Swin Transformer V2 (Liu et al., 2022) have achieved very good results for image classification.
For Video Recognition, early methods were feature-based (Klaser et al., 2008; Laptev, 2003; Wang et al., 2013). Later, the success of 2D CNNs (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; He et al., 2016; Tan & Le, 2019) on ImageNet (Deng et al., 2009) lead to their application to video recognition (Karpathy et al., 2014; Ng et al., 2015; Simonyan & Zisserman, 2014). However, these methods lacked temporal modeling capabilities. The release of large-scale datasets such as Kinetics (Kay et al., 2017) prompted 3D CNN based methods (Carreira & Zisserman, 2017; Feichtenhofer et al., 2016; Tran et al., 2015). Since these were computationally expensive, various methods were proposed to mitigate the issue (Feichtenhofer, 2020; Sun et al., 2015; Szegedy et al., 2016; Tran et al., 2018; Xie et al., 2018; Li et al., 2020; Lin et al., 2019; Qiu et al., 2019; Feichtenhofer et al., 2019; Duan et al., 2020; Li et al., 2020; Wang et al., 2021). When the ViT (Dosovitskiy et al., 2021) architecture became popular in image recognition, it seamlessly made its way into the video domain. Initial methods used Self-Attention in combination with CNNs (Wang et al., 2018, 2020b; Kondratyuk et al., 2021) while later works (Liu et al., 2022b; Arnab et al., 2021; Bertasius et al., 2021; Yan et al., 2022; Zhang et al., 2021; Patrick et al., 2021; Fan et al., 2021; Li et al., 2022a; Patrick et al., 2021; Sharir et al., 2021) introduced pure Transformer based architectures. More recently, Video-FocalNets (Wasim et al., 2023) proposed a Focal Modulation (Yang et al., 2022) extension for videos, while Uniformer (Li et al., 2022b) proposed an efficient hybrid architecture for video recognition. Very recently, a key development in this area came with FlashAttention (Dao et al., 2022; Dao, 2023), which presents a hardware-aware implementation of the Attention algorithm that improves the efficiency of Attention-based models.
State-Space Models: Recently, State-Space Models (SSMs), such as the Structured State-Space Model S4 (Gu et al., 2022), have been presented as an alternative to Self-Attention (Vaswani et al., 2017) for efficient modeling of long sequences with linear complexity. Various variants building on the S4 architecture have also been proposed, including S5 (Smith et al., 2023), H3 (Fu et al., 2023), and GSS (Mehta et al., 2023). However, the original S4 (Gu et al., 2022) and its variants had a weakness compared to Self-Attention, mainly because they did not have any input dependencies. To mitigate this, Gu and Dao (2024) proposed the input-dependent state-space model Mamba alongside an efficient hardware-optimized parallel selective scan mechanism (S6). Various works have been proposed in computer vision applying Mamba to different downstream domains. Two variants were initially proposed for image classification: Vim (Zhu et al., 2024) and VMamba (Liu et al., 2024). Vim proposed an isotropic architecture with a bi-directional scanning variant of Mamba (Gu & Dao, 2024) for effectively scanning the image token sequence. In contrast, VMamba (Liu et al., 2024) proposed a hierarchical architecture with a four-directional scan across all four spatial dimensions. Subsequently, other variants such as LocalVMamba (Huang et al., 2024) had a Swin (Liu et al., 2021) style windowed scan while EfficientVMamba (Pei et al., 2025) proposed an atrous-selective scan to improve efficiency. The concurrent work GroupMamba (Shaker et al., 2025) proposed a parameter-efficient Modulated Group Mamba layer with channel grouping and distillation-based training. Furthermore, Mamba was also used in various applications in video understanding (Yang et al., 2024; Li et al., 2024; Chen et al., 2024), image segmentation (Liu et al., 2024; Ma et al., 2024; Ruan & Xiang, 2024; Gong et al., 2025), and various other tasks (Guo et al., 2024b; He et al., 2025; Wang et al., 2024; Guo et al., 2024a; Liang et al., 2024). SiMBA (Patro & Agneeswaran, 2024) uses the Fourier transform with non-linearities to model eigenvalues as negative real numbers in an attempt to improve the training. Similar methods have also been proposed for CNNs (Wang et al., 2020a) and Transformers (Xiao et al., 2021; Touvron et al., 2021). A complementary work to ours, VideoMamba (Li et al., 2024), proposes to use a distillation-based objective to stabilize the training of larger models. However, we show that a simple interleaving of Self-Attention layers within a Mamba-based model is enough to stabilize training for image and action recognition applications and improve robustness against high-frequency noise in the input. While prior works (Hatamizadeh & Kautz, 2025; Wang et al., 2024; Fei et al., 2024; Lenz et al., 2025) have explored hybrid Mamba–Transformer architectures, our contribution is distinct in both focus and scope. Specifically, we investigate the stability challenges that arise when scaling vision models, an aspect not addressed in these studies. For instance, JAMBA (Lenz et al., 2025) is an NLP-oriented work, while MambaVision (Hatamizadeh & Kautz, 2025), PoinTramba (Wang et al., 2024), MaTVLM (Li et al., 2025), and Dimba (Fei et al., 2024) do not analyze stability or evaluate model scaling in the image or video domain. PoinTramba has a specially different design adapted for point clouds where they insert entire encoders based on either Mamba or Transformer architectures unlike our interleaved design, among other differing components. Similarly, Dimba is designed for the diffusion process, where attention layers are substituted with Mamba layers to reduce the computational demand. JAMBA is a hybrid model for NLP, and MaTVLM is specifically made for VLM architectures while still employing distillation in design. Furthermore, MambaVision is a hierarchical model with convolutions as well as attention included in the design, unlike our isotropic design and strictly Mamba-Attention interleaved architecture. None of these works discusses the stability aspects of their designs. Our work is, to the best of our knowledge, the first to examine stability in large-scale vision models and to propose a hybrid design as a promising solution in this context.
Limitations of Mamba-based Networks for Visual Recognition
Although Mamba-based networks have shown state-of-the-art performance for image classification (Li et al., 2024; Zhu et al., 2024) and action recognition (Li et al., 2024), their training is unstable, which limits the scalability of these architectures. For instance, VideoMamba (Li et al., 2024) uses a distillation technique to improve training stability and performance. Since the proposed self-distillation technique requires training a smaller model first, it is a cumbersome approach that increases the training cost.
Before we propose our solution to the scalability problem in Section 4, we analyze the behavior of pure Mamba-based visual architectures in more detail. We focus on VideoMamba (Li et al., 2024) since it is the largest architecture and the only one that can be applied to video data. VideoMamba trains its tiny and small models with 7M and 25M parameters, respectively, in a conventional setting. However, distillation is used to train it as soon as the parameters are scaled up to the middle model (75M parameters) and base model (98M parameters). The method uses the smaller model as the teacher for the larger middle and base models. This is a departure from the general knowledge distillation where a larger complex model is distilled into a smaller student model (Gou et al., 2020). This reversal suggests that the purpose of distillation is not merely to transfer knowledge from a simpler model to a complex one but to stabilize the learning process of the middle and base models. As shown in Fig 1, the architecture cannot be scaled beyond 25M parameters without distillation, i.e., the accuracy does not increase further. While distillation improves the accuracy, it does not address the scaling issue since the base model is not better than the middle model. To better understand the impact of distillation on the training, we trained VideoMamba’s middle variant with and without distillation. The training curves shown in Figure 3 indicate the presence of instabilities without distillation. We also present, in Figure 3, the loss curve for our StableMamba, which has a stable convergence without distillation. To further validate our findings regarding training instabilities, we conducted additional experiments using GroupMamba (Shaker et al., 2025), a recently published architecture that exhibits similar stability issues through a different manifestation. Shaker et al. (2025) demonstrated that training instabilities manifest as high variance in loss trajectories. Without distillation regularization, training runs exhibit substantial variance and slower convergence, while incorporating a distillation loss significantly reduces variance and accelerates convergence, as shown in Fig 3.
Fig. 3.
Loss curves obtained from training a VideoMamba with and without distillation against StableMamba. b GroupMamba-S trained with and without distillation as well as GroupMamba trained with our proposed method termed StableGroupMamba. Please see Section 5.1 for more details
Furthermore, we implemented our proposed stabilization method on GroupMamba, denoted by StableGroupMamba. Our approach achieves a stability comparable to that obtained through distillation regularization, providing convergent evidence for the effectiveness of our method. This additional evidence reinforces our assertion that distillation becomes necessary to mitigate or resolve these training instabilities and improve convergence, and that our proposed method can be used to address this issue without distillation.
Furthermore, in Figure 2, we compare the behavior of VideoMamba (Li et al., 2024) with ViT-B16 (Dosovitskiy et al., 2021) under an increasing amount of Gaussian blurring in the input image during inference. For this, we use the images from the ImageNet-C (Hendrycks & Dietterich, 2019) benchmark, which evaluates the robustness of networks to common corruptions like Gaussian blur. As shown in Figure 2(a), VideoMamba (Li et al., 2024) suffers more than the vision Transformer from high intensities of Gaussian blurring. The better robustness of ViT-B16 can be explained by the fact that Transformers tend to focus on lower frequencies in the input image (Naseer et al., 2021; Park & Kim, 2022). This observation is further supported by another experiment that examines the behavior of networks under JPEG compression corruption. JPEG compression primarily removes high frequencies as the compression rate increases, although it also introduces tertiary compression-related artifacts as well. The removal of higher frequencies remains the dominant effect. Fig 2(b) shows that the VideoMamba is less robust to corruptions of higher frequencies, and addressing this challenge is an important contribution of this paper.
To further quantify the sensitivity of Mamba-based architectures to high-frequency components, we conduct a systematic frequency amplification analysis. Specifically, we transform the input images into the frequency domain using a Fast Fourier Transform (FFT) and isolate 20% of the highest frequency coefficients. These components are then scaled by a multiplicative factor to amplify their presence. After performing the inverse FFT, we obtain modified images with enhanced high-frequency content while preserving the overall content of the image. To provide a qualitative reference for the magnitude of these spectral corruptions, Figure 4 illustrates the resulting visual artifacts for a representative sample across amplification multipliers of 0 (original image), 10, and 30. We then evaluate the impact of this amplification by computing the cosine similarity between the latent tokens produced by the original image and those produced by its frequency-amplified counterpart at various multiplicative factors and various depths of the network. A low cosine similarity indicates that the internal representation of a network is significantly altered during the forward pass through the network.
Fig. 4.
Visualization of high-frequency spectral amplification. Examples of the input perturbations used in the sensitivity analysis: a the original input image; b and c represent images where 20% of the highest frequency components are amplified by factors of 10 and 30, respectively. Although the fine-grained spectral noise shows only minor degradation of the image quality, VideoMamba is much more sensitive to such noise compared to ViT
As illustrated in Figure 5, the Mamba architecture demonstrates a consistently lower cosine similarity compared to the Vision Transformer across layers 1, 7, and 10. While using a multiplier of 10 does not have a noticeable impact on the internal representation of the ViT-B16, it strongly changes the representation of VideoMamba. We hypothesize that having large differences in the forward pass for identical images that only differ in the amplitude of high frequencies, as it can be caused by using different image scaling and compression settings, results in the high variation of the training loss of Mamba models as shown in Fig 3(b). In Section 4.3, we will discuss the theoretical perspective of this sensitivity and empirical findings.
Fig. 5.
Sensitivity analysis of internal representations to high-frequency spectral perturbations. We plot the average cosine similarity between original token embeddings and those subject to varying levels of high-frequency amplification (the highest 20% of the spectrum). As the frequency amplification multiplier increases, VideoMamba exhibits a significantly sharper decline in similarity compared to ViT-B/16. This trend is consistent across early (a), middle (b), and late (c) layers, providing empirical evidence of Mamba’s high susceptibility to high frequencies
The above-mentioned observations provide enough evidence that it is difficult to scale Mamba models. Using distillation with a smaller model is a workaround to address training instabilities for larger models since it penalizes the larger model for deviating from the smaller one and thus acts as a regularization constraint, but it does not resolve the scalability issue. Furthermore, they are less robust to common image corruptions than vision Transformers. We thus propose an efficient distillation-free solution that mitigates the scalability issue, including training stability issues for large models, and improves the robustness to common image corruptions. Our solution is motivated by the fact that vision Transformers suffer less from these issues, and we hypothesize that adding attention blocks to pure Mamba-based visual architectures resolves these issues. We evaluate the effectiveness of this hypothesis in the subsequent sections.
StableMamba for Image Classification and Action Recognition
Before discussing the StableMamba architecture in Section 4.2, we briefly introduce state-space models in general.
State-Space Models
State-space models (SSMs) are inspired by continuous systems in which an input signal u(t) is mapped to a latent state h(t) before being mapped to an output signal y(t). Concretely, a linear ordinary differential equation describes the SSM model:
| 1 |
where h(t) is the hidden state, is the first derivative, u(t) is the input, and y(t) is the output. is the evolution matrix, and and are the projection matrices of the system.
Discretization of State-Space Models: As mentioned before, Equation (1) is valid for continuous time systems. To apply Equation (1) on a discretized input sequence instead of a continuous function u(t), Equation (1) must be discretized using a step size which describes the input time-step resolution. The standard discretization that follows Mamba (Gu & Dao, 2024) is the Zero-Order Hold (ZOH) discretization:
| 2 |
The difference between S4 (Gu et al., 2022) and Mamba (Gu & Dao, 2024) is the selective scan mechanism that conditions the parameters of , , and on input.
StableMamba
VideoMamba (Li et al., 2024) uses bi-directional Mamba layers introduced by VisionMamba (Zhu et al., 2024) and shown in Fig 6(d). A bi-directional Mamba block adapts the concept of bi-directional sequence modeling to vision-related tasks. It processes flattened visual token sequences simultaneously using forward and backward state-space models.
Fig. 6.
a The overall architecture of the StableMamba model. b Anatomy of the Transformer block. c Anatomy of the Mamba block. d Anatomy of bidirectional Mamba layer
Our architecture consists of stacked StableMamba blocks. Within each StableMamba block are N bi-directional Mamba blocks and A Transformer blocks as shown in Fig 6(a). The purpose of the Transformer blocks is to stabilize the training and increase the robustness by resetting the focus after several bi-directional Mamba blocks more on lower frequencies. We will evaluate the impact of the number of Transformer blocks in each StableMamba block and the position of the Transformer block within the StableMamba block in Section 5. We now describe the two blocks in more detail.
Transformer block: The Transformer block is detailed in Figure 6(b). Each Transformer block begins with a Root Mean Square (RMS) normalization layer applied to the input data. It follows a Self-Attention layer where three learnable linear layers , , and are used for transforming the input into queries (), keys () and values () such that , , and . The output of the Self-Attention layer is then calculated as:
| 3 |
where is the dimension of the query; furthermore, a skip connection is added to the output. Subsequently, another RMS normalization is applied, after which this output is fed to an MLP layer. This constitutes the entire Transformer block shown in Fig 6(b). The operations can be summarized as:
| 4 |
where is the input to the Transformer block. is the convolutional patch embedding and is the positional encoding as in (Dosovitskiy et al., 2021). is the RMS norm layer and denotes the multi-head Self-Attention layer described in Equation (3). The is defined by:
| 5 |
Mamba block: The Mamba block (Fig 6(c)) has the same structure as the Transformer block except that it uses a bi-directional Mamba layer instead of a self-attention layer. For brevity’s sake, we will call the bi-directional Mamba layer simply as the Mamba layer. The Mamba block performs the following operations:
| 6 |
Our Mamba block differs from VideoMamba (Li et al., 2024) in that we add an RMS normalization layer and an MLP layer inside the Mamba block.
The number of parameters of the network can be controlled by the depth of the network and the embedding dimension. We introduce four variations of our model: StableMamba-Tiny has 7M parameters, StableMamba-Small has 27M parameters, StableMamba-Middle has 76M parameters, StableMamba-Base has 101M parameters, and StableMamba-Large has 187M parameters. The complete list of hyperparameters for reproducibility purposes is provided in Table 1. We use 4 nodes with 4 A100 GPUs (40GB) each for training all of our StableMamba models.
Table 1.
Hyperparameters for StableMamba
| StableMamba Training Recipe | |||
|---|---|---|---|
| T=Tiny, S=Small, M=Medium, B=Base, and L=Large | |||
| Dataset | IN1K | K400 | SSv2 |
| Epochs | 300 | 70(T), 50(S,M,B) | 35(T), 30(S,M) |
| Batch size | 128 | 32(T)/16(S,M,B) | 32(T)/16(S,M) |
| Optimizer | AdamW | AdamW | AdamW |
| Optimizer momentum | |||
| Learning rate | 5e-4 | 4e-4(T,S), 2e-4(M,B) | 4e-4 |
| Minimum learning rate | 1e-5(T,S,M), 5e-6(B,L) | 1e-6 | 1e-6 |
| Scheduler | cosine | cosine | cosine |
| Weight decay | 0.1(T), 0.05(S,M,B,L) | 0.1(T), 0.05(S,M,B) | 0.1(T), 0.05(S,M) |
| Warmup epochs | 5 (T,S), 30(M), 20(B,L) | 5 | 5 |
| Trans. to Mamba blocks | 1 : 7 | 1 : 7 | 1 : 7 |
| Label smoothing | 0.1 | 0.1 | 0.1 |
| Drop path | 0(T), 0.15(S), 0.5(M,B,L) | 0.1(T), 0.35(S), 0.8(M,B) | 0.1(T), 0.35(S), 0.8(M) |
| Repeated aug. | Yes(T), No(S,M,B,L) | 2 | 2 |
| Input size | |||
| Patch size | 16 | 16 | 16 |
| Rand. aug. | (7, 0.25)(T), (9, 0.5)(S,M,B,L) | (7, 0.25)(T), (9, 0.5)(S,M,B) | (7, 0.25)(T), (9, 0.5)(S,M) |
| Mixup prob. | 0.8 | 0.8 | 0.8 |
| Cutmix prob. | 1.0 | 1.0 | 1.0 |
Theoretical Perspective for Mamba’s Sensitivity
The sensitivity of Mamba to high frequency perturbations, as shown in Section 3, can be understood through the structural differences between selective SSMs and attention blocks. Han et al. (2024) showed that the selective SSM from Equation (1) can be rewritten (7) in a formulation comparable to linear attention (8):
| 7 |
| 8 |
In Equation (7), t indexes the sequence position (time step), denotes the input at step t, is the selective SSM hidden state, and is the corresponding output. The operator denotes the Hadamard product, and 1 denotes all-ones vectors (or tensors) of appropriate shape. Since the matrix is in practice a diagonal matrix, is the vector of the elements of on the diagonal. The quantity is an input-dependent forget or decay gate applied to the previous state , is an (input-dependent) input gate applied to , and and are (input-dependent) linear maps that project the gated input into the state space and the state into the output space, respectively. In Equation (8), denotes the (linear attention) running state or accumulator, is the query at step t, and are the key and value at step t. The term denotes the (linear attention) normalizer accumulator, which corresponds often to an accumulated feature map of keys, and the denominator provides an additional normalization.
As discussed in Han et al. (2024), a side-by-side comparison of Equation (7) and Equation (8) allows an interpretation of Mamba as a variant of single-head linear attention augmented with an input-dependent forget gate and an input gate but without attention normalization. Han et al. (2024) have demonstrated empirically that removing the normalization by in Equation (8) causes the standard deviation of token norms to grow dramatically across layers, allowing higher-magnitude tokens to dominate the feature map while suppressing others. When high frequencies are amplified as in Section 3, the magnitudes of certain tokens become disproportionately large. Without normalization they overwhelm the hidden state in Mamba’s recurrence, whereas softmax or linear attention’s denominator re-normalizes and attenuates this effect. This leads to training instabilities and high sensitivity to high-frequency perturbations as shown in Figure 7 using attention rollout (Abnar & Zuidema, 2020) for transformer and Mamba rollout (Ali et al., 2025) for the Mamba-based network. While DeiT’s attention rollout retains meaningful even if high frequencies are strongly amplified, Mamba’s rollout attribution maps degrade progressively with increasing high-frequency amplification, eventually becoming nearly random. Honarpisheh et al. (2025) also compare the generalization bounds of specific variants of selective SSMs and linear attention models. While the bound of linear attention scales with , where is the upper bound of the input’s -norm, i.e., for all t, the bound of the selective SSMs scales as . This theoretical analysis also explains why selective SSMs are inherently more sensitive to input magnitudes than attention-based models.
Fig. 7.
Attention rollout (Abnar & Zuidema, 2020) and Mamba rollout (Ali et al., 2025) under progressive high-frequency perturbations. Top: Input image with increasing high-frequency content. Middle: DeiT attention rollout remains consistently localized on the bird’s head across all perturbation levels. Bottom: Mamba rollout exhibits progressively degraded localization as high-frequency content increases, indicating reduced robustness to high-frequency perturbations
Results
We evaluate our model for image classification on ImageNet-1K (IN1K) (Deng et al., 2009) and for video recognition on Kinetics-400 (K400) (Kay et al., 2017) and Something-Something-v2 (SSv2) (Goyal et al., 2017). For evaluating the robustness to various common corruptions, we use the ImageNet-C (IN-C) (Hendrycks & Dietterich, 2019) benchmark. Note that ImageNet-C is only used for testing, but not for training.
Evaluation on ImageNet-1K
We use the IN1K (Deng et al., 2009) dataset for pre-training our models. IN1K contains 1.28M training and 50k validation images for 1000 categories. The models pre-trained on IN1K are used as an initializing point for fine-tuning on the other datasets.
Evaluation Setup: We train our models for 300 epochs on IN1K, using the AdamW optimizer (Loshchilov & Hutter, 2017) with a learning rate of 5e-4, weight decay of 0.1 for the tiny model and 0.05 for the other models, a batch size of 128 per GPU, input image resolution of 224, and a patch size of 16. We set the ratio of Transformer blocks to Mamba blocks to 1:7 for our baseline models. We use 4 nodes with 4 A100 GPUs (40GB) each for training. We do not use any automatic mixed precision. For a fair comparison, we also train our models with and without distillation to gauge the effect of distillation on the overall training scheme and architecture. Following VideoMamba (Li et al., 2024), use the spatial-first bidirectional scan for images. The complete set of hyperparameters is provided in Table 1.
To provide a comprehensive evaluation, we conducted additional experiments comparing our approach with the GroupMamba architecture (Shaker et al., 2025). The implementation required modifying the third-stage embedding dimension from 348 to 384 to ensure compatibility with multi-head attention mechanisms. This architectural adjustment increased the parameter count to 39M for both distilled and non-distilled variants. We applied our method to GroupMamba by systematically replacing alternating Visual Single Selective Scanning (VSSS) blocks with attention blocks while preserving the original interleaving pattern and ratio, creating StableGroupMamba. The resulting architecture exhibits a modest parameter increase (39.3M vs. 39.2M) compared to the baseline due to the computational requirements of multi-head attention exceeding those of 2D selective scan operations.
Note that the distillation strategies employed by GroupMamba and VideoMamba differ substantially in their teacher model selection. GroupMamba utilizes a significantly larger RegNet-Y architecture as the teacher model, whereas VideoMamba employs smaller models from within their own architectural family. This fundamental difference in teacher model capacity accounts for the more pronounced performance improvements observed in GroupMamba experiments, as larger teacher models typically provide richer supervisory signals during knowledge distillation.
Results: We present results for evaluating StableMamba on the IN1K dataset with other comparable methods in Table 2. We train our method with and without distillation to show the impact of distillation on the accuracy. We first compare the results without distillation. StableMamba outperforms the current state-of-the-art isotropic visual SSM models (ViM and VideoMamba) on IN1K for all model sizes. Compared to VideoMamba, the improvement () of StableMamba is largest for the model M, which is the largest model of VideoMamba that can be trained without distillation. Note that an improvement of on IN1K is substantial. The improvements compared to VideoMamba are visualized by the solid lines in Figure 1, which show the lack of scalability of VideoMamba. If we compare VideoMamba and StableMamba with distillation, we observe that distillation improves the accuracy for both architectures, but StableMamba still outperforms VideoMamba. The accuracy of StableMamba- is higher than that of VideoMamba-. It is interesting to note that StableMamba-B without distillation even outperforms VideoMamba- with distillation by . The trend continues with StableMamba-L having 84.6% top-1 accuracy against 83.9% of StableMamba-B, i.e., an improvement of +0.7. Most important, however, is that StableMamba can be scaled up and does not need any distillation as shown in Figure 1.
Table 2.
Performance comparison on ImageNet-1K: We report the performance of our proposed models with state-of-the-art Mamba-based models and popular convolution-based and Transformer-based models on the ImageNet-1K (Deng et al., 2009) validation set. Our proposed models outperform the Mamba-based models. represents the results using distillation. ‘iso.’ means isotropic
| Type | Model | iso. | Image Size | #Params (M) | FLOPs (G) | IN1K Top-1% |
|---|---|---|---|---|---|---|
| CNN | ConvNeXt-T (Liu et al., 2022a) | ✗ | 224 | 29 | 4.5 | 82.1 |
| ConvNeXt-S (Liu et al., 2022a) | ✗ | 224 | 50 | 8.7 | 83.1 | |
| ConvNeXt-B (Liu et al., 2022a) | ✗ | 224 | 89 | 15.4 | 83.8 | |
| CNN+ SSM. | VMamba-T (Liu et al., 2024) | ✗ | 224 | 31 | 4.9 | 82.2 |
| VMamba-S (Liu et al., 2024) | ✗ | 224 | 50 | 8.7 | 83.5 | |
| VMamba-B (Liu et al., 2024) | ✗ | 224 | 89 | 15.4 | 83.7 | |
| Trans. | Swin-T (Liu et al., 2021) | ✗ | 224 | 28 | 4.6 | 81.3 |
| Swin-S (Liu et al., 2021) | ✗ | 224 | 50 | 8.7 | 83.0 | |
| Swin-B (Liu et al., 2021) | ✗ | 224 | 88 | 15.4 | 83.5 | |
| DeiT-T (Touvron et al., 2021) | ✓ | 224 | 6 | 1.3 | 72.2 | |
| DeiT-S (Touvron et al., 2021) | ✓ | 224 | 22 | 4.6 | 79.8 | |
| DeiT-B (Touvron et al., 2021) | ✓ | 224 | 87 | 17.6 | 81.8 | |
| SSM | ViM-T (Zhu et al., 2024) | ✓ | 224 | 7 | 1.1 | 76.1 |
| ViM-S (Zhu et al., 2024) | ✓ | 224 | 26 | 4.3 | 80.5 | |
| VideoMamba-T (Liu et al., 2024) | ✓ | 224 | 7 | 1.1 | 76.9 | |
| VideoMamba-S (Liu et al., 2024) | ✓ | 224 | 26 | 4.3 | 81.2 | |
| VideoMamba-M (Liu et al., 2024) | ✓ | 224 | 74 | 12.7 | 81.4 | |
| VideoMamba- (Liu et al., 2024) | ✓ | 224 | 74 | 12.7 | 82.8 | |
| VideoMamba- (Liu et al., 2024) | ✓ | 224 | 98 | 16.9 | 82.7 | |
| StableMamba-T | ✓ | 224 | 7 | 1.2 | 77.4 | |
| StableMamba-S | ✓ | 224 | 27 | 4.4 | 81.5 | |
| StableMamba-M | ✓ | 224 | 76 | 12.9 | 83.1 | |
| StableMamba- | ✓ | 224 | 76 | 12.9 | 83.5 | |
| StableMamba-B | ✓ | 224 | 101 | 17.1 | 83.9 | |
| StableMamba- | ✓ | 224 | 101 | 17.1 | 84.1 | |
| StableMamba-L | ✓ | 224 | 187 | 33.7 | 84.6 | |
| GroupMamba-S | ✓ | 224 | 39.2 | 8.1 | 83.2 | |
| GroupMamba- | ✓ | 224 | 39.2 | 8.1 | 84.0 | |
| StableGroupMamba-S | ✓ | 224 | 39.3 | 8.3 | 83.8 | |
| StableGroupMamba- | ✓ | 224 | 39.3 | 8.3 | 84.2 |
For further evidence of the effectiveness of our technique, we train GroupMamba-S (Shaker et al., 2025) with the attention interleaved layers, denoted by StableGroupMamba-S. The results in Table 2 show that it has better performance than GroupMamba-S without distillation and converges better as shown in Fig3b.
Evaluation on Video Recognition
After pre-training on IN1K, we fine-tune the models on two large-scale datasets. The first dataset, K400 (Kay et al., 2017), includes approximately 240,000 training videos and 19,000 validation videos, each about 10 seconds long, spanning 400 different human action classes. The second dataset, SSv2 (Goyal et al., 2017), consists of around 220,000 videos: 168,000 for training, 24,000 for validation, and 27,000 for testing, covering 174 different classes.
Evaluation Setup: For fine-tuning, we use a batch size of 32 for tiny and a batch size of 16 for the other variants due to the GPU memory limit. We set the number of linear warm-up epochs to 5, and the total number of epochs to 70 for K400 and 35 for SSv2 as in (Li et al., 2024) for the tiny model, and 50 and 30, respectively, for the other models. We use AdamW as an optimizer. We use the same spatiotemporal scan as used by VideoMamba (Li et al., 2024). The complete list of hyperparameters for reproducibility is provided in Table 1.
Results: StableMamba demonstrates superior performance in downstream video recognition tasks compared to VideoMamba, which is the only Mamba architecture that can be applied to videos. On the K400 dataset in Table 3, StableMamba tiny and small outperform their VideoMamba counterparts without distillation. Distillation improves the accuracy for the middle models, but even with distillation, StableMamba- improves the accuracy of VideoMamba- by , which is a substantial improvement on this dataset. StableMamba-B extends this trend further, achieving improvement over StableMamba-M and improvement over StableMamba-, establishing new state-of-the-art performance on this benchmark. The results on the SSv2 dataset shown in Table 4 are similar, but the improvements are even larger. StableMamba- improves on the accuracy of VideoMamba- by .
Table 3.
Comparison with state-of-the-art methods on Kinetics-400 (Kay et al., 2017). represents initialization with ImageNet-1K pretraining using distillation
| Arch. | Model | P.T. | Input Size | #Params (M) | FLOPs (G) | K400 Top-1% |
|---|---|---|---|---|---|---|
| CNN | - | 80224 | 60 | 234310 | 79.8 | |
| (Feichtenhofer et al., 2019) | ||||||
| X3D-M (Feichtenhofer, 2020) | - | 16224 | 4 | 6310 | 76.0 | |
| X3D-XL (Feichtenhofer, 2020) | - | 16312 | 20 | 194310 | 80.4 | |
| CNN+ Trans. | MViTv1-B (Fan et al., 2021) | – | 32224 | 37 | 7015 | 80.2 |
| MViTv2-S (Li et al., 2022a) | - | 16224 | 35 | 6415 | 81.0 | |
| UniFormer-S (Li et al., 2022b) | IN1K | 16224 | 21 | 4214 | 80.8 | |
| UniFormer-B (Li et al., 2022b) | IN1K | 16224 | 50 | 9714 | 82.0 | |
| UniFormer-B (Li et al., 2022b) | IN1K | 32224 | 50 | 25934 | 83.0 | |
| Trans. | Swin-T (Liu et al., 2022b) | IN1K | 32224 | 28 | 8834 | 78.8 |
| Swin-B (Liu et al., 2022b) | IN1K | 32224 | 88 | 8834 | 80.6 | |
| Swin-B (Liu et al., 2022b) | IN21K | 32224 | 88 | 28234 | 82.7 | |
| STAM (Sharir et al. 2021) | IN21K | 64224 | 121 | 104011 | 79.2 | |
| TimeSformer-L | IN21K | 96224 | 121 | 238031 | 80.7 | |
| (Bertasius et al. 2021) | ||||||
| ViViT-L (Arnab et al., 2021) | IN21K | 16224 | 311 | 399234 | 81.3 | |
| Mformer-HR (Patrick et al., 2021) | IN21K | 16336 | 311 | 959310 | 81.1 | |
| SSM | VideoMamba-T (Li et al., 2024) | IN1K | 16224 | 7 | 1734 | 78.1 |
| VideoMamba-S (Li et al., 2024) | IN1K | 16224 | 26 | 6834 | 80.8 | |
| VideoMamba- (Li et al., 2024) | IN1K | 16224 | 74 | 20234 | 81.9 | |
| StableMamba-T | IN1K | 16224 | 7 | 1934 | 78.6 | |
| StableMamba-S | IN1K | 16224 | 27 | 7034 | 81.2 | |
| StableMamba-M | IN1K | 16224 | 76 | 20634 | 82.2 | |
| StableMamba- | IN1K | 16224 | 76 | 20634 | 82.5 | |
| StableMamba-B | IN1K | 16224 | 101 | 30334 | 82.8 |
Table 4.
Comparison with state-of-the-art methods on the Something-Something-v2 (Goyal et al., 2017) dataset. represents initialization with ImageNet-1K pretraining using distillation. Network input sizes are the same as mentioned in K400
| Arch. | Model | P.T. | #Params (M) | FLOPs (G) | SSv2 Top-1% |
|---|---|---|---|---|---|
| CNN | K400 | 53 | 10631 | 63.1 | |
| (Feichtenhofer et al., 2019) | |||||
| CT- (Li et al., 2020) | IN1K | 21 | 7511 | 64.5 | |
| (Wang et al., 2021) | IN1K | 26 | 7511 | 65.3 | |
| CNN+ Trans. | MViTv1-B (Fan et al., 2021) | K400 | 37 | 7131 | 64.7 |
| MViTv1-B (Fan et al., 2021) | K400 | 37 | 17031 | 67.1 | |
| MViTv2-S (Li et al., 2022a) | K400 | 35 | 6531 | 68.2 | |
| MViTv2-B (Li et al., 2022a) | K400 | 51 | 22531 | 70.5 | |
| UniFormer-S (Li et al., 2022b) | IN1K+K400 | 21 | 4231 | 67.7 | |
| UniFormer-B (Li et al., 2022b) | IN1K+K400 | 50 | 9731 | 70.4 | |
| Trans. | Swin-B (Liu et al., 2022b) | K400 | 89 | 8831 | 69.6 |
| ViViT-L (Arnab et al., 2021) | IN21K+K400 | 311 | 399234 | 65.4 | |
| Mformer-HR (Patrick et al., 2021) | IN21K+K400 | 311 | 118531 | 68.1 | |
| TimeSformer-HR | IN21K | 121 | 170331 | 62.5 | |
| (Bertasius et al. 2021) | |||||
| SSM | VideoMamba-T (Li et al., 2024) | IN1K | 7 | 932 | 65.1 |
| VideoMamba-S (Li et al., 2024) | IN1K | 26 | 3432 | 66.6 | |
| VideoMamba- (Li et al., 2024) | IN1K | 74 | 10134 | 67.3 | |
| StableMamba-T | IN1K | 7 | 1032 | 65.7 | |
| StableMamba-S | IN1K | 27 | 3532 | 67.3 | |
| StableMamba-M | IN1K | 76 | 10334 | 67.8 | |
| StableMamba- | IN1K | 76 | 10334 | 68.1 |
Evaluation on ImageNet-C
IN-C (Hendrycks & Dietterich, 2019) is a benchmark for evaluating the robustness of neural networks to images with common corruptions like JPEG compression. It includes 19 common types of image corruption at 5 different intensity levels. We test our network on this benchmark to assess the robustness introduced by attention layers.
Results: We present results for Gaussian blurring and JPEG compression corruption for StableMamba-M in comparison with VideoMamba-M, ViT-B16, and ResNet-50 in Figure 2. We see that StableMamba-M (blue) outperforms VideoMamba-M (yellow) for all levels of corruption. The gap becomes larger as the intensity of corruption increases. StableMamba behaves similarly or even slightly better than the pure attention-based architecture ViT-B16 and is more robust than ResNet-50, in particular for the highly relevant JPEG compression setting.
We also report the results across all corruptions in Table 5. The Mean Corruption Error (mCE) on the ImageNet-C dataset presented in Table 5 showcases the robustness of various models to common image corruptions, with errors reported relative to AlexNet. Our proposed model, StableMamba-M, demonstrates superior performance with an mCE of 50.5%, which is competitive with the DeiT-B model, which has an mCE of 50.4%. Notably, StableMamba-M outperforms ViT-B/16 and VideoMamba-M, which have mCEs of 53.7% and 51.6%, respectively, highlighting its improved robustness. This comparison underscores StableMamba’s effectiveness in enhancing model stability and corruption resistance, providing a significant advancement over existing models like VideoMamba.
Table 5.
Mean Corruption Error (mCE) on the ImageNet-C (Hendrycks & Dietterich, 2019) dataset across all 19 corruptions. mCE is reported relative to AlexNet (Krizhevsky et al., 2012) errors on ImageNet-C
| Model | Error on Clean | Mean Corruption Error (mCE) |
|---|---|---|
| AlexNet | 43.48% | 100.0% |
| SqueezeNet1.1 | 41.82% | 104.4% |
| VGG11 | 30.98% | 93.5% |
| VGG19 | 27.62% | 88.9% |
| VGG19BN | 25.78% | 81.6% |
| DenseNet121 | 25.57% | 73.4% |
| DenseNet169 | 24.40% | 69.4% |
| DenseNet201 | 23.10% | 68.4% |
| DenseNet161 | 22.86% | 66.4% |
| CondenseNet4 | 26.25% | 80.8% |
| CondenseNet8 | 28.93% | 84.6% |
| ResNet18 | 30.24% | 84.7% |
| ResNet34 | 26.69% | 77.9% |
| ResNet50 | 23.87% | 76.7% |
| ResNet101 | 22.63% | 70.4% |
| ResNet152 | 21.69% | 69.3% |
| ResNeXt50 | 22.89% | 68.2% |
| ResNeXt101 | 21.81% | 63.6% |
| ResNeXt101_64 | 21.04% | 62.2% |
| ViT-B/16 | 22.10% | 53.7% |
| DeiT-B | 18.20% | 50.4% |
| VideoMamba-M | 18.60% | 51.6% |
| StableMamba-M | 16.90% | 50.5% |
Ablation Studies
Position of Transformer Blocks: In Figure 6(a), the Transformer block is placed in the middle of the StableMamba blocks. This position results from our analysis of the impact on the location of the Transformer block. We conducted three experiments each for StableMamba-T and StableMamba-S, totaling six experiments, to determine the optimal position for the Transformer block. We tested placing the Transformer block at the start, middle, and end of the StableMamba blocks and evaluated their performance on the IN1K dataset. As shown in Figure 8(a), the performance of StableMamba is not highly sensitive to the Transformer’s position in both tiny and small models. However, there is a slight performance improvement when the Transformer block is in the middle. Therefore, we use the middle position as the default for our StableMamba architecture.
Fig. 8.
(a) Impact of the position of the Transformer block within StableMamba. (b) Impact of the ratio of Transformer blocks to Mamba blocks
Number of Transformer Blocks: Similar to the position of Transformer blocks within each StableMamba block, the ratio of Transformer blocks to Mamba blocks is another design parameter for the StableMamba block. We interleave a Transformer block for every k Mamba block; for example, we interleave one Transformer block for every seven Mamba blocks. To evaluate the impact of the ratio, we conducted experiments varying the number of Mamba blocks per Transformer block. As shown in Figure 8(b), the performance on the IN1K dataset improves as the number of Mamba blocks per Transformer block increases, reaching optimal accuracy at a ratio of 1:7. Beyond this ratio, the performance decreases. Therefore, we set the design parameter to one Transformer block for every seven Mamba blocks in the StableMamba architecture.
Impact of Context Length: Apart from the network architecture itself, it is interesting to investigate the network with context lengths of different sizes. To probe the suitability of our approach for a long context, we perform additional experiments. First, we train StableMamba-T with a longer context for video classification, using 32 frames instead of the usual 16 frames. Second, we train StableMamba with a larger resolution (448 instead of 224) to see its effect on image classification as well. The results in Table 6 show that StableMamba and VideoMamba benefit from the increased context length, which is a general strength of Mamba-based architectures. In all cases, StableMamba outperforms VideoMamba.
Table 6.
Impact of image resolution (top) and number of input frames (bottom) for StableMamba and VideoMamba
| Model | Context Length | Training Dataset | FLOPs (G) | Accuracy |
|---|---|---|---|---|
| VideoMamba-T | IN1K | 1.1 | 76.9% | |
| StableMamba-T | IN1K | 1.2 | 77.4% | |
| VideoMamba-T | IN1K | 4.3 | 79.3% | |
| StableMamba-T | IN1K | 4.5 | 79.9% | |
| VideoMamba-T | K400 | 78.1% | ||
| StableMamba-T | K400 | 78.6% | ||
| VideoMamba-T | K400 | 78.8% | ||
| StableMamba-T | K400 | 79.3% |
Impact of Dataset Size: Along with the context length, it is also interesting to ablate the data efficiency of the network. For this purpose, we conducted scaling experiments using 25%, 50%, 75%, and 100% of the training dataset while performing the validation on the full validation set. The results in Figure 9 show that our network consistently outperforms the VideoMamba model across all data regimes. While conventional approaches exhibit performance saturation as data volume increases, our architecture maintains higher accuracy at each threshold and continues to improve with additional data. The performance gap is already evident at the 25% level for small and middle models and progressively widens with dataset scaling, confirming that our modifications enable better representation learning from limited samples without compromising the ability to leverage larger datasets.
Fig. 9.
(a) Dataset scaling experiment using 25%, 50%, 75%, and 100% of the training dataset while performing the validation on the full validation set
Impact on Throughput: To evaluate the computational impact of incorporating attention mechanisms with extended temporal receptive fields, we conducted a throughput analysis of the StableMamba-T architecture. We measured processing throughput (clips per second) across varying temporal sequence lengths, systematically evaluating performance from 8-frame to 128-frame video clips. As shown in Figure 10, the integration of transformer blocks within the StableMamba architecture introduces a negligible computational overhead. The throughput characteristics only marginally degrade compared to those of the baseline VideoMamba-T model across all evaluated sequence lengths, indicating that our stabilization approach maintains computational efficiency while providing enhanced training stability.
Fig. 10.
Throughput plot for VideoMamba and StableMamba. Even with attention blocks, StableMamba remains competitive on all temporal length sequences with a pure Mamba-based model. The Transformer-Tiny model is made by replacing all the Mamba blocks in the StableMamba-Tiny model with Transformer blocks
We also evaluate the performance of our approach at the architectural extreme where all blocks are replaced with Transformer layers, effectively creating a pure Transformer model. As anticipated, this configuration substantially increases computational complexity and reduces inference throughput. Consequently, our approach maintains greater architectural and computational closeness to the original Mamba design while achieving improved stability compared to pure Mamba architectures.
Conclusion
We have investigated and addressed the scalability challenge in large visual state-space models by proposing a straightforward interleaved design that scales effectively to a substantial number of parameters, consistently outperforming smaller models. Our ablation studies provide insights regarding optimal positioning, the number of attention layers in the architecture, and its robustness to common corruptions in the input like JPEG compression. Extensive experiments show that our method enables the scaling of Mamba-based models to over 180M parameters, significantly enhancing performance while also improving overall robustness. Evaluations on the K400 and SSv2 datasets for video recognition validate that our approach achieves state-of-the-art results.
Acknowledgements
The work has been supported by the Federal Ministry of Research, Technology and Space (BMFTR) under grant no. 16IS22094A WEST-AI and the ERC Consolidator Grant FORHUE (101044724). The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC). The authors also gratefully acknowledge the granted access to the Marvin cluster hosted by the University of Bonn.
Author Contributions
Conceptualization: Syed Talal Wasim; Methodology: Syed Talal Wasim; Formal analysis and investigation: Syed Talal Wasim, Hamid Suleman; Writing - original draft preparation: Hamid Suleman; Writing - review and editing: Juergen Gall, Muzammal Naseer; Funding acquisition: Juergen Gall; Supervision: Juergen Gall and Muzammal Naseer
Funding
Open Access funding enabled and organized by Projekt DEAL. The work has been supported by the Federal Ministry of Research, Technology and Space (BMFTR) under grant no.16IS22094A WEST-AI and the ERC Consolidator Grant FORHUE (101044724).
Data Availability
Not applicable
Declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Obtained from all authors
Materials availability
Not applicable
Code availability
Code will be made publicly available on Github.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Hamid Suleman, Syed Talal Wasim have contributed equally to this work.
References
- Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). Vivit: A video vision transformer. In: ICCV.
- Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. In: ACL.
- Ali, A.A., Zimerman, I., & Wolf, L. (2025). The hidden attention of mamba models. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1516–1534. Association for Computational Linguistics, Vienna, Austria. 10.18653/v1/2025.acl-long.76 . https://aclanthology.org/2025.acl-long.76/
- Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In: ICML.
- Chen, G., Huang, Y., Xu, J., Pei, B., Chen, Z., Li, Z., Wang, J., Li, K., Lu, T., & Wang, L. (2024). Video mamba suite: State space model as a versatile alternative for video understanding. arxiv preprint, arXiv:2403.09626
- Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR.
- Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint, arXiv:2307.08691
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR.
- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In: CVPR.
- Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS.
- Duan, H., Zhao, Y., Xiong, Y., Liu, W., & Lin, D. (2020). Omni-sourced webly-supervised learning for video recognition. In: ECCV.
- Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., & Ré, C. (2023). Hungry hungry hippos: Towards language modeling with state space models. In: ICLR.
- Feichtenhofer, C. (2020). X3d: Expanding architectures for efficient video recognition. In: CVPR.
- Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In: ICCV.
- Fei, Z., Fan, M., Yu, C., Li, D., Zhang, Y., & Huang, J. (2024). Dimba: Transformer-mamba diffusion models. arXiv preprint arXiv:2406.01159
- Feichtenhofer, C., Pinz, A., & Wildes, R. (2016). Spatiotemporal residual networks for video action recognition. In: NeurIPS.
- Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. In: ICCV.
- Gu, A., & Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. In: Conference on Language Modeling
- Goyal, R., Ebrahimi Kahou, S., Michalski, V., et al. (2017). The “something something” video database for learning and evaluating visual common sense. In: ICCV
- Gu, A., Goel, K., & Ré, C. (2022). Efficiently modeling long sequences with structured state spaces. In: ICLR.
- Gong, H., Kang, L., Wang, Y., Wan, X., & Li, H. (2025). nnMamba: 3d biomedical image segmentation, classification and landmark detection with state space model. In: International Symposium on Biomedical Imaging
- Guo, H., Li, J., Dai, T., Ouyang, Z., Ren, X., & Xia, S.-T. (2024). Mambair: A simple baseline for image restoration with state-space model. In: ECCV. Springer
- Guo, T., Wang, Y., & Meng, C. (2024). Mambamorph: a mamba-based backbone with contrastive feature learning for deformable mr-ct registration. CoRR
- Gou, J., Yu, B., Maybank, S.J., & Tao, D. (2020). Knowledge distillation: A survey. IJCV.
- Howard, A.G., Bo Chen, M.Z., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv preprint, arXiv:1704.04861
- Honarpisheh, A., Bozdag, M., Camps, O., & Sznaier, M. (2025). Generalization error analysis for selective state-space models through the lens of attention. In: NeurIPS.
- He, X., Cao, K., Yan, K., Li, R., Xie, C., Zhang, J., & Zhou, M. (2025). Pan-mamba: Effective pan-sharpening with state space model. Information Fusion,115, Article 102779. [Google Scholar]
- Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. In: ICLR
- Hatamizadeh, A., & Kautz, J. (2025). Mambavision: A hybrid mamba-transformer vision backbone. In: CVPR
- Huang, T., Pei, X., You, S., Wang, F., Qian, C., & Xu, C. (2024). Localmamba: Visual state space model with windowed selective scan. In: ECCV. Springer
- Han, D., Wang, Z., Xia, Z., Han, Y., Pu, Y., Ge, C., Song, J., Song, S., Zheng, B., & Huang, G. (2024). Demystify mamba in vision: A linear attention perspective. In: NeurIPS.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.
- Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint, arXiv:1705.06950
- Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In: BMVC.
- Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In: NeurIPS.
- Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In: CVPR.
- Kondratyuk, D., Yuan, L., Li, Y., Zhang, L., Tan, M., Brown, M., & Gong, B. (2021). Movinets: Mobile video networks for efficient video recognition. In: CVPR.
- Liu, Z., et al. (2022). Swin Transformer V2: Scaling up capacity and resolution. In: CVPR.
- Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding. In: ICCV. [DOI] [PubMed]
- Loshchilov, I., & Hutter, F. (2017). Fixing weight decay regularization in adam. arxiv preprint, arXiv:1711.05101
- Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020). Tea: Temporal excitation and aggregation for action recognition. In: CVPR.
- Laptev, L. (2003). Space-time interest points. In: ICCV.
- Lenz, B., Lieber, O., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C., Fridman, C., Padnos, D., Gissin, D., Jannai, D., Muhlgay, D., Zimberg, D., Gerber, E.M., Dolev, E., Krakovsky, E., Safahi, E., Schwartz, E., Cohen, G., Shachaf, G., Rozenblum, H., Bata, H., Blass, I., Magar, I., Dalmedigos, I., Osin, J., Fadlon, J., Rozman, M., Danos, M., Gokhman, M., Zusman, M., Gidron, N., Ratner, N., Gat, N., Rozen, N., Fried, O., Leshno, O., Antverg, O., Abend, O., Dagan, O., Cohavi, O., Alon, R., Belson, R., Cohen, R., Gilad, R., Glozman, R., Lev, S., Shalev-Shwartz, S., Meirom, S.H., Delbari, T., Ness, T., Asida, T., Gal, T.B., Braude, T., Pumerantz, U., Cohen, J., Belinkov, Y., Globerson, Y., Levy, Y.P., & Shoham, Y. (2025). Jamba: Hybrid transformer-mamba language models. In: ICLR.
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical vision transformer using shifted windows. In: ICCV.
- Li, Y., Liao, B., Liu, W., & Wang, X. (2025). Matvlm: Hybrid mamba-transformer for efficient vision-language modeling. In: ICCV.
- Li, K., Li, X., Wang, Y., Wang, J., & Qiao, Y. (2020). Ct-net: Channel tensorization network for video classification. In: ICLR.
- Li, K., Li, X., Wang, Y., He, Y., Wang, Y., Wang, L., & Qiao, Y. (2024). Videomamba: State space model for efficient video understanding. In: ECCV. Springer
- Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In: CVPR.
- Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In: CVPR.
- Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., & Liu, Y. (2024). Vmamba: Visual state space model. In: NeurIPS.
- Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection. In: CVPR.
- Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unified transformer for efficient spatiotemporal representation learning. In: ICLR.
- Liu, J., Yang, H., Zhou, H.-Y., Xi, Y., Yu, L., Yu, Y., Liang, Y., Shi, G., Zhang, S., Zheng, H., et al. (2024). Swin-umamba: Mamba-based unet with imagenet-based pretraining. In: MICCAI. Springer
- Liang, D., Zhou, X., Wang, X., Zhu, X., Xu, W., Zou, Z., Ye, X., & Bai, X. (2024). Pointmamba: A simple state space model for point cloud analysis. In: NeurIPS.
- Mehta, H., Gupta, A., Cutkosky, A., & Neyshabur, B. (2023). Long range language modeling via gated state spaces. In: ICLR.
- Ma, J., Li, F., & Wang, B. (2024). U-mamba: Enhancing long-range dependency for biomedical image segmentation. arxiv preprint, arXiv:2401.04722
- Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In: CVPR.
- Naseer, M.M., Ranasinghe, K., Khan, S.H., Hayat, M., Shahbaz Khan, F., & Yang, M.-H. (2021). Intriguing properties of vision transformers. In: NeurIPS.
- Patro, B.N., & Agneeswaran, V.S. (2024). Simba: Simplified mamba-based architecture for vision and multivariate time series. arxiv preprint, arXiv:2403.15360
- Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., Vedaldi, A., & Henriques, J.F. (2021). Keeping your eye on the ball: Trajectory attention in video transformers. In: NeurIPS.
- Pei, X., Huang, T., & Xu, C. (2025). Efficientvmamba: Atrous selective scan for light weight visual mamba. In: AAAI.
- Park, N., & Kim, S. (2022) How do vision transformers work? In: ICLR.
- Qiu, Z., Yao, T., Ngo, C.-W., Tian, X., & Mei, T. (2019). Learning spatio-temporal representation with local and global diffusion. In: CVPR.
- Ruan, J., & Xiang, S. (2024). Vm-unet: Vision mamba unit for medical image segmentation. Communications and Applications: ACM Transactions on Multimedia Computing. [Google Scholar]
- Sun, L., Jia, K., Yeung, D.-Y., & Shi, B.E. (2015). Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV.
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In: CVPR.
- Sharir, G., Noy, A., Zelnik-& Manor, L. (2021). An image is worth 16x16 words, what is a video worth? arxiv preprint, arxiv:2103.13915
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In: CVPR.
- Shaker, A., Wasim, S.T., Khan, S., Gall, J., & Khan, F.S. (2025). Groupmamba: Efficient group-based visual state space model. In: CVPR.
- Smith, J.T., Warrington, A., & Linderman, S.W. (2023). Simplified state space layers for sequence modeling. In: ICLR.
- Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In: NeurIPS.
- Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. ICLR.
- Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In: ICCV.
- Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jegou, H. (2021). Training data-efficient image transformers & distillation through attention. In: ICML.
- Tan, M., & Le, Q.V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In: ICML.
- Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In: CVPR.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: NeurIPS.
- Wang, Z., Chen, Z., Wu, Y., Zhao, Z., Zhou, L., & Xu, D. (2024). Pointramba: A hybrid transformer-mamba framework for point cloud analysis. arXiv preprint arXiv:2405.15463
- Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., & Xie, S. (2023). Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: CVPR.
- Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In: CVPR.
- Wasim, S.T., Khattak, M.U., Naseer, M., Khan, S., Shah, M., & Khan, F.S. (2023). Video-focalnets: Spatio-temporal focal modulation for video action recognition. In: ICCV.
- Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV.
- Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). TDN: Temporal difference networks for efficient action recognition. In: CVPR.
- Wang, C., Tsepa, O., Ma, J., & Wang, B. (2024). Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arxiv preprint, arXiv:2402.00789
- Wang, H., Wu, X., Huang, Z., & Xing, E.P. (2020). High-frequency component helps explain the generalization of convolutional neural networks. In: CVPR.
- Wang, X., Xiong, X., Neumann, M., Piergiovanni, A., Ryoo, M.S., Angelova, A., Kitani, K.M., & Hua, W. (2020). Attentionnas: Spatiotemporal attention cell search for video classification. In: ECCV.
- Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: ECCV.
- Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. In: NeurIPS.
- Yang, J., Li, C., Dai, X., Yuan, L., & Gao, J. (2022). Focal modulation networks. In: NeurIPS.
- Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., & Schmid, C. (2022). Multiview transformers for video recognition. In: CVPR.
- Yang, Y., Xing, Z., & Zhu, L. (2024). Vivim: a video vision mamba for medical video object segmentation. arxiv preprint, arXiv:2401.14168
- Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., & Tighe, J. (2021). Vidtr: Video transformer without convolutions. In: ICCV. [DOI] [PMC free article] [PubMed]
- Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). Vision mamba: Efficient visual representation learning with bidirectional state space model. In: ICML.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable









