Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 1;15:22265. doi: 10.1038/s41598-025-07032-3

Two-stage Mamba-based diffusion model for image restoration

Lei Liu 1,2, Luan Ma 1, Shuai Wang 1,2,, Jun Wang 3, Silas N Melo 4
PMCID: PMC12216552  PMID: 40595135

Abstract

Image restoration is fundamental in computer vision to restore high-quality images from degraded ones. Recently, models such as the transformer and diffusion have shown notable success in addressing this challenge. However, transformer-based methods face high computational costs due to quadratic complexity, while diffusion-based methods often struggle with suboptimal results due to inaccurate noise estimation. This study proposes Diff-Mamba, a two-stage adaptive Mamba-based diffusion model for image restoration. Diff-Mamba integrates the linear complexity state space model (SSM, also known as Mamba) into image restoration, expanding its applicability to visual data generation. Diff-Mamba mainly consists of two parts: the diffusion state space model (DSSM) and the diffusion feedforward neural network (DFNN). DSSM combines Mamba’s high efficiency with the representative power of diffusion models, enhancing both inference and training. DFNN regulates the information flow, enabling each depthwise convolutional layer to focus on the details of image, thus learning more effective local structures for image restoration. The study’s findings, verified through extensive experiments, indicate that Diff-Mamba outperforms both diffusion-based and transformer-based methods in image deraining, denoising, and deblurring, demonstrating competitive restoration performance with various commonly used datasets. Code is available at https://github.com/maluan-ml/Diff-Mamba.

Keywords: Diffusion Mamba, Image deblurring, Image denoising, Image deraining, Image restoration

Subject terms: Engineering, Electrical and electronic engineering

Introduction

Image restoration is the process of recovering low-quality images affected by factors such as rain, noise, and blurring. Improving image quality is crucial for accurate data analysis, clear visualization, and a better user experience. Consequently, image restoration has gained significant attention in image processing and computational imaging in recent decades. Early image restoration methods primarily relied on traditional image processing techniques and optimization algorithms based mainly on statistical models and prior knowledge to address image degradation. Although these methods can partially achieve image restoration, they often have high computational complexity, limited robustness, and poor detail preservation. Recently, deep learning models have shown outstanding capabilities in image restoration, including transformers1,2, diffusion models3,4, and convolutional neural networks (CNNs)57. These models can learn the complex patterns and features of images, such as textures, shapes, and edges, from large datasets. By understanding the structure and content of images, they can more accurately reconstruct details and textures, thus improving image quality. However, despite their effectiveness, these models face increased quadratic complexity as the input size increases and require extensive data and computational resources for training. Solving these issues is crucial for advancing image restoration technology in practical applications, especially in real-world scenarios where computational efficiency and resource constraints are becoming increasingly important.

To address these challenges, we propose a two-stage adaptive Mamba-based diffusion model (Diff-Mamba) for image restoration. This novel approach leverages Mamba’s linear scalability, which ensures that the computational complexity remains manageable even as the input image size increases. By overcoming the quadratic complexity inherent in traditional deep learning models, Diff-Mamba provides a more efficient and scalable solution for image restoration, making it more suitable for large-scale and real-time applications. Mamba8 is a sequence model based on state space modeling (SSM)911 that has shown promise in fields such as language, audio and images. In terms of complexity, it scales linearly with an increasing number of labels, providing better training and inference efficiency than transformers. This linear scalability makes Mamba suitable for long-sequence modeling and a fundamental part of the diffusion model in image restoration. In addition, we introduce the diffusion feedforward neural network (DFNN) module, which incorporates diffusion time steps into the gating mechanism to regulate the information flow. Using the element-by-element product of two parallel-path linear transformation layers, DFNN focuses on capturing fine details in images. Deep convolution encodes spatial information from adjacent pixels, enabling the network to learn the real image structure for effective restoration.

The main contributions of this study are as follows.

  • We propose a two-stage diffusion method for image restoration that effectively applies Mamba to diffusion models. Our method effectively balances the generation capacity of diffusion models with a significant reduction in computational burden, providing an efficient solution for large-scale image restoration tasks without compromising output quality.

  • We introduce an effective DFNN module that regulates the flow of information, allowing each depthwise convolutional layer to focus on details supplemented by other levels and selectively transmit the most valuable information, thereby improving the quality of the output image. Our DFNN module enables each depthwise convolutional layer to adaptively focus on specific details and selectively transmit the most relevant features across layers. This selective flow of information leads to a significant improvement in the quality of the restored image, particularly in preserving fine details and enhancing image sharpness.

Our two-stage Mamba-based diffusion method has been validated through extensive experiments, demonstrating effectiveness in image deraining, denoising, and deblurring tasks.

Related works

Image restoration

Image restoration involves eliminating or reducing noise in images using various algorithms and techniques to recover clarity and detail, thus improving image quality. With the widespread adoption of CNNs, the performance of image denoising algorithms has been significantly has improved. Zhang et al.12 proposed DnCNN, a feedforward denoising CNN that adopts residual learning and batch normalization to reduce training time and improve denoising performance. Tian et al.13 introduced the cross transformer denoising CNN (CTNet), a novel method that addresses image restoration challenges in complex scenes. Anwar et al.14 proposed RIDNet, the first model to incorporate feature attention into the restoration process. RIDNet employs a modular structure in a single-stage blind image denoising network, employing residual structures to capture subtle changes in images and feature attention to exploit channel correlations, thereby improving the restoration effectiveness. Transformers have been widely applied in image restoration due to their ability to learn long-term dependencies and capture global interactions among diverse contextual information. Wang et al.15 proposed Uformer, an image restoration model based on transformer architecture, which introduces novel locally enhanced transformer blocks and learnable multiscale restoration modulators. Additionally, Zamir et al.16 introduced Restormer, an improved design targeting key modules (multihead attention and feedforward networks) within transformers that effectively capture the relationships between distant pixels. Recently, there has been increasing interest in combining diffusion models with transformers for denoising tasks. Xia et al.17 proposed DiffIR, an effective diffusion-based method for image restoration consisting of a compact image restoration prior extraction network (CPEN), dynamic image restoration transformer (DIRformer), and a denoising network, providing innovative solutions for image denoising.

Diffusion models

The first diffusion model in image generation is the denoising diffusion probabilistic model (DDPM)3. DDPM is the first to apply the “denoising” diffusion probability model to image generation tasks. It includes two main processes: forward and reverse diffusions. The forward process converts data into noise, whereas the reverse process converts noise into data. U-Net18,19 and vision transformer (ViT)2 are two common fundamental networks used in diffusion models. U-Net offers high memory resources, while ViT exhibits effective scalability and multimodal learning; however, its secondary complexity limits visual labeling. Transformer-based diffusion models20 have recently attracted significant attention. These models rely solely on attention modules and multilayer perceptrons (MLPs), achieving significant scalability in computer vision tasks. However, transformers face efficiency problems when handling tokens. Inspired by Mamba, this study proposes using Mamba blocks to design backbone networks to improve computational efficiency and achieve the desired recovery effect.

Mamba

The SSM originates from control theory and describes the dynamic changes in a system through continuous states. Recently, SSM has been widely introduced in deep learning to address long-range dependency problems. The linear state space layer (LSSL)21 is an early type of SSM known for its capabilities to handle long-range dependency problems. However, its complexity is relatively high. The structured state space sequence (S4)9 reduced the complexity of LSSL by normalizing the parameters of the diagonal structure. The S4 model focuses on modeling long-range dependencies and can serve as an alternative to CNN and transformers. S510 extends S4 with a multi-input multiple-output (MIMO) SSM and efficient parallel scanning technology that improves model performance. The gated state space layer (GSSL)11 enhances the model’s representative ability by adding gating units to the S4 framework. Mamba, also known as S6, is a new spatial state model incorporating selective scanning modules, one-dimensional causal convolution, and normalization layers. It outperforms the transformer on large-scale datasets and exhibits linear scalability in sequence length. Its potential for processing large-scale image data, such as image restoration, natural language processing, point clouds, and image generation, is gradually being recognized. These innovations provide new methods for deep learning of complex sequence and large-scale image data, advancing related research.

Methods

This study develops an efficient Diff-Mamba model for high-resolution image restoration. Figure 1 shows the overall flowchart of the first-stage training pipeline (FTP) of the proposed Diff-Mamba-based image restoration, which features a 4-level U-Net architecture with Diff-Mamba modules. This model processes clean and degraded images, performs feature transformations, and estimates noisy images. On the right of Fig. 1 is the main structure of the Diff-Mamba module, which consists of two core parts of the diffusion state space model (DSSM) and the diffusion feedforward neural network (DFNN).

Fig. 1.

Fig. 1

On the left is the overall flow chart of the first-stage training pipeline of the Diff-Mamba-based image restoration. On the right is the Diff-Mamba module.

First-stage training pipeline

Given pairs of clean image Inline graphic and degraded image Inline graphic, Inline graphic, where Inline graphic represents spatial dimensions. The initial step of Diff-Mamba-based image restoration involves adding Gaussian noise with a mean of 0 and variance Inline graphic to the clean image Inline graphic at the time step Inline graphic, using the forward diffusion model. The noise variance is defined by a fixed value Inline graphic within the interval Inline graphic, and the mean is determined by Inline graphic and the noise distribution of the data. The single-step diffusion noise addition process from the time step (Inline graphic) to the time step Inline graphic is expressed as:

graphic file with name 41598_2025_7032_Article_Equ1.gif 1

and the final expression for the noise distribution is:

graphic file with name 41598_2025_7032_Article_Equ2.gif 2

Thus, the noise sample Inline graphic and Inline graphic is added to the degraded image Inline graphic along the channel dimension, resulting in Inline graphic as input to Diff-Mamba. Subsequently, Inline graphic is encoded using a Inline graphic convolution, generating the embedded feature Inline graphic, where Inline graphic is the number of channels. Inline graphic is processed through a four-level symmetric encoder––decoder. The time step Inline graphic is encoded and integrated into the feature Inline graphic and Diff-Mamba, respectively. The encoder––decoder structure gradually encodes and decodes the image features. Specifically, different levels of encoder and decoder are applied at different stages, gradually extracting and recovering image features through downsampling and upsampling. At each decoding level, the skip connections concatenate features from the encoder and decoder to assist in information transmission and detail recovery. Finally, a Inline graphic convolution is used for refinement, producing a residual image. This image is then added to the noise sample Inline graphic, yielding a noise estimation Inline graphic.

Second-stage training pipeline

We adopt a two-stage training pipeline. In the first stage, the Diff-Mamba model is trained in a relatively broad or preliminary manner, using data that may not be refined enough or parameters that are not set precisely. In the second stage, we initialize the model using the parameters (i.e., weights and biases) obtained from the first training stage. This is because the model has already learned the basic patterns during the first stage of training, and the goal of the second stage is to further optimize the model based on this foundation. The first-stage training primarily trains Diff-Mamba using constrained noise, with the specific loss function expressed as:

graphic file with name 41598_2025_7032_Article_Equ3.gif 3

By estimating the noise Inline graphic, a sampling algorithm can be used to generate the final clean image Inline graphic.

The second-stage training uses data from the first stage to further optimize the model, achieving optimal recovery performance when combined with four-step sampling, as shown in Fig. 2. Restoration quality is improved using the Inline graphic loss and the SSIM loss22 to constrain the restored image generated from the real image. The loss function is expressed as:

graphic file with name 41598_2025_7032_Article_Equ4.gif 4

The Inline graphic loss function computes the absolute error between the generated and the ground truth images. Conversely, the SSIM loss function measures the structural similarity difference between the generated and the ground truth images. The parameter Inline graphic represents a weight, empirically set to 0.84.

Fig. 2.

Fig. 2

The second-stage training pipeline, which can generate cleaner images using the data from the first stage training.

Diffusion state space model

The combination of the diffusion model and Mamba is typically achieved through Mamba’s optimization framework. Diffusion models typically generate samples based on the principle of progressively adding noise and denoising. Usually, the reverse denoising process of diffusion models is computationally intensive. Mamba accelerates this process through efficient computation and optimization methods, such as efficient batching and multi-task learning. Mamba’s multi-stage training strategy is applied in the training of diffusion models, where the first stage quickly optimizes the initial parameters, and in the second stage, the model parameters are more finely tuned. SSM is designed to encode and decode one-dimensional input sequences. The model maps the input sequence Inline graphic to the hidden state Inline graphic and derives the predicted output sequence Inline graphic. This process is represented by a linear ordinary differential equation expressed as follows:

graphic file with name 41598_2025_7032_Article_Equ5.gif 5

where Inline graphic represents the state matrix that defines the evolution of the hidden state. The projection parameters Inline graphic define the mapping of the input signal to the hidden state and the hidden state to the output, respectively. The time scale parameter Inline graphic, introduced in S4, controls the discretization step size of the continuous system. Using the zero-order hold (ZOH) method, the continuous system parameters Inline graphic and Inline graphic are converted into discrete system parameters Inline graphic and Inline graphic. The specific formulae are expressed as:

graphic file with name 41598_2025_7032_Article_Equ6.gif 6

Using discretized parameters, the implicit state Inline graphic and the output Inline graphic are calculated recursively by:

graphic file with name 41598_2025_7032_Article_Equ7.gif 7

A structured convolution kernel Inline graphic is constructed for the convolution operation with the input sequence Inline graphic, which is expressed as:

graphic file with name 41598_2025_7032_Article_Equ8.gif 8

where M represents the length of the input sequence i.

Mamba extracts and enhances input features through multidirectional scanning and feature fusion. The S6 block introduces a selective mechanism to improve the accuracy and effectiveness of feature extraction. The two-dimensional selective scanning module consists of three components, as shown in Fig. 3, i.e., the scan expanding of Mamba, the SSM (S6) block, and the scan merging of Mamba. The input image is scanned in four directions, converting the two-dimensional image into a one-dimensional sequence. Subsequently, the S6 module extracts features from this sequence, ensuring that the information is scanned in various directions, thus capturing various features. The scan merging module combines the sequences from the four directions and restores the output image to its original size. The S6 module is illustrated in Fig. 4.

Fig. 3.

Fig. 3

Two-dimensional selective scanning module. (a) Scan expanding of Mamba. (b) SSM block. (c) Scan merging of Mamba.

Fig. 4.

Fig. 4

State space model (S6) block.

Diffusion feedforward neural network

Our DFNN utilizes a gating mechanism and deep convolution to improve learning performance. The gating mechanism regulates information flow by selectively transmitting information through element-level multiplication. This design allows each convolutional layer to focus on complementing details across each level, enabling more effective learning of local structures for image restoration. The specific structure of DFNN is shown in Fig. 5.

Fig. 5.

Fig. 5

Diffusion feedforward neural network (DFNN).

After normalizing the input feature Inline graphic with the diffusion time step Inline graphic, it was sent to DFNN for the feature transformation. The final result is residually connected to the original input. Given the input feature Inline graphic , the output Inline graphic of DFNN is expressed as:

graphic file with name 41598_2025_7032_Article_Equ9.gif 9

where Inline graphic is a 1Inline graphic1 point-wise convolution, and Inline graphic is a 3Inline graphic3 depth-wise convolution. Inline graphic denotes the nonlinear activation function GELU, Inline graphic denotes element-wise multiplication, and LN denotes layer normalization.

Experiments

Experiment settings

We applied Diff-Mamba to three image restoration tasks: image deraining, denoising, and deblurring. All experiments were performed on an NVIDIA RTX 4090 GPU. The model underwent the first-stage training followed by the second-stage training, with time step Inline graphic set to 1,000. Optimization was performed using the AdamW optimizer (Inline graphic, Inline graphic), with an initial learning rate of Inline graphic and gradually reduced to Inline graphic using cosine annealing. During the first-stage training, the (patch, batch) cycle was updated every 10k iterations as follows: Inline graphic, Inline graphic, Inline graphic. A total of 270k iterations were conducted for rain and noise removal, and 330k iterations were conducted for blur removal. In the second-stage training, the (patch, batch) cycle was updated every 5k iterations as Inline graphic, Inline graphic, Inline graphic, with 90k iterations for rain and noise reduction, and 30k iterations for deblurring. Our method took 76 hours for image denoising, 36 hours for image deraining, and 64 hours for image deblurring. During the training process, real-time inference is possible. In the first stage of training, the PSNR values of the images are recorded and the corresponding training results are saved every 10k iterations. In the second stage of training, the PSNR values are recorded, and the corresponding training results are saved every 5k iterations. Wandb and TensorBoard23 were used for logging and monitoring, respectively. Experimental results from other methods were obtained using pretrained models or official experimental reports.

Image deraining

The model was trained on the Rain13K dataset16, which consists of 13,712 pairs of clean and rainwater-affected images. Its robustness and accuracy were evaluated using four datasets: Test10024, Rain100H25, Rain100L25, and Test280026. The peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) were used as evaluation indicators, with higher values indicating closer proximity to the original image. The results indicate that Diff-Mamba generally outperforms most existing methods in multiple datasets, including PReNet7, MSPFN27, MPRNet28, HINet29, SPAIR30, and IR-SDE31, as shown in Table 1. Note that the results of IR-SDE are from publicly available code retraining, and IR-SDE* is the result of training using a pretrained model. The learned perceptual image patch similarity (LPIPS) measures the perceptual similarity of the images—a higher value indicates a greater perceptual difference between two images, while a lower value indicates that two images are more similar in perception. The results in Table 2 show that Diff-Mamba outperforms IR-SDE in all test datasets with respect to PSNR and SSIM, indicating superior performance in restoring the detail and structure of the image. Table 2 also shows that Diff-Mamba performs better in the LPIPS metric (lower LPIPS value), demonstrating an advantage in perceptual recovery quality. Diff-Mamba not only excels in traditional image quality assessment metrics but also shows higher performance in perceptual similarity, highlighting its effectiveness and advantages in the image deraining task. Furthermore, its performance under high rain intensity was comparable to that of the latest methods. Diff-Mamba exhibited strong rain removal and image restoration capabilities, particularly performing better in high-rain intensity scenarios. Figure 6 shows the challenging visual effects of our Diff-Mamba method on the Test100 dataset, illustrating its effectiveness in improving image clarity, removing rain and restoring details obscured by rain.

Table 1.

Image deraining.

Methods Test10024 Rain100H25 Rain100L25 Test280026
PSNRInline graphic SSIMInline graphic PSNRInline graphic SSIMInline graphic PSNRInline graphic SSIMInline graphic PSNRInline graphic SSIM↑
PReNet7 24.81 0.851 27.77 0.858 32.44 0.950 31.75 0.916
MSPFN27 27.50 0.876 28.66 0.860 32.40 0.933 32.82 0.930
MPRNet28 30.27 0.897 30.41 0.890 36.40 0.965 33.64 0.938
HINet29 30.29 0.906 30.65 0.894 37.28 0.970 33.91 0.941
SPAIR30 30.35 0.909 30.95 0.892 36.93 0.969 33.34 0.936
IR-SDE31 26.74 0.834 20.79 0.699 30.83 0.912 30.42 0.891
IR-SDE*31 31.65 0.904 38.30 0.980
Diff-Mamba (Ours) 30.96 0.912 30.47 0.893 37.07 0.960 33.93 0.939

Significant values are in bold.

Table 2.

Compare our method with IR-SDE31 in image rain removal.

Methods Test10024 Rain100H25 Rain100L25 Test280026
PSNRInline graphic SSIMInline graphic LPIPSInline graphic PSNRInline graphic SSIMInline graphic LPIPSInline graphic PSNRInline graphic SSIMInline graphic LPIPSInline graphic PSNRInline graphic SSIMInline graphic LPIPS↓
IR-SDE31 26.74 0.834 0.125 20.79 0.699 0.267 30.83 0.912 0.102 30.42 0.891 0.065
Diff-Mamba (Ours) 30.96 0.912 0.112 30.47 0.893 0.174 37.07 0.960 0.088 33.93 0.939 0.053

Significant values are in bold.

Fig. 6.

Fig. 6

Image deraining on the Test100 dataset. This figure presents a comparison of Diff-Mamba with other advanced methods, demonstrating that Diff-Mamba effectively removes rain and restores details. For optimal perspective, specific areas are shown in close-up. Zooming into these areas highlights Diff-Mamba’s superior performance in removing rain and restoring details.

Image denoising

Our denoising experiments were performed on the SIDD dataset32, which contains 320 pairs of high-resolution images randomly cropped into 30,608 image blocks of size 512 Inline graphic 512. We evaluated 1,280 256 Inline graphic 256 image blocks from the SIDD validation set and 1,280 512 Inline graphic 512 image blocks from the DND dataset33. The DND dataset was evaluated online on the official website: https://noise.visinf.tu-darmstadt.de/. Table 3 shows that Diff-Mamba outperforms most existing methods on the SIDD and DND datasets. For the SIDD dataset, Diff-Mamba surpasses various advanced methods with respect to PSNR and SSIM, very close to the best method NAFNet34. On the other hand, our method outperforms NAFNet on the DND dataset. The results highlight Diff-Mamba’s strong ability for image denoising, particularly in image clarity and structural restoration, demonstrating its competitiveness among cutting-edge technologies. Compared with other state-of-the-art methods, Diff-Mamba demonstrates exceptional effectiveness in enhancing image quality and preserving details. Figures 7 and 8 show the visualization comparison of Diff-Mamba with other methods on the SIDD and DND dataset, demonstrating our method’s good denoising ability while retaining clearer structures.

Table 3.

Image denoising.

Methods SIDD32 DND33
PSNRInline graphic SSIMInline graphic PSNRInline graphic SSIM↑
RIDNet14 38.71 0.914 39.26 0.953
CBDNet35 30.78 0.801 38.06 0.942
MHCNN36 39.06 0.914 39.52 0.951
CDN37 39.36 0.918 39.44 0.951
MIRNet38 39.72 0.959 39.88 0.956
CycleISP39 39.52 0.957 39.56 0.956
MPRNet28 39.71 0.958 39.81 0.954
NAFNet34 40.30 0.961 38.41 0.943
VDN40 39.26 0.955 39.38 0.951
C-BSN-DND41 36.84 0.933 38.60 0.941
SDAP42 37.53 0.936 38.17 0.932
Diff-Mamba (Ours) 39.81 0.960 39.63 0.956

Significant values are in bold.

Fig. 7.

Fig. 7

Comparison of denoising results on SIDD dataset. The comparison highlights the denoising effects of different methods on the same image scene. The images generated by Diff-Mamba exhibit less noise visually while preserving more details.

Fig. 8.

Fig. 8

Comparison of denoising results on DND dataset. For optimal perspective, specific areas are shown in close-up. Zooming into these areas highlights Diff-Mamba’s superior performance in both denoising and detail restoration.

Image deblurring

Our deblurring experiments were trained on the GoPro dataset43, which contains 2,103 pairs of clear and blurred images, and were tested on the RealBlur-R and RealBlur-J datasets44. Table 4 shows that Diff-Mamba outperforms traditional methods, such as DeblurGAN45 and DeblurGAN-v246, in all metrics, particularly in complex blur scenarios, where it provided greater clarity and visual quality. Additionally, Diff-Mamba is competitive with leading methods such as MPRNet and NAFNet in recent benchmark tests, demonstrating advantages in certain metrics. Table 5 indicates that Diff-Mamba shows overall superior performance in the PSNR, SSIM, and LPIPS metrics, demonstrating a comprehensive improvement in image quality and detail preservation for deblurring tasks. In particular, on the RealBlur-J dataset, Diff-Mamba significantly outperforms IR-SDE in both PSNR and SSIM, highlighting its robust capability in handling complex blur conditions. On the GoPro dataset, Diff-Mamba achieves a PSNR of 31.60 dB and SSIM of 0.953, significantly higher than DiffUIR’s 29.17 dB and 0.864. Figure 9 illustrates the effectiveness of our method in generating high-quality images, demonstrating its superior performance compared to existing approaches. Specifically, our method achieved significantly clearer and more realistic results on the RealBlur-R and RealBlur-J test sets. The images produced by our method exhibit a significant improvement in clarity, with finer details becoming more distinct and less obscured by blurriness. This method enhanced clarity and contributed to a more accurate and realistic representation of the original scene, as is evident from the parallel comparisons presented in the figure.

Table 4.

Image deblurring.

Methods GoPro43 RealBlur-R44 RealBlur-J44
PSNRInline graphic SSIMInline graphic PSNRInline graphic SSIMInline graphic PSNRInline graphic SSIM↑
DeblurGAN45 28.70 0.858 33.79 0.903 27.97 0.834
DeblurGAN-v246 29.55 0.934 35.26 0.944 28.70 0.866
DMPHN47 31.20 0.940 35.70 0.948 28.42 0.860
MT-RNN48 31.15 0.945 35.79 0.951 28.44 0.862
DBGAN49 31.10 0.942 33.78 0.909 24.93 0.745
MPRNet28 32.66 0.959 35.99 0.952 28.70 0.873
NAFNet34 33.71 0967 35.97 0.951 28.31 0.856
IR-SDE31 30.70 0.901 33.96 0.918 24.21 0.729
DiffUIR50 29.17 0.864
Diff-Mamba (Ours) 31.60 0.953 35.99 0.953 28.67 0.876

Significant values are in bold.

Table 5.

Compare our Diff-Mamba with IR-SDE31 in image deblurring.

Methods GoPro43 RealBlur-R44 RealBlur-J44
PSNRInline graphic SSIMInline graphic LPIPSInline graphic PSNRInline graphic SSIMInline graphic LPIPSInline graphic PSNRInline graphic SSIMInline graphic LPIPS↓
IR-SDE31 30.70 0.901 0.064 33.96 0.918 0.114 24.21 0.729 0.267
Diff-Mamba (Ours) 31.60 0.953 0.076 35.81 0.953 0.076 28.67 0.870 0.165

Significant values are in bold.

Fig. 9.

Fig. 9

Comparison of deblurring in the RealBlur-R dataset. For optimal perspective, specific areas are shown in close-up. By zooming into these areas, Diff-Mamba achieves the desired effect when compared with other advanced methods.

Ablation study

To investigate the proposed image restoration method, various ablation studies were conducted to verify the effectiveness of the proposed components.

Effect of DFNN

To demonstrate the effectiveness of the feedforward network, we trained it on the SIDD dataset and only conducted the first-stage training. The testing results are shown in Table 6(a). As can be seen, both PSNR and SSIM are improved with DFNN, demonstrating the effectiveness of incorporating the deep feature fusion network.

Table 6.

Ablation experiments.

Experiments PSNRInline graphic SSIMInline graphic Training timeInline graphic(h)
(a) Without DFFN 39.24 0.951 49.5
With DFFN 39.72 0.958 78.5
(b) Only one stage (270K) 30.45 0.891 30.3
Only one stage (360K) 30.76 0.903 41.2
Two-stage (270K+90K) 30.96 0.912 42.5
(c) Without t embedding 28.65 0.841 31.5
With t embedding 30.45 0.891 30.3

Significant values are in bold.

Effect of two-stage training

To demonstrate the effectiveness of the two-stage training approach, we evaluated its performance on the Rain13K and Test100 datasets, respectively. The first-stage training iterated 270K times, while the second-stage iterated 90K times. Table 6(b) shows that the two-stage training significantly improved recovery results compared to the single-stage training.

Effect of time step embedding

We embedded the time step t into a deep feature fusion network and performed only the first-stage training on the Rain13K dataset. To validate its effectiveness, we applied the trained model on the Test100 dataset for image deraining. Table 6(c) indicates that the embedding time step t improves image restoration quality with less training time, demonstrating that it is crucial for the proposed image restoration method.

Effect of sampling step

The diffusion model generates the final image through multiple sampling steps, with each sampling progressively refining the image’s quality. As shown in Table 7, experiments have shown that using fewer sampling steps results in higher distortion metrics (PSNR and SSIM), indicating that the image restoration quality is lower compared to using more sampling steps. However, increasing the number of sampling steps improves perceptual results (LPIPS), meaning better visual quality, but at the cost of longer processing time. When selecting the number of sampling steps, it is important to balance these metrics while considering the processing time. Through this trade-off, we ultimately decide to use 4 sampling steps in the second-stage training process, ensuring good perceptual quality without excessively increasing the computational cost.

Table 7.

Sampling step S for image restoration.

S PSNRInline graphic SSIMInline graphic LPIPSInline graphic Times↓(s)
2 10.98 0.018 1.4163 30
3 25.63 0.876 0.1673 56
4 30.92 0.899 0.1313 104
5 30.43 0.893 0.1273 128
10 30.00 0.883 0.1233 235
15 29.28 0.867 0.1139 681
50 28.97 0.856 0.1091 821

Significant values are in bold.

Effect of weighting parameter Inline graphic of the loss function

In the second stage, we chose Inline graphic in the mixed loss function. To validate the effectiveness of this choice, we conducted a series of experiments to systematically evaluate the impact of different Inline graphic values on the model’s performance. We first used the pretrained model obtained from the first stage, which provided a good initialization for further training in the second stage. In the second stage, we adjusted the weight of the balance term in the loss function by varying the value of Inline graphic. To shorten the runtime, we set the number of iterations to 9000. Specifically, we selected different Inline graphic values (e.g., 0.12, 0.24, 0.36, 0.48, 0.60, 0.72, 0.84, 0.96) and trained the model under each configuration to observe its impact on performance. As shown in Table 8, we observed that Inline graphic provided the highest PSNR value of 37.02 dB. Although the configurations with Inline graphic and Inline graphic did not significantly increase the training time, their PSNR values slightly decreased and did not lead to a significant improvement in performance. Therefore, Inline graphic strikes the best balance between improving image quality and maintaining reasonable training time, making it the most effective choice for the second stage of training.

Table 8.

The impact of Inline graphic on the second stage.

Inline graphic 0.12 0.24 0.36 0.48 0.60 0.72 0.84 0.96
PSNRInline graphic (dB) 36.87 36.86 37.01 36.69 36.78 36.80 37.02 36.92
TimeInline graphic (min) 50.4 50.3 50.6 50.6 50.8 50.0 50.2 50.4

Significant values are in bold.

Comparison of model size and computational burden

We compared the computational complexity of various image denoising models. As shown in Table 9, our Diff-Mamba model significantly outperforms other methods in terms of computational burden and model size. Despite having fewer parameters and lower computational complexity, it still provides the best image restoration quality across various noise levels. In contrast, although CycleISP and MPRNet perform well in restoration quality, they require higher computational resources.

Table 9.

The comparison of model size and computational burden.

Methods Flops(G) Params(M) PSNR↑
CycleISP39 10.6 23.88 39.49
MPRNet28 13.2 21.63 39.68
Diff-Mamba(ours) 9.3 15.93 39.80

Significant values are in bold.

Running time

We compared the running time of our method with SOTA image restoration methods. To ensure a fair comparison, all methods were evaluated on Inline graphic input images using the publicly available code on an NVIDIA RTX 4090 GPU. The runtime of our method is significantly better than other Transformer-based or CNN-based methods, indicating that our model achieves good computational efficiency while maintaining excellent restoration performance. As shown in Table 10, our Diff-Mamba achieves a good balance between processing speed and image restoration quality. Its processing time is 106 s, significantly lower than CycleISP (582 s), MPRNet (359 s), and IR-SDE (238 s), demonstrating higher computational efficiency. In terms of PSNR, Diff-Mamba reaches 30.92 dB, slightly higher than MPRNet (30.82 dB) and MIRNet (30.76 dB), and only slightly lower than CycleISP (30.94 dB). Despite the lower computational cost, Diff-Mamba still provides comparable or similar restoration quality, proving its efficiency and superiority in practical applications.

Table 10.

Running time of different methods of image restoration.

Methods CycleISP39 MPRNet28 MIRNet38 Diff-Mamba (Ours)
TimeInline graphic (s) 582 359 238 106
PSNRInline graphic (dB) 30.82 30.91 30.76 30.92

Significant values are in bold.

Discussions

Our Diff-Mamba implements image denoising, deblurring, and denoising functions, demonstrating competitive restoration performance on various commonly used datasets. Furthermore, the core principles of our algorithm may be useful in enhancing image quality under low-light conditions, though additional considerations such as noise reduction and detail preservation would need to be addressed. We believe that exploring this in future work could be an interesting direction.

Diffusion models and transformer-based methods typically have high computational costs, especially when handling large-scale data, with computational complexity approaching Inline graphic. Our Diff-Mamba reduces this complexity to a linear level O(N), significantly improving the processing speed. Although our algorithm provides promising results to some extent, there are still several limitations that need to be discussed.

Although Diff-Mamba offers advantages in terms of computational complexity and speed, its image restoration accuracy may not be on par with some more complex algorithms. Additionally, Diff-Mamba may struggle to fully recover image details under extreme low-light conditions or heavy fog, particularly in complex scenes where the balance between noise reduction and detail preservation could affect the final results. We plan to further explore and address these issues in future research.

The algorithm has primarily been evaluated on a limited set of benchmark datasets, and further validation is needed on more diverse real-world data to confirm its robustness and generalizability. Although our method may not yield the best results in terms of accuracy, this is primarily due to its trade-off between performance and computational efficiency. The design prioritizes reducing computational complexity, which may limit its ability to achieve the highest performance on certain tasks.

Conclusions

This study proposes a two-stage Mamba-based diffusion model for image restoration, which incorporates two main components of DSSM and DFNN. DSSM combines Mamba’s high efficiency with the representative power of diffusion models, achieving optimal performance in both inference and training. In contrast, DFNN regulates the information flow to enable each depthwise convolutional layer to focus on details and learn more effective local image structures for image restoration. Numerous experiments demonstrate that our Diff-Mamba method is highly competitive in image deraining, denoising, and deblurring tasks. The Diff-Mamba method exhibits linear complexity in theory, addressing the challenge of quadratic complexity, and demonstrates its superiority in multiple image restoration tasks in practical applications, achieving competitive results on limited computing resources.

Acknowledgements

The authors would like to thank Editage (www.editage.cn) for English language editing.

Author contributions

L.L. and J.W. conceptualized the framework; L.M. conducted the experiments; S.W. acquired the funding; S.W. and J.W. validated the results; L.L. and L.M. drafted the original manuscript, and S.M. reviewed and edited the manuscript. All authors have read and approved the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Hebei Province (No. F2022201013); the Scientific Research Program of Anhui Provincial Ministry of Education (No. 2024AH051686); the Science and Technology Program of Huaibei (No. 2023HK037); the Anhui Shenhua Meat Products Co., Ltd. Cooperation Project (No. 22100084); and the Entrusted Project by Huaibei Mining Group (2023).

Data availibility

The data used in this study are available from public links. Rain13K: https://github.com/swz30/Restormer/blob/main/Deraining/download_data.py. Test100, Rain100H, Rain100L, and Test2800: https://github.com/hezhangsprinter/DID-MDN/tree/master. SIDD: https://github.com/AbdoKamel/sidd-ground-truth-image-estimation. DND: https://noise.visinf.tu-darmstadt.de/. GoPro: https://seungjunnah.github.io/Datasets/gopro. RealBlur-R and RealBlur-J: http://cg.postech.ac.kr/research/realblur/.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 1–21 (Vienna, Austria, 2021).
  • 2.Khan, S. et al. Transformers in vision: A survey. ACM Comput. Surv.54, 1–41 (2022). [Google Scholar]
  • 3.Ho, J., Jain, A., Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 6840–6851 (Vancouver, Canada, 2020).
  • 4.Choi, J., Kim, S., Jeong, Y., Gwon, Y., Yoon, S. ILVR: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 14347–14356 (Montreal, QC, Canada, 2021).
  • 5.Dong, C., Deng, Y., Loy, C., Tang, X. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 576–584 (Santiago, Chile, 2015).
  • 6.Yu, J. et al. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 4470–4479 (Seoul, Korea, 2019).
  • 7.Ren, D., Zuo, W., Hu, Q., Zhu, P., Meng, D. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3937–3946 (Long Beach, CA, USA, 2019).
  • 8.Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Preprint at arXiv:2312.00752 (2023).
  • 9.Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. Preprint at arXiv:2111.00396 (2021).
  • 10.Smith, J. T., Warrington, A. & Linderman, S. W. Simplified state space layers for sequence modeling. Preprint at arXiv:2208.04933 (2022).
  • 11.Mehta, H., Gupta, A., Cutkosky, A. & Neyshabur, B. Long range language modeling via gated state spaces. Preprint at arXiv:2206.13947 (2022).
  • 12.Zhang, K., Zuo, W., Chen, Y., Meng, D. & Zhang, L. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process.26, 3142–3155 (2017). [DOI] [PubMed] [Google Scholar]
  • 13.Tian, C. et al. A cross transformer for image denoising. Inf. Fusion102, 102043 (2024). [Google Scholar]
  • 14.Anwar, S. & Barnes, N. Real image denoising with feature attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3155–3164 (Seoul, Korea, 2019).
  • 15.Wang, Z. et al. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17683–17693 (New Orleans, LA, USA, 2022).
  • 16.Zamir, S. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5718–5729 (New Orleans, LA, USA, 2022).
  • 17.Xia, B. et al. Diffir: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13049–13059 (Paris, France, 2023).
  • 18.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention – MICCAI, 234–241 (Munich, Germany, 2015).
  • 19.Oktay, O. et al. Attention u-net: Learning where to look for the pancreas. Preprint at arXiv:1804.03999 (2018).
  • 20.Wang, L. et al. Learning a coarse-to-fine diffusion transformer for image restoration. Preprint at arXiv:2308.08730 (2023).
  • 21.Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, 572–585 (Virtual, 2021).
  • 22.Zhao, H., Gallo, O., Frosio, I. & Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging3, 47–57 (2016). [Google Scholar]
  • 23.Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv:1603.04467 (2016).
  • 24.Zhang, H., Sindagi, V. & Patel, V. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol.30, 3943–3956 (2020). [Google Scholar]
  • 25.Yang, W. et al. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1685–1694 (Honolulu, HI, USA, 2017).
  • 26.Zhang, H. & Patel, V. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 695–704 (Salt Lake City, UT, USA, 2018).
  • 27.Jiang, K. et al. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8343–8352 (Seattle, WA, USA, 2020).
  • 28.Zamir, S. et al. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14816–14826 (Nashville, TN, USA, 2021).
  • 29.Chen, L., Lu, X., Zhang, J., Chu, X. & Chen, C. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 182–192 (Nashville, TN, USA, 2021).
  • 30.Purohit, K., Suin, M., Rajagopalan, A. & Boddeti, V. Spatially-adaptive image restoration using distortion-guided networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2289–2299 (Montreal, QC, Canada, 2021).
  • 31.Luo, Z., Gustafsson, F., Zhao, Z., Sjölund, J. & Schon, T. Image restoration with mean-reverting stochastic differential equations. In Proceedings of the 40th International Conference on Machine Learning, 23045–23066 (Honolulu, HI, USA, 2023).
  • 32.Abdelhamed, A., Lin, S. & Brown, M. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1692–1700 (Salt Lake City, UT, USA, 2018).
  • 33.Plötz, T. & Roth, S. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2750–2759 (Honolulu, HI, USA, 2017).
  • 34.Chen, L., Chu, X., Zhang, X. & Sun, J. Simple baselines for image restoration. In Proceedings of the European Conference on Computer Vision, 17–33 (Tel Aviv, Israel, 2022).
  • 35.Guo, S., Yan, Z., Zhang, K., Zuo, W. & Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1712–1722 (Long Beach, CA, USA, 2019).
  • 36.Zhang, J., Qu, M., Wang, Y. et al. A multi-head convolutional neural network with multi-path attention improves image denoising. In Pacific Rim International Conference on Artificial Intelligence, 338–351 (Springer Nature Switzerland, Cham, 2022).
  • 37.Zhang, J. et al. Considering image information and self-similarity: A compositional denoising network. Sensors23, 5915 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zamir, S. et al. Learning enriched features for real image restoration and enhancement. In Proceedings of the European Conference on Computer Vision, 492–511 (Glasgow, UK, 2020).
  • 39.Zamir, S. et al. Cycleisp: Real image restoration via improved data synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2693–2702 (Seattle, WA, USA, 2020).
  • 40.Yue, Z., Yong, H., Zhao, Q., Meng, D. & Zhang, L. Variational denoising network: Toward blind noise modeling and removal. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 1690–1701 (Vancouver, Canada, 2019).
  • 41.Jang, Y., Lee, K., Park, G., Kim, S. & Cho, N. Self-supervised image denoising with downsampled invariance loss and conditional blind-spot network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 12162–12171 (Paris, France, 2023).
  • 42.Pan, Y., Liu, X., Liao, X., Cao, Y. & Ren, C. Random sub-samples generation for self-supervised real image denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 12116–12125 (Paris, France, 2023).
  • 43.Nah, S., Kim, TH. & Lee, KM. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 257–265 (Honolulu, HI, USA, 2017).
  • 44.Rim, J., Lee, H., Won, J. & Cho, S. Real-world blur dataset for learning and benchmarking deblurring algorithms. In Proceedings of the European Conference on Computer Vision, 184–201 (Glasgow, UK, 2020).
  • 45.Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D. & Matas, J. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8183–8192 (Salt Lake City, UT, USA, 2018).
  • 46.Kupyn, O., Martyniuk, T., Wu, J. & Wang, Z. Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8877–8886 (Seoul, Korea, 2019).
  • 47.Zhang, H., Dai, Y., Li, H. & Koniusz, P. Deep stacked hierarchical multi-patch network for image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5971–5979 (Long Beach, CA, USA, 2019).
  • 48.Park, D., Kang, DU., Kim, J. & Chun, S. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In Proceedings of the European Conference on Computer Vision, 327–343 (Glasgow, UK, 2020).
  • 49.Zhang, K. et al. Deblurring by realistic blurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2734–2743 (Seattle, WA, USA, 2020).
  • 50.Zheng, D. et al. Selective hourglass mapping for universal image restoration based on diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 25445–25455 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used in this study are available from public links. Rain13K: https://github.com/swz30/Restormer/blob/main/Deraining/download_data.py. Test100, Rain100H, Rain100L, and Test2800: https://github.com/hezhangsprinter/DID-MDN/tree/master. SIDD: https://github.com/AbdoKamel/sidd-ground-truth-image-estimation. DND: https://noise.visinf.tu-darmstadt.de/. GoPro: https://seungjunnah.github.io/Datasets/gopro. RealBlur-R and RealBlur-J: http://cg.postech.ac.kr/research/realblur/.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES