Two-stage Mamba-based diffusion model for image restoration

Lei Liu; Luan Ma; Shuai Wang; Jun Wang; Silas N Melo

doi:10.1038/s41598-025-07032-3

. 2025 Jul 1;15:22265. doi: 10.1038/s41598-025-07032-3

Two-stage Mamba-based diffusion model for image restoration

Lei Liu ^1,², Luan Ma ¹, Shuai Wang ^1,^2,^✉, Jun Wang ³, Silas N Melo ⁴

PMCID: PMC12216552 PMID: 40595135

Abstract

Image restoration is fundamental in computer vision to restore high-quality images from degraded ones. Recently, models such as the transformer and diffusion have shown notable success in addressing this challenge. However, transformer-based methods face high computational costs due to quadratic complexity, while diffusion-based methods often struggle with suboptimal results due to inaccurate noise estimation. This study proposes Diff-Mamba, a two-stage adaptive Mamba-based diffusion model for image restoration. Diff-Mamba integrates the linear complexity state space model (SSM, also known as Mamba) into image restoration, expanding its applicability to visual data generation. Diff-Mamba mainly consists of two parts: the diffusion state space model (DSSM) and the diffusion feedforward neural network (DFNN). DSSM combines Mamba’s high efficiency with the representative power of diffusion models, enhancing both inference and training. DFNN regulates the information flow, enabling each depthwise convolutional layer to focus on the details of image, thus learning more effective local structures for image restoration. The study’s findings, verified through extensive experiments, indicate that Diff-Mamba outperforms both diffusion-based and transformer-based methods in image deraining, denoising, and deblurring, demonstrating competitive restoration performance with various commonly used datasets. Code is available at https://github.com/maluan-ml/Diff-Mamba.

Keywords: Diffusion Mamba, Image deblurring, Image denoising, Image deraining, Image restoration

Subject terms: Engineering, Electrical and electronic engineering

Introduction

Image restoration is the process of recovering low-quality images affected by factors such as rain, noise, and blurring. Improving image quality is crucial for accurate data analysis, clear visualization, and a better user experience. Consequently, image restoration has gained significant attention in image processing and computational imaging in recent decades. Early image restoration methods primarily relied on traditional image processing techniques and optimization algorithms based mainly on statistical models and prior knowledge to address image degradation. Although these methods can partially achieve image restoration, they often have high computational complexity, limited robustness, and poor detail preservation. Recently, deep learning models have shown outstanding capabilities in image restoration, including transformers^1,2, diffusion models^3,4, and convolutional neural networks (CNNs)^5–7. These models can learn the complex patterns and features of images, such as textures, shapes, and edges, from large datasets. By understanding the structure and content of images, they can more accurately reconstruct details and textures, thus improving image quality. However, despite their effectiveness, these models face increased quadratic complexity as the input size increases and require extensive data and computational resources for training. Solving these issues is crucial for advancing image restoration technology in practical applications, especially in real-world scenarios where computational efficiency and resource constraints are becoming increasingly important.

To address these challenges, we propose a two-stage adaptive Mamba-based diffusion model (Diff-Mamba) for image restoration. This novel approach leverages Mamba’s linear scalability, which ensures that the computational complexity remains manageable even as the input image size increases. By overcoming the quadratic complexity inherent in traditional deep learning models, Diff-Mamba provides a more efficient and scalable solution for image restoration, making it more suitable for large-scale and real-time applications. Mamba⁸ is a sequence model based on state space modeling (SSM)^9–11 that has shown promise in fields such as language, audio and images. In terms of complexity, it scales linearly with an increasing number of labels, providing better training and inference efficiency than transformers. This linear scalability makes Mamba suitable for long-sequence modeling and a fundamental part of the diffusion model in image restoration. In addition, we introduce the diffusion feedforward neural network (DFNN) module, which incorporates diffusion time steps into the gating mechanism to regulate the information flow. Using the element-by-element product of two parallel-path linear transformation layers, DFNN focuses on capturing fine details in images. Deep convolution encodes spatial information from adjacent pixels, enabling the network to learn the real image structure for effective restoration.

The main contributions of this study are as follows.

We propose a two-stage diffusion method for image restoration that effectively applies Mamba to diffusion models. Our method effectively balances the generation capacity of diffusion models with a significant reduction in computational burden, providing an efficient solution for large-scale image restoration tasks without compromising output quality.
We introduce an effective DFNN module that regulates the flow of information, allowing each depthwise convolutional layer to focus on details supplemented by other levels and selectively transmit the most valuable information, thereby improving the quality of the output image. Our DFNN module enables each depthwise convolutional layer to adaptively focus on specific details and selectively transmit the most relevant features across layers. This selective flow of information leads to a significant improvement in the quality of the restored image, particularly in preserving fine details and enhancing image sharpness.

Our two-stage Mamba-based diffusion method has been validated through extensive experiments, demonstrating effectiveness in image deraining, denoising, and deblurring tasks.

Related works

Image restoration

Image restoration involves eliminating or reducing noise in images using various algorithms and techniques to recover clarity and detail, thus improving image quality. With the widespread adoption of CNNs, the performance of image denoising algorithms has been significantly has improved. Zhang et al.¹² proposed DnCNN, a feedforward denoising CNN that adopts residual learning and batch normalization to reduce training time and improve denoising performance. Tian et al.¹³ introduced the cross transformer denoising CNN (CTNet), a novel method that addresses image restoration challenges in complex scenes. Anwar et al.¹⁴ proposed RIDNet, the first model to incorporate feature attention into the restoration process. RIDNet employs a modular structure in a single-stage blind image denoising network, employing residual structures to capture subtle changes in images and feature attention to exploit channel correlations, thereby improving the restoration effectiveness. Transformers have been widely applied in image restoration due to their ability to learn long-term dependencies and capture global interactions among diverse contextual information. Wang et al.¹⁵ proposed Uformer, an image restoration model based on transformer architecture, which introduces novel locally enhanced transformer blocks and learnable multiscale restoration modulators. Additionally, Zamir et al.¹⁶ introduced Restormer, an improved design targeting key modules (multihead attention and feedforward networks) within transformers that effectively capture the relationships between distant pixels. Recently, there has been increasing interest in combining diffusion models with transformers for denoising tasks. Xia et al.¹⁷ proposed DiffIR, an effective diffusion-based method for image restoration consisting of a compact image restoration prior extraction network (CPEN), dynamic image restoration transformer (DIRformer), and a denoising network, providing innovative solutions for image denoising.

Diffusion models

The first diffusion model in image generation is the denoising diffusion probabilistic model (DDPM)³. DDPM is the first to apply the “denoising” diffusion probability model to image generation tasks. It includes two main processes: forward and reverse diffusions. The forward process converts data into noise, whereas the reverse process converts noise into data. U-Net^18,19 and vision transformer (ViT)² are two common fundamental networks used in diffusion models. U-Net offers high memory resources, while ViT exhibits effective scalability and multimodal learning; however, its secondary complexity limits visual labeling. Transformer-based diffusion models²⁰ have recently attracted significant attention. These models rely solely on attention modules and multilayer perceptrons (MLPs), achieving significant scalability in computer vision tasks. However, transformers face efficiency problems when handling tokens. Inspired by Mamba, this study proposes using Mamba blocks to design backbone networks to improve computational efficiency and achieve the desired recovery effect.

Mamba

The SSM originates from control theory and describes the dynamic changes in a system through continuous states. Recently, SSM has been widely introduced in deep learning to address long-range dependency problems. The linear state space layer (LSSL)²¹ is an early type of SSM known for its capabilities to handle long-range dependency problems. However, its complexity is relatively high. The structured state space sequence (S4)⁹ reduced the complexity of LSSL by normalizing the parameters of the diagonal structure. The S4 model focuses on modeling long-range dependencies and can serve as an alternative to CNN and transformers. S5¹⁰ extends S4 with a multi-input multiple-output (MIMO) SSM and efficient parallel scanning technology that improves model performance. The gated state space layer (GSSL)¹¹ enhances the model’s representative ability by adding gating units to the S4 framework. Mamba, also known as S6, is a new spatial state model incorporating selective scanning modules, one-dimensional causal convolution, and normalization layers. It outperforms the transformer on large-scale datasets and exhibits linear scalability in sequence length. Its potential for processing large-scale image data, such as image restoration, natural language processing, point clouds, and image generation, is gradually being recognized. These innovations provide new methods for deep learning of complex sequence and large-scale image data, advancing related research.

Methods

This study develops an efficient Diff-Mamba model for high-resolution image restoration. Figure 1 shows the overall flowchart of the first-stage training pipeline (FTP) of the proposed Diff-Mamba-based image restoration, which features a 4-level U-Net architecture with Diff-Mamba modules. This model processes clean and degraded images, performs feature transformations, and estimates noisy images. On the right of Fig. 1 is the main structure of the Diff-Mamba module, which consists of two core parts of the diffusion state space model (DSSM) and the diffusion feedforward neural network (DFNN).

Fig. 1 — On the left is the overall flow chart of the first-stage training pipeline of the Diff-Mamba-based image restoration. On the right is the Diff-Mamba module.

First-stage training pipeline

Given pairs of clean image Inline graphic and degraded image , , where represents spatial dimensions. The initial step of Diff-Mamba-based image restoration involves adding Gaussian noise with a mean of 0 and variance to the clean image at the time step , using the forward diffusion model. The noise variance is defined by a fixed value Inline graphic within the interval , and the mean is determined by and the noise distribution of the data. The single-step diffusion noise addition process from the time step () to the time step is expressed as:

and the final expression for the noise distribution is:

Thus, the noise sample Inline graphic and is added to the degraded image along the channel dimension, resulting in as input to Diff-Mamba. Subsequently, is encoded using a convolution, generating the embedded feature , where is the number of channels. is processed through a four-level symmetric encoder––decoder. The time step Inline graphic is encoded and integrated into the feature and Diff-Mamba, respectively. The encoder––decoder structure gradually encodes and decodes the image features. Specifically, different levels of encoder and decoder are applied at different stages, gradually extracting and recovering image features through downsampling and upsampling. At each decoding level, the skip connections concatenate features from the encoder and decoder to assist in information transmission and detail recovery. Finally, a Inline graphic convolution is used for refinement, producing a residual image. This image is then added to the noise sample , yielding a noise estimation .

Second-stage training pipeline

We adopt a two-stage training pipeline. In the first stage, the Diff-Mamba model is trained in a relatively broad or preliminary manner, using data that may not be refined enough or parameters that are not set precisely. In the second stage, we initialize the model using the parameters (i.e., weights and biases) obtained from the first training stage. This is because the model has already learned the basic patterns during the first stage of training, and the goal of the second stage is to further optimize the model based on this foundation. The first-stage training primarily trains Diff-Mamba using constrained noise, with the specific loss function expressed as:

By estimating the noise Inline graphic , a sampling algorithm can be used to generate the final clean image .

The second-stage training uses data from the first stage to further optimize the model, achieving optimal recovery performance when combined with four-step sampling, as shown in Fig. 2. Restoration quality is improved using the Inline graphic loss and the SSIM loss²² to constrain the restored image generated from the real image. The loss function is expressed as:

The Inline graphic loss function computes the absolute error between the generated and the ground truth images. Conversely, the SSIM loss function measures the structural similarity difference between the generated and the ground truth images. The parameter represents a weight, empirically set to 0.84.

Fig. 2 — The second-stage training pipeline, which can generate cleaner images using the data from the first stage training.

Diffusion state space model

The combination of the diffusion model and Mamba is typically achieved through Mamba’s optimization framework. Diffusion models typically generate samples based on the principle of progressively adding noise and denoising. Usually, the reverse denoising process of diffusion models is computationally intensive. Mamba accelerates this process through efficient computation and optimization methods, such as efficient batching and multi-task learning. Mamba’s multi-stage training strategy is applied in the training of diffusion models, where the first stage quickly optimizes the initial parameters, and in the second stage, the model parameters are more finely tuned. SSM is designed to encode and decode one-dimensional input sequences. The model maps the input sequence Inline graphic to the hidden state and derives the predicted output sequence . This process is represented by a linear ordinary differential equation expressed as follows:

where Inline graphic represents the state matrix that defines the evolution of the hidden state. The projection parameters define the mapping of the input signal to the hidden state and the hidden state to the output, respectively. The time scale parameter , introduced in S4, controls the discretization step size of the continuous system. Using the zero-order hold (ZOH) method, the continuous system parameters Inline graphic and are converted into discrete system parameters and . The specific formulae are expressed as:

Using discretized parameters, the implicit state Inline graphic and the output are calculated recursively by:

A structured convolution kernel Inline graphic is constructed for the convolution operation with the input sequence , which is expressed as:

where M represents the length of the input sequence i.

Mamba extracts and enhances input features through multidirectional scanning and feature fusion. The S6 block introduces a selective mechanism to improve the accuracy and effectiveness of feature extraction. The two-dimensional selective scanning module consists of three components, as shown in Fig. 3, i.e., the scan expanding of Mamba, the SSM (S6) block, and the scan merging of Mamba. The input image is scanned in four directions, converting the two-dimensional image into a one-dimensional sequence. Subsequently, the S6 module extracts features from this sequence, ensuring that the information is scanned in various directions, thus capturing various features. The scan merging module combines the sequences from the four directions and restores the output image to its original size. The S6 module is illustrated in Fig. 4.

Fig. 3 — Two-dimensional selective scanning module. (a) Scan expanding of Mamba. (b) SSM block. (c) Scan merging of Mamba.

Diffusion feedforward neural network

Our DFNN utilizes a gating mechanism and deep convolution to improve learning performance. The gating mechanism regulates information flow by selectively transmitting information through element-level multiplication. This design allows each convolutional layer to focus on complementing details across each level, enabling more effective learning of local structures for image restoration. The specific structure of DFNN is shown in Fig. 5.

Fig. 5 — Diffusion feedforward neural network (DFNN).

After normalizing the input feature Inline graphic with the diffusion time step , it was sent to DFNN for the feature transformation. The final result is residually connected to the original input. Given the input feature , the output of DFNN is expressed as:

where Inline graphic is a 11 point-wise convolution, and is a 33 depth-wise convolution. denotes the nonlinear activation function GELU, denotes element-wise multiplication, and LN denotes layer normalization.

Experiments

Experiment settings

We applied Diff-Mamba to three image restoration tasks: image deraining, denoising, and deblurring. All experiments were performed on an NVIDIA RTX 4090 GPU. The model underwent the first-stage training followed by the second-stage training, with time step Inline graphic set to 1,000. Optimization was performed using the AdamW optimizer (, ), with an initial learning rate of and gradually reduced to using cosine annealing. During the first-stage training, the (patch, batch) cycle was updated every 10k iterations as follows: , , . A total of 270k iterations were conducted for rain and noise removal, and 330k iterations were conducted for blur removal. In the second-stage training, the (patch, batch) cycle was updated every 5k iterations as Inline graphic , , , with 90k iterations for rain and noise reduction, and 30k iterations for deblurring. Our method took 76 hours for image denoising, 36 hours for image deraining, and 64 hours for image deblurring. During the training process, real-time inference is possible. In the first stage of training, the PSNR values of the images are recorded and the corresponding training results are saved every 10k iterations. In the second stage of training, the PSNR values are recorded, and the corresponding training results are saved every 5k iterations. Wandb and TensorBoard²³ were used for logging and monitoring, respectively. Experimental results from other methods were obtained using pretrained models or official experimental reports.

Image deraining

The model was trained on the Rain13K dataset¹⁶, which consists of 13,712 pairs of clean and rainwater-affected images. Its robustness and accuracy were evaluated using four datasets: Test100²⁴, Rain100H²⁵, Rain100L²⁵, and Test2800²⁶. The peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) were used as evaluation indicators, with higher values indicating closer proximity to the original image. The results indicate that Diff-Mamba generally outperforms most existing methods in multiple datasets, including PReNet⁷, MSPFN²⁷, MPRNet²⁸, HINet²⁹, SPAIR³⁰, and IR-SDE³¹, as shown in Table 1. Note that the results of IR-SDE are from publicly available code retraining, and IR-SDE* is the result of training using a pretrained model. The learned perceptual image patch similarity (LPIPS) measures the perceptual similarity of the images—a higher value indicates a greater perceptual difference between two images, while a lower value indicates that two images are more similar in perception. The results in Table 2 show that Diff-Mamba outperforms IR-SDE in all test datasets with respect to PSNR and SSIM, indicating superior performance in restoring the detail and structure of the image. Table 2 also shows that Diff-Mamba performs better in the LPIPS metric (lower LPIPS value), demonstrating an advantage in perceptual recovery quality. Diff-Mamba not only excels in traditional image quality assessment metrics but also shows higher performance in perceptual similarity, highlighting its effectiveness and advantages in the image deraining task. Furthermore, its performance under high rain intensity was comparable to that of the latest methods. Diff-Mamba exhibited strong rain removal and image restoration capabilities, particularly performing better in high-rain intensity scenarios. Figure 6 shows the challenging visual effects of our Diff-Mamba method on the Test100 dataset, illustrating its effectiveness in improving image clarity, removing rain and restoring details obscured by rain.

Table 1.

Image deraining.

Methods	Test100²⁴		Rain100H²⁵		Rain100L²⁵		Test2800²⁶
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM↑
PReNet⁷	24.81	0.851	27.77	0.858	32.44	0.950	31.75	0.916
MSPFN²⁷	27.50	0.876	28.66	0.860	32.40	0.933	32.82	0.930
MPRNet²⁸	30.27	0.897	30.41	0.890	36.40	0.965	33.64	0.938
HINet²⁹	30.29	0.906	30.65	0.894	37.28	0.970	33.91	0.941
SPAIR³⁰	30.35	0.909	30.95	0.892	36.93	0.969	33.34	0.936
IR-SDE³¹	26.74	0.834	20.79	0.699	30.83	0.912	30.42	0.891
IR-SDE*³¹	–	–	31.65	0.904	38.30	0.980	–	–
Diff-Mamba (Ours)	30.96	0.912	30.47	0.893	37.07	0.960	33.93	0.939

Open in a new tab

Significant values are in bold.

Table 2.

Compare our method with IR-SDE³¹ in image rain removal.

Methods	Test100²⁴			Rain100H²⁵			Rain100L²⁵			Test2800²⁶
	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS↓
IR-SDE³¹	26.74	0.834	0.125	20.79	0.699	0.267	30.83	0.912	0.102	30.42	0.891	0.065
Diff-Mamba (Ours)	30.96	0.912	0.112	30.47	0.893	0.174	37.07	0.960	0.088	33.93	0.939	0.053

Open in a new tab

Significant values are in bold.

Fig. 6 — Image deraining on the Test100 dataset. This figure presents a comparison of Diff-Mamba with other advanced methods, demonstrating that Diff-Mamba effectively removes rain and restores details. For optimal perspective, specific areas are shown in close-up. Zooming into these areas highlights Diff-Mamba’s superior performance in removing rain and restoring details.

Image denoising

Our denoising experiments were performed on the SIDD dataset³², which contains 320 pairs of high-resolution images randomly cropped into 30,608 image blocks of size 512 Inline graphic 512. We evaluated 1,280 256 256 image blocks from the SIDD validation set and 1,280 512 512 image blocks from the DND dataset³³. The DND dataset was evaluated online on the official website: https://noise.visinf.tu-darmstadt.de/. Table 3 shows that Diff-Mamba outperforms most existing methods on the SIDD and DND datasets. For the SIDD dataset, Diff-Mamba surpasses various advanced methods with respect to PSNR and SSIM, very close to the best method NAFNet³⁴. On the other hand, our method outperforms NAFNet on the DND dataset. The results highlight Diff-Mamba’s strong ability for image denoising, particularly in image clarity and structural restoration, demonstrating its competitiveness among cutting-edge technologies. Compared with other state-of-the-art methods, Diff-Mamba demonstrates exceptional effectiveness in enhancing image quality and preserving details. Figures 7 and 8 show the visualization comparison of Diff-Mamba with other methods on the SIDD and DND dataset, demonstrating our method’s good denoising ability while retaining clearer structures.

Table 3.

Image denoising.

Methods	SIDD³²		DND³³
Methods	PSNR	SSIM	PSNR	SSIM↑
RIDNet¹⁴	38.71	0.914	39.26	0.953
CBDNet³⁵	30.78	0.801	38.06	0.942
MHCNN³⁶	39.06	0.914	39.52	0.951
CDN³⁷	39.36	0.918	39.44	0.951
MIRNet³⁸	39.72	0.959	39.88	0.956
CycleISP³⁹	39.52	0.957	39.56	0.956
MPRNet²⁸	39.71	0.958	39.81	0.954
NAFNet³⁴	40.30	0.961	38.41	0.943
VDN⁴⁰	39.26	0.955	39.38	0.951
C-BSN-DND⁴¹	36.84	0.933	38.60	0.941
SDAP⁴²	37.53	0.936	38.17	0.932
Diff-Mamba (Ours)	39.81	0.960	39.63	0.956

Open in a new tab

Significant values are in bold.

Fig. 7 — Comparison of denoising results on SIDD dataset. The comparison highlights the denoising effects of different methods on the same image scene. The images generated by Diff-Mamba exhibit less noise visually while preserving more details.

Fig. 8 — Comparison of denoising results on DND dataset. For optimal perspective, specific areas are shown in close-up. Zooming into these areas highlights Diff-Mamba’s superior performance in both denoising and detail restoration.

Image deblurring

Our deblurring experiments were trained on the GoPro dataset⁴³, which contains 2,103 pairs of clear and blurred images, and were tested on the RealBlur-R and RealBlur-J datasets⁴⁴. Table 4 shows that Diff-Mamba outperforms traditional methods, such as DeblurGAN⁴⁵ and DeblurGAN-v2⁴⁶, in all metrics, particularly in complex blur scenarios, where it provided greater clarity and visual quality. Additionally, Diff-Mamba is competitive with leading methods such as MPRNet and NAFNet in recent benchmark tests, demonstrating advantages in certain metrics. Table 5 indicates that Diff-Mamba shows overall superior performance in the PSNR, SSIM, and LPIPS metrics, demonstrating a comprehensive improvement in image quality and detail preservation for deblurring tasks. In particular, on the RealBlur-J dataset, Diff-Mamba significantly outperforms IR-SDE in both PSNR and SSIM, highlighting its robust capability in handling complex blur conditions. On the GoPro dataset, Diff-Mamba achieves a PSNR of 31.60 dB and SSIM of 0.953, significantly higher than DiffUIR’s 29.17 dB and 0.864. Figure 9 illustrates the effectiveness of our method in generating high-quality images, demonstrating its superior performance compared to existing approaches. Specifically, our method achieved significantly clearer and more realistic results on the RealBlur-R and RealBlur-J test sets. The images produced by our method exhibit a significant improvement in clarity, with finer details becoming more distinct and less obscured by blurriness. This method enhanced clarity and contributed to a more accurate and realistic representation of the original scene, as is evident from the parallel comparisons presented in the figure.

Table 4.

Image deblurring.

Methods	GoPro⁴³		RealBlur-R⁴⁴		RealBlur-J⁴⁴
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM↑
DeblurGAN⁴⁵	28.70	0.858	33.79	0.903	27.97	0.834
DeblurGAN-v2⁴⁶	29.55	0.934	35.26	0.944	28.70	0.866
DMPHN⁴⁷	31.20	0.940	35.70	0.948	28.42	0.860
MT-RNN⁴⁸	31.15	0.945	35.79	0.951	28.44	0.862
DBGAN⁴⁹	31.10	0.942	33.78	0.909	24.93	0.745
MPRNet²⁸	32.66	0.959	35.99	0.952	28.70	0.873
NAFNet³⁴	33.71	0967	35.97	0.951	28.31	0.856
IR-SDE³¹	30.70	0.901	33.96	0.918	24.21	0.729
DiffUIR⁵⁰	29.17	0.864	–	–	–	–
Diff-Mamba (Ours)	31.60	0.953	35.99	0.953	28.67	0.876

Open in a new tab

Significant values are in bold.

Table 5.

Compare our Diff-Mamba with IR-SDE³¹ in image deblurring.

Methods	GoPro⁴³			RealBlur-R⁴⁴			RealBlur-J⁴⁴
	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS↓
IR-SDE³¹	30.70	0.901	0.064	33.96	0.918	0.114	24.21	0.729	0.267
Diff-Mamba (Ours)	31.60	0.953	0.076	35.81	0.953	0.076	28.67	0.870	0.165

Open in a new tab

Significant values are in bold.

Fig. 9 — Comparison of deblurring in the RealBlur-R dataset. For optimal perspective, specific areas are shown in close-up. By zooming into these areas, Diff-Mamba achieves the desired effect when compared with other advanced methods.

Ablation study

To investigate the proposed image restoration method, various ablation studies were conducted to verify the effectiveness of the proposed components.

Effect of DFNN

To demonstrate the effectiveness of the feedforward network, we trained it on the SIDD dataset and only conducted the first-stage training. The testing results are shown in Table 6(a). As can be seen, both PSNR and SSIM are improved with DFNN, demonstrating the effectiveness of incorporating the deep feature fusion network.

Table 6.

Ablation experiments.

	Experiments	PSNR	SSIM	Training time(h)
(a)	Without DFFN	39.24	0.951	49.5
(a)	With DFFN	39.72	0.958	78.5
(b)	Only one stage (270K)	30.45	0.891	30.3
	Only one stage (360K)	30.76	0.903	41.2
	Two-stage (270K+90K)	30.96	0.912	42.5
(c)	Without t embedding	28.65	0.841	31.5
(c)	With t embedding	30.45	0.891	30.3

Open in a new tab

Significant values are in bold.

Effect of two-stage training

To demonstrate the effectiveness of the two-stage training approach, we evaluated its performance on the Rain13K and Test100 datasets, respectively. The first-stage training iterated 270K times, while the second-stage iterated 90K times. Table 6(b) shows that the two-stage training significantly improved recovery results compared to the single-stage training.

Effect of time step embedding

We embedded the time step t into a deep feature fusion network and performed only the first-stage training on the Rain13K dataset. To validate its effectiveness, we applied the trained model on the Test100 dataset for image deraining. Table 6(c) indicates that the embedding time step t improves image restoration quality with less training time, demonstrating that it is crucial for the proposed image restoration method.

Effect of sampling step

The diffusion model generates the final image through multiple sampling steps, with each sampling progressively refining the image’s quality. As shown in Table 7, experiments have shown that using fewer sampling steps results in higher distortion metrics (PSNR and SSIM), indicating that the image restoration quality is lower compared to using more sampling steps. However, increasing the number of sampling steps improves perceptual results (LPIPS), meaning better visual quality, but at the cost of longer processing time. When selecting the number of sampling steps, it is important to balance these metrics while considering the processing time. Through this trade-off, we ultimately decide to use 4 sampling steps in the second-stage training process, ensuring good perceptual quality without excessively increasing the computational cost.

Table 7.

Sampling step S for image restoration.

S	PSNR	SSIM	LPIPS	Times↓(s)
2	10.98	0.018	1.4163	30
3	25.63	0.876	0.1673	56
4	30.92	0.899	0.1313	104
5	30.43	0.893	0.1273	128
10	30.00	0.883	0.1233	235
15	29.28	0.867	0.1139	681
50	28.97	0.856	0.1091	821

Open in a new tab

Significant values are in bold.

Effect of weighting parameter of the loss function

In the second stage, we chose Inline graphic in the mixed loss function. To validate the effectiveness of this choice, we conducted a series of experiments to systematically evaluate the impact of different values on the model’s performance. We first used the pretrained model obtained from the first stage, which provided a good initialization for further training in the second stage. In the second stage, we adjusted the weight of the balance term in the loss function by varying the value of Inline graphic . To shorten the runtime, we set the number of iterations to 9000. Specifically, we selected different values (e.g., 0.12, 0.24, 0.36, 0.48, 0.60, 0.72, 0.84, 0.96) and trained the model under each configuration to observe its impact on performance. As shown in Table 8, we observed that Inline graphic provided the highest PSNR value of 37.02 dB. Although the configurations with and did not significantly increase the training time, their PSNR values slightly decreased and did not lead to a significant improvement in performance. Therefore, strikes the best balance between improving image quality and maintaining reasonable training time, making it the most effective choice for the second stage of training.

Table 8.

The impact of Inline graphic on the second stage.

	0.12	0.24	0.36	0.48	0.60	0.72	0.84	0.96
PSNR (dB)	36.87	36.86	37.01	36.69	36.78	36.80	37.02	36.92
Time (min)	50.4	50.3	50.6	50.6	50.8	50.0	50.2	50.4

Open in a new tab

Significant values are in bold.

Comparison of model size and computational burden

We compared the computational complexity of various image denoising models. As shown in Table 9, our Diff-Mamba model significantly outperforms other methods in terms of computational burden and model size. Despite having fewer parameters and lower computational complexity, it still provides the best image restoration quality across various noise levels. In contrast, although CycleISP and MPRNet perform well in restoration quality, they require higher computational resources.

Table 9.

The comparison of model size and computational burden.

Methods	Flops(G)	Params(M)	PSNR↑
CycleISP³⁹	10.6	23.88	39.49
MPRNet²⁸	13.2	21.63	39.68
Diff-Mamba(ours)	9.3	15.93	39.80

Open in a new tab

Significant values are in bold.

Running time

We compared the running time of our method with SOTA image restoration methods. To ensure a fair comparison, all methods were evaluated on Inline graphic input images using the publicly available code on an NVIDIA RTX 4090 GPU. The runtime of our method is significantly better than other Transformer-based or CNN-based methods, indicating that our model achieves good computational efficiency while maintaining excellent restoration performance. As shown in Table 10, our Diff-Mamba achieves a good balance between processing speed and image restoration quality. Its processing time is 106 s, significantly lower than CycleISP (582 s), MPRNet (359 s), and IR-SDE (238 s), demonstrating higher computational efficiency. In terms of PSNR, Diff-Mamba reaches 30.92 dB, slightly higher than MPRNet (30.82 dB) and MIRNet (30.76 dB), and only slightly lower than CycleISP (30.94 dB). Despite the lower computational cost, Diff-Mamba still provides comparable or similar restoration quality, proving its efficiency and superiority in practical applications.

Table 10.

Running time of different methods of image restoration.

Methods	CycleISP³⁹	MPRNet²⁸	MIRNet³⁸	Diff-Mamba (Ours)
Time (s)	582	359	238	106
PSNR (dB)	30.82	30.91	30.76	30.92

Open in a new tab

Significant values are in bold.

Discussions

Our Diff-Mamba implements image denoising, deblurring, and denoising functions, demonstrating competitive restoration performance on various commonly used datasets. Furthermore, the core principles of our algorithm may be useful in enhancing image quality under low-light conditions, though additional considerations such as noise reduction and detail preservation would need to be addressed. We believe that exploring this in future work could be an interesting direction.

Diffusion models and transformer-based methods typically have high computational costs, especially when handling large-scale data, with computational complexity approaching Inline graphic . Our Diff-Mamba reduces this complexity to a linear level O(N), significantly improving the processing speed. Although our algorithm provides promising results to some extent, there are still several limitations that need to be discussed.

Although Diff-Mamba offers advantages in terms of computational complexity and speed, its image restoration accuracy may not be on par with some more complex algorithms. Additionally, Diff-Mamba may struggle to fully recover image details under extreme low-light conditions or heavy fog, particularly in complex scenes where the balance between noise reduction and detail preservation could affect the final results. We plan to further explore and address these issues in future research.

The algorithm has primarily been evaluated on a limited set of benchmark datasets, and further validation is needed on more diverse real-world data to confirm its robustness and generalizability. Although our method may not yield the best results in terms of accuracy, this is primarily due to its trade-off between performance and computational efficiency. The design prioritizes reducing computational complexity, which may limit its ability to achieve the highest performance on certain tasks.

Conclusions

This study proposes a two-stage Mamba-based diffusion model for image restoration, which incorporates two main components of DSSM and DFNN. DSSM combines Mamba’s high efficiency with the representative power of diffusion models, achieving optimal performance in both inference and training. In contrast, DFNN regulates the information flow to enable each depthwise convolutional layer to focus on details and learn more effective local image structures for image restoration. Numerous experiments demonstrate that our Diff-Mamba method is highly competitive in image deraining, denoising, and deblurring tasks. The Diff-Mamba method exhibits linear complexity in theory, addressing the challenge of quadratic complexity, and demonstrates its superiority in multiple image restoration tasks in practical applications, achieving competitive results on limited computing resources.

Acknowledgements

The authors would like to thank Editage (www.editage.cn) for English language editing.

Author contributions

L.L. and J.W. conceptualized the framework; L.M. conducted the experiments; S.W. acquired the funding; S.W. and J.W. validated the results; L.L. and L.M. drafted the original manuscript, and S.M. reviewed and edited the manuscript. All authors have read and approved the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Hebei Province (No. F2022201013); the Scientific Research Program of Anhui Provincial Ministry of Education (No. 2024AH051686); the Science and Technology Program of Huaibei (No. 2023HK037); the Anhui Shenhua Meat Products Co., Ltd. Cooperation Project (No. 22100084); and the Entrusted Project by Huaibei Mining Group (2023).

Data availibility

The data used in this study are available from public links. Rain13K: https://github.com/swz30/Restormer/blob/main/Deraining/download_data.py. Test100, Rain100H, Rain100L, and Test2800: https://github.com/hezhangsprinter/DID-MDN/tree/master. SIDD: https://github.com/AbdoKamel/sidd-ground-truth-image-estimation. DND: https://noise.visinf.tu-darmstadt.de/. GoPro: https://seungjunnah.github.io/Datasets/gopro. RealBlur-R and RealBlur-J: http://cg.postech.ac.kr/research/realblur/.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 1–21 (Vienna, Austria, 2021).
2.Khan, S. et al. Transformers in vision: A survey. ACM Comput. Surv.54, 1–41 (2022). [Google Scholar]
3.Ho, J., Jain, A., Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 6840–6851 (Vancouver, Canada, 2020).
4.Choi, J., Kim, S., Jeong, Y., Gwon, Y., Yoon, S. ILVR: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 14347–14356 (Montreal, QC, Canada, 2021).
5.Dong, C., Deng, Y., Loy, C., Tang, X. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 576–584 (Santiago, Chile, 2015).
6.Yu, J. et al. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 4470–4479 (Seoul, Korea, 2019).
7.Ren, D., Zuo, W., Hu, Q., Zhu, P., Meng, D. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3937–3946 (Long Beach, CA, USA, 2019).
8.Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Preprint at arXiv:2312.00752 (2023).
9.Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. Preprint at arXiv:2111.00396 (2021).
10.Smith, J. T., Warrington, A. & Linderman, S. W. Simplified state space layers for sequence modeling. Preprint at arXiv:2208.04933 (2022).
11.Mehta, H., Gupta, A., Cutkosky, A. & Neyshabur, B. Long range language modeling via gated state spaces. Preprint at arXiv:2206.13947 (2022).
12.Zhang, K., Zuo, W., Chen, Y., Meng, D. & Zhang, L. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process.26, 3142–3155 (2017). [DOI] [PubMed] [Google Scholar]
13.Tian, C. et al. A cross transformer for image denoising. Inf. Fusion102, 102043 (2024). [Google Scholar]
14.Anwar, S. & Barnes, N. Real image denoising with feature attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3155–3164 (Seoul, Korea, 2019).
15.Wang, Z. et al. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17683–17693 (New Orleans, LA, USA, 2022).
16.Zamir, S. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5718–5729 (New Orleans, LA, USA, 2022).
17.Xia, B. et al. Diffir: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13049–13059 (Paris, France, 2023).
18.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention – MICCAI, 234–241 (Munich, Germany, 2015).
19.Oktay, O. et al. Attention u-net: Learning where to look for the pancreas. Preprint at arXiv:1804.03999 (2018).
20.Wang, L. et al. Learning a coarse-to-fine diffusion transformer for image restoration. Preprint at arXiv:2308.08730 (2023).
21.Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, 572–585 (Virtual, 2021).
22.Zhao, H., Gallo, O., Frosio, I. & Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging3, 47–57 (2016). [Google Scholar]
23.Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv:1603.04467 (2016).
24.Zhang, H., Sindagi, V. & Patel, V. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol.30, 3943–3956 (2020). [Google Scholar]
25.Yang, W. et al. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1685–1694 (Honolulu, HI, USA, 2017).
26.Zhang, H. & Patel, V. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 695–704 (Salt Lake City, UT, USA, 2018).
27.Jiang, K. et al. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8343–8352 (Seattle, WA, USA, 2020).
28.Zamir, S. et al. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14816–14826 (Nashville, TN, USA, 2021).
29.Chen, L., Lu, X., Zhang, J., Chu, X. & Chen, C. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 182–192 (Nashville, TN, USA, 2021).
30.Purohit, K., Suin, M., Rajagopalan, A. & Boddeti, V. Spatially-adaptive image restoration using distortion-guided networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2289–2299 (Montreal, QC, Canada, 2021).
31.Luo, Z., Gustafsson, F., Zhao, Z., Sjölund, J. & Schon, T. Image restoration with mean-reverting stochastic differential equations. In Proceedings of the 40th International Conference on Machine Learning, 23045–23066 (Honolulu, HI, USA, 2023).
32.Abdelhamed, A., Lin, S. & Brown, M. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1692–1700 (Salt Lake City, UT, USA, 2018).
33.Plötz, T. & Roth, S. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2750–2759 (Honolulu, HI, USA, 2017).
34.Chen, L., Chu, X., Zhang, X. & Sun, J. Simple baselines for image restoration. In Proceedings of the European Conference on Computer Vision, 17–33 (Tel Aviv, Israel, 2022).
35.Guo, S., Yan, Z., Zhang, K., Zuo, W. & Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1712–1722 (Long Beach, CA, USA, 2019).
36.Zhang, J., Qu, M., Wang, Y. et al. A multi-head convolutional neural network with multi-path attention improves image denoising. In Pacific Rim International Conference on Artificial Intelligence, 338–351 (Springer Nature Switzerland, Cham, 2022).
37.Zhang, J. et al. Considering image information and self-similarity: A compositional denoising network. Sensors23, 5915 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Zamir, S. et al. Learning enriched features for real image restoration and enhancement. In Proceedings of the European Conference on Computer Vision, 492–511 (Glasgow, UK, 2020).
39.Zamir, S. et al. Cycleisp: Real image restoration via improved data synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2693–2702 (Seattle, WA, USA, 2020).
40.Yue, Z., Yong, H., Zhao, Q., Meng, D. & Zhang, L. Variational denoising network: Toward blind noise modeling and removal. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 1690–1701 (Vancouver, Canada, 2019).
41.Jang, Y., Lee, K., Park, G., Kim, S. & Cho, N. Self-supervised image denoising with downsampled invariance loss and conditional blind-spot network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 12162–12171 (Paris, France, 2023).
42.Pan, Y., Liu, X., Liao, X., Cao, Y. & Ren, C. Random sub-samples generation for self-supervised real image denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 12116–12125 (Paris, France, 2023).
43.Nah, S., Kim, TH. & Lee, KM. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 257–265 (Honolulu, HI, USA, 2017).
44.Rim, J., Lee, H., Won, J. & Cho, S. Real-world blur dataset for learning and benchmarking deblurring algorithms. In Proceedings of the European Conference on Computer Vision, 184–201 (Glasgow, UK, 2020).
45.Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D. & Matas, J. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8183–8192 (Salt Lake City, UT, USA, 2018).
46.Kupyn, O., Martyniuk, T., Wu, J. & Wang, Z. Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8877–8886 (Seoul, Korea, 2019).
47.Zhang, H., Dai, Y., Li, H. & Koniusz, P. Deep stacked hierarchical multi-patch network for image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5971–5979 (Long Beach, CA, USA, 2019).
48.Park, D., Kang, DU., Kim, J. & Chun, S. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In Proceedings of the European Conference on Computer Vision, 327–343 (Glasgow, UK, 2020).
49.Zhang, K. et al. Deblurring by realistic blurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2734–2743 (Seattle, WA, USA, 2020).
50.Zheng, D. et al. Selective hourglass mapping for universal image restoration based on diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 25445–25455 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, 1–21 (Vienna, Austria, 2021).

[CR2] 2.Khan, S. et al. Transformers in vision: A survey. ACM Comput. Surv.54, 1–41 (2022). [Google Scholar]

[CR3] 3.Ho, J., Jain, A., Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 6840–6851 (Vancouver, Canada, 2020).

[CR4] 4.Choi, J., Kim, S., Jeong, Y., Gwon, Y., Yoon, S. ILVR: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 14347–14356 (Montreal, QC, Canada, 2021).

[CR5] 5.Dong, C., Deng, Y., Loy, C., Tang, X. Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 576–584 (Santiago, Chile, 2015).

[CR6] 6.Yu, J. et al. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 4470–4479 (Seoul, Korea, 2019).

[CR7] 7.Ren, D., Zuo, W., Hu, Q., Zhu, P., Meng, D. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3937–3946 (Long Beach, CA, USA, 2019).

[CR8] 8.Gu, A. & Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. Preprint at arXiv:2312.00752 (2023).

[CR9] 9.Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. Preprint at arXiv:2111.00396 (2021).

[CR10] 10.Smith, J. T., Warrington, A. & Linderman, S. W. Simplified state space layers for sequence modeling. Preprint at arXiv:2208.04933 (2022).

[CR11] 11.Mehta, H., Gupta, A., Cutkosky, A. & Neyshabur, B. Long range language modeling via gated state spaces. Preprint at arXiv:2206.13947 (2022).

[CR12] 12.Zhang, K., Zuo, W., Chen, Y., Meng, D. & Zhang, L. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process.26, 3142–3155 (2017). [DOI] [PubMed] [Google Scholar]

[CR13] 13.Tian, C. et al. A cross transformer for image denoising. Inf. Fusion102, 102043 (2024). [Google Scholar]

[CR14] 14.Anwar, S. & Barnes, N. Real image denoising with feature attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 3155–3164 (Seoul, Korea, 2019).

[CR15] 15.Wang, Z. et al. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17683–17693 (New Orleans, LA, USA, 2022).

[CR16] 16.Zamir, S. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5718–5729 (New Orleans, LA, USA, 2022).

[CR17] 17.Xia, B. et al. Diffir: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 13049–13059 (Paris, France, 2023).

[CR18] 18.Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention – MICCAI, 234–241 (Munich, Germany, 2015).

[CR19] 19.Oktay, O. et al. Attention u-net: Learning where to look for the pancreas. Preprint at arXiv:1804.03999 (2018).

[CR20] 20.Wang, L. et al. Learning a coarse-to-fine diffusion transformer for image restoration. Preprint at arXiv:2308.08730 (2023).

[CR21] 21.Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, 572–585 (Virtual, 2021).

[CR22] 22.Zhao, H., Gallo, O., Frosio, I. & Kautz, J. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging3, 47–57 (2016). [Google Scholar]

[CR23] 23.Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Preprint at arXiv:1603.04467 (2016).

[CR24] 24.Zhang, H., Sindagi, V. & Patel, V. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol.30, 3943–3956 (2020). [Google Scholar]

[CR25] 25.Yang, W. et al. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1685–1694 (Honolulu, HI, USA, 2017).

[CR26] 26.Zhang, H. & Patel, V. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 695–704 (Salt Lake City, UT, USA, 2018).

[CR27] 27.Jiang, K. et al. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8343–8352 (Seattle, WA, USA, 2020).

[CR28] 28.Zamir, S. et al. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14816–14826 (Nashville, TN, USA, 2021).

[CR29] 29.Chen, L., Lu, X., Zhang, J., Chu, X. & Chen, C. Hinet: Half instance normalization network for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 182–192 (Nashville, TN, USA, 2021).

[CR30] 30.Purohit, K., Suin, M., Rajagopalan, A. & Boddeti, V. Spatially-adaptive image restoration using distortion-guided networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2289–2299 (Montreal, QC, Canada, 2021).

[CR31] 31.Luo, Z., Gustafsson, F., Zhao, Z., Sjölund, J. & Schon, T. Image restoration with mean-reverting stochastic differential equations. In Proceedings of the 40th International Conference on Machine Learning, 23045–23066 (Honolulu, HI, USA, 2023).

[CR32] 32.Abdelhamed, A., Lin, S. & Brown, M. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1692–1700 (Salt Lake City, UT, USA, 2018).

[CR33] 33.Plötz, T. & Roth, S. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2750–2759 (Honolulu, HI, USA, 2017).

[CR34] 34.Chen, L., Chu, X., Zhang, X. & Sun, J. Simple baselines for image restoration. In Proceedings of the European Conference on Computer Vision, 17–33 (Tel Aviv, Israel, 2022).

[CR35] 35.Guo, S., Yan, Z., Zhang, K., Zuo, W. & Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1712–1722 (Long Beach, CA, USA, 2019).

[CR36] 36.Zhang, J., Qu, M., Wang, Y. et al. A multi-head convolutional neural network with multi-path attention improves image denoising. In Pacific Rim International Conference on Artificial Intelligence, 338–351 (Springer Nature Switzerland, Cham, 2022).

[CR37] 37.Zhang, J. et al. Considering image information and self-similarity: A compositional denoising network. Sensors23, 5915 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Zamir, S. et al. Learning enriched features for real image restoration and enhancement. In Proceedings of the European Conference on Computer Vision, 492–511 (Glasgow, UK, 2020).

[CR39] 39.Zamir, S. et al. Cycleisp: Real image restoration via improved data synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2693–2702 (Seattle, WA, USA, 2020).

[CR40] 40.Yue, Z., Yong, H., Zhao, Q., Meng, D. & Zhang, L. Variational denoising network: Toward blind noise modeling and removal. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 1690–1701 (Vancouver, Canada, 2019).

[CR41] 41.Jang, Y., Lee, K., Park, G., Kim, S. & Cho, N. Self-supervised image denoising with downsampled invariance loss and conditional blind-spot network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 12162–12171 (Paris, France, 2023).

[CR42] 42.Pan, Y., Liu, X., Liao, X., Cao, Y. & Ren, C. Random sub-samples generation for self-supervised real image denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 12116–12125 (Paris, France, 2023).

[CR43] 43.Nah, S., Kim, TH. & Lee, KM. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 257–265 (Honolulu, HI, USA, 2017).

[CR44] 44.Rim, J., Lee, H., Won, J. & Cho, S. Real-world blur dataset for learning and benchmarking deblurring algorithms. In Proceedings of the European Conference on Computer Vision, 184–201 (Glasgow, UK, 2020).

[CR45] 45.Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D. & Matas, J. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8183–8192 (Salt Lake City, UT, USA, 2018).

[CR46] 46.Kupyn, O., Martyniuk, T., Wu, J. & Wang, Z. Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8877–8886 (Seoul, Korea, 2019).

[CR47] 47.Zhang, H., Dai, Y., Li, H. & Koniusz, P. Deep stacked hierarchical multi-patch network for image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5971–5979 (Long Beach, CA, USA, 2019).

[CR48] 48.Park, D., Kang, DU., Kim, J. & Chun, S. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In Proceedings of the European Conference on Computer Vision, 327–343 (Glasgow, UK, 2020).

[CR49] 49.Zhang, K. et al. Deblurring by realistic blurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2734–2743 (Seattle, WA, USA, 2020).

[CR50] 50.Zheng, D. et al. Selective hourglass mapping for universal image restoration based on diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 25445–25455 (2024).

PERMALINK

Two-stage Mamba-based diffusion model for image restoration

Lei Liu

Luan Ma

Shuai Wang

Jun Wang

Silas N Melo

Abstract

Introduction

Related works

Image restoration

Diffusion models

Mamba

Methods

Fig. 1.

First-stage training pipeline

Second-stage training pipeline

Fig. 2.

Diffusion state space model

Fig. 3.

Fig. 4.

Diffusion feedforward neural network

Fig. 5.

Experiments

Experiment settings

Image deraining

Table 1.

Table 2.

Fig. 6.

Image denoising

Table 3.

Fig. 7.

Fig. 8.

Image deblurring

Table 4.

Table 5.

Fig. 9.

Ablation study

Effect of DFNN

Table 6.

Effect of two-stage training

Effect of time step embedding

Effect of sampling step

Table 7.

Effect of weighting parameter of the loss function

Table 8.

Comparison of model size and computational burden

Table 9.

Running time

Table 10.

Discussions

Conclusions

Acknowledgements

Author contributions

Funding

Data availibility

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases