Taming diffusion models for image restoration: a review

Ziwei Luo; Fredrik Gustafsson; Zheng Zhao; Jens Sjölund; Thomas Schön

doi:10.1098/rsta.2024.0358

. 2025 Jun 19;383(2299):20240358. doi: 10.1098/rsta.2024.0358

Taming diffusion models for image restoration: a review

Ziwei Luo ¹, Fredrik Gustafsson ², Zheng Zhao ^1,³, Jens Sjölund ¹, Thomas Schön ^1,^✉

PMCID: PMC12201591 PMID: 40534290

Abstract

Diffusion models (DMs) have achieved remarkable progress in generative modelling, particularly in enhancing image quality to conform to human preferences. Recently, these models have also been applied to low-level computer vision for photo-realistic image restoration (IR) in tasks such as image denoising, deblurring and dehazing. In this review, we introduce key constructions in DMs and survey contemporary techniques that make use of DMs in solving general IR tasks. We also point out the main challenges and limitations of existing diffusion-based IR frameworks and provide potential directions for future work.

This article is part of the theme issue ‘Generative modelling meets Bayesian inference: a new paradigm for inverse problems’.

Keywords: Diffusion model, image restoration, Generative models, inverse problems

1. Introduction

Image restoration (IR) is a long-standing and challenging research topic in computer vision, which generally has two high-level aims: (i) recover high-quality (HQ) images from their degraded low-quality (LQ) counterparts, and (ii) eliminate undesired objects from specific scenes. The former includes tasks like image denoising [1] and deblurring [2], while the latter contains tasks like rain/haze/snow removal [3] and shadow removal [4]. Figure 1 showcases examples of these applications. To solve different IR problems, traditional methods require task-specific knowledge to model the degradation and perform restoration in the spatial or frequency domain, by combining classical signal processing algorithms [6–8] with specific image-degradation parameters [9]. More recently, numerous efforts have been made to train deep learning models on collected datasets to improve performance on different IR tasks [10]. Most of them directly train neural networks on sets of paired LQ–HQ images with a reconstruction objective (e.g. $ℓ_{1}$ or $ℓ_{2}$ distances) as typical in supervised learning. While effective, this approach tends to produce over-smooth results, particularly in textures [11]. Although this issue can be alleviated by including adversarial or perceptual losses [12], the training then typically becomes unstable and the results often contain undesired artefacts or are inconsistent with the input LQ images [11].

Generally, there are two types of image restoration (IR) tasks: 1) Recover images from their degraded versions and 2) Eliminate undesired objects from specific scenes. — Generally, there are two types of IR tasks: (1) Recover images from their degraded versions and (2) eliminate undesired objects from specific scenes. Here, all top rows are LQ input images and the bottom rows are the corresponding HQ images generated by a diffusion-based IR model [5]. As observed, applying DMs for IR can produce photo-realistic results in line with human perceptual preferences.

Recently, generative diffusion models (DMs) [13] have drawn increasing attention due to their stable training process and remarkable performance in producing realistic images and videos [14]. Inspired by them, numerous works have incorporated the diffusion process into various IR problems to obtain high-perceptual/photo-realistic results [15–18]. However, these methods exhibit considerable diversity and complexity across various domains and IR tasks, obscuring the shared foundations that are key to understanding and improving diffusion-based IR approaches. In light of this, our paper reviews the key concepts in DMs and then surveys trending techniques for applying them to IR tasks. More specifically, the fundamentals of DMs are introduced in §2, in which we further elucidate the score-based stochastic differential equations (Score-SDEs) and then show the connections between denoising diffusion probabilistic models (DDPMs) and Score-SDEs. In addition, the conditional diffusion models (CDMs) are elaborated such that we can learn to guide the image generation, which is key in adapting DMs for general IR tasks. Several diffusion-based IR frameworks are then summarized in §3. In particular, we can leverage CDMs for IR from different perspectives including DDPMs, Score-SDE and their connections. The connection even yields a training-free approach for non-blind IR, i.e. for tasks with known degradation parameters. Finally, we conclude the paper with a discussion of the remaining challenges and potential future work in §4.

2. Generative modelling with DMs

Generative DMs are a family of probabilistic models employing an iterative process (e.g. Markov chains) to transform the data distribution into a reference distribution. In the following, §2a describes a typical formulation of DMs: the DDPMs [13,19], followed by §2b which generalizes this to Score-SDEs for a more detailed analysis of the diffusion/reverse process. Finally, in §2c, we further show how to guide DMs for conditional generation, which is a key enabling technique for diffusion-based IR.

(a). DDPMs

Given a variable $x_{0}$ sampled from a data distribution $p_{0} (x)$ , DDPMs [13,19] are latent variable models consisting of two Markov chains: a forward/diffusion process $q (x_{1 : T} | x_{0})$ and a reverse process $p_{θ} (x_{0 : T})$ . The forward process transfers $x_{0}$ to a Gaussian distribution by sequentially injecting noise. For simplicity, we set $q_{0} (x) = p_{0} (x)$ such that the forward process starts from the data distribution. Then the reverse process learns to generate new data samples starting from the Gaussian noise. An overview of the DDPM is shown in figure 2. Below, we explain the forward and backward processes, and provide details on how the DDPMs are trained.

Denoising diffusion probabilistic models (DDPMs). — DDPMs. The forward path transfers data to Gaussian noise, and the reverse path learns to generate data from noise along the actual time reversal of the forward process. Here, the reverse transition distribution $p_{θ} (x_{t - 1} | x_{t})$ represents the model we aim to learn, and the conditional posterior $q (x_{t - 1} | x_{t}, x_{0})$ is a tractable Gaussian which serves as the target distribution the model wants to match as the $L_{t}$ term in equation (2.7).

(i). Forward diffusion process

The forward process perturbs data samples $x_{0}$ to noise $x_{T}$ . It can be characterized by a joint distribution encompassing all intermediate states, represented in the form

\begin{array}{lcr} q (x_{1 : T} | x_{0}) = \prod_{t = 1}^{T} q (x_{t} | x_{t - 1}), x_{0} \sim p_{0} (x) . \end{array}

(2.1)

Here, the transition kernel $q (x_{t} | x_{t - 1})$ is a handcrafted Gaussian given by

\begin{array}{lcr} q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I), \end{array}

(2.2)

where $β_{1 : T} \in (0, 1)$ is the variance schedule: a set of pre-defined hyper-parameters that ensure the forward process (approximately) converges to a Gaussian distribution. Let $α_{t} = 1 - β_{t}$ and ${\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}$ , equation (2.2) then allows us to marginalize the joint distribution of equation (2.1) to

\begin{array}{lcr} q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I) . \end{array}

(2.3)

We usually set $β_{1} < β_{2} < \dots < β_{T}$ such that $α_{1} > α_{2} > \dots > α_{T} \approx 0$ and the terminal distribution $q (x_{T}) \approx N (x_{T}; 0, I)$ is thus a standard Gaussian, which allows us to generate new data points by reversing the diffusion process starting from sampled Gaussian noise. Moreover, it is important to note that posteriors along the forward process are tractable when conditioned on the original data sample $x_{0}$ , i.e. $q (x_{t - 1} | x_{t}, x_{0})$ is a tractable Gaussian [19]. This tractability enables the derivation of the DDPM training objective, which we will describe in §2a(iii).

(ii). Reverse process

In contrast, the reverse process learns to match the actual time reversal of the forward process, which is also a joint distribution modelled by $p_{θ} (x_{0 : T})$ as follows:

\begin{array}{lcr} p_{θ} (x_{0 : T}) = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}), x_{T} \sim N (0, I) . \end{array}

(2.4)

In DDPMs, the transition kernel $p_{θ} (x_{t - 1} | x_{t})$ is defined as a learnable Gaussian:

\begin{array}{lcr} p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t) I), \end{array}

(2.5)

where $μ_{θ}$ and $Σ_{θ}$ are the parameterized mean and variance, respectively. Learning the latent variable model of equation (2.5) is key to DDPMs since it substantially affects the quality of data sampling. That is, we have to adjust the parameters $θ$ until the final sampled variable $x_{0}$ is close to that sampled from the real data distribution.

(iii). Training objective

To learn the reverse process, we usually minimize the variational bound on the negative log-likelihood which introduces the forward joint distribution of equation (2.1) in the objective $L$ as

\begin{aligned} E_{q_{0} (x)} [- \log p_{θ} (x_{0})] \leq \underset{negative evidence lower bound}{\underset{⏟}{E_{q (x_{0 : T})} [- \log \frac{p_{θ} (x_{0 : T})}{q (x_{1 : T} ∣ x_{0})}]}} = E_{q} [- \log p (x_{T}) - \sum_{t = 1}^{T} \log \frac{p_{θ} (x_{t - 1} ∣ x_{t})}{q (x_{t} ∣ x_{t - 1})}] . \end{aligned}

(2.6)

Here, $p (x_{T})$ is a standard Gaussian, $p_{θ} (x_{t - 1} | x_{t})$ is the reverse transition kernel in equation (2.5) that we want to learn, and $q (x_{t} | x_{t - 1})$ is the forward transition kernel of equation (2.2). This objective can be further rewritten according to

\begin{aligned} L := E_{q} [\underset{L_{T}}{\underset{⏟}{D_{K L} (q (x_{T} ∣ x_{0}) ∣∣ p (x_{T}))}} + \sum_{t = 2}^{T} \underset{L_{t - 1}}{\underset{⏟}{D_{K L} (q (x_{t - 1} ∣ x_{t}, x_{0}) ∣∣ p_{θ} (x_{t - 1} ∣ x_{t}))}} \underset{L_{0}}{\underset{⏟}{- \log p_{θ} (x_{0} ∣ x_{1})}}], \end{aligned}

(2.7)

where $L_{T}$ is called the prior matching term and contains no learnable parameters, $L_{t - 1}$ is the posterior matching term and $L_{0}$ the data reconstruction term that maximizes the likelihood of $x_{0}$ . Sohl-Dickstein et al. [19] prove that the conditional posterior distribution in $L_{t - 1}$ is a tractable Gaussian: $q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; {\tilde{μ}}_{t} (x_{t}, x_{0}), {\tilde{β}}_{t} I)$ , where the mean and variance are

{\tilde{μ}}_{t} (x_{t}, x_{0}) = \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t} + \frac{\sqrt{{\bar{α}}_{t - 1}} β_{t}}{1 - {\bar{α}}_{t}} x_{0}, and {\tilde{β}}_{t} = \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t} .

(2.8)

All terms in ${\tilde{β}}_{t}$ are known and thus the posterior variance in equation (2.5) can be non‐parametric, i.e. $Σ_{θ} (x_{t}, t) = {\tilde{β}}_{t}$ , which does not depend on $x_{t}$ and allows us to only focus on learning the posterior mean $μ_{θ} (x_{t}, t)$ . Specifically, applying the reparameterization trick to $q (x_{t} | x_{0})$ of equation (2.3) gives an estimate of the initial state: $x_{0} = \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{t})$ , which can be substituted into equation (2.8) to obtain: ${\tilde{μ}}_{t} (x_{t}, x_{0}) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{t})$ . The noise $ϵ_{t}$ then can be learned using a neural network $ϵ_{θ} (x_{t}, t)$ , and the parameterized distribution mean can be rewritten as

\begin{array}{lcr} μ_{θ} (x_{t}, t) = \frac{1}{{\sqrt{α}}_{t}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)) . \end{array}

(2.9)

The transition kernel $p_{θ} (x_{t - 1} | x_{t})$ of equation (2.5) is finally updated according to the following:

\begin{array}{lcr} p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), {\tilde{β}}_{t} I), \end{array}

(2.10)

where the variance ${\tilde{β}}_{t}$ is predefined as in equation (2.8). Note that, $p_{θ} (x_{t - 1} | x_{t})$ now matches the form of $q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; {\tilde{μ}}_{t} (x_{t}, x_{0}), {\tilde{β}}_{t} I)$ , to minimize the KL term of $L_{t - 1}$ in equation (2.7). Also note that DDPMs only need to learn the noise network $ϵ_{θ} (x_{t}, t)$ , for which it is common to use a U-Net architecture with several self-attention layers [13]. The noise network $ϵ_{θ} (x_{t}, t)$ takes an image $x_{t}$ and a time $t$ as input, and outputs a noise image of the same shape as $x_{t}$ . More specifically, the scalar time $t$ is encoded into vectors similar to positional embedding [20] and is combined with $x_{t}$ in the feature space for time-varying noise prediction.

Simplified objective

We now have known expressions for all components of the objective $L$ in equation (2.7). However, its current form is not ideal to use for model training since it requires $L_{t}$ to be computed at every timestep of the entire diffusion process, which is time-consuming and impractical. Fortunately, the prior matching term $L_{T}$ can be ignored since it does not contain any parameters. By substituting equations (2.8) and (2.9) into equation (2.7), we also find that the final expanded version of the posterior matching term $L_{t - 1}$ ( $t \in {2, \dots, T}$ ) and the data reconstruction term $L_{0}$ have similar forms,

\begin{array}{lcr} L_{t - 1} = \frac{β_{t}}{2 α_{t} (1 - {\bar{α}}_{t - 1})} E_{x_{0}, ϵ} [‖ ϵ_{t} - ϵ_{θ} (x_{t}, t) ‖^{2}] and L_{0} = \frac{1}{2 α_{1}} E_{x_{0}, ϵ} [‖ ϵ_{1} - ϵ_{θ} (x_{1}, 1) ‖^{2}], \end{array}

(2.11)

where $E_{x_{0}, ϵ}$ denotes $E_{x_{0} \sim q_{0} (x), ϵ \sim N (0, I)}$ . By ignoring the weights outside the expectations in equation (2.11), the final training objective can therefore be obtained according to the following [13]:

\begin{aligned} L_{simple} := E_{x_{0}, t, ϵ} [‖ ϵ_{t} - ϵ_{θ} (x_{t}, t) ‖^{2}] = E_{x_{0}, t, ϵ} [‖ ϵ_{t} - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} \cdot ϵ, t) ‖^{2}], \end{aligned}

(2.12)

which essentially learns to match the predicted and real added noise for each training sample and thus is also called the noise matching loss. Compared to the original objective $L$ in equation (2.7), $L_{simple}$ is a re-weighted version that puts more focus on larger timesteps $t$ , which empirically has been shown to improve the training [13]. Once trained, the noise prediction network $ϵ_{θ} (x_{t}, t)$ can be used to generate new data $x_{0}$ by running equation (2.10) starting from $x_{T} \sim N (0, I)$ , i.e. by iterating

\begin{array}{lcr} x_{t - 1} = μ_{θ} (x_{t}, t) + \sqrt{{\tilde{β}}_{t}} ϵ where μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)), \end{array}

(2.13)

as a parameterized data sampling process, similar to that in Langevin dynamics [21].

(b). Data perturbation and sampling with SDEs

We have shown how DDPM works for data perturbation and data generation. We can further generalize the DDPM to stochastic differential equations, namely, Score-SDE [22], where both the forward and reverse processes are in continuous-time state spaces. This generalization offers a deeper insight into the mathematics behind DMs that underlies the success of diffusion-based generative modelling. Figure 3 shows an overview of the Score-SDE approach.

(i). Data perturbation with forward SDEs

Here, we construct variables ${x (t)}_{t \in [0, T]}$ for data perturbation in continuous time, which can be modelled as a forward SDE defined by

\begin{array}{lcr} d x = f (x, t) d t + g (t) d w, x (0) \sim p_{0} (x), \end{array}

(2.14)

where $f (x, t)$ and $g (t)$ are called the drift and diffusion functions, respectively, and $w$ is a standard Wiener process (also known as Brownian motion). We use $p_{t} (x)$ to denote the marginal probability density of $x (t)$ , and use $p (x (t) | x (s))$ to denote the transition kernel from $x (s)$ to $x (t)$ . Moreover, we always design the SDE to drift to a fixed prior distribution (e.g. standard Gaussian), ensuring that $x (T)$ becomes independent of $p_{0} (x)$ and can be sampled individually.

(ii). Sampling with reverse-time SDEs

We can sample noise and reverse the forward SDE to generate new data close to that sampled from the real data distribution. Note that reversing equation (2.14) yields another diffusion process, i.e. a reverse-time SDE [23] in the form

\begin{array}{lcr} d x = [f (x, t) - g (t)^{2} \nabla_{x} \log p_{t} (x)] d t + g (t) d \hat{w}, x (T) \sim p_{T} (x), \end{array}

(2.15)

where $\hat{w}$ is a reverse-time Wiener process and $\nabla_{x} \log p_{t} (x)$ is called the score (or score function). The score $\nabla_{x} \log p_{t} (x)$ is the vector field of $x$ pointing to the directions in which the probability density function has the largest growth rate [21]. Once the score is known for all time $t$ , simulating equation (2.15) in time allows us to sample new data from noise.

Earlier work such as the score-based generative models [21] often learn the score using score matching [24]. However, score matching is computationally costly and only works for discrete times. Song et al. [22] propose a continuous-time version that optimizes the following:

\begin{array}{lcr} E_{t, x (0), x (t)} [‖ s_{θ} (x (t), t) - \nabla_{x (t)} \log p_{t} (x (t) | x (0)) ‖^{2}], \end{array}

(2.16)

where $t$ is uniformly sampled over $[0, T]$ , $x (0) \sim p_{0} (x)$ , $x (t) \sim p_{t} (x (t) | x (0))$ , and $s_{θ} (x (t), t)$ represents the score prediction network. This objective ensures that the optimal score network, denoted $s_{θ}^{*} (x (t), t)$ , from equation (2.16) satisfies $s_{θ}^{*} (x (t), t) = \nabla_{x} \log p_{t} (x)$ almost surely [22,25].

(iii). Interpreting DDPM with the variance preserving SDE

Notably, extending DDPM to an infinite number of timesteps (i.e. continuous timesteps) leads to a special SDE which gives a more reliable interpretation of the diffusion process, and allows us to optimize the sampling with more efficient SDE/ordinary differential equation (ODE) solvers [22,26]. Specifically, recall the DDPM perturbation kernel $q (x_{t} | x_{t - 1})$ of equation (2.2) and write it in the form

x_{i} = \sqrt{1 - β_{t}} x_{i - 1} + \sqrt{β_{i}} ϵ_{i - 1}, ϵ \sim N (0, I) and i = 1, \dots, N,

(2.17)

where $i$ is the discrete timestep. Let us define an auxiliary set ${{\bar{β}}_{i} = N β_{t}}_{i = 1}^{N}$ and obtain

\begin{array}{lcr} x_{i} = \sqrt{1 - \frac{{\bar{β}}_{i}}{N}} x_{i - 1} + \sqrt{\frac{{\bar{β}}_{i}}{N}} ϵ_{i - 1} . \end{array}

(2.18)

As a preparation to convert functions from discrete-time to continuous-time, let $β (\frac{i}{N}) := {\bar{β}}_{i}$ , $x (\frac{i}{N}) := x_{i}$ and $ϵ (\frac{i}{N}) := ϵ_{i}$ . We can now rewrite equation (2.18) with the difference $Δ t = \frac{1}{N}$ and time $t \in 0, \frac{1}{N}, \dots, \frac{N - 1}{N}$ as follows:

\begin{aligned} (2.19) & x (t + Δ t) & = \sqrt{1 - β (t + Δ t) Δ t} x (t) + \sqrt{β (t + Δ t) Δ t} ϵ (t) \\ (2.20) & \approx x (t) - \frac{1}{2} β (t) Δ t x (t) + \sqrt{β (t + Δ t)} \sqrt{Δ t} ϵ (t) (Taylor series) \\ (2.21) & \approx x (t) - \frac{1}{2} β (t) Δ t x (t) + \sqrt{β (t)} \sqrt{Δ t} ϵ (t), \end{aligned}

where the two approximate equalities hold when $Δ t \to 0$ . Then we convert $Δ t$ to $d t$ , $\sqrt{Δ t} ϵ (t)$ to $d w$ and obtain the following: $d x = - \frac{1}{2} β (t) x d t + \sqrt{β (t)} d w$ , which is a typical mean-reverting SDE (also known as the Ornstein–Uhlenbeck process [27]) that drifts towards a stationary distribution, i.e. a standard Gaussian in this case. Song et al. [22] also name it the variance preserving SDE (VP-SDE) and further illustrate that DDPM’s marginal distribution $q (x_{t} | x_{0})$ in equation (2.3) is a solution to the VP-SDE. Therefore, we can use either the diffusion reverse process (equation 2.13) or the reverse-time SDE (equation 2.15) to sample new data from noise with the same trained DDPM. In addition, the score $\nabla_{x} \log p_{t} (x)$ can be directly computed from the marginal distribution $q (x_{t} | x_{0})$ in equation (2.3),

\begin{array}{lcr} \nabla_{x_{t}} \log p_{t} (x_{t}) = - \frac{x_{t} - \sqrt{{\bar{α}}_{t}} x_{0}}{(1 - {\bar{α}}_{t})} = - \frac{ϵ_{t}}{\sqrt{1 - {\bar{α}}_{t}}}, \end{array}

(2.22)

where $ϵ_{t}$ is from the reparameterization trick and can be approximated using the noise prediction network $ϵ_{θ} (x_{t}, t)$ . Equation (2.22) thus shows how we convert the DM to an SDE (i.e. obtain the score $\nabla_{x} \log p_{t} (x)$ from noise $ϵ_{θ} (x_{t}, t)$ ). Then, numerous efficient SDE/ODE solvers can be used to optimize DMs, further bringing interpretability and faster sampling [22].

(c). CDMs

So far, we have learned how to sample data from different types of DMs. However, all of the above methods are concerned with unconditional generation, which is insufficient for IR where we want to sample HQ images conditioned on degraded LQ images. Therefore, we present the CDM below.

Let us keep the diffusion process $q (x_{1 : T} | x_{0})$ of equation (2.1) unchanged and reconstruct the reverse process in equation (2.4) with a condition $y$ , i.e. $p_{θ} (x_{0 : T} | y) = p (x_{T} | y) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}, y)$ . The conditional reverse kernel can then be modelled as

\begin{array}{lrr} p_{θ, ϕ} (x_{t - 1} | x_{t}, y) = Z \cdot p_{θ} (x_{t - 1} | x_{t}) p_{ϕ} (y | x_{t - 1}), \end{array}

(2.23)

where $p_{ϕ} (y | x)$ is an additional network that predicts $y$ from $x$ , and $Z = p_{ϕ} (y | x_{t})^{- 1}$ can be treated as a constant since it does not depend on $x_{t - 1}$ . This equation yields an adjusted mean for the posterior distribution of equation (2.10), given by [28]

\begin{array}{lcr} {\hat{μ}}_{θ} (x_{t}, t, y) = μ_{θ} (x_{t}, t) + η \cdot {\tilde{β}}_{t} \nabla_{x_{t}} \log p_{ϕ} (y | x_{t}), \end{array}

(2.24)

where $η$ is the gradient scale (also called the guidance scale). Equation (2.22) further gives the score of the joint distribution $p_{t} (x_{t}, y)$ :

\begin{aligned} (2.25) & \nabla_{x_{t}} \log p_{t} (x_{t}, y) & = \nabla_{x_{t}} \log p_{t} (x_{t}) + \nabla_{x_{t}} \log p_{t} (y | x_{t}) \\ (2.26) & \approx - \frac{1}{\sqrt{1 - {\bar{α}}_{t}}} (ϵ_{θ} (x_{t}, t) - \sqrt{1 - {\bar{α}}_{t}} \nabla_{x_{t}} \log p_{ϕ} (y | x_{t})), \end{aligned}

which provides a conditional noise predictor ${\hat{ϵ}}_{θ}$ with the following form [28]

\begin{array}{lcr} {\hat{ϵ}}_{θ} (x_{t}, t, y) = ϵ_{θ} (x_{t}, t) - η \cdot \sqrt{1 - {\bar{α}}_{t}} \nabla_{x_{t}} \log p_{ϕ} (y | x_{t}) . \end{array}

(2.27)

The conditional sampling is performed as a regular DDPM by substituting this new noise predictor ${\hat{ϵ}}_{θ} (x_{t}, t, y)$ into the posterior mean of equation (2.9). The gradient scale $η$ controls the performance trade-off between image quality and fidelity, i.e. lower $η$ produces photo-realistic results, and higher $η$ yields better consistency with the condition.

(i). Conditional SDE

Similar to guided diffusion, we can also change the score function to control the reverse-time SDE conditioned on the variable $y$ , i.e. by replacing $\nabla_{x} \log p_{t} (x)$ with $\nabla_{x} \log p_{t} (x | y)$ in equation (2.15). Since $p_{t} (x | y) \propto p_{t} (x) p_{t} (y | x)$ , the conditional score can be decomposed as

\begin{array}{lcr} \nabla_{x} \log p_{t} (x | y) = \nabla_{x} \log p_{t} (x) + \nabla_{x} \log p_{t} (y | x), \end{array}

(2.28)

which means that we can simulate the following reverse-time SDE for conditional generation:

\begin{array}{lcr} d x = [f (x, t) - g (t)^{2} (\nabla_{x} \log p_{t} (x) + \nabla_{x} \log p_{t} (y | x))] d t + g (t) d \hat{w}, \end{array}

(2.29)

where $x (T) \sim p_{T} (x | y)$ . Song et al. [22] show that we can use a separate network to learn $p_{t} (y | x)$ (e.g. a time-dependent classifier if $y$ represents class labels), or estimate its log gradient $\nabla_{x} \log p_{t} (y | x)$ directly with heuristics and domain knowledge.

With these CDMs, we can sample images with specified labels (such as dog and cat) or, as the main topic of this paper, recover clean HQ images from corrupted LQ inputs.

3. DMs for IR

Diffusion-based IR can be considered a special case of CDMs with image conditioning. We first introduce the concept of image degradation, which is a process that transforms an HQ image $x$ into an LQ image $y$ characterized by undesired corruptions. The general image degradation process can be modelled as follows:

\begin{array}{lcr} y = A (x) + n, \end{array}

(3.1)

where $A$ denotes the degradation function and $n$ is additive noise. As the examples show in figure 1, degradation can manifest itself in various forms such as noise, blur, rain and haze. IR then aims to reverse this process to obtain a clean HQ image from the corrupted LQ counterpart $y$ .

IR is further decomposed into two distinct settings, blind and non-blind IR, depending on whether or not the degradation parameters $A$ and $n$ in equation (3.1) are known. Blind IR is the most general setting, in which no explicit knowledge of the degradation process is assumed. Blind IR methods instead utilize datasets of paired LQ–HQ images for supervised training of models. Non-blind IR methods, in contrast, assume access to $A$ and $n$ . This is an unrealistic assumption for many important real-world IR tasks, and thus limits non-blind methods to a subset of specific IR tasks such as bicubic downsampling, Gaussian blurring, colourization, or inpainting with a fixed mask. In the following, we first describe the most straightforward diffusion-based approach for general blind IR tasks in §3a. Representative non-blind diffusion-based approaches are then covered in §3b. Finally, §3c covers more recent methods for general blind IR.

(a). Conditional direct DM

The most straightforward approach for applying DMs to general IR tasks is to use the CDM with image guidance from §2c. In the IR context, the term $p_{ϕ} (y | x)$ in equation (2.27) represents the image degradation model which can be either a fixed operator with known parameters or a learnable neural network, depending on the task. It is also noted that strong guidance (large $η$ in equation 2.27) leads to good fidelity but visually LQ results (e.g. over-smooth images), while weak guidance (small $η$ ) has the opposite effect [28]. Now, let us consider the extreme case: how about decreasing $η$ to zero, i.e. no guidance? A simple observation from equation (2.27) is that with $η = 0$ , the conditional noise predictor learns the unconditional noise predictor directly: ${\hat{ϵ}}_{θ} (x_{t}, t, y) = ϵ_{θ} (x_{t}, t)$ , and the objective for diffusion-based IR is given by

\begin{array}{lcr} L_{cddm} = E_{x_{0}, y, t, ϵ} [‖ ϵ_{t} - {\hat{ϵ}}_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} \cdot ϵ, t, y) ‖^{2}] . \end{array}

(3.2)

We name this the conditional direct diffusion model (CDDM), which essentially follows the same training and sampling procedure as DDPM, except for the condition $y$ in the noise prediction as shown in figure 4. As a result, the generated image can be of very high visual quality (it looks realistic), but often has limited consistency with the original HQ image [16], as can be observed for the examples in the right-hand part of figure 4. Fortunately, some IR tasks, such as image super-resolution, colourization and inpainting, are highly ill-posed and can tolerate diverse predictions. CDDM can then be effectively applied to these tasks for photo-realistic IR.

Left: Overview of the conditional direct diffusion model (CDDM) on the face inpainting case. — *Left*: Overview of CDDM on the face inpainting case. The only change compared to DDPM (figure 2) is the reverse transition model $p_{θ} (x_{t - 1} | x_{t}, y)$ , which involves the LQ image $y$ in sampling to generate the corresponding HQ image. *Right*: Two IR examples (image super-resolution and inpainting) performed under the CDDM framework. These results look realistic but are not consistent with the original image.

One typical method is SR3 [16], which employs CDDM with a few modifications for image super-resolution. To condition the model on the LQ image $y$ , SR3 up-samples $y$ to the target resolution so that $y$ can be concatenated with the intermediate state $x_{t}$ along the channel dimension. Subsequently, Palette [29] extends SR3 to general IR tasks including colourization, inpainting, uncropping and JPEG restoration. There is more work [17,30] employing the same ‘direct diffusion’ strategy but adopting different restoration pipelines and additional networks for task-specific model learning. More recently, Wang et al. [31] propose StableSR, which further adapts a large-scale pretrained DM (Stable Diffusion [14]) for IR, by tweaking the noise predictor with image conditioning in the same way as for CDDM.

(b). Training-free CDMs

The key to the success of CDDM in IR lies in learning the conditional noise predictor ${\hat{ϵ}}_{θ} (x_{t}, t, y)$ by optimizing equation (3.2) on a dataset of paired LQ–HQ images. Unfortunately, this means that ${\hat{ϵ}}_{θ} (x_{t}, t, y)$ needs to be re-trained to handle tasks which are not included in the current training data, even in the non-blind setting where the degradation parameters $A$ and $n$ in equation (3.1) are known. For non-blind IR, a training-free approach can instead be derived by directly incorporating the degradation function into a pretrained unconditional DM, such as a DDPM.

With known degradation parameters, the term $p (y | x)$ also becomes accessible: $p (y | x) = N (A (x), σ_{n}^{2} I)$ , if the noise $n$ is Gaussian. Traditional IR approaches often solve this problem using maximum a posteriori (MAP) estimation [6], as follows:

\begin{array}{lcr} \hat{x} = \arg \min_{x} \frac{1}{2 σ_{n}^{2}} ‖ y - A (x) ‖^{2} + λ P (x), \end{array}

(3.3)

where $P (x)$ is a prior term empirically chosen to characterize the prior knowledge of $x$ . Then, a natural idea is to incorporate a pretrained unconditional DDPM into $P (x)$ as a powerful learned image prior. Specifically, recall the conditional score of equation (2.28) in the form

\begin{array}{lcr} \nabla_{x_{t}} \log p_{t} (x_{t} | y) = \nabla_{x_{t}} \log p_{t} (x_{t}) + \nabla_{x_{t}} \log p_{t} (y | x_{t}), \end{array}

(3.4)

where $x_{t}$ matches the diffusion state in DDPM, and the unconditional score $\nabla_{x_{t}} \log p_{t} (x_{t})$ can be obtained from equation (2.22) and approximated with DDPM’s noise predictor, as $\nabla_{x_{t}} \log p_{t} (x_{t}) \approx s_{θ} (x_{t}, t) = - \frac{ϵ_{θ} (x_{t}, t)}{\sqrt{1 - {\bar{α}}_{t}}}$ . However, computing $p_{t} (y | x_{t})$ in equation (3.4) is difficult since there is no obvious relationship between $y$ and the state $x_{t}$ . Fortunately, with Gaussian noise $n \sim N (0, σ_{n}^{2} I)$ , Chung et al. [32] propose an approximation for $\nabla_{x_{t}} \log p_{t} (y | x_{t})$ at each timestep $t$ :

\begin{array}{lcr} \nabla_{x_{t}} \log p_{t} (y | x_{t}) \approx \nabla_{x_{t}} \log p_{t} (y | {\hat{x}}_{0}), where {\hat{x}}_{0} = \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} + (1 - {\bar{α}}_{t}) s_{θ} (x_{t}, t)) . \end{array}

(3.5)

This can be obtained via Tweedie’s formula [32–35]. The approximation above is motivated by a Dirac approximation $\nabla_{x_{t}} \log p (y | x_{t}) = \nabla_{x_{t}} \log \int p (y | x_{0}) p (x_{0} | x_{t}) d x_{0} \approx \nabla_{x_{t}} \log p (y | {\hat{x}}_{0} (x_{t}))$ , where ${\hat{x}}_{0} (x_{t})$ is any Monte Carlo sample of $p (x_{0} | x_{t})$ . However, the approximation usually exhibits high variance since it uses a single Monte Carlo sample, while drawing more samples incurs more computations. Moreover, it is worth noting that $p_{t} (y | {\hat{x}}_{0})$ is a tractable Gaussian: $p_{t} (y | {\hat{x}}_{0}) = N (A ({\hat{x}}_{0}), σ_{n}^{2} I)$ . Computing $\nabla_{x_{t}} \log p_{t} (y | {\hat{x}}_{0})$ and substituting it for $\nabla_{x_{t}} \log p_{t} (y | x_{t})$ in equation (3.4) thus gives the following:

\begin{array}{lcr} \nabla_{x_{t}} \log p_{t} (x_{t} | y) \approx s_{θ} (x_{t}, t) - \frac{1}{2 σ_{n}^{2}} \nabla_{x_{t}} ‖ y - A ({\hat{x}}_{0}) ‖^{2} . \end{array}

(3.6)

We can then incorporate this approximation equation (3.6) into the sampling of a pretrained DDPM,

\begin{aligned} (3.7) & x_{t - 1} & = \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} + (1 - α_{t}) \nabla_{x_{t}} \log p_{t} (x_{t} | y)) + \sqrt{{\tilde{β}}_{t}} ϵ \\ (3.8) & \approx \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} + (1 - α_{t}) [s_{θ} (x_{t}, t) - \frac{1}{2 σ_{n}^{2}} \nabla_{x_{t}} ‖ y - A ({\hat{x}}_{0}) ‖^{2}]) + \sqrt{{\tilde{β}}_{t}} ϵ \\ (3.9) & = \underset{data consistency term}{\underset{⏟}{- \frac{1 - α_{t}}{2 σ_{n}^{2} \sqrt{{\bar{α}}_{t}}} \nabla_{x_{t}} ‖ y - A ({\hat{x}}_{0}) ‖^{2}}} + \underset{diffusion term}{\underset{⏟}{\frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} + (1 - α_{t}) s_{θ} (x_{t}, t)) + \sqrt{{\tilde{β}}_{t}} ϵ}}, \end{aligned}

where the first line is derived from equations (2.13) and (2.22) with additional condition $y$ . Note that the diffusion term in equation (3.9) is actually an unconditional sampling step in DDPM, where $s_{θ}$ is obtained from equation (2.22) as $s_{θ} (x_{t}, t) = - \frac{ϵ_{θ} (x_{t}, t)}{\sqrt{1 - {\bar{α}}_{t}}}$ . By letting $ρ = \frac{1 - α_{t}}{2 σ_{n}^{2} \sqrt{{\bar{α}}_{t}}}$ represent the step size of the data consistency term and simplifying the diffusion term, we then finally have

\begin{array}{lcr} x_{t - 1} = - ρ \nabla_{x_{t}} ‖ y - A ({\hat{x}}_{0}) ‖^{2} + μ_{θ} (x_{t}, t) + \sqrt{{\tilde{β}}_{t}} ϵ, \end{array}

(3.10)

where $μ_{θ}$ and $\tilde{β}$ are the posterior mean and variance of equation (2.10), respectively. This approach is called the diffusion posterior sampling (DPS) [32,36]. Note that equation (3.10) is conceptually similar to the MAP estimate in equation (3.3), with $\nabla_{x_{t}} ‖ y - A ({\hat{x}}_{0}) ‖^{2}$ as the data consistency term and $μ_{θ} (x_{t}, t) + \sqrt{{\tilde{β}}_{t}} ϵ$ being a diffusion-based image prior. When the degradation parameters of equation (3.1) are known, DPS thus utilizes this knowledge to guide the sampling process of a pretrained DDPM, encouraging generated images to be consistent with the LQ input image $y$ .

However, DPS does rely on the approximation in equation (3.5), for which the approximation error approaches zero only when the noise $n$ of $y$ has a high variance: $σ_{n} \to \infty$ . For the case where the LQ image is noiseless, $y = A (x)$ , we would prefer to introduce the approach from figure 5 where the unconditional generated state ${\hat{x}}_{t}$ is refined using the known degradation $A$ and the LQ image $y$ . More specifically, since the term $\nabla_{x_{t}} \log p_{t} (y | x_{t})$ now is unattainable (or non-approximable), we instead apply the forward marginal transition of equation (2.3) also on the LQ image $y$ to obtain $y_{t} \sim q (y_{t} | y) = N (y_{t}; \sqrt{{\bar{α}}_{t}} y, (1 - {\bar{α}}_{t}) I)$ , which is an intermediate state between $y$ and Gaussian noise. Then, we impose data consistency by projecting ${\hat{x}}_{t}$ onto a conditional path as follows:

Overview of the projection-based CDM. There are two paths for the HQ image $x$ and LQ image $y$ , generated from the same DM. At each reverse step $t$ , the sampling first leverages the pretrained DDPM for unconditional generation, i.e. $p_{θ} ({\hat{x}}_{t} | x_{t + 1})$ , and then refines ${\hat{x}}_{t}$ to $x_{t}$ with functions $H$ and $b$ as $x_{t} = H ({\hat{x}}_{t}) + b (y_{t})$ , where $y_{t}$ is obtained by applying the forward marginal transition equation (2.3) on the LQ image as $y_{t} \sim q (y_{t} | y)$ .

\begin{array}{lcr} x_{t} = H ({\hat{x}}_{t}) + b (y_{t}), where {\hat{x}}_{t} = μ_{θ} (x_{t}, t) + \sqrt{{\tilde{β}}_{t}} ϵ, \end{array}

(3.11)

where $H$ and $b$ are functions derived from the known degradation $A$ . For computational efficiency, the two functions are typically assumed to be linear and tailored to specific IR tasks. This projection-based method is referred to as iterative latent variable refinement [37]. In addition, for linear degradation problems, we can refine the intermediate state $x_{t}$ by decomposing $A$ into partitions and then combine them with the LQ image $y$ in the reverse process [15], or optimize $x_{t}$ using the Bayesian framework directly [38]. These approaches are similar to the projection-based method but can be more computationally efficient.

Recently, another class of approaches for training-free CDMs have emerged which are based on Feynman–Kac models and sequential Monte Carlo (SMC) samplers [39–41]. At the core, they wrap the approximations for $p_{t} (y | x_{t})$ in the proposals of an SMC sampler, so that the marginal distributions of the sampler anneals to the target conditional one. This approach is statistically exact, regardless of the approximations in $p_{t} (y | x_{t})$ , in the sense that as the number of particles used in the SMC sampler goes to infinity, the resulting population converges in distribution to the target. As such, this type of method improves significantly over DPS [32] in terms of statistical errors. However, it comes at the cost of storing a population of particles which does not scale well in memory in the problem dimension, and the efficiencies of their proposals. To improve the sampler, Cardoso et al. [40] consider linear Gaussian likelihood models and propose efficient proposals based on an inpainting problem, while Janati et al. [41] develop a divide-and-conquer construction to set intermediate target distributions. Note that although Corenflos et al. [42], Dou & Song [43] also use SMC samplers, they target different Feynman–Kac models. Moreover, the methods in [42] are training-free only for special problems (e.g., inpainting).

(c). Diffusion process towards degraded images

In previous sections, we have presented several diffusion-based IR methods, both for the blind and non-blind settings. However, these methods all generate images starting from Gaussian noise, which intuitively should be inefficient for IR tasks, given that input LQ images are closely related to the corresponding HQ images. That is, it should be easier to translate directly from the LQ image to the HQ image, rather than from noise to HQ image, as shown in figure 6. To address this problem, for general blind IR tasks, Luo et al. [18] propose the IR-SDE that models image degradation with a mean-reverting SDE:

Overview of the approach that performs diffusion towards degraded images. Here, the LQ image $y$ is involved in both the forward and backward processes. Moreover, the terminal state $x_{T}$ is a (noisy) LQ image rather than Gaussian noise.

\begin{array}{lcr} d x = θ_{t} (μ - x) d t + σ_{t} d w, \end{array}

(3.12)

where $μ$ is the state mean the SDE drifts to. The parameters $θ_{t}$ and $σ_{t}$ are predefined and they control the speed of the mean-reversion and the stochastic volatility, respectively. It is noted that the VP-SDE [22] is a special case of equation (3.12) where $μ$ is set to 0. Moreover, the SDE in equation (3.12) is proven to be tractable when the coefficients satisfy $σ_{t}^{2} / θ_{t} = 2 λ^{2}$ for all timesteps [18]. Similar to DDPM, we can obtain the marginal transition kernel $p_{t} (x)$ , which is a Gaussian given by

\begin{array}{lcr} p_{t} (x_{t} | x_{0}) = N (x_{t}; μ + (x_{0} - μ) e^{- {\bar{θ}}_{t}}, λ^{2} (1 - e^{- 2 {\bar{θ}}_{t}})), \end{array}

(3.13)

where ${\bar{θ}}_{t} = \int_{0}^{t} θ_{z} d z$ . As $t \to \infty$ , the terminal distribution converges to a stationary Gaussian with mean $μ$ and variance $λ^{2}$ . By setting the HQ image as the initial state $x_{0}$ and the LQ image as the terminal state mean $μ$ , this SDE iteratively transforms the HQ image into the LQ image with additional noise (where the noise level is fixed to $λ$ ). Then, we can restore the HQ image based on the reverse-time process of equation (3.12) as follows:

d x = [θ_{t} (μ - x) - σ_{t}^{2} \nabla_{x} \log p_{t} (x)] d t + σ_{t} d \hat{w} .

(3.14)

Notably, the score function $\nabla_{x} \log p_{t} (x)$ is tractable when conditioning on the known $x_{0}$ in training, as $\nabla_{x} \log p_{t} (x | x_{0}) = - \frac{x_{t} - m_{t}}{v_{t}}$ , where $m_{t}$ and $v_{t}$ are the mean and variance of equation (3.13), respectively. Learning this score with a neural network is similar to denoising score matching [25] but the target score is directly computed from the training distributions.

However, IR-SDE still needs to add noise to the LQ image as a terminal state $x_{T}$ . For fixed point-to-point mapping with a diffusion process, we further introduce the diffusion bridge (DB) [44] which can naturally transfer complex data distributions to reference distributions, i.e. directly from HQ to LQ images, without adding noise. More specifically, given a diffusion process defined by a forward SDE as in equation (2.14), Rogers & Williams [45] show that we can force the SDE to drift from the HQ image $x$ to a particular condition (the LQ image $y$ ) via Doob’s $h$ -transform [46]:

\begin{array}{lcr} d x = f (x, t) d t + {g (t)}^{2} h (x_{t}, t, y, T) + g (t) d w, \end{array}

(3.15)

where $h (x_{t}, t, y, T) = \nabla_{x_{t}} \log p (x_{T} | x_{t}) |_{x_{T} = y}$ is the gradient of the log transition kernel from $t$ to $T$ , derived from the original SDE. By setting the terminal state $x_{T} = y$ , the term ${g (t)}^{2} h (x_{t}, t, y, T)$ pushes each forward step towards the end condition $y$ , which exactly models the image degradation process. The corresponding reverse-time SDE of equation (3.15) can then be written as

\begin{array}{lcr} d x = [f (x, t) - g (t)^{2} (s (x_{t}, t, y, T) - h (x_{t}, t, y, T))] d t + g (t) d \hat{w}, \end{array}

(3.16)

where $s (x_{t}, t, y, T) = \nabla_{x_{t}} \log p (x_{t} | x_{T}) |_{x_{T} = y}$ is the conditional score function which can be learned via score-matching. The HQ image can then be recovered from the LQ image $y$ by iteratively running equation (3.16) in time as a traditional SDE solver. Note that we can design specific SDEs (e.g., VP/VE-SDE [22] to make the function $h (x_{t}, t, y, T)$ tractable [44,47,48]. The simplest case is the Brownian bridge [44] which constructs the marginal distribution as $p (x_{t} | x_{0}, x_{T}) = N ((1 - \frac{t}{T}) x_{0} + \frac{t}{T} x_{T}, \frac{t (T - t)}{T} I)$ . Another particular case is the Schrödinger bridge [48], which aims to compute a diffusion process that interpolates within the optimal coupling (when the reference measure of the bridge is chosen to be a Brownian motion) between the HQ and LQ image distributions [49]. The solution of the Schrödinger bridge converges weakly to an optimal transport plan with respect to 2-Wasserstein [48,50]. Most DB frameworks learn the noise $ϵ_{θ} (x_{t}, t)$ directly by adopting the similar score reparameterization trick from equation (2.22), which leads to the following objective: $‖ ϵ_{θ} (x_{t}, t) - \frac{x_{t} - m_{t}}{\sqrt{v_{t}}} ‖$ , where $m_{t}$ and $v_{t}$ are the marginal mean and variance of the forward process. More recently, Yue et al. [51] further propose to apply the DB to IR-SDE as the generalized Ornstein-Uhlenbeck bridge to achieve better performance. However, designing the forward SDE in equation (3.15) with a tractable yet effective $h (x_{t}, t, y, T)$ remains a challenge and is under-explored in IR. With the growing popularity of Score-SDEs and DBs, we hope that future approaches will offer various efficient and elegant solutions to general IR problems.

4. Conclusion and discussion

DMs have shown incredible capabilities and gained significant popularity in generative modelling. In particular, the mathematics behind them make these models exceedingly elegant. Building on their core concepts, we have described several approaches that effectively employ DMs for various IR tasks, achieving impressive results. However, it is also crucial to highlight the main challenges and further outline potential directions for future work.

(a). Difficult to process out-of-distribution degradations

Applying the trained DMs to out-of-distribution (OOD) data often leads to inferior performance and produces visually unpleasant artifacts [52], as shown in figure 7. There is research [31] proposing to address this issue by introducing the capable stable diffusion [14] with a feature control module [54]. However, such approaches still have to refine the stable DM with specific IR datasets. Moreover, the commonly used synthetic data strategy [55] just simulates known degradations such as noise, blur and compression and is unable to cover all corruption types that might be encountered in real-world applications. Inspired by the success of large language models and vision-language models, more recent approaches [5,52,53] have begun to explore the use of various language-based image representations in IR. The main idea is to produce ‘clean’ text descriptions of input LQ images, describing the main image content without undesired degradation-related concepts, and use these to guide the restoration process.

Failed examples of applying a trained diffusion model [53] to real-world and out-of-distribution (OOD) LQ input images. — Failed examples of applying a trained DM [53] to real-world and OOD LQ input images. In the left-hand example, the predicted HQ image contains unrecognizable text. In the right-hand example, the generated window shutters are visually unpleasant and inconsistent with the LQ input image.

(b). Inconsistency in image generation

While DMs produce photo-realistic results, the generated details are often inconsistent with the original input, especially regarding texture and text information, as shown in the right-hand part of figure 4 and in figure 7. This is mainly due to the intrinsic bias in the multi-step noise/score estimation and the stochasticity of the noise injection in each iteration. One solution is to add a predictor to generate the initial HQ image (with $ℓ_{1}$ loss) and then gradually add more details via a diffusion process [30]. However, this requires an additional network and the performance highly depends on the trained predictor. IR-SDE [18] proposes a maximum likelihood objective to learn the optimal restoration path, but its reverse-time process still contains noise injection (i.e. Wiener process) thus leading to unsatisfactory results. Recently, flow matching and optimal transport have shown great potential in image generation. In particular, they can form straight‐line trajectories in inference, which are more efficient than the curved paths from the DMs [56]. The use of such methods for IR tasks is therefore a seemingly promising future direction.

(c). High computational cost and inference time

Most diffusion-based IR methods require a significant number of diffusion steps to generate the final HQ image (typically 1000 steps using DDPMs), which is both time-consuming and computationally costly, thus bringing challenges for deployment in various real-world applications. This problem can be alleviated using latent DMs [14,57] or efficient sampling techniques [26,58]. Unfortunately, these are not always suitable for IR tasks since the latent DM often produces colour shifting [31], and the efficient sampling would decrease the image generation quality [58]. Considering the particularity of IR, several works [18,42,48] design the diffusion process towards degraded images (§3c), such that their inference can start from the LQ image (rather than Gaussian noise). While this makes the sampling process more efficient (typically requiring less than 100 diffusion steps), it could be possible to improve further by designing more effective SDEs or DB functions.

(i). Closing

We have covered the basics of DMs and key techniques for applying them to IR tasks. This is an active research area with many interesting challenges and potential future directions, such as achieving photo-realistic yet consistent image generation, robustness to real-world image degradations, and more computationally efficient sampling. We hope that this review paper offers a foundational understanding that enables readers to gain deeper insights into the mathematical principles underlying advanced diffusion-based IR approaches.

Contributor Information

Ziwei Luo, Email: ziwei.luo@it.uu.se.

Fredrik Gustafsson, Email: fredrik.gustafsson@ki.se.

Zheng Zhao, Email: zheng.zhao@liu.se; zz@zabmon.com.

Jens Sjölund, Email: jens.sjolund@it.uu.se.

Thomas Schön, Email: thomas.schon@it.uu.se.

Data accessibility

This article has no additional data.

Declaration of AI use

We have not used AI-assisted technologies in creating this article.

Authors’ contributions

Z.L.: conceptualization, investigation, methodology, project administration, validation, writing—original draft, writing—review and editing; F.G.: supervision, writing—original draft, writing—review and editing; Z.Z.: investigation, methodology, writing—review and editing; J.S.: supervision, writing—review and editing; T.S.: conceptualization, funding acquisition, supervision, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

This research was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, by the project Deep Probabilistic Regression – New Models and Learning Algorithms (contract number: 2021-04301) funded by the Swedish Research Council, and by the Kjell & Märta Beijer Foundation.

References

1. Buades A, Coll B, Morel JM. 2005. A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4 , 490–530. ( 10.1137/040616024) [DOI] [Google Scholar]
2. Shan Q, Jia J, Agarwala A. 2008. High-quality motion deblurring from a single image. ACM Trans. Graph. 27 , 1–10. ( 10.1145/1360612.1360672) [DOI] [Google Scholar]
3. Jose Valanarasu JM, Yasarla R, Patel VM. 2022. Transweather: transformer-based restoration of images degraded by adverse weather conditions. In Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2353–2363. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52688.2022.00239) [DOI] [Google Scholar]
4. Le H, Samaras D. 2019. Shadow removal via shadow image decomposition. In Proceeding of the IEEE/CVF Internatinal Conference on Computer Vision, pp. 8578–8587. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/ICCV.2019.00867) [DOI] [Google Scholar]
5. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2024. Controlling vision-language models for universal image restoration. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
6. Banham MR, Katsaggelos AK. 1997. Digital image restoration. IEEE Signal Process. Mag. 14 , 24–41. ( 10.1109/79.581363) [DOI] [Google Scholar]
7. Orfanidis SJ. 1995. Introduction to signal processing. Cliffs, NJ: Englewood. [Google Scholar]
8. Rabiner LR, Gold B. 1975. Theory and application of digital signal processing. Englewood Cliffs, NJ: Prentice-Hall. [Google Scholar]
9. Kundur D, Hatzinakos D. 1996. Blind image deconvolution. IEEE Signal Process. Mag. 13 , 43–64. ( 10.1109/79.489268) [DOI] [Google Scholar]
10. Zhao H, Gallo O, Frosio I, Kautz J. 2016. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3 , 47–57. ( 10.1109/tci.2016.2644865) [DOI] [Google Scholar]
11. Ledig C, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR.2017.19) [DOI] [Google Scholar]
12. Johnson J, Alahi A, Fei-Fei L. 2016. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 694–711. Cham, Switzerland: Springer International Publishing. ( 10.1007/978-3-319-46475-6_43) [DOI] [Google Scholar]
13. Ho J, Jain A, Abbeel P. 2020. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33 , 6840–6851. ( 10.48550/arXiv.2006.11239) [DOI] [Google Scholar]
14. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52688.2022.01042) [DOI] [Google Scholar]
15. Kawar B, Elad M, Ermon S, Song J. 2022. Denoising diffusion restoration models. Adv. Neural Inf. Process. Syst. 35 , 23593–23606. ( 10.48550/arXiv.2201.11793) [DOI] [Google Scholar]
16. Saharia C, Ho J, Chan W, Salimans T, Fleet DJ, Norouzi M. 2022. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 45 , 4713–4726. ( 10.1109/TPAMI.2022.3204461) [DOI] [PubMed] [Google Scholar]
17. Özdenizci O, Legenstein R. 2023. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Trans. Pattern Anal. Mach. Intell. 45 , 10346–10357. ( 10.1109/tpami.2023.3238179) [DOI] [PubMed] [Google Scholar]
18. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2023. Image restoration with mean-reverting stochastic differential equations. In International Conference on Machine Learning, pp. 23045–23066. Red Hook, NY. [Google Scholar]
19. Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. Red Hook, NY. [Google Scholar]
20. Vaswani A. 2017. Attention is all you need. In Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
21. Song Y, Ermon S. 2019. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 32 , 11895–11907. ( 10.48550/arXiv.1907.05600) [DOI] [Google Scholar]
22. Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. 2020. Score-based generative modeling through stochastic differential equations. arXiv. ( 10.48550/arXiv.2011.13456) [DOI]
23. Anderson BDO. 1982. Reverse-time diffusion equation models. Stoch. Process. Their Appl. 12 , 313–326. ( 10.1016/0304-4149(82)90051-5) [DOI] [Google Scholar]
24. Hyvärinen A, Dayan P. 2005. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6 , 695–709. [Google Scholar]
25. Vincent P. 2011. A connection between score matching and denoising autoencoders. Neural Comput. 23 , 1661–1674. ( 10.1162/neco_a_00142) [DOI] [PubMed] [Google Scholar]
26. Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J. 2022. Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 35 , 5775–5787. ( 10.48550/arXiv.2206.00927) [DOI] [Google Scholar]
27. Gillespie DT. 1996. Exact numerical simulation of the ornstein-uhlenbeck process and its integral. Phys. Rev. E 54 , 2084–2091. ( 10.1103/physreve.54.2084) [DOI] [PubMed] [Google Scholar]
28. Dhariwal P, Nichol A. 2021. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 34 , 8780–8794. ( 10.48550/arXiv.2105.05233) [DOI] [Google Scholar]
29. Saharia C, Chan W, Chang H, Lee C, Ho J, Salimans T, Fleet D, Norouzi M. 2022. Palette: image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10. ( 10.1145/3528233.3530757) [DOI] [Google Scholar]
30. Whang J, Delbracio M, Talebi H, Saharia C, Dimakis AG, Milanfar P. 2022. Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16293–16303. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
31. Wang J, Yue Z, Zhou S, Chan KCK, Loy CC. 2024. Exploiting diffusion prior for real-world image super-resolution. Int. J. Comput. Vis. 132 , 1–21. ( 10.1007/s11263-024-02168-7) [DOI] [Google Scholar]
32. Chung H, Kim J, Mccann MT, Klasky ML, Ye JC. 2022. Diffusion posterior sampling for general noisy inverse problems. arXiv Preprint arXiv:2209.14687. [Google Scholar]
33. Efron B. 2011. Tweedie’s formula and selection bias. J. Am. Stat. Assoc. 106 , 1602–1614. ( 10.1198/jasa.2011.tm11181) [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Song J, Vahdat A, Mardani M, Kautz J. 2023. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
35. Boys B, Girolami M, Pidstrigach J, Reich S, Mosca A, Akyildiz OD. 2023. Tweedie moment projected diffusions for inverse problems. arXiv. ( 10.48550/arXiv.2310.06721) [DOI]
36. Bruna J, Han J. 2024. Posterior sampling with denoising oracles via tilted transport. arXiv. ( 10.48550/arXiv.2407.00745) [DOI]
37. Choi J, Kim S, Jeong Y, Gwon Y, Yoon S. 2021. Ilvr: conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
38. Zhang G, Ji J, Zhang Y, Yu M, Jaakkola T, Chang S. 2023. Towards coherent image inpainting using denoising diffusion implicit models. In International Conference on Machine Learning, pp. 41164–41193. Red Hook, NY: PMLR. [Google Scholar]
39. Wu L, Trippe B, Naesseth C, Blei D, Cunningham JP. 2023. Practical and asymptotically exact conditional sampling in diffusion models. In Advances in neural information processing systems, pp. 31372–31403, vol. 36. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
40. Cardoso G, Idrissi YJ, Corff SL, Moulines E. 2024. Monte Carlo guided denoising diffusion models for Bayesian linear inverse problems. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
41. Janati Y, Durmus A, Moulines E, Olsson J. 2024. Divide-and-conquer posterior sampling for denoising diffusion priors. arXiv. ( 10.48550/arXiv.2403.11407) [DOI]
42. Corenflos A, Zhao Z, Särkkä S, Sjölund J, Schön TB. 2024. Conditioning diffusion models by explicit forward-backward bridging. arXiv. ( 10.48550/arXiv.2405.13794) [DOI]
43. Dou Z, Song Y. 2024. Diffusion posterior sampling for linear inverse problem solving: a filtering perspective. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
44. Li B, Xue K, Liu B, Lai YK. 2023. Bbdm: image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp. 1952–1961. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52729.2023.00194) [DOI] [Google Scholar]
45. Rogers LCG, Williams D. 2000. Diffusions, markov processes, and martingales: itô calculus. vol. 2. Cambridge, UK: Cambridge University Press. [Google Scholar]
46. Doob JL, Doob JI. 1984. Classical potential theory and its probabilistic counterpart. vol. 262. Berlin, Germany: Springer. [Google Scholar]
47. Zhou L, Lou A, Khanna S, Ermon S. 2024. Denoising diffusion bridge models. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
48. Liu GH, Vahdat A, Huang DA, Theodorou EA, Nie W, Anandkumar A. 2023. I2sb: image-to-image schrödinger bridge. In Proceedings of the 40th International Conference on Machine Learning, pp. 22042–22062. Red Hook, NY: Proceedings of Machine Learning Research (PMLR). [Google Scholar]
49. Chen T, Liu GH, Theodorou EA. 2021. Likelihood training of schrödinger bridge using forward-backward sdes theory. arXiv. ( 10.48550/arXiv.2110.11291) [DOI]
50. Peyré G, Cuturi M. 2019. Computational optimal transport: with applications to data science. Found.Trends Mach. Learn. 11 , 355–607. ( 10.1561/2200000073) [DOI] [Google Scholar]
51. Yue C, Peng Z, Ma J, Du S, Wei P, Zhang D. 2023. Image restoration through generalized ornstein-uhlenbeck bridge. arXiv. ( 10.48550/arXiv.2312.10299) [DOI]
52. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2024. Photo-realistic image restoration in the wild with controlled vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6641–6651. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPRW63382.2024.00658) [DOI] [Google Scholar]
53. Yu F, Gu J, Li Z, Hu J, Kong X, Wang X, He J, Qiao Y, Dong C. 2024. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 25669–25680. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
54. Zhang L, Rao A, Agrawala M. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision, pp. 3836–3847. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
55. Wang X, Xie L, Dong C, Shan Y. 2021. Real-esrgan: training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision, pp. 1905–1914. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
56. Lipman Y, Chen RTQ, Ben-Hamu H, Nickel M, Le M. 2023. Flow matching for generative modeling. In The Eleventh Int. Conf. on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
57. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2023. Refusion: enabling large-size realistic image restoration with latent-space diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1680–1691. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPRW59228.2023.00169) [DOI] [Google Scholar]
58. Song J, Meng C, Ermon S. 2021. Denoising diffusion implicit models. In International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article has no additional data.

[rsta.2024.0358_B1] 1. Buades A, Coll B, Morel JM. 2005. A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4 , 490–530. ( 10.1137/040616024) [DOI] [Google Scholar]

[rsta.2024.0358_B2] 2. Shan Q, Jia J, Agarwala A. 2008. High-quality motion deblurring from a single image. ACM Trans. Graph. 27 , 1–10. ( 10.1145/1360612.1360672) [DOI] [Google Scholar]

[rsta.2024.0358_B3] 3. Jose Valanarasu JM, Yasarla R, Patel VM. 2022. Transweather: transformer-based restoration of images degraded by adverse weather conditions. In Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2353–2363. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52688.2022.00239) [DOI] [Google Scholar]

[rsta.2024.0358_B4] 4. Le H, Samaras D. 2019. Shadow removal via shadow image decomposition. In Proceeding of the IEEE/CVF Internatinal Conference on Computer Vision, pp. 8578–8587. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/ICCV.2019.00867) [DOI] [Google Scholar]

[rsta.2024.0358_B5] 5. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2024. Controlling vision-language models for universal image restoration. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B6] 6. Banham MR, Katsaggelos AK. 1997. Digital image restoration. IEEE Signal Process. Mag. 14 , 24–41. ( 10.1109/79.581363) [DOI] [Google Scholar]

[rsta.2024.0358_B7] 7. Orfanidis SJ. 1995. Introduction to signal processing. Cliffs, NJ: Englewood. [Google Scholar]

[rsta.2024.0358_B8] 8. Rabiner LR, Gold B. 1975. Theory and application of digital signal processing. Englewood Cliffs, NJ: Prentice-Hall. [Google Scholar]

[rsta.2024.0358_B9] 9. Kundur D, Hatzinakos D. 1996. Blind image deconvolution. IEEE Signal Process. Mag. 13 , 43–64. ( 10.1109/79.489268) [DOI] [Google Scholar]

[rsta.2024.0358_B10] 10. Zhao H, Gallo O, Frosio I, Kautz J. 2016. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3 , 47–57. ( 10.1109/tci.2016.2644865) [DOI] [Google Scholar]

[rsta.2024.0358_B11] 11. Ledig C, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR.2017.19) [DOI] [Google Scholar]

[rsta.2024.0358_B12] 12. Johnson J, Alahi A, Fei-Fei L. 2016. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 694–711. Cham, Switzerland: Springer International Publishing. ( 10.1007/978-3-319-46475-6_43) [DOI] [Google Scholar]

[rsta.2024.0358_B13] 13. Ho J, Jain A, Abbeel P. 2020. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33 , 6840–6851. ( 10.48550/arXiv.2006.11239) [DOI] [Google Scholar]

[rsta.2024.0358_B14] 14. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52688.2022.01042) [DOI] [Google Scholar]

[rsta.2024.0358_B15] 15. Kawar B, Elad M, Ermon S, Song J. 2022. Denoising diffusion restoration models. Adv. Neural Inf. Process. Syst. 35 , 23593–23606. ( 10.48550/arXiv.2201.11793) [DOI] [Google Scholar]

[rsta.2024.0358_B16] 16. Saharia C, Ho J, Chan W, Salimans T, Fleet DJ, Norouzi M. 2022. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 45 , 4713–4726. ( 10.1109/TPAMI.2022.3204461) [DOI] [PubMed] [Google Scholar]

[rsta.2024.0358_B17] 17. Özdenizci O, Legenstein R. 2023. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Trans. Pattern Anal. Mach. Intell. 45 , 10346–10357. ( 10.1109/tpami.2023.3238179) [DOI] [PubMed] [Google Scholar]

[rsta.2024.0358_B18] 18. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2023. Image restoration with mean-reverting stochastic differential equations. In International Conference on Machine Learning, pp. 23045–23066. Red Hook, NY. [Google Scholar]

[rsta.2024.0358_B19] 19. Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. Red Hook, NY. [Google Scholar]

[rsta.2024.0358_B20] 20. Vaswani A. 2017. Attention is all you need. In Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B21] 21. Song Y, Ermon S. 2019. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 32 , 11895–11907. ( 10.48550/arXiv.1907.05600) [DOI] [Google Scholar]

[rsta.2024.0358_B22] 22. Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. 2020. Score-based generative modeling through stochastic differential equations. arXiv. ( 10.48550/arXiv.2011.13456) [DOI]

[rsta.2024.0358_B23] 23. Anderson BDO. 1982. Reverse-time diffusion equation models. Stoch. Process. Their Appl. 12 , 313–326. ( 10.1016/0304-4149(82)90051-5) [DOI] [Google Scholar]

[rsta.2024.0358_B24] 24. Hyvärinen A, Dayan P. 2005. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6 , 695–709. [Google Scholar]

[rsta.2024.0358_B25] 25. Vincent P. 2011. A connection between score matching and denoising autoencoders. Neural Comput. 23 , 1661–1674. ( 10.1162/neco_a_00142) [DOI] [PubMed] [Google Scholar]

[rsta.2024.0358_B26] 26. Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J. 2022. Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 35 , 5775–5787. ( 10.48550/arXiv.2206.00927) [DOI] [Google Scholar]

[rsta.2024.0358_B27] 27. Gillespie DT. 1996. Exact numerical simulation of the ornstein-uhlenbeck process and its integral. Phys. Rev. E 54 , 2084–2091. ( 10.1103/physreve.54.2084) [DOI] [PubMed] [Google Scholar]

[rsta.2024.0358_B28] 28. Dhariwal P, Nichol A. 2021. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 34 , 8780–8794. ( 10.48550/arXiv.2105.05233) [DOI] [Google Scholar]

[rsta.2024.0358_B29] 29. Saharia C, Chan W, Chang H, Lee C, Ho J, Salimans T, Fleet D, Norouzi M. 2022. Palette: image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10. ( 10.1145/3528233.3530757) [DOI] [Google Scholar]

[rsta.2024.0358_B30] 30. Whang J, Delbracio M, Talebi H, Saharia C, Dimakis AG, Milanfar P. 2022. Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16293–16303. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]

[rsta.2024.0358_B31] 31. Wang J, Yue Z, Zhou S, Chan KCK, Loy CC. 2024. Exploiting diffusion prior for real-world image super-resolution. Int. J. Comput. Vis. 132 , 1–21. ( 10.1007/s11263-024-02168-7) [DOI] [Google Scholar]

[rsta.2024.0358_B32] 32. Chung H, Kim J, Mccann MT, Klasky ML, Ye JC. 2022. Diffusion posterior sampling for general noisy inverse problems. arXiv Preprint arXiv:2209.14687. [Google Scholar]

[rsta.2024.0358_B33] 33. Efron B. 2011. Tweedie’s formula and selection bias. J. Am. Stat. Assoc. 106 , 1602–1614. ( 10.1198/jasa.2011.tm11181) [DOI] [PMC free article] [PubMed] [Google Scholar]

[rsta.2024.0358_B34] 34. Song J, Vahdat A, Mardani M, Kautz J. 2023. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B35] 35. Boys B, Girolami M, Pidstrigach J, Reich S, Mosca A, Akyildiz OD. 2023. Tweedie moment projected diffusions for inverse problems. arXiv. ( 10.48550/arXiv.2310.06721) [DOI]

[rsta.2024.0358_B36] 36. Bruna J, Han J. 2024. Posterior sampling with denoising oracles via tilted transport. arXiv. ( 10.48550/arXiv.2407.00745) [DOI]

[rsta.2024.0358_B37] 37. Choi J, Kim S, Jeong Y, Gwon Y, Yoon S. 2021. Ilvr: conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]

[rsta.2024.0358_B38] 38. Zhang G, Ji J, Zhang Y, Yu M, Jaakkola T, Chang S. 2023. Towards coherent image inpainting using denoising diffusion implicit models. In International Conference on Machine Learning, pp. 41164–41193. Red Hook, NY: PMLR. [Google Scholar]

[rsta.2024.0358_B39] 39. Wu L, Trippe B, Naesseth C, Blei D, Cunningham JP. 2023. Practical and asymptotically exact conditional sampling in diffusion models. In Advances in neural information processing systems, pp. 31372–31403, vol. 36. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B40] 40. Cardoso G, Idrissi YJ, Corff SL, Moulines E. 2024. Monte Carlo guided denoising diffusion models for Bayesian linear inverse problems. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B41] 41. Janati Y, Durmus A, Moulines E, Olsson J. 2024. Divide-and-conquer posterior sampling for denoising diffusion priors. arXiv. ( 10.48550/arXiv.2403.11407) [DOI]

[rsta.2024.0358_B42] 42. Corenflos A, Zhao Z, Särkkä S, Sjölund J, Schön TB. 2024. Conditioning diffusion models by explicit forward-backward bridging. arXiv. ( 10.48550/arXiv.2405.13794) [DOI]

[rsta.2024.0358_B43] 43. Dou Z, Song Y. 2024. Diffusion posterior sampling for linear inverse problem solving: a filtering perspective. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B44] 44. Li B, Xue K, Liu B, Lai YK. 2023. Bbdm: image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp. 1952–1961. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52729.2023.00194) [DOI] [Google Scholar]

[rsta.2024.0358_B45] 45. Rogers LCG, Williams D. 2000. Diffusions, markov processes, and martingales: itô calculus. vol. 2. Cambridge, UK: Cambridge University Press. [Google Scholar]

[rsta.2024.0358_B46] 46. Doob JL, Doob JI. 1984. Classical potential theory and its probabilistic counterpart. vol. 262. Berlin, Germany: Springer. [Google Scholar]

[rsta.2024.0358_B47] 47. Zhou L, Lou A, Khanna S, Ermon S. 2024. Denoising diffusion bridge models. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B48] 48. Liu GH, Vahdat A, Huang DA, Theodorou EA, Nie W, Anandkumar A. 2023. I2sb: image-to-image schrödinger bridge. In Proceedings of the 40th International Conference on Machine Learning, pp. 22042–22062. Red Hook, NY: Proceedings of Machine Learning Research (PMLR). [Google Scholar]

[rsta.2024.0358_B49] 49. Chen T, Liu GH, Theodorou EA. 2021. Likelihood training of schrödinger bridge using forward-backward sdes theory. arXiv. ( 10.48550/arXiv.2110.11291) [DOI]

[rsta.2024.0358_B50] 50. Peyré G, Cuturi M. 2019. Computational optimal transport: with applications to data science. Found.Trends Mach. Learn. 11 , 355–607. ( 10.1561/2200000073) [DOI] [Google Scholar]

[rsta.2024.0358_B51] 51. Yue C, Peng Z, Ma J, Du S, Wei P, Zhang D. 2023. Image restoration through generalized ornstein-uhlenbeck bridge. arXiv. ( 10.48550/arXiv.2312.10299) [DOI]

[rsta.2024.0358_B52] 52. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2024. Photo-realistic image restoration in the wild with controlled vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6641–6651. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPRW63382.2024.00658) [DOI] [Google Scholar]

[rsta.2024.0358_B53] 53. Yu F, Gu J, Li Z, Hu J, Kong X, Wang X, He J, Qiao Y, Dong C. 2024. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 25669–25680. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]

[rsta.2024.0358_B54] 54. Zhang L, Rao A, Agrawala M. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision, pp. 3836–3847. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]

[rsta.2024.0358_B55] 55. Wang X, Xie L, Dong C, Shan Y. 2021. Real-esrgan: training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision, pp. 1905–1914. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]

[rsta.2024.0358_B56] 56. Lipman Y, Chen RTQ, Ben-Hamu H, Nickel M, Le M. 2023. Flow matching for generative modeling. In The Eleventh Int. Conf. on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

[rsta.2024.0358_B57] 57. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2023. Refusion: enabling large-size realistic image restoration with latent-space diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1680–1691. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPRW59228.2023.00169) [DOI] [Google Scholar]

[rsta.2024.0358_B58] 58. Song J, Meng C, Ermon S. 2021. Denoising diffusion implicit models. In International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]

PERMALINK

Taming diffusion models for image restoration: a review

Ziwei Luo

Fredrik Gustafsson

Zheng Zhao

Jens Sjölund

Thomas Schön

Roles

Abstract

1. Introduction

Figure 1.

2. Generative modelling with DMs

(a). DDPMs

Figure 2.

(i). Forward diffusion process

(ii). Reverse process

(iii). Training objective

Simplified objective

(b). Data perturbation and sampling with SDEs

Figure 3.

(i). Data perturbation with forward SDEs

(ii). Sampling with reverse-time SDEs

(iii). Interpreting DDPM with the variance preserving SDE

(c). CDMs

(i). Conditional SDE

3. DMs for IR

(a). Conditional direct DM

Figure 4.

(b). Training-free CDMs

Figure 5.

(c). Diffusion process towards degraded images

Figure 6.

4. Conclusion and discussion

(a). Difficult to process out-of-distribution degradations

Figure 7.

(b). Inconsistency in image generation

(c). High computational cost and inference time

(i). Closing

Contributor Information

Data accessibility

Declaration of AI use

Authors’ contributions

Conflict of interest declaration

Funding

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases