Abstract
Diffusion models (DMs) have achieved remarkable progress in generative modelling, particularly in enhancing image quality to conform to human preferences. Recently, these models have also been applied to low-level computer vision for photo-realistic image restoration (IR) in tasks such as image denoising, deblurring and dehazing. In this review, we introduce key constructions in DMs and survey contemporary techniques that make use of DMs in solving general IR tasks. We also point out the main challenges and limitations of existing diffusion-based IR frameworks and provide potential directions for future work.
This article is part of the theme issue ‘Generative modelling meets Bayesian inference: a new paradigm for inverse problems’.
Keywords: Diffusion model, image restoration, Generative models, inverse problems
1. Introduction
Image restoration (IR) is a long-standing and challenging research topic in computer vision, which generally has two high-level aims: (i) recover high-quality (HQ) images from their degraded low-quality (LQ) counterparts, and (ii) eliminate undesired objects from specific scenes. The former includes tasks like image denoising [1] and deblurring [2], while the latter contains tasks like rain/haze/snow removal [3] and shadow removal [4]. Figure 1 showcases examples of these applications. To solve different IR problems, traditional methods require task-specific knowledge to model the degradation and perform restoration in the spatial or frequency domain, by combining classical signal processing algorithms [6–8] with specific image-degradation parameters [9]. More recently, numerous efforts have been made to train deep learning models on collected datasets to improve performance on different IR tasks [10]. Most of them directly train neural networks on sets of paired LQ–HQ images with a reconstruction objective (e.g. or distances) as typical in supervised learning. While effective, this approach tends to produce over-smooth results, particularly in textures [11]. Although this issue can be alleviated by including adversarial or perceptual losses [12], the training then typically becomes unstable and the results often contain undesired artefacts or are inconsistent with the input LQ images [11].
Figure 1.
Generally, there are two types of IR tasks: (1) Recover images from their degraded versions and (2) eliminate undesired objects from specific scenes. Here, all top rows are LQ input images and the bottom rows are the corresponding HQ images generated by a diffusion-based IR model [5]. As observed, applying DMs for IR can produce photo-realistic results in line with human perceptual preferences.
Recently, generative diffusion models (DMs) [13] have drawn increasing attention due to their stable training process and remarkable performance in producing realistic images and videos [14]. Inspired by them, numerous works have incorporated the diffusion process into various IR problems to obtain high-perceptual/photo-realistic results [15–18]. However, these methods exhibit considerable diversity and complexity across various domains and IR tasks, obscuring the shared foundations that are key to understanding and improving diffusion-based IR approaches. In light of this, our paper reviews the key concepts in DMs and then surveys trending techniques for applying them to IR tasks. More specifically, the fundamentals of DMs are introduced in §2, in which we further elucidate the score-based stochastic differential equations (Score-SDEs) and then show the connections between denoising diffusion probabilistic models (DDPMs) and Score-SDEs. In addition, the conditional diffusion models (CDMs) are elaborated such that we can learn to guide the image generation, which is key in adapting DMs for general IR tasks. Several diffusion-based IR frameworks are then summarized in §3. In particular, we can leverage CDMs for IR from different perspectives including DDPMs, Score-SDE and their connections. The connection even yields a training-free approach for non-blind IR, i.e. for tasks with known degradation parameters. Finally, we conclude the paper with a discussion of the remaining challenges and potential future work in §4.
2. Generative modelling with DMs
Generative DMs are a family of probabilistic models employing an iterative process (e.g. Markov chains) to transform the data distribution into a reference distribution. In the following, §2a describes a typical formulation of DMs: the DDPMs [13,19], followed by §2b which generalizes this to Score-SDEs for a more detailed analysis of the diffusion/reverse process. Finally, in §2c, we further show how to guide DMs for conditional generation, which is a key enabling technique for diffusion-based IR.
(a). DDPMs
Given a variable sampled from a data distribution , DDPMs [13,19] are latent variable models consisting of two Markov chains: a forward/diffusion process and a reverse process . The forward process transfers to a Gaussian distribution by sequentially injecting noise. For simplicity, we set such that the forward process starts from the data distribution. Then the reverse process learns to generate new data samples starting from the Gaussian noise. An overview of the DDPM is shown in figure 2. Below, we explain the forward and backward processes, and provide details on how the DDPMs are trained.
Figure 2.
DDPMs. The forward path transfers data to Gaussian noise, and the reverse path learns to generate data from noise along the actual time reversal of the forward process. Here, the reverse transition distribution represents the model we aim to learn, and the conditional posterior is a tractable Gaussian which serves as the target distribution the model wants to match as the term in equation (2.7).
(i). Forward diffusion process
The forward process perturbs data samples to noise . It can be characterized by a joint distribution encompassing all intermediate states, represented in the form
| (2.1) |
Here, the transition kernel is a handcrafted Gaussian given by
| (2.2) |
where is the variance schedule: a set of pre-defined hyper-parameters that ensure the forward process (approximately) converges to a Gaussian distribution. Let and , equation (2.2) then allows us to marginalize the joint distribution of equation (2.1) to
| (2.3) |
We usually set such that and the terminal distribution is thus a standard Gaussian, which allows us to generate new data points by reversing the diffusion process starting from sampled Gaussian noise. Moreover, it is important to note that posteriors along the forward process are tractable when conditioned on the original data sample , i.e. is a tractable Gaussian [19]. This tractability enables the derivation of the DDPM training objective, which we will describe in §2a(iii).
(ii). Reverse process
In contrast, the reverse process learns to match the actual time reversal of the forward process, which is also a joint distribution modelled by as follows:
| (2.4) |
In DDPMs, the transition kernel is defined as a learnable Gaussian:
| (2.5) |
where and are the parameterized mean and variance, respectively. Learning the latent variable model of equation (2.5) is key to DDPMs since it substantially affects the quality of data sampling. That is, we have to adjust the parameters until the final sampled variable is close to that sampled from the real data distribution.
(iii). Training objective
To learn the reverse process, we usually minimize the variational bound on the negative log-likelihood which introduces the forward joint distribution of equation (2.1) in the objective as
| (2.6) |
Here, is a standard Gaussian, is the reverse transition kernel in equation (2.5) that we want to learn, and is the forward transition kernel of equation (2.2). This objective can be further rewritten according to
| (2.7) |
where is called the prior matching term and contains no learnable parameters, is the posterior matching term and the data reconstruction term that maximizes the likelihood of . Sohl-Dickstein et al. [19] prove that the conditional posterior distribution in is a tractable Gaussian: , where the mean and variance are
| (2.8) |
All terms in are known and thus the posterior variance in equation (2.5) can be non‐parametric, i.e. , which does not depend on and allows us to only focus on learning the posterior mean . Specifically, applying the reparameterization trick to of equation (2.3) gives an estimate of the initial state: , which can be substituted into equation (2.8) to obtain: . The noise then can be learned using a neural network , and the parameterized distribution mean can be rewritten as
| (2.9) |
The transition kernel of equation (2.5) is finally updated according to the following:
| (2.10) |
where the variance is predefined as in equation (2.8). Note that, now matches the form of , to minimize the KL term of in equation (2.7). Also note that DDPMs only need to learn the noise network , for which it is common to use a U-Net architecture with several self-attention layers [13]. The noise network takes an image and a time as input, and outputs a noise image of the same shape as . More specifically, the scalar time is encoded into vectors similar to positional embedding [20] and is combined with in the feature space for time-varying noise prediction.
Simplified objective
We now have known expressions for all components of the objective in equation (2.7). However, its current form is not ideal to use for model training since it requires to be computed at every timestep of the entire diffusion process, which is time-consuming and impractical. Fortunately, the prior matching term can be ignored since it does not contain any parameters. By substituting equations (2.8) and (2.9) into equation (2.7), we also find that the final expanded version of the posterior matching term ( ) and the data reconstruction term have similar forms,
| (2.11) |
where denotes . By ignoring the weights outside the expectations in equation (2.11), the final training objective can therefore be obtained according to the following [13]:
| (2.12) |
which essentially learns to match the predicted and real added noise for each training sample and thus is also called the noise matching loss. Compared to the original objective in equation (2.7), is a re-weighted version that puts more focus on larger timesteps , which empirically has been shown to improve the training [13]. Once trained, the noise prediction network can be used to generate new data by running equation (2.10) starting from , i.e. by iterating
| (2.13) |
as a parameterized data sampling process, similar to that in Langevin dynamics [21].
(b). Data perturbation and sampling with SDEs
We have shown how DDPM works for data perturbation and data generation. We can further generalize the DDPM to stochastic differential equations, namely, Score-SDE [22], where both the forward and reverse processes are in continuous-time state spaces. This generalization offers a deeper insight into the mathematics behind DMs that underlies the success of diffusion-based generative modelling. Figure 3 shows an overview of the Score-SDE approach.
Figure 3.
Data perturbation and sampling with SDEs. In contrast to DDPMs, the Score-SDE continuously perturbs the data to Gaussian noise using a forward SDE, , and then generates new samples by estimating the score and simulating the corresponding reverse-time SDE.
(i). Data perturbation with forward SDEs
Here, we construct variables for data perturbation in continuous time, which can be modelled as a forward SDE defined by
| (2.14) |
where and are called the drift and diffusion functions, respectively, and is a standard Wiener process (also known as Brownian motion). We use to denote the marginal probability density of , and use to denote the transition kernel from to . Moreover, we always design the SDE to drift to a fixed prior distribution (e.g. standard Gaussian), ensuring that becomes independent of and can be sampled individually.
(ii). Sampling with reverse-time SDEs
We can sample noise and reverse the forward SDE to generate new data close to that sampled from the real data distribution. Note that reversing equation (2.14) yields another diffusion process, i.e. a reverse-time SDE [23] in the form
| (2.15) |
where is a reverse-time Wiener process and is called the score (or score function). The score is the vector field of pointing to the directions in which the probability density function has the largest growth rate [21]. Once the score is known for all time , simulating equation (2.15) in time allows us to sample new data from noise.
Earlier work such as the score-based generative models [21] often learn the score using score matching [24]. However, score matching is computationally costly and only works for discrete times. Song et al. [22] propose a continuous-time version that optimizes the following:
| (2.16) |
where is uniformly sampled over , , , and represents the score prediction network. This objective ensures that the optimal score network, denoted , from equation (2.16) satisfies almost surely [22,25].
(iii). Interpreting DDPM with the variance preserving SDE
Notably, extending DDPM to an infinite number of timesteps (i.e. continuous timesteps) leads to a special SDE which gives a more reliable interpretation of the diffusion process, and allows us to optimize the sampling with more efficient SDE/ordinary differential equation (ODE) solvers [22,26]. Specifically, recall the DDPM perturbation kernel of equation (2.2) and write it in the form
| (2.17) |
where is the discrete timestep. Let us define an auxiliary set and obtain
| (2.18) |
As a preparation to convert functions from discrete-time to continuous-time, let , and . We can now rewrite equation (2.18) with the difference and time as follows:
where the two approximate equalities hold when . Then we convert to , to and obtain the following: , which is a typical mean-reverting SDE (also known as the Ornstein–Uhlenbeck process [27]) that drifts towards a stationary distribution, i.e. a standard Gaussian in this case. Song et al. [22] also name it the variance preserving SDE (VP-SDE) and further illustrate that DDPM’s marginal distribution in equation (2.3) is a solution to the VP-SDE. Therefore, we can use either the diffusion reverse process (equation 2.13) or the reverse-time SDE (equation 2.15) to sample new data from noise with the same trained DDPM. In addition, the score can be directly computed from the marginal distribution in equation (2.3),
| (2.22) |
where is from the reparameterization trick and can be approximated using the noise prediction network . Equation (2.22) thus shows how we convert the DM to an SDE (i.e. obtain the score from noise ). Then, numerous efficient SDE/ODE solvers can be used to optimize DMs, further bringing interpretability and faster sampling [22].
(c). CDMs
So far, we have learned how to sample data from different types of DMs. However, all of the above methods are concerned with unconditional generation, which is insufficient for IR where we want to sample HQ images conditioned on degraded LQ images. Therefore, we present the CDM below.
Let us keep the diffusion process of equation (2.1) unchanged and reconstruct the reverse process in equation (2.4) with a condition , i.e. . The conditional reverse kernel can then be modelled as
| (2.23) |
where is an additional network that predicts from , and can be treated as a constant since it does not depend on . This equation yields an adjusted mean for the posterior distribution of equation (2.10), given by [28]
| (2.24) |
where is the gradient scale (also called the guidance scale). Equation (2.22) further gives the score of the joint distribution :
which provides a conditional noise predictor with the following form [28]
| (2.27) |
The conditional sampling is performed as a regular DDPM by substituting this new noise predictor into the posterior mean of equation (2.9). The gradient scale controls the performance trade-off between image quality and fidelity, i.e. lower produces photo-realistic results, and higher yields better consistency with the condition.
(i). Conditional SDE
Similar to guided diffusion, we can also change the score function to control the reverse-time SDE conditioned on the variable , i.e. by replacing with in equation (2.15). Since , the conditional score can be decomposed as
| (2.28) |
which means that we can simulate the following reverse-time SDE for conditional generation:
| (2.29) |
where . Song et al. [22] show that we can use a separate network to learn (e.g. a time-dependent classifier if represents class labels), or estimate its log gradient directly with heuristics and domain knowledge.
With these CDMs, we can sample images with specified labels (such as dog and cat) or, as the main topic of this paper, recover clean HQ images from corrupted LQ inputs.
3. DMs for IR
Diffusion-based IR can be considered a special case of CDMs with image conditioning. We first introduce the concept of image degradation, which is a process that transforms an HQ image into an LQ image characterized by undesired corruptions. The general image degradation process can be modelled as follows:
| (3.1) |
where denotes the degradation function and is additive noise. As the examples show in figure 1, degradation can manifest itself in various forms such as noise, blur, rain and haze. IR then aims to reverse this process to obtain a clean HQ image from the corrupted LQ counterpart .
IR is further decomposed into two distinct settings, blind and non-blind IR, depending on whether or not the degradation parameters and in equation (3.1) are known. Blind IR is the most general setting, in which no explicit knowledge of the degradation process is assumed. Blind IR methods instead utilize datasets of paired LQ–HQ images for supervised training of models. Non-blind IR methods, in contrast, assume access to and . This is an unrealistic assumption for many important real-world IR tasks, and thus limits non-blind methods to a subset of specific IR tasks such as bicubic downsampling, Gaussian blurring, colourization, or inpainting with a fixed mask. In the following, we first describe the most straightforward diffusion-based approach for general blind IR tasks in §3a. Representative non-blind diffusion-based approaches are then covered in §3b. Finally, §3c covers more recent methods for general blind IR.
(a). Conditional direct DM
The most straightforward approach for applying DMs to general IR tasks is to use the CDM with image guidance from §2c. In the IR context, the term in equation (2.27) represents the image degradation model which can be either a fixed operator with known parameters or a learnable neural network, depending on the task. It is also noted that strong guidance (large in equation 2.27) leads to good fidelity but visually LQ results (e.g. over-smooth images), while weak guidance (small ) has the opposite effect [28]. Now, let us consider the extreme case: how about decreasing to zero, i.e. no guidance? A simple observation from equation (2.27) is that with , the conditional noise predictor learns the unconditional noise predictor directly: , and the objective for diffusion-based IR is given by
| (3.2) |
We name this the conditional direct diffusion model (CDDM), which essentially follows the same training and sampling procedure as DDPM, except for the condition in the noise prediction as shown in figure 4. As a result, the generated image can be of very high visual quality (it looks realistic), but often has limited consistency with the original HQ image [16], as can be observed for the examples in the right-hand part of figure 4. Fortunately, some IR tasks, such as image super-resolution, colourization and inpainting, are highly ill-posed and can tolerate diverse predictions. CDDM can then be effectively applied to these tasks for photo-realistic IR.
Figure 4.
Left: Overview of CDDM on the face inpainting case. The only change compared to DDPM (figure 2) is the reverse transition model , which involves the LQ image in sampling to generate the corresponding HQ image. Right: Two IR examples (image super-resolution and inpainting) performed under the CDDM framework. These results look realistic but are not consistent with the original image.
One typical method is SR3 [16], which employs CDDM with a few modifications for image super-resolution. To condition the model on the LQ image , SR3 up-samples to the target resolution so that can be concatenated with the intermediate state along the channel dimension. Subsequently, Palette [29] extends SR3 to general IR tasks including colourization, inpainting, uncropping and JPEG restoration. There is more work [17,30] employing the same ‘direct diffusion’ strategy but adopting different restoration pipelines and additional networks for task-specific model learning. More recently, Wang et al. [31] propose StableSR, which further adapts a large-scale pretrained DM (Stable Diffusion [14]) for IR, by tweaking the noise predictor with image conditioning in the same way as for CDDM.
(b). Training-free CDMs
The key to the success of CDDM in IR lies in learning the conditional noise predictor by optimizing equation (3.2) on a dataset of paired LQ–HQ images. Unfortunately, this means that needs to be re-trained to handle tasks which are not included in the current training data, even in the non-blind setting where the degradation parameters and in equation (3.1) are known. For non-blind IR, a training-free approach can instead be derived by directly incorporating the degradation function into a pretrained unconditional DM, such as a DDPM.
With known degradation parameters, the term also becomes accessible: , if the noise is Gaussian. Traditional IR approaches often solve this problem using maximum a posteriori (MAP) estimation [6], as follows:
| (3.3) |
where is a prior term empirically chosen to characterize the prior knowledge of . Then, a natural idea is to incorporate a pretrained unconditional DDPM into as a powerful learned image prior. Specifically, recall the conditional score of equation (2.28) in the form
| (3.4) |
where matches the diffusion state in DDPM, and the unconditional score can be obtained from equation (2.22) and approximated with DDPM’s noise predictor, as . However, computing in equation (3.4) is difficult since there is no obvious relationship between and the state . Fortunately, with Gaussian noise , Chung et al. [32] propose an approximation for at each timestep :
| (3.5) |
This can be obtained via Tweedie’s formula [32–35]. The approximation above is motivated by a Dirac approximation , where is any Monte Carlo sample of . However, the approximation usually exhibits high variance since it uses a single Monte Carlo sample, while drawing more samples incurs more computations. Moreover, it is worth noting that is a tractable Gaussian: . Computing and substituting it for in equation (3.4) thus gives the following:
| (3.6) |
We can then incorporate this approximation equation (3.6) into the sampling of a pretrained DDPM,
where the first line is derived from equations (2.13) and (2.22) with additional condition . Note that the diffusion term in equation (3.9) is actually an unconditional sampling step in DDPM, where is obtained from equation (2.22) as . By letting represent the step size of the data consistency term and simplifying the diffusion term, we then finally have
| (3.10) |
where and are the posterior mean and variance of equation (2.10), respectively. This approach is called the diffusion posterior sampling (DPS) [32,36]. Note that equation (3.10) is conceptually similar to the MAP estimate in equation (3.3), with as the data consistency term and being a diffusion-based image prior. When the degradation parameters of equation (3.1) are known, DPS thus utilizes this knowledge to guide the sampling process of a pretrained DDPM, encouraging generated images to be consistent with the LQ input image .
However, DPS does rely on the approximation in equation (3.5), for which the approximation error approaches zero only when the noise of has a high variance: . For the case where the LQ image is noiseless, , we would prefer to introduce the approach from figure 5 where the unconditional generated state is refined using the known degradation and the LQ image . More specifically, since the term now is unattainable (or non-approximable), we instead apply the forward marginal transition of equation (2.3) also on the LQ image to obtain , which is an intermediate state between and Gaussian noise. Then, we impose data consistency by projecting onto a conditional path as follows:
Figure 5.
Overview of the projection-based CDM. There are two paths for the HQ image and LQ image , generated from the same DM. At each reverse step , the sampling first leverages the pretrained DDPM for unconditional generation, i.e. , and then refines to with functions and as , where is obtained by applying the forward marginal transition equation (2.3) on the LQ image as .
| (3.11) |
where and are functions derived from the known degradation . For computational efficiency, the two functions are typically assumed to be linear and tailored to specific IR tasks. This projection-based method is referred to as iterative latent variable refinement [37]. In addition, for linear degradation problems, we can refine the intermediate state by decomposing into partitions and then combine them with the LQ image in the reverse process [15], or optimize using the Bayesian framework directly [38]. These approaches are similar to the projection-based method but can be more computationally efficient.
Recently, another class of approaches for training-free CDMs have emerged which are based on Feynman–Kac models and sequential Monte Carlo (SMC) samplers [39–41]. At the core, they wrap the approximations for in the proposals of an SMC sampler, so that the marginal distributions of the sampler anneals to the target conditional one. This approach is statistically exact, regardless of the approximations in , in the sense that as the number of particles used in the SMC sampler goes to infinity, the resulting population converges in distribution to the target. As such, this type of method improves significantly over DPS [32] in terms of statistical errors. However, it comes at the cost of storing a population of particles which does not scale well in memory in the problem dimension, and the efficiencies of their proposals. To improve the sampler, Cardoso et al. [40] consider linear Gaussian likelihood models and propose efficient proposals based on an inpainting problem, while Janati et al. [41] develop a divide-and-conquer construction to set intermediate target distributions. Note that although Corenflos et al. [42], Dou & Song [43] also use SMC samplers, they target different Feynman–Kac models. Moreover, the methods in [42] are training-free only for special problems (e.g., inpainting).
(c). Diffusion process towards degraded images
In previous sections, we have presented several diffusion-based IR methods, both for the blind and non-blind settings. However, these methods all generate images starting from Gaussian noise, which intuitively should be inefficient for IR tasks, given that input LQ images are closely related to the corresponding HQ images. That is, it should be easier to translate directly from the LQ image to the HQ image, rather than from noise to HQ image, as shown in figure 6. To address this problem, for general blind IR tasks, Luo et al. [18] propose the IR-SDE that models image degradation with a mean-reverting SDE:
Figure 6.
Overview of the approach that performs diffusion towards degraded images. Here, the LQ image is involved in both the forward and backward processes. Moreover, the terminal state is a (noisy) LQ image rather than Gaussian noise.
| (3.12) |
where is the state mean the SDE drifts to. The parameters and are predefined and they control the speed of the mean-reversion and the stochastic volatility, respectively. It is noted that the VP-SDE [22] is a special case of equation (3.12) where is set to 0. Moreover, the SDE in equation (3.12) is proven to be tractable when the coefficients satisfy for all timesteps [18]. Similar to DDPM, we can obtain the marginal transition kernel , which is a Gaussian given by
| (3.13) |
where . As , the terminal distribution converges to a stationary Gaussian with mean and variance . By setting the HQ image as the initial state and the LQ image as the terminal state mean , this SDE iteratively transforms the HQ image into the LQ image with additional noise (where the noise level is fixed to ). Then, we can restore the HQ image based on the reverse-time process of equation (3.12) as follows:
| (3.14) |
Notably, the score function is tractable when conditioning on the known in training, as , where and are the mean and variance of equation (3.13), respectively. Learning this score with a neural network is similar to denoising score matching [25] but the target score is directly computed from the training distributions.
However, IR-SDE still needs to add noise to the LQ image as a terminal state . For fixed point-to-point mapping with a diffusion process, we further introduce the diffusion bridge (DB) [44] which can naturally transfer complex data distributions to reference distributions, i.e. directly from HQ to LQ images, without adding noise. More specifically, given a diffusion process defined by a forward SDE as in equation (2.14), Rogers & Williams [45] show that we can force the SDE to drift from the HQ image to a particular condition (the LQ image ) via Doob’s -transform [46]:
| (3.15) |
where is the gradient of the log transition kernel from to , derived from the original SDE. By setting the terminal state , the term pushes each forward step towards the end condition , which exactly models the image degradation process. The corresponding reverse-time SDE of equation (3.15) can then be written as
| (3.16) |
where is the conditional score function which can be learned via score-matching. The HQ image can then be recovered from the LQ image by iteratively running equation (3.16) in time as a traditional SDE solver. Note that we can design specific SDEs (e.g., VP/VE-SDE [22] to make the function tractable [44,47,48]. The simplest case is the Brownian bridge [44] which constructs the marginal distribution as . Another particular case is the Schrödinger bridge [48], which aims to compute a diffusion process that interpolates within the optimal coupling (when the reference measure of the bridge is chosen to be a Brownian motion) between the HQ and LQ image distributions [49]. The solution of the Schrödinger bridge converges weakly to an optimal transport plan with respect to 2-Wasserstein [48,50]. Most DB frameworks learn the noise directly by adopting the similar score reparameterization trick from equation (2.22), which leads to the following objective: , where and are the marginal mean and variance of the forward process. More recently, Yue et al. [51] further propose to apply the DB to IR-SDE as the generalized Ornstein-Uhlenbeck bridge to achieve better performance. However, designing the forward SDE in equation (3.15) with a tractable yet effective remains a challenge and is under-explored in IR. With the growing popularity of Score-SDEs and DBs, we hope that future approaches will offer various efficient and elegant solutions to general IR problems.
4. Conclusion and discussion
DMs have shown incredible capabilities and gained significant popularity in generative modelling. In particular, the mathematics behind them make these models exceedingly elegant. Building on their core concepts, we have described several approaches that effectively employ DMs for various IR tasks, achieving impressive results. However, it is also crucial to highlight the main challenges and further outline potential directions for future work.
(a). Difficult to process out-of-distribution degradations
Applying the trained DMs to out-of-distribution (OOD) data often leads to inferior performance and produces visually unpleasant artifacts [52], as shown in figure 7. There is research [31] proposing to address this issue by introducing the capable stable diffusion [14] with a feature control module [54]. However, such approaches still have to refine the stable DM with specific IR datasets. Moreover, the commonly used synthetic data strategy [55] just simulates known degradations such as noise, blur and compression and is unable to cover all corruption types that might be encountered in real-world applications. Inspired by the success of large language models and vision-language models, more recent approaches [5,52,53] have begun to explore the use of various language-based image representations in IR. The main idea is to produce ‘clean’ text descriptions of input LQ images, describing the main image content without undesired degradation-related concepts, and use these to guide the restoration process.
Figure 7.
Failed examples of applying a trained DM [53] to real-world and OOD LQ input images. In the left-hand example, the predicted HQ image contains unrecognizable text. In the right-hand example, the generated window shutters are visually unpleasant and inconsistent with the LQ input image.
(b). Inconsistency in image generation
While DMs produce photo-realistic results, the generated details are often inconsistent with the original input, especially regarding texture and text information, as shown in the right-hand part of figure 4 and in figure 7. This is mainly due to the intrinsic bias in the multi-step noise/score estimation and the stochasticity of the noise injection in each iteration. One solution is to add a predictor to generate the initial HQ image (with loss) and then gradually add more details via a diffusion process [30]. However, this requires an additional network and the performance highly depends on the trained predictor. IR-SDE [18] proposes a maximum likelihood objective to learn the optimal restoration path, but its reverse-time process still contains noise injection (i.e. Wiener process) thus leading to unsatisfactory results. Recently, flow matching and optimal transport have shown great potential in image generation. In particular, they can form straight‐line trajectories in inference, which are more efficient than the curved paths from the DMs [56]. The use of such methods for IR tasks is therefore a seemingly promising future direction.
(c). High computational cost and inference time
Most diffusion-based IR methods require a significant number of diffusion steps to generate the final HQ image (typically 1000 steps using DDPMs), which is both time-consuming and computationally costly, thus bringing challenges for deployment in various real-world applications. This problem can be alleviated using latent DMs [14,57] or efficient sampling techniques [26,58]. Unfortunately, these are not always suitable for IR tasks since the latent DM often produces colour shifting [31], and the efficient sampling would decrease the image generation quality [58]. Considering the particularity of IR, several works [18,42,48] design the diffusion process towards degraded images (§3c), such that their inference can start from the LQ image (rather than Gaussian noise). While this makes the sampling process more efficient (typically requiring less than 100 diffusion steps), it could be possible to improve further by designing more effective SDEs or DB functions.
(i). Closing
We have covered the basics of DMs and key techniques for applying them to IR tasks. This is an active research area with many interesting challenges and potential future directions, such as achieving photo-realistic yet consistent image generation, robustness to real-world image degradations, and more computationally efficient sampling. We hope that this review paper offers a foundational understanding that enables readers to gain deeper insights into the mathematical principles underlying advanced diffusion-based IR approaches.
Contributor Information
Ziwei Luo, Email: ziwei.luo@it.uu.se.
Fredrik Gustafsson, Email: fredrik.gustafsson@ki.se.
Zheng Zhao, Email: zheng.zhao@liu.se; zz@zabmon.com.
Jens Sjölund, Email: jens.sjolund@it.uu.se.
Thomas Schön, Email: thomas.schon@it.uu.se.
Data accessibility
This article has no additional data.
Declaration of AI use
We have not used AI-assisted technologies in creating this article.
Authors’ contributions
Z.L.: conceptualization, investigation, methodology, project administration, validation, writing—original draft, writing—review and editing; F.G.: supervision, writing—original draft, writing—review and editing; Z.Z.: investigation, methodology, writing—review and editing; J.S.: supervision, writing—review and editing; T.S.: conceptualization, funding acquisition, supervision, writing—review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration
We declare we have no competing interests.
Funding
This research was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, by the project Deep Probabilistic Regression – New Models and Learning Algorithms (contract number: 2021-04301) funded by the Swedish Research Council, and by the Kjell & Märta Beijer Foundation.
References
- 1. Buades A, Coll B, Morel JM. 2005. A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4 , 490–530. ( 10.1137/040616024) [DOI] [Google Scholar]
- 2. Shan Q, Jia J, Agarwala A. 2008. High-quality motion deblurring from a single image. ACM Trans. Graph. 27 , 1–10. ( 10.1145/1360612.1360672) [DOI] [Google Scholar]
- 3. Jose Valanarasu JM, Yasarla R, Patel VM. 2022. Transweather: transformer-based restoration of images degraded by adverse weather conditions. In Proceeding of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2353–2363. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52688.2022.00239) [DOI] [Google Scholar]
- 4. Le H, Samaras D. 2019. Shadow removal via shadow image decomposition. In Proceeding of the IEEE/CVF Internatinal Conference on Computer Vision, pp. 8578–8587. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/ICCV.2019.00867) [DOI] [Google Scholar]
- 5. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2024. Controlling vision-language models for universal image restoration. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 6. Banham MR, Katsaggelos AK. 1997. Digital image restoration. IEEE Signal Process. Mag. 14 , 24–41. ( 10.1109/79.581363) [DOI] [Google Scholar]
- 7. Orfanidis SJ. 1995. Introduction to signal processing. Cliffs, NJ: Englewood. [Google Scholar]
- 8. Rabiner LR, Gold B. 1975. Theory and application of digital signal processing. Englewood Cliffs, NJ: Prentice-Hall. [Google Scholar]
- 9. Kundur D, Hatzinakos D. 1996. Blind image deconvolution. IEEE Signal Process. Mag. 13 , 43–64. ( 10.1109/79.489268) [DOI] [Google Scholar]
- 10. Zhao H, Gallo O, Frosio I, Kautz J. 2016. Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3 , 47–57. ( 10.1109/tci.2016.2644865) [DOI] [Google Scholar]
- 11. Ledig C, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR.2017.19) [DOI] [Google Scholar]
- 12. Johnson J, Alahi A, Fei-Fei L. 2016. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 694–711. Cham, Switzerland: Springer International Publishing. ( 10.1007/978-3-319-46475-6_43) [DOI] [Google Scholar]
- 13. Ho J, Jain A, Abbeel P. 2020. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33 , 6840–6851. ( 10.48550/arXiv.2006.11239) [DOI] [Google Scholar]
- 14. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52688.2022.01042) [DOI] [Google Scholar]
- 15. Kawar B, Elad M, Ermon S, Song J. 2022. Denoising diffusion restoration models. Adv. Neural Inf. Process. Syst. 35 , 23593–23606. ( 10.48550/arXiv.2201.11793) [DOI] [Google Scholar]
- 16. Saharia C, Ho J, Chan W, Salimans T, Fleet DJ, Norouzi M. 2022. Image super-resolution via iterative refinement. IEEE Trans. Pattern Anal. Mach. Intell. 45 , 4713–4726. ( 10.1109/TPAMI.2022.3204461) [DOI] [PubMed] [Google Scholar]
- 17. Özdenizci O, Legenstein R. 2023. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE Trans. Pattern Anal. Mach. Intell. 45 , 10346–10357. ( 10.1109/tpami.2023.3238179) [DOI] [PubMed] [Google Scholar]
- 18. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2023. Image restoration with mean-reverting stochastic differential equations. In International Conference on Machine Learning, pp. 23045–23066. Red Hook, NY. [Google Scholar]
- 19. Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. Red Hook, NY. [Google Scholar]
- 20. Vaswani A. 2017. Attention is all you need. In Advances in neural information processing systems. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 21. Song Y, Ermon S. 2019. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 32 , 11895–11907. ( 10.48550/arXiv.1907.05600) [DOI] [Google Scholar]
- 22. Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B. 2020. Score-based generative modeling through stochastic differential equations. arXiv. ( 10.48550/arXiv.2011.13456) [DOI]
- 23. Anderson BDO. 1982. Reverse-time diffusion equation models. Stoch. Process. Their Appl. 12 , 313–326. ( 10.1016/0304-4149(82)90051-5) [DOI] [Google Scholar]
- 24. Hyvärinen A, Dayan P. 2005. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6 , 695–709. [Google Scholar]
- 25. Vincent P. 2011. A connection between score matching and denoising autoencoders. Neural Comput. 23 , 1661–1674. ( 10.1162/neco_a_00142) [DOI] [PubMed] [Google Scholar]
- 26. Lu C, Zhou Y, Bao F, Chen J, Li C, Zhu J. 2022. Dpm-solver: a fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. Neural Inf. Process. Syst. 35 , 5775–5787. ( 10.48550/arXiv.2206.00927) [DOI] [Google Scholar]
- 27. Gillespie DT. 1996. Exact numerical simulation of the ornstein-uhlenbeck process and its integral. Phys. Rev. E 54 , 2084–2091. ( 10.1103/physreve.54.2084) [DOI] [PubMed] [Google Scholar]
- 28. Dhariwal P, Nichol A. 2021. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 34 , 8780–8794. ( 10.48550/arXiv.2105.05233) [DOI] [Google Scholar]
- 29. Saharia C, Chan W, Chang H, Lee C, Ho J, Salimans T, Fleet D, Norouzi M. 2022. Palette: image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10. ( 10.1145/3528233.3530757) [DOI] [Google Scholar]
- 30. Whang J, Delbracio M, Talebi H, Saharia C, Dimakis AG, Milanfar P. 2022. Deblurring via stochastic refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16293–16303. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
- 31. Wang J, Yue Z, Zhou S, Chan KCK, Loy CC. 2024. Exploiting diffusion prior for real-world image super-resolution. Int. J. Comput. Vis. 132 , 1–21. ( 10.1007/s11263-024-02168-7) [DOI] [Google Scholar]
- 32. Chung H, Kim J, Mccann MT, Klasky ML, Ye JC. 2022. Diffusion posterior sampling for general noisy inverse problems. arXiv Preprint arXiv:2209.14687. [Google Scholar]
- 33. Efron B. 2011. Tweedie’s formula and selection bias. J. Am. Stat. Assoc. 106 , 1602–1614. ( 10.1198/jasa.2011.tm11181) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Song J, Vahdat A, Mardani M, Kautz J. 2023. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 35. Boys B, Girolami M, Pidstrigach J, Reich S, Mosca A, Akyildiz OD. 2023. Tweedie moment projected diffusions for inverse problems. arXiv. ( 10.48550/arXiv.2310.06721) [DOI]
- 36. Bruna J, Han J. 2024. Posterior sampling with denoising oracles via tilted transport. arXiv. ( 10.48550/arXiv.2407.00745) [DOI]
- 37. Choi J, Kim S, Jeong Y, Gwon Y, Yoon S. 2021. Ilvr: conditioning method for denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
- 38. Zhang G, Ji J, Zhang Y, Yu M, Jaakkola T, Chang S. 2023. Towards coherent image inpainting using denoising diffusion implicit models. In International Conference on Machine Learning, pp. 41164–41193. Red Hook, NY: PMLR. [Google Scholar]
- 39. Wu L, Trippe B, Naesseth C, Blei D, Cunningham JP. 2023. Practical and asymptotically exact conditional sampling in diffusion models. In Advances in neural information processing systems, pp. 31372–31403, vol. 36. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 40. Cardoso G, Idrissi YJ, Corff SL, Moulines E. 2024. Monte Carlo guided denoising diffusion models for Bayesian linear inverse problems. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 41. Janati Y, Durmus A, Moulines E, Olsson J. 2024. Divide-and-conquer posterior sampling for denoising diffusion priors. arXiv. ( 10.48550/arXiv.2403.11407) [DOI]
- 42. Corenflos A, Zhao Z, Särkkä S, Sjölund J, Schön TB. 2024. Conditioning diffusion models by explicit forward-backward bridging. arXiv. ( 10.48550/arXiv.2405.13794) [DOI]
- 43. Dou Z, Song Y. 2024. Diffusion posterior sampling for linear inverse problem solving: a filtering perspective. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 44. Li B, Xue K, Liu B, Lai YK. 2023. Bbdm: image-to-image translation with brownian bridge diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern Recognition, pp. 1952–1961. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPR52729.2023.00194) [DOI] [Google Scholar]
- 45. Rogers LCG, Williams D. 2000. Diffusions, markov processes, and martingales: itô calculus. vol. 2. Cambridge, UK: Cambridge University Press. [Google Scholar]
- 46. Doob JL, Doob JI. 1984. Classical potential theory and its probabilistic counterpart. vol. 262. Berlin, Germany: Springer. [Google Scholar]
- 47. Zhou L, Lou A, Khanna S, Ermon S. 2024. Denoising diffusion bridge models. In The Twelfth International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 48. Liu GH, Vahdat A, Huang DA, Theodorou EA, Nie W, Anandkumar A. 2023. I2sb: image-to-image schrödinger bridge. In Proceedings of the 40th International Conference on Machine Learning, pp. 22042–22062. Red Hook, NY: Proceedings of Machine Learning Research (PMLR). [Google Scholar]
- 49. Chen T, Liu GH, Theodorou EA. 2021. Likelihood training of schrödinger bridge using forward-backward sdes theory. arXiv. ( 10.48550/arXiv.2110.11291) [DOI]
- 50. Peyré G, Cuturi M. 2019. Computational optimal transport: with applications to data science. Found.Trends Mach. Learn. 11 , 355–607. ( 10.1561/2200000073) [DOI] [Google Scholar]
- 51. Yue C, Peng Z, Ma J, Du S, Wei P, Zhang D. 2023. Image restoration through generalized ornstein-uhlenbeck bridge. arXiv. ( 10.48550/arXiv.2312.10299) [DOI]
- 52. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2024. Photo-realistic image restoration in the wild with controlled vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6641–6651. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPRW63382.2024.00658) [DOI] [Google Scholar]
- 53. Yu F, Gu J, Li Z, Hu J, Kong X, Wang X, He J, Qiao Y, Dong C. 2024. Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 25669–25680. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
- 54. Zhang L, Rao A, Agrawala M. 2023. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision, pp. 3836–3847. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
- 55. Wang X, Xie L, Dong C, Shan Y. 2021. Real-esrgan: training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF Conference on Computer Vision, pp. 1905–1914. Los Alamitos, CA: IEEE Computer Society. [Google Scholar]
- 56. Lipman Y, Chen RTQ, Ben-Hamu H, Nickel M, Le M. 2023. Flow matching for generative modeling. In The Eleventh Int. Conf. on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
- 57. Luo Z, Gustafsson FK, Zhao Z, Sjölund J, Schön TB. 2023. Refusion: enabling large-size realistic image restoration with latent-space diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1680–1691. Los Alamitos, CA: IEEE Computer Society. ( 10.1109/CVPRW59228.2023.00169) [DOI] [Google Scholar]
- 58. Song J, Meng C, Ermon S. 2021. Denoising diffusion implicit models. In International Conference on Learning Representations. Red Hook, NY: Curran Associates, Inc. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
This article has no additional data.







![Failed examples of applying a trained diffusion model [53] to real-world and out-of-distribution (OOD) LQ input images.](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a6f2/12201591/8cb6249a4df5/rsta.2024.0358.f007.jpg)