ESRGAN-DP: Enhanced super-resolution generative adversarial network with adaptive dual perceptual loss

Jie Song; Huawei Yi; Wenqian Xu; Xiaohui Li; Bo Li; Yuanyuan Liu

doi:10.1016/j.heliyon.2023.e15134

. 2023 Apr 3;9(4):e15134. doi: 10.1016/j.heliyon.2023.e15134

ESRGAN-DP: Enhanced super-resolution generative adversarial network with adaptive dual perceptual loss

Jie Song ^a, Huawei Yi ^a,^∗, Wenqian Xu ^a, Xiaohui Li ^a, Bo Li ^a, Yuanyuan Liu ^b

PMCID: PMC10119608 PMID: 37089297

Abstract

The proposal of perceptual loss solves the over-smoothing problem of images caused by pixel-wise loss and improves the visual quality of the images, but it also inevitably produces a large number of artifacts and distortions in images. The reason for this phenomenon is that the perceptual features only rely on a single pre-trained visual geometry group (VGG) network, which results in the features of the image being unable to be fully extracted, thus limiting the reasoning ability of the model. To fundamentally reduce the generation of artifacts and distortions, this paper proposes the Dual Perceptual Loss (DP Loss). First, we improve the perceptual feature extraction method so that it no longer only extracts single-type VGG features. In addition, a residual network (ResNet) feature that has a complementary relationship with the VGG feature can also be extracted. Then, we propose a dynamic weighting method to eliminate the magnitude difference between perceptual losses. Finally, to obtain the excellent effect of image reconstruction, enhanced super-resolution generative adversarial network (ESRGAN) with strong learning capability is used in this paper to adapt the complexity of DP Loss. The abundant experimental studies and evaluations are conducted on benchmark datasets. Results are encouraging and better than those previously reported on these datasets. The code is available at https://github.com/Sunny6-6-6/ESRGAN-DP.

Keywords: Super resolution, Perceptual loss, Residual network, Visual geometry group

1. Introduction

A low-resolution (LR) image is reconstructed as a high-resolution (HR) image, which is called single image super-resolution (SISR) reconstruction. It is an extremely challenging task. Under the condition of high upscaling factors, low-resolution images contain too little information. Therefore, it is very difficult to reconstruct high-resolution images that are highly consistent with ground-truth (GT) images. The main challenge of this paper is to make the reconstructed image as close as possible to the GT image, which is of great significance to the fields such as medicine [1,2], remote sensing [3], monitoring [4] and so on.

As the field of super-resolution (SR) reconstruction attracted more and more attention, many excellent algorithms have been proposed. According to the existing studies, the algorithms can be divided into three categories: interpolation-based methods, model-based methods and learning-based methods. The interpolation-based methods (e.g., bicubic interpolation [5] and the nearest neighbor interpolation [6]) have low complexity and high efficiency. However, since the expanded pixels are calculated by the neighborhood pixel, these methods don't make any reasoning for the texture details, which makes the image smooth on the whole. Although the model-based method can generate images efficiently, the quality of images completely depends on whether the prior image is consistent with the image to be reconstructed. When there is a deviation in the information contained of the two images, the visual quality of the reconstructed image will decrease rapidly. However, most SR methods are based on the learning, including neighbor embedding [7], sparse coding [[8], [9], [10], [11]] and random forests [12]. With the popularization and development of deep neural network, the field of SR reconstruction has made great progress. Chao et al. [13] used deep neural network for the first time to solve the problem of SR reconstruction and proposed SRCNN, which attracted extensive attention. Later, various SR reconstruction methods based on deep learning were proposed. At that time, the main optimization goal was to minimize the mean square error (MSE) of GT images and reconstructed images [13,14] or maximize the peak signal-to-noise ratio (PSNR), which were also two metrics commonly used to evaluate and compare SR methods [15]. However, the ability of MSE (or PSNR) to capture perceptual-related differences (e.g., texture details) is very limited. To solve this problem, Johnson et al. [16] proposed perceptual loss, which extracted image features through pre-trained network to predict the uncertainty factors under LR. Subsequently, considering the great success of generative adversarial network (GAN) [17] in image generation, Ledig et al. [18] combined it with the perceptual-driven method, and made a further breakthrough in the clarity of the reconstructed image.

Although perceptual loss has made great progress in improving visual quality, there are still some obvious defects. One of them is that only relying on a single pre-trained network cannot mine potential features of images throughly, which limits the reasoning ability of the network. In order to solve this problem, based on the residual network (ResNet) [19], this paper proposes a perceptual loss called ResNet loss different from the visual geometry group (VGG) loss [16,18] in extraction method. Compared with the network structure of VGG [20], the ResNet network is composed of a large number of residual blocks [19], and it can retain more feature information and will not be lost with the increase of network layers. Some features extracted by the two distinct networks will exist in a complementary form, so this paper adopts the method of combining VGG loss with ResNet loss to improve the information acquisition ability of the overall perceptual features.

In the process of optimizing VGG loss and ResNet loss simultaneously, we expect that both of them provide strong support for the network. However, the two losses are generated in different ways, so their magnitudes will be different. If they are added directly, the difference in magnitude will cause the network to be biased towards learning the perceptual features extracted by a single pre-trained network. As a result, the advantage of dual perceptual cannot be fully utilized. The use of static constants for weighting and combination to balance the magnitude also has some defects. It can only control the initial state of the two losses, but cannot ensure the two losses to continue to be at the same magnitude during the training process. To cope with the problem mentioned above, this paper proposes a dynamic weighting method. It takes the VGG loss as the reference target and weights the ResNet loss dynamically, which keeps the relative size of two losses stable. This method eliminates the influence of magnitude, and enables the network to learn both ResNet features and VGG features with a strong degree of attention. Therefore, this paper defines the combination of the VGG loss and the dynamically weighted ResNet loss as a Dual Perceptual Loss (DP Loss). In addition, DP Loss has a certain degree of flexibility by using the dynamic weight, hence the DP Loss can be applied to different models conveniently.

The hyperparameter setting of DP Loss has a great influence on the performance of the model. Therefore, this paper first applies DP Loss to super-resolution generative adversarial network (SRGAN) [18] to get SRGAN with Dual Perceptual Loss (SRGAN-DP), and tests the influence of different hyperparameter combinations on the model to obtain the optimal hyperparameter combination. According to the experimental results, the SRGAN-DP under the optimal condition is better than SRGAN in all the evaluation metrics. Then, we incorporate DP Loss under the optimal parameter combination obtained by experimental verification into enhanced super-resolution generative adversarial network (ESRGAN), and obtain ESRGAN with Dual Perceptual Loss (ESRGAN-DP). Finally, the experimental results show that the ESRGAN-DP has better visual effect and evaluation metrics compared with SRGAN-DP.

The main contributions of this paper are itemized as follows:1. Following the VGG loss, this paper proposes the ResNet loss that is generated by the perceptual features extracted by the pre-trained ResNet network. Compared with VGG, the ResNet network does not lose the original information with the increase of layers, thus, a deeper level of output can be used to obtain the higher level of perceptual features. 2. In order to explore the advantages of the simultaneous existence of VGG features and ResNet features, this paper elaborates the complementarity between them through theoretical basis and visualization. 3. In this paper, we analyze the impact of the magnitude problem between multiple perceptual losses on training and propose a dynamic weighting solution. 4. Considering the influence of the hyperparameter combination in DP Loss on the reconstruction results, this paper conducts experiments on different benchmark datasets. By comparing and analyzing the experimental results, the loss function under the optimal hyperparameter combination is selected and applied to ESRGAN. Experiments show that compared with the traditional ESRGAN, the visual effects and evaluation metrics have been significantly improved, and compared with other state-of-the-art methods are also outstanding.

In Section 2, we discuss the related network models and the research progress of loss functions. In Section 3, we focus on the theoretical ideas related to DP Loss. In Section 4, we integrate DP Loss into SRGAN for hyperparameter analysis, and apply the loss function under the optimal hyperparameter combination to ESRGAN to obtain ESRGAN-DP, and then compare ESRGAN-DP with other state-of-the-art SR methods to verify the effectiveness of the proposed method; Section 5 concludes this paper.

2. Related work

This section first introduces different network structures in the SR field, and then focuses on the improvement of related loss functions. In Table 1, we show the solutions and effects proposed by some widely concerned SR models.

Table 1.

Comparison of the solutions and effects proposed by SR models.

Type	Improvements of Network Structure				Improvements of Loss Function
Method	VDSR [21]	EDSR [22]	RRDB [27]	AdderSR [33]	SRGAN [18]	ESRGAN [27]	SROBB [39]
Proposed Solution	Increases the number of convolutional network layers to 20	Uses residual blocks and removes unnecessary BN layers	Adds residual structure on the basis of dense residual block	Constructs an adder filter to replace the traditional convolution filter	Applies perceptual loss, adversarial loss and content loss together to the model	Uses the features before activation to construct the perceptual loss	Constructs a targeted perceptual loss based on the labels of object, background and boundary
Effect	Significantly improves PNSR value	Accelerates network convergence and improves performance	Greatly improves the reasoning ability of the network	Reduces computational complexity while ensuring network performance	Has the ability to generate texture details of the image and greatly improves the visual quality	Further improves visual quality of images and reconstructs finer textures	Generates more faithful textures and reduces artifacts

Open in a new tab

Since the pioneering SRCNN [13] was proposed, the use of deep learning to solve the SISR problem has attracted more and more attention. Compared with traditional methods, deep learning has obvious advantages. Not only does it have a great improvement in visual quality, but also has more diversity in optimization and improvement. Furthermore, more and more network architectures are used to solve the problem of SR reconstruction. Kim et al. [21] designed the VDSR, which increased the number of network layers to 20 and significantly improved the reconstruction effect. Lim et al. [22] constructed the EDSR by using residual blocks that deleted unnecessary Batch Normalization (BN) layers [23]. Inspired by DenseNet [24], Zhang et al. [25] proposed residual dense block (RDB) and applied it to SR, and further discussed the deeper network structure with channel attention mechanism [26]. Wang et al. [27] believed that the performance of the network can be better exerted without using BN layer, and introduced a residual in residual dense block (RRDB) to further improve the reasoning ability of the network. In order to restore more faithful textures, Wang et al. [28] used prior category information for targeted generation of data. Then, they constructed the spatial feature transform (SFT) layer, and adjusted the features of the middle layer in a single network by using semantic segmentation maps [[29], [30], [31]]. Yang et al. [32] made adjustments to the residual self-encoding network and used the attention mechanism to develop an improved residual self-encoding and attention mechanism super resolution (RSAMSR) network. In order to fundamentally reduce the computational cost, Song et al. [33] applied the AdderNets [34] to SR, and developed a learnable energy activation to adjust the feature distribution and refine the relevant details. Aiming at solving the problem of the uncertainty of image down-sampling methods in real scenes and the large solution space from LR to HR, Guo et al. [35] proposed a dual regression scheme that can not only learn the mapping relationship from LR to HR, but also can estimate the down-sampling kernel. In order to better extract low-frequency and high-frequency information, Pandey and Ghanekar [36] proposed a multi-scale feature enhancement attention residual network.

Johnson et al. [16] believed that only focusing on optimizing the MSE or PSNR of the pixel space ratio of the GT image and the reconstructed image would make the reconstructed image smooth. Therefore, they constructed the perceptual loss, which was to improve the reconstruction effect by minimizing the feature space error between the GT image and the reconstructed image. After that, Ledig et al. [18] applied the perceptual loss and the GAN [17] to SR, and constituted the total loss of the generator by using the weighted combination of several loss components, including the adversarial loss, the content loss and the perceptual loss. The loss function proposed by Sajjadi et al. [37] consisted of four parts, namely the pixel-wise loss, the perceptual loss, the texture matching loss and the adversarial loss. Wang et al. [27] proposed ESRGAN based on the idea of SRGAN, which used a variety of techniques to further improve the texture details of the reconstructed image. In terms of perceptual loss, they proposed to use the output before the activation of the convolutional layer to obtain more feature information, so that the object to be minimized was the error of the feature space before activation. As a result, a convincing effect was obtained on the texture details of the reconstructed image and they won the PIRM2018-SR Challenge [38] champion. Rad et al. [39] designed a targeted perceptual loss on the basis of the labels of object, background and boundary, which made the network reconstruct the image from multiple perspectives and improved the overall effect of the image. Therefore, discussing the perceptual loss is crucial to the improvement of the reconstruction results, especially the facticity of texture details. In order to reduce the unnatural artifacts generated in the perceptual-driven method, this paper designs a novel perceptual loss function to achieve this goal.

3. Methods

This paper solves the SR problem by training deep neural network. According to the theoretical idea proposed by Chao et al. [13], the optimization objective is represented in Eq. (1).

Equation 1.

(1)

where $I_{i}^{L R} \in R^{C \times H \times W}$ and $I_{i}^{H R} \in R^{C \times H \times W}$ represent the $i - t h$ LR and HR sub-image pairs in the training set, respectively. $G (I_{i}^{L R}; θ)$ represents the up-sampling network. $θ$ is the parameter to be optimized within the neural network. $L$ is the loss function.

The definition of $L$ is crucial, and it is closely related to the reconstruction effect. According to the theoretical idea Ledig et al. [18] propose, we define several losses by comparing the differences between the reconstructed image and the GT image from different perspectives, and constitute the total loss through the weighted sum of every loss component. Here, $L$ can be represented as:

Equation 2.

(2)

where $l_{c o n t e n t}$ is the content loss of the pixel-wise 1-norm distance between images reconstructed by the generator and GT images, $l_{a d v e r s a r i a l}$ is the loss caused by the application of relativistic average GAN (RaGAN) [40], $l_{D P}$ is the DP Loss we propose. $λ$ , $η$ and $γ$ are the coefficients of balancing different loss terms, respectively. In Section 3.2, the DP Loss proposed in this paper will be introduced in detail.

3.1. 3.1. adversarial network structure

In order to enable a large number of features to be better learned, in the network structure, we select ESRGAN [27] with strong learning ability, the network structure is shown in Fig. 1. The model is composed of a generator network and a discriminator network. The two networks conduct the adversarial training through the alternate optimization technique.

Fig. 1 — Adversarial network structure, where $s$ in the generator network is a residual scaling parameter.

The generator network is a residual network composed of a large number of residual in residual dense blocks (RRDBs). In order to retain more feature information, the network does not contain the BN layer. This paper progressively enlarges the image by continuously adding up-sampling blocks with a 2 $\times$ upscaling factor (if a 3 $\times$ upscaling image is required, only an up-sampling block with a 3 $\times$ upscaling factor is used). Each up-sampling block consists of three steps. Firstly, the input data is amplified by using the nearest neighbor interpolation [6] with a 2 $\times$ upscaling factor, and then it passes through a 3 $\times$ 3 filter, and finally the output data of the filter is activated by LeakyReLU.

The input image size of the discriminator network is fixed at 128 $\times$ 128. The discrimination method uses the theoretical idea of RaD proposed in Ref. [40], and the output form is:

Equation 3.

(3)

where $a v g [\cdot]$ represents the operation of taking the average for the discriminant value of all fake data (the output of the generator) in the mini-batch. According to Eq. (3), $D_{R a}$ is asymmetric, and the optimization direction will be substantially changed only by changing the parameter order and discriminant target value. Therefore, regardless of whether the optimization is the discrimination loss or the generation loss, the target network in this method can benefit from real data and fake data rather than the part of the data.

3.2. Perceptual loss

3.2.1. VGG loss and ResNet loss

Ledig et al. [18] proposed to define the VGG loss according to the ReLU activation layer of the pre-trained 19 layer VGG network. In order to obtain more feature information, Wang et al. [27] redefined the VGG loss after the convolutional layer and before the activation layer. This paper uses the VGG loss defined in Ref. [27], that is, the L1 norm loss function is used to define the Manhattan Distance between the reconstructed image features and the GT image features:

Equation 4.

(4)

where $Φ_{i, j}$ represents features obtained by the $j - t h$ convolution (before activation) before the $i - t h$ maxpooling layer in the VGG network. $C_{i, j}$ , $W_{i, j}$ and $H_{i, j}$ are the dimensions of their respective feature spaces in the VGG network.

Based on the ideas of Bruna et al. [41], Gatys et al. [42], Johnson et al. [16] and Ledig et al. [18], we define the ResNet loss on the ReLU activation layer of the pre-trained 50 layer ResNet network described in He et al. [19]. The ResNet network is different from the VGG network in structure, so a specific technique is used to specify each feature space. As shown in Fig. 2, we divide ResNet-50 into four stages, each of which contains several bottleneck layers. The extracted perceptual features use the output value of the bottleneck layer at each stage, and the ResNet loss can also be expressed as:

Equation 5.

(5)

where $β_{m, n}$ represents features obtained by the $n - t h$ bottleneck layer (after activation) at the $m - t h$ stage. $C_{m, n}$ , $W_{m, n}$ and $H_{m, n}$ are the dimensions of their respective feature spaces in the ResNet network.

Fig. 2 — Block structure of the pre-trained 50 layer ResNet network.

In this paper, in order to ensure the obvious differences between optional features, we set the output of the last bottleneck layer of each stage in ResNet-50 as the optional feature. Therefore, there are four categories of optional features in total, which are $β_{1,2}$ , $β_{2,4}$ , $β_{3,6}$ and $β_{4,3}$ . But it is not necessarily that the features in deeper layers can bring better overall effect. The optimal feature map under specific conditions needs to be determined through experimental analysis of different situations.

3.2.2. Complementarity of VGG features and ResNet features

In this section, we compare the differences of the perceptual features extracted by the pre-trained VGG and ResNet networks from the visual perspective and theoretical basis. In order to make the features of two networks have certain comparability, this paper keeps the length, width, and channel number of the two features as consistent as possible. The depth of the two networks that the images pass through is also as consistent as possible. In this paper, we choose $β_{1,2}$ and $Φ_{3,3}$ , which pass through two max-pooling layers and seven convolutions and are obtained after the ReLU activation layer. The length, width and the number of channels are exactly the same. The specific effect is shown in Fig. 3. By visualizing the feature map under a single channel, it can be clearly seen that the perceptual ways of the two pre-trained networks are different. The features under $Φ_{3,3}$ are more distinct, but some information is lost (e.g., the features under the 64th channel are very sparse). Benefiting from the excellent properties of the residual network, more original data is retained under $β_{1,2}$ , especially the features under the 1st channel basically retains all single-channel pixel information, and the features under the 256th channel focuses on perceiving the edge of the image. This huge difference of perceptual ways is exactly what we need.

Fig. 3 — Comparison of feature maps under different channels (after activation). The image “Lenna” is from Set14. (a) Original. (b) 1st channel of $β_{1,2}$ . (c) 64th channel of $β_{1,2}$ . (d) 256th channel of $β_{1,2}$ . (e) 1st channel of $Φ_{3,3}$ . (f) 64th channel of $Φ_{3,3}$ . (g) 256th channel of $Φ_{3,3}$ .

In order to illustrate the source of the above-mentioned differences in a deeper level, this paper makes the analysis from the forward calculation of the neural network, because the most essential difference between the two networks lies in the organization way of the network layer output. Fig. 4(a) and Fig. 4(b) show the basic connection way of the VGG network and ResNet network respectively. It can be seen that VGG uses multiple network layers to connect in sequence, while ResNet adopts a special jump connection way. The different connection ways make them have the distinct ways of forward calculation, which limits the back propagation, namely the optimization of network parameters. Furthermore, since the parameters of two networks are trained in the task of image recognition, some unimportant features will be discarded after the input features pass through the convolutional layer. Therefore, after the images pass through the VGG network, some features will become more obvious while other features will become weaker or even disappear with the number of layers increases. However, ResNet has the ability to retain the most original features while emphasizing important features due to its special forward calculation mechanism. Therefore, the features extracted by two networks have certain complementary properties. Fig. 3(e, f, g) and Fig. 3(b, c, d) can fully illustrate these two points, respectively. Fig. 3(e, f, g) directly discard unimportant features (e.g., background) and highlight the most important feature information, while Fig. 3(b, c, d) retain most of the original structural information and emphasizes edge position features.

Fig. 4 — Comparison of basic connection way of the VGG network and ResNet network. (a) VGG network. (b) ResNet network.

3.2.3. Magnitude issue and dynamic weighting

Obtaining more feature information from the image can enable the network to have better recovery capabilities in terms of faithful texture details. Therefore, we adopt a joint optimization strategy for the two perceptual losses mentioned in Section 3.2.1. However, VGG features and ResNet features are obtained from different types of pre-trained networks, which leads to the obvious magnitude difference between the two perceptual losses. This difference can directly affect the whole training process of the neural network. The specific reasons are as follows.

Firstly, the loss function under the several perceptual losses is given as:

Equation 6.

(6)

where $P L$ represents a single perceptual loss function and $n$ represents the number of loss functions. Then, according to the gradient descent method, the $(m + 1) - t h$ update of the network weight $w$ can be expressed as:

Equation 7.

(7)

where $γ$ is the learning rate. It can be seen from Eq. (7) that if there is a numerical difference in magnitude between $P L$ s, there will also be such a difference in range of changes. Therefore, the gradient value of $w$ is limited by the magnitude, which leads to the fact that only a part of $P L$ s occupy the dominant position. As a result, the advantages of dual perceptual cannot be exhibited. We describe this problem through the extent to which the loss is sensitive to changes of network parameters. For example, we update the weight $w$ : at a certain stage of the training process, $P L_{1} = 1$ , $P L_{2} = 0.02$ . Let $Δ w > 0$ , when $w + Δ w$ , then $P L_{1} = 1.1$ , $P L_{2} = 0.01$ ; when $w - Δ w$ , then $P L_{1} = 0.9$ , $P L_{2} = 0.03$ . From above, it can be seen as follows: $| Δ P L_{1} | = 10 * | Δ P L_{2} |$ . If $Δ w \to 0$ , then $\frac{\partial P L_{1}}{\partial w} = - 10 * \frac{\partial P L_{2}}{\partial w}$ . The total gradient contains $\frac{\partial P L_{1}}{\partial w}$ and $\frac{\partial P L_{2}}{\partial w}$ , so it is dominated by the gradient under $P L_{1}$ and its value is positive. However, in terms of the sensitivity of the loss, $P L_{2}$ is more sensitive to $Δ w$ (comparing with itselves, $P L_{1}$ fluctuates by 0.1 times but $P L_{2}$ fluctuates by 0.5 times). Therefore, the total gradient is negative, that is, the gradient under $P L_{2}$ is dominant, which is more beneficial to the model. In summary, the difference of magnitude affects the correlation between the sensitivity of the loss and the total gradient, so that the model cannot have a strong attention to both perceptual features. Even in the later stages of the training process, $P L_{1}$ has converged, but the network still does not make too many adjustments for $P L_{2}$ .

In order to resolve the problems above, this paper eliminates the influence of the difference of magnitude by weighting $P L$ s. In general, the static constant is used as the weight value, but the $P L$ s in the training process is uncontrollable, which cannot ensure that the two losses are always at the same magnitude. In this paper, we use the dynamic weighting to solve this problem, and further define DP Loss $l_{D P}$ . The DP Loss $l_{D P}$ is represented in Eq. (8).

Equation 8.

(8)

where the ResNet loss $l_{R E S}$ is dynamically weighted and the weight value is $\frac{1}{μ} ζ_{l_{V G G}, l_{R E S}}$ , $μ$ is a nonzero constant. The $ζ_{l_{V G G}, l_{R E S}}$ can be expressed as:

Equation 9.

(9)

where $v a l u e (\frac{l_{V G G} + c}{l_{R E S} + c})$ means to take the value of $\frac{l_{V G G} + c}{l_{R E S} + c}$ and disconnect the relationship between the functions associated with the value. $c$ is a tiny positive constant, so it can be negligible, and its role is just to keep the denominator from being zero. Therefore, $\frac{1}{μ} ζ_{l_{V G G}, l_{R E S}}$ is just a value that changes with the change of the ratio of $l_{V G G}$ to $l_{R E S}$ . Thus, $\frac{1}{μ} ζ_{l_{V G G}, l_{R E S}}$ is taken as the weight value under the ResNet loss, which can only change the update range of network parameters, not the update direction.

From Eq. (8) and Eq. (9), it can be seen that the weighted ResNet loss value always exists in DP Loss in the form of multiple of VGG loss value, and $μ$ is used to determine the multiple. In the process of updating network, when the VGG loss becomes smaller, the ResNet loss will be forced to decrease, and vice versa. The purpose of this is to ensure that the relative size of the loss value between the VGG loss and the ResNet loss is in a fixed state, and the network is always trained in this state. Therefore, it is prevented the situation that the network focuses on learning the perceptual features extracted by a single network due to the difference of magnitude between the perceptual losses during the training process, so that the advantages of dual perceptual can be fully utilized.

The method proposed in this paper only performs dynamic weights for the perceptual loss. The purpose of this is to ensure that DP Loss has a certain degree of flexibility and is convenient for migration to other models. The specific process of obtaining DP Loss is described by algorithm 1. In addition, it is necessary to discuss the effect of the different combinations of $μ$ , $Φ$ and $β$ in DP Loss on the model, this occurs because the effects under various conditions are very different in both the visual quality of the image and the evaluation metric. In Section 4.3, we compare and analyze the hyperparameter combinations under different conditions to determine the optimal hyperparameter of DP Loss in the relevant model.

Algorithm 1Process of obtaining DP Loss

Input: The reconstructed image

I_{S R}

, the high-resolution image

I_{H R}

, the

μ

, the

i

and

j

Φ_{i, j}

, and the

m

and

n

β_{m, n}

Output: DP Loss

l_{D P}

1
Initialization:

2
Load the pre-trained parameters for VGG19 and ResNet50 networks

3
Intercept the VGG19 network to obtain $T r u n c a t i o n V G G (\cdot)$ according to $i$ and $j$ ;

4
Intercept the ResNet50 network to obtain $T r u n c a t i o n R e s N e t (\cdot)$ according to $m$ and $n$ ;

5
1) Obtain VGG loss

6
Extract $I_{S R}$ ’s VGG features $P_{V G G}^{S R} \leftarrow T r u n c a t i o n V G G (I_{S R})$ ;

7
Extract $I_{H R}$ ’s VGG features $P_{V G G}^{H R} \leftarrow T r u n c a t i o n V G G (I_{H R})$ ;

8
Calculate the Manhattan Distance between $P_{V G G}^{S R}$ and $P_{V G G}^{H R}$ to obtain VGG loss $l_{V G G}$ ;

9
2) Obtain ResNet loss:

10
Extract $I_{S R}$ ’s ResNet features $P_{R E S}^{S R} \leftarrow T r u n c a t i o n R e s N e t (I_{S R})$ ;

11
Extract $I_{H R}$ ’s ResNet features $P_{R E S}^{H R} \leftarrow T r u n c a t i o n R e s N e t (I_{H R})$ ;

12
Calculate the Manhattan Distance between $P_{R E S}^{S R}$ and $P_{R E S}^{H R}$ to obtain ResNet loss $l_{R E S}$ ;

13
3) Obtain DP Loss:

14
Calculate the ratio $R_{l_{V G G}, l_{R E S}}$ of VGG loss to ResNet loss, where $R_{l_{V G G}, l_{R E S}} = \frac{l_{V G G}}{l_{R E S}}$ ;

15
Break off the relationship between the functions associated with the $R_{l_{V G G}, l_{R E S}}$ to obtain $ζ_{l_{V G G}, l_{R E S}}$ ;

16
Calculate DP Loss $l_{D P} \leftarrow l_{V G G} + \frac{1}{μ} ζ_{l_{V G G}, l_{R E S}} l_{R E S}$ ;

Open in a new tab

4. Experiments

4.1. Datasets and evaluation metrics

The training set includes 800 high-definition images from the public dataset DIV2K [43]. By using sliding window to crop the images, 32,592 non-overlapping sub-images with the size of 480 $\times$ 480 are obtained. And then we perform bicubic interpolation operation on these images to get the corresponding downsampling images. Following ESRGAN, only a 4 $\times$ upscaling factor is considered in the experiment. In this paper, the widely used benchmark datasets are taken as the test datasets. The datasets contain Set5 [44], Set14 [11], BSD100 [45] and Urban100 [46], which have 5 images, 14 images, 100 images and 100 images, respectively.

The peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [47] and learned perceptual image patch similarity (LPIPS) [48] are evaluation metrics in the experiment. The lower the LPIPS value, the higher the perceptual similarity, that is, the reconstructed images are closer to GT images in visual quality. In order to make a fair comparison, a 4-pixel wide stripe is removed from each border of all the evaluated images, and the Y-channel of the images is used to calculate PSNR and SSIM.

4.2. Training details

In this paper, the software environment is the Linux system equipped with compute unified device architecture (CUDA) (version 11.2). The hardware environment is a GTX 2080TI GPU, and the deep learning framework is Pytorch [49] (version 1.7.1). Herein, the mini-batch size is set to 16, the spatial size of the cropped HR patch is 128 $\times$ 128 and parameters of the generator loss function in Eq. (2) are $λ = 1 e - 2$ , $η = 5 e - 3$ and $γ = 1$ . The model training needs 400K iterations in total. The initial learning rate is $1 e - 2$ , and the learning rate is halved when the iteration reaches [50k, 100k, 200k, 300k]. The Adam [50] with $β_{1} = 0.9$ and $β_{2} = 0.99$ is used to alternately optimize the generator and the discriminator.

4.3. Hyperparameter analysis

According to Eqs. (4), (5), (6), three important hyperparameters need to be specified, which are the VGG features $Φ$ , the ResNet features $β$ and the constant $μ$ . For the $Φ$ , we refer to Refs. [18,27] and designate it as $Φ_{5,4}$ (before activation), while $β$ and $μ$ need to be determined according to the experiment. In this paper, we set the optional hyperparameters of $β$ as $β_{1,3}$ , $β_{2,4}$ , $β_{3,6}$ and $β_{4,3}$ (after activation), and the optional hyperparameters of $μ$ as 0.2, 0.5, 1, 5, 10 and 20. Since SRGAN and ESRGAN are similar in implementation ideas, it can be proved that there is a certain positive correlation in the improvement of the image reconstruction effect of the two models using different hyperparameter combinations (the correlation will be proved in Section 4.3.4). Therefore, in order to reduce the training burden, we use SRGAN to conduct comparative experiments under each hyperparameter combination, and apply the obtained optimal hyperparameter combination to ESRGAN. In order to improve the efficiency of finding the optimal hyperparameters, the optimal hyperparameter combination is obtained by alternately fixing a certain hyperparameter. In order to ensure that the results are fair and effective, the number of iterations is 400K during training.

4.3.1. Determination of hyperparameter μ

In this paper, $β$ is fixed as $β_{1,3}$ firstly, and then the experimental results are compared under the different values of $μ$ . The experimental result on the visual effect is shown in Fig. 5(a–h). When the $μ$ is 0.5 or 1, the lines become clearer and the textures are closer to the GT image. Table 2 gives the comparison of results caused by losses under the different values of $μ$ , and all evaluation metrics have the better performance after adding DP Loss under the appropriate value of $μ$ . When $μ$ is 1, the values of SSIM and PSNR in each dataset are the best, followed by the values of LPIPS; when $μ$ is 0.5, the values of LPIPS are the best, followed by most of the values of SSIM and PSNR. In this paper, we pay more attention to the LPIPS, namely the perceptual similarity. Compared with SSIM and PSNR, LPIPS is more in line with human perceptual habits. Finally, the value of $μ$ is determined as 0.5.

Fig. 5 — Comparison of visual effects of DP Loss under different $μ$ . The image “Img024” is from Urban100. (a) Original. (b) $μ = 0.2$ . (c) $μ = 0.5$ . (d) $μ = 1$ . (e) $μ = 5$ . (f) $μ = 10$ . (g) $μ = 20$ . (h) $μ = \infty$ .

Table 2.

Comparison of results of DP Loss under the different $μ$ . $μ = \infty$ means only using the VGG loss, namely the original perceptual loss.

Dataset	Set5			Set14			BSD100			Urban100
Metric	SSIM	PNSR	LPIPS	SSIM	PNSR	LPIPS	SSIM	PNSR	LPIPS	SSIM	PNSR	LPIPS
$μ = \infty$	0.7706	26.89	0.1085	0.7347	26.07	0.1239	0.6919	25.37	0.1483	0.7150	24.61	0.1407
$μ = 0.2$	0.3596	20.33	0.1500	0.3470	20.12	0.1672	0.3075	19.81	0.2089	0.3443	19.64	0.1830
$μ = 0.5$	0.7922	27.39	0.1052	0.7570	26.37	0.1207	0.7225	25.82	0.1392	0.7460	25.07	0.1297
$μ = 1$	0.7959	27.47	0.1080	0.7617	26.51	0.1226	0.7259	25.95	0.1447	0.7490	25.18	0.1331
$μ = 5$	0.7905	27.37	0.1092	0.7565	26.49	0.1236	0.7188	25.86	0.1489	0.7391	25.04	0.1386
$μ = 10$	0.7826	27.13	0.1082	0.7486	26.28	0.1230	0.7111	25.71	0.1473	0.7311	24.87	0.1393
$μ = 20$	0.7814	27.13	0.1098	0.7481	26.35	0.1250	0.7090	25.68	0.1495	0.7288	24.86	0.1413

Open in a new tab

The best performance is highlighted in italic (1st est) andbold(2nd best).

4.3.2. Determination of hyperparameter β

The optimal $μ$ obtained in Section 4.3.1 is used as a fixed value, and then we observe the influence on the image reconstruction when $β$ is under the different values. Comparisons of visual effects and objective evaluation metrics are shown in Fig. 6(a–e) and Table 3, respectively. It can be seen that the result of DP Loss under the condition of $β_{3,6}$ occupies more advantages. From visual effects, the images reconstructed by using $β_{3,6}$ have less unnatural artifacts, so that the reconstructed images are closer to the GT images in structure. From evaluation metrics, it can be found that both PSNR and SSIM occupy the first or second best, and the LPIPS is the best in BSDS100 and Urban100 datasets. To sum up, the $β_{3,6}$ will be the optimal features.

Fig. 6 — Comparison of visual effects of DP Loss under different $β$ . The image “148026” is from BSD100. (a) Original. (b) $β_{1,3}$ . (c) $β_{2,4}$ . (d) $β_{3,6}$ . (e) $β_{4,3}$ .

Table 3.

Comparison of DP Loss under different $β$ .

Dataset	Set5			Set14			BSD100			Urban100
Metric	SSIM	PSNR	LPIPS	SSIM	PSNR	LPIPS	SSIM	PNSR	LPIPS	SSIM	PNSR	LPIPS
$β_{1,3}$	0.7922	27.39	0.1052	0.7570	26.37	0.1207	0.7225	25.82	0.1392	0.7460	25.07	0.1297
$β_{2,4}$	0.7899	27.27	0.1039	0.7547	26.26	0.1187	0.7152	25.59	0.1371	0.7406	24.90	0.1283
$β_{3,6}$	0.7915	27.45	0.1051	0.7564	26.46	0.1200	0.7202	25.82	0.1370	0.7434	25.06	0.1278
$β_{4,3}$	0.7876	27.13	0.1040	0.7543	26.30	0.1176	0.7131	25.60	0.1405	0.7372	24.89	0.1327

Open in a new tab

The best performance is highlighted initalic(1st best) and bold (2nd best).

4.3.3. SRGAN and ESRGAN under the DP loss with the optimal hyperparameter

This paper uses the two optimal hyperparameters in DP Loss obtained from Section 4.3.1 and Section 4.3.2 and applies them to SRGAN and ESRGAN respectively to obtain SRGAN-DP and ESRGAN-DP. The specific effects are given in Table 4. Compared with SRGAN-DP, ESRGAN-DP has a more superior effect in terms of LPIPS.

Table 4.

Comparison between SRGAN-DP and ESRGAN-DP in LPIPS values.

Methods		DataSet
Methods	Set5	Set14	BSD100	Urban100
SRGAN-DP	0.1051	0.1200	0.1370	0.1278
ESRGAN-DP	0.0990	0.1139	0.1280	0.1186

Open in a new tab

Font bold indicates the best performance.

4.3.4. The positive correlation between SRGAN-DP and ESRGAN-DP

In order to prove there is a certain positive correlation between the effects of SRGAN-DP and ESRGAN-DP under different hyperparameter combinations, this paper selects four combinations of representative hyperparameters from Sections 4.3.1 and 4.3.2, which are $μ = \infty$ , $μ = 1$ + $β_{1,3}$ , $μ = 10$ + $β_{1,3}$ and $μ = 0.5$ + $β_{3,6}$ , respectively. In SRGAN-DP, the performance effects of the four hyperparameter combinations are: when $μ = \infty$ , all evaluation metrics are the worst; when $μ = 1$ + $β_{1,3}$ , PSNR and SSIM are the best; when $μ = 10$ + $β_{1,3}$ , each metric is neither the best nor the worst; when $μ = 0.5$ + $β_{3,6}$ , LPIPS is the best. It can be seen from Table 5 that the effects shown in SRGAN-DP are also shown in ESRGAN-DP. Therefore, it can be concluded that the effects of SRGAN-DP and ESRGAN-DP under different hyperparameter combinations have a certain positive correlation.

Table 5.

Comparison of different hyperparameter combinations between SRGAN-DP and ESRGAN-DP on BSD100.

Hyperparameter	SRGAN-DP			ESRGAN-DP
Hyperparameter	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
$μ = \infty$	25.37	0.6919	0.1483	24.95	0.6785	0.1428
$μ = 1$ + $β_{1,3}$	25.95	0.7259	0.1447	25.43	0.7007	0.1329
$μ = 10$ + $β_{1,3}$	25.71	0.7111	0.1473	25.35	0.6968	0.1403
$μ = 0.5$ + $β_{3,6}$	25.82	0.7202	0.1370	25.40	0.6993	0.1280

Open in a new tab

The best performance is highlighted in bold and the worst performance is highlighted in underline.

4.4. Comparison with the state-of-the-art technologies

To demonstrate the effectiveness of ESRGAN-DP, this paper compares it with the state-of-the-art single image SR methods in terms of evaluation metrics and visual quality, including EnhanceNet [37], SRGAN [18], ESRGAN [27], SFTGAN [28] DRN [35] and ESRGAN+ [51].

As can be seen from Table 6, compared with ESRGAN, ESRGAN-DP shows a better performance on SSIM, PSNR and LPIPS. Especially in the four test datasets, LPIPS decreases by 0.01305 on average compared with the ESRGAN, and it is the best among all the models, but SSIM and PSNR are not the best. However, LPIPS is more convincing, because it is more in line with human perceptual habits from a visual perspective. For example, the PSNR value and SSIM value of DRN are mostly the best in all models, but it does not mean that it has the best visual quality. From Figs. 7 and 8, too little reasoning about texture details causes the images reconstructed by DRN to still have a certain sense of blur.

Table 6.

Comparison of different SR methods on the benchmark datasets.

Dataset	Metric	Bicubic	DRN [35]	EnhanceNet [37]	SFTGAN [18]	SRGAN [27]	ESRGAN [28]	ESRGAN+ [51]	ESRGAN-DP (our)
Set5	PSNR	26.69	29.95	26.76	27.26	26.69	26.50	25.88	27.11
	SSIM	0.7736	0.8522	0.7670	0.7765	0.7813	0.7565	0.7511	0.7748
	LPIPS	0.3644	0.1964	0.1198	0.1028	0.1304	0.1080	0.1178	0.0990
Set14	PSNR	26.08	28.96	26.02	26.29	25.88	25.52	25.01	26.00
	SSIM	0.7466	0.8261	0.7344	0.7397	0.7347	0.7175	0.7159	0.7366
	LPIPS	0.3870	0.2196	0.1337	0.1177	0.1422	0.1254	0.1363	0.1139
BSD100	PSNR	26.07	25.57	25.51	25.71	24.66	24.95	24.62	25.40
	SSIM	0.7177	0.7239	0.6974	0.7065	0.7063	0.6785	0.6893	0.6993
	LPIPS	0.4454	0.2922	0.1611	0.1358	0.1622	0.1428	0.1446	0.1280
Urban100	PSNR	24.73	26.23	24.65	25.04	24.04	24.21	23.98	24.79
	SSIM	0.7101	0.7793	0.7168	0.7314	0.7209	0.7045	0.7182	0.7284
	LPIPS	0.4346	0.2320	0.1522	0.1259	0.1534	0.1355	0.1334	0.1186

Open in a new tab

Font bold indicates the best performance.

The more detailed observation can be made from the fingernails and the edge of sleeves in Fig. 7(a–i) and the shape and shadow of bricks in Fig. 8(a–i). These images contain a lot of interactive lines, which contribute to the comparison of the reconstruction ability of SR methods in complex scenes. It is observed that the images reconstructed by Bicubic are extremely blurry in vision. The images reconstructed by EnhanceNet and SRGAN make an improvement in visual quality, but have serious artifacts and distortions. The artifacts of the images reconstructed by ESRGAN are reduced, but too many unreal textures are generated, which results in a great difference between reconstructed images and GT images. SFTGAN performs better than the above methods, but the reconstructed images are poor in the clarity of texture. The images reconstructed by DRN retain more original information, but are still slightly smooth on the whole, which have a certain influence on the visual quality. In terms of clarity of textures, ESRGAN + has a certain improvement compared with other models, but it also generates too many unrealistic structures to make the images lack facticity. The proposed method enhances the ability to reason missing information in LR to restore the facticity of the image as much as possible, which makes the reconstructed image more clear and closer to GT image from visual perspective.

To sum up, in terms of both evaluation metrics and visual effect, compared with the original perceptual loss only using a single VGG loss, the perceptual loss using DP Loss has superior advantages in solving SR problems.

4.5. Ablation study

In order to demonstrate the effectiveness of the proposed DP Loss and dynamic weighting, ESRGAN is used to as the baseline model to conduct ablation study. It can be seen from Table 7 that whether ResNet loss is applied to the original ESRGAN or the dynamic weighting is applied on the basis of ResNet loss, all evaluation metrics have been improved to a certain extent. Furthermore, from the perspective of SSIM, applying ResNet Loss to ESRGAN can significantly improve the value of SSIM. It shows that the ability of the network to recover the realistic structures of images is greatly improved due to the complementary property of VGG features and ResNet features. From the perspective of LPIPS, both two perceptual losses and dynamic weighting make the images have a great positive impact, which is closely related to the improvement of the visual quality of images. It can be seen from Fig. 9(a–d) that the application of ResNet loss can eliminate a large number of unrealistic artifacts, and then further application of dynamic weighting can make the texture clearer and the structure more in line with GT images.

Table 7.

Comparison of ESRGAN under different conditions.

Metric	VGG Loss	ResNet Loss	Dynamic Weighting	Set5	Set14	BSD100	Urban100
PSNR	√			26.50	25.52	24.95	24.21
	√	√		26.92	25.97	25.33	24.49
	√	√	√	27.11	26.00	25.40	24.79
SSIM	√			0.7565	0.7175	0.6785	0.7045
	√	√		0.7711	0.7347	0.6945	0.7181
	√	√	√	0.7748	0.7366	0.6993	0.7284
LPIPS	√			0.1080	0.1254	0.1428	0.1355
	√	√		0.1017	0.1165	0.1379	0.1302
	√	√	√	0.0990	0.1139	0.1280	0.1186

Open in a new tab

The best performance is highlighted in italic (1st best) andbold (2nd best).

Fig. 9 — Visual comparison of results of ESRGAN under different conditions. (a) VGG loss. (b) VGG loss + ResNet loss. (c) VGG loss + ResNet loss + Dynamic weighting. (d) Original.

5. Conclusions

In this paper, we propose a DP Loss to reslove the problem that there are always structural distortions in the results reconstructed by SR methods under the traditional perceptual-driven. Both the features extracted by pre-trained ResNet network and those extracted by pre-trained VGG network are applied to the perceptual loss, which improves the information acquisition ability of the perceptual features from the perspective of feature extraction method to enhance the ability of network to reason texture details. The strategy of using the dynamic weighting method for ResNet loss is to eliminate the interference of the magnitude difference on the training, which makes the advantages of dual perceptual more obvious. In addition, we compare the different influences of different hyperparameter combinations in the DP Loss on the results, and the quantitative and qualitative assessments obtained from four popular benchmark datasets both demonstrate the effectiveness of the method proposed in this paper. In future work, we will investigate how to extend the advantages of DP Loss in SR task to other tasks, such as style transfer task and human pose estimation task.

Author contribution statement

Jie Song: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Huawei Yi: Conceived and designed the experiments; Contributed reagents, materials, analysis tools or data.

Wenqian Xu, Xiaohui Li, Bo Li, Yuanyuan Liu: Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data.

Data availability statement

Data will be made available on request.

Declaration of interest's statement

The authors declare no conflict of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation for youth scientists of China (Grant No. 61802161), the Natural Science Foundation of Liaoning Province, China (Grant No. 20180550886, Grant No. 2020-MS-292), and the Scientific Research Foundation of Liaoning Provincial Education Department, China (No. JZL202015402).

References

1.Caballero J. Med Image Comput Comput Assist Interv. 2013. Cardiac image super-resolution with global correspondence using multi-atlas patchmatch; pp. 9–16. [DOI] [PubMed] [Google Scholar]
2.Peled S., Yeshurun Y. Superresolution in MRI: application to human white matter fiber tract visualization by diffusion tensor imaging. Magn. Reson. Med. Off. J. Soci. Magn.. Reson. Med. 2015;45:29–35. doi: 10.1002/1522-2594(200101)45:1<29::aid-mrm1005>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
3.Yang D., Li Z., Xia Y., Chen Z. IEEE International Conference on Digital Signal Processing. 2015. Remote sensing image super-resolution: challenges and approaches. [Google Scholar]
4.Zhang L., Zhang H., Shen H., Li P. A super-resolution reconstruction algorithm for surveillance images. Signal Process. 2010;90:848–859. [Google Scholar]
5.Zhang X., Zheng Z., Asanuma I., Xu Y. A new kind of super-resolution reconstruction algorithm based on the ICM and the bicubic interpolation. Inf. Japan. 2008;16:8027–8036. [Google Scholar]
6.Olivier R., Cao H. Nearest neighbor value interpolation. Int. J. Adv. Comput. Sci. Appl. 2012;3:25–30. [Google Scholar]
7.Chang H., Yeung D.Y., Xiong Y. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004. Super-resolution through neighbor embedding. [Google Scholar]
8.Timofte R., De V., Gool L.V. IEEE International Conference on Computer Vision. 2014. Anchored neighborhood regression for fast example-based super-resolution. [Google Scholar]
9.Timofte R., Desmet V., Vangool L. Springer International Publishing; 2014. A+: Adjusted Anchored Neighborhood Regression for Fast Super-resolution. [Google Scholar]
10.Yang J., Wright J., Huang T.S., Yi M. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage; Alaska, USA: 2008. Image super-resolution as sparse representation of raw image patches; p. 2008. CVPR 2008), 24-26 June 2008. [Google Scholar]
11.Zeyde R., Elad M., Protter M. Curves and Surfaces - 7th International Conference, Avignon, France. Revised Selected Papers; 2010. On single image scale-up using sparse-representations. June 24-30, 2010. [Google Scholar]
12.Schulter S., Leistner C., Bischof H. IEEE Conference on Computer Vision and Pattern Recognition. 2015. Fast and accurate image upscaling with super-resolution forests; pp. 3791–3799. [Google Scholar]
13.Chao D., Chen C.L., He K., Tang X. ECCV; 2014. Learning a Deep Convolutional Network for Image Super-resolution. [Google Scholar]
14.Shi W., Caballero J., Huszár F., Totz J., Wang Z. IEEE Conference on Computer Vision and Pattern Recognition. CVPR); 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network; p. 2016. [Google Scholar]
15.Yang C.-Y., Ma C., Yang M.-H. In: Computer Vision - ECCV 2014 - 13th European Conference. Fleet D.J., Pajdla T., Schiele B., Tuytelaars T., editors. Springer; Zurich, Switzerland: 2014. Single-image super-resolution: a benchmark; pp. 372–386. September 6-12, 2014, Proceedings, Part IV. [DOI] [Google Scholar]
16.Johnson J., Alahi A., Fei-Fei L. European Conference on Computer Vision. 2016. Perceptual losses for real-time style transfer and super-resolution. [Google Scholar]
17.Goodfellow I.J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y. Generative adversarial networks. Adv. Neural Inf. Process. Syst. 2014;3:2672–2680. [Google Scholar]
18.Ledig C., Theis L., Huszar F., Caballero J., Cunningham A., Acosta A., Aitken A., Tejani A., Totz J., Wang Z. IEEE Computer Society; 2016. Photo-realistic Single Image Super-resolution Using a Generative Adversarial Network. [Google Scholar]
19.He K., Zhang X., Ren S., Sun J. IEEE; 2016. Deep Residual Learning for Image Recognition. [DOI] [Google Scholar]
20.Simonyan K., Zisserman A. 3rd International Conference on Learning Representations. 2015. Very deep convolutional networks for large-scale image recognition. [Google Scholar]
21.Kim J., Lee J.K., Lee K.M. IEEE Conference on Computer Vision and Pattern Recognition. 2016. Accurate image super-resolution using very deep convolutional networks. [Google Scholar]
22.Lim B., Son S., Kim H., Nah S., Lee K.M. IEEE Conference on Computer Vision and Pattern Recognition Workshops. CVPRW); 2017. Enhanced deep residual networks for single image super-resolution; p. 2017. [Google Scholar]
23.Ioffe S., Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift, International conference on machine learning. pmlr. 2015:448–456. [Google Scholar]
24.Huang G., Liu Z., Laurens V., Weinberger K.Q. IEEE Computer Society; 2016. Densely Connected Convolutional Networks. [Google Scholar]
25.Zhang Y., Tian Y., Kong Y., Zhong B., Fu Y. IEEE; 2018. Residual Dense Network for Image Super-resolution. [DOI] [PubMed] [Google Scholar]
26.Zhang Y., Li K., Li K., Wang L., Zhong B., Fu Y. 2018. Image Super-resolution Using Very Deep Residual Channel Attention Networks. [Google Scholar]
27.Wang X., Yu K., Wu S., Gu J., Liu Y., Dong C., Loy C.C., Qiao Y., Tang X. European Conference on Computer Vision. 2018. ESRGAN: enhanced super-resolution generative adversarial networks. [Google Scholar]
28.Wang X., Yu K., Dong C., Loy C.C. IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2018. Recovering realistic texture in image super-resolution by deep spatial feature transform; p. 2018. [Google Scholar]
29.Li X., Liu Z., Luo P., Loy C.C., Tang X. IEEE; 2017. Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade. [Google Scholar]
30.Liu Z., Li X., Luo P., Loy C.C., Tang X. Deep learning markov random field for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018:1. doi: 10.1109/TPAMI.2017.2737535. –1. [DOI] [PubMed] [Google Scholar]
31.Long J., Shelhamer E., Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015;39:640–651. doi: 10.1109/CVPR.2015.7298965. [DOI] [PubMed] [Google Scholar]
32.Yang X., Wang S., Han J., Guo Y., Li T. RSAMSR: a deep neural network based on residual self-encoding and attention mechanism for image super-resolution. Optik. 2021;245 [Google Scholar]
33.Song D., Wang Y., Chen H., Xu C., Xu C., Tao D. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2021. AdderSR: towards energy efficient image super-resolution; pp. 15648–15657. [Google Scholar]
34.Chen H., Wang Y., Xu C., Shi B., Xu C. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2020. AdderNet: do we really need multiplications in deep learning? [Google Scholar]
35.Guo Y., Chen J., Wang J., Chen Q., Cao J., Deng Z., Xu Y., Tan M. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2020. Closed-loop matters: dual regression networks for single image super-resolution. [Google Scholar]
36.Pandey G., Ghanekar U. Single image super-resolution using multi-scale feature enhancement attention residual network. Optik. 2021;231 [Google Scholar]
37.Sajjadi M., Scholkopf B., Hirsch M. IEEE International Conference on Computer Vision. 2017. EnhanceNet: single image super-resolution through automated texture synthesis. [Google Scholar]
38.Blau Y., Mechrez R., Timofte R., Michaeli T., Zelnik-Manor L. Proceedings of the European Conference on Computer Vision. ECCV) Workshops; 2018. The 2018 PIRM challenge on perceptual image super-resolution. [Google Scholar]
39.Rad M.S., Bozorgtabar B., Marti U.V., Basler M., Ekenel H.K., Thiran J.P. IEEE/CVF International Conference on Computer Vision. ICCV; 2019. SROBB: targeted perceptual loss for single image super-resolution; p. 2019. [Google Scholar]
40.Jolicoeur-Martineau A. International Conference on Learning Representations. 2019. The relativistic discriminator: a key element missing from standard GAN.https://openreview.net/forum?id=S1erHoR5t7 [Google Scholar]
41.Bruna J., Sprechmann P., Lecun Y. 4th International Conference on Learning Representations. 2016. Super-resolution with deep convolutional sufficient statistics. [Google Scholar]
42.Gatys L.A., Ecker A.S., Bethge M. MIT Press; 2015. Texture Synthesis Using Convolutional Neural Networks. [Google Scholar]
43.Agustsson E., Timofte R. IEEE Conference on Computer Vision and Pattern Recognition Workshops. CVPRW); 2017. NTIRE 2017 challenge on single image super-resolution: dataset and study; p. 2017. [Google Scholar]
44.Bevilacqua M., Roumy A., Guillemot C., Morel A. 2012. Low-complexity Single Image Super-resolution Based on Nonnegative Neighbor Embedding. Bmvc. [DOI] [Google Scholar]
45.Martin D., Fowlkes C., Tal D., Malik J. IEEE International Conference on Computer Vision. 2002. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. [Google Scholar]
46.Huang J.B., Singh A., Ahuja N. IEEE; 2015. Single Image Super-resolution from Transformed Self-Exemplars. [Google Scholar]
47.Wang Z., Bovik A.C., Sheikh H.R., Simoncelli E.P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004 doi: 10.1016/j.jvcir.2019.102655. [DOI] [PubMed] [Google Scholar]
48.Zhang R., Isola P., Efros A.A., Shechtman E., Wang O. IEEE; 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. [Google Scholar]
49.Paszke A., Gross S., Chintala S., Chanan G., Yang E., DeVito Z., Lin Z., Desmaison A., Antiga L., Lerer A. NIPS 2017 Workshop on Autodiff. Long Beach, California; USA: 2017. Automatic differentiation in PyTorch.https://openreview.net/forum?id=BJJsrmfCZ [Google Scholar]
50.Kingma D.P., Ba J., Adam . In: 3rd International Conference on Learning Representations, ICLR 2015. Bengio Y., LeCun Y., editors. Conference Track Proceedings; San Diego, CA, USA: May 7-9, 2015. A method for stochastic optimization; p. 2015.http://arxiv.org/abs/1412.6980 [Google Scholar]
51.Rakotonirina N.C., Rasoanaivo A. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP); 2020. ESRGAN+ : further improving enhanced super-resolution generative adversarial network; pp. 3637–3641. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data will be made available on request.

[bib1] 1.Caballero J. Med Image Comput Comput Assist Interv. 2013. Cardiac image super-resolution with global correspondence using multi-atlas patchmatch; pp. 9–16. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Peled S., Yeshurun Y. Superresolution in MRI: application to human white matter fiber tract visualization by diffusion tensor imaging. Magn. Reson. Med. Off. J. Soci. Magn.. Reson. Med. 2015;45:29–35. doi: 10.1002/1522-2594(200101)45:1<29::aid-mrm1005>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Yang D., Li Z., Xia Y., Chen Z. IEEE International Conference on Digital Signal Processing. 2015. Remote sensing image super-resolution: challenges and approaches. [Google Scholar]

[bib4] 4.Zhang L., Zhang H., Shen H., Li P. A super-resolution reconstruction algorithm for surveillance images. Signal Process. 2010;90:848–859. [Google Scholar]

[bib5] 5.Zhang X., Zheng Z., Asanuma I., Xu Y. A new kind of super-resolution reconstruction algorithm based on the ICM and the bicubic interpolation. Inf. Japan. 2008;16:8027–8036. [Google Scholar]

[bib6] 6.Olivier R., Cao H. Nearest neighbor value interpolation. Int. J. Adv. Comput. Sci. Appl. 2012;3:25–30. [Google Scholar]

[bib7] 7.Chang H., Yeung D.Y., Xiong Y. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004. Super-resolution through neighbor embedding. [Google Scholar]

[bib8] 8.Timofte R., De V., Gool L.V. IEEE International Conference on Computer Vision. 2014. Anchored neighborhood regression for fast example-based super-resolution. [Google Scholar]

[bib9] 9.Timofte R., Desmet V., Vangool L. Springer International Publishing; 2014. A+: Adjusted Anchored Neighborhood Regression for Fast Super-resolution. [Google Scholar]

[bib10] 10.Yang J., Wright J., Huang T.S., Yi M. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage; Alaska, USA: 2008. Image super-resolution as sparse representation of raw image patches; p. 2008. CVPR 2008), 24-26 June 2008. [Google Scholar]

[bib11] 11.Zeyde R., Elad M., Protter M. Curves and Surfaces - 7th International Conference, Avignon, France. Revised Selected Papers; 2010. On single image scale-up using sparse-representations. June 24-30, 2010. [Google Scholar]

[bib12] 12.Schulter S., Leistner C., Bischof H. IEEE Conference on Computer Vision and Pattern Recognition. 2015. Fast and accurate image upscaling with super-resolution forests; pp. 3791–3799. [Google Scholar]

[bib13] 13.Chao D., Chen C.L., He K., Tang X. ECCV; 2014. Learning a Deep Convolutional Network for Image Super-resolution. [Google Scholar]

[bib14] 14.Shi W., Caballero J., Huszár F., Totz J., Wang Z. IEEE Conference on Computer Vision and Pattern Recognition. CVPR); 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network; p. 2016. [Google Scholar]

[bib15] 15.Yang C.-Y., Ma C., Yang M.-H. In: Computer Vision - ECCV 2014 - 13th European Conference. Fleet D.J., Pajdla T., Schiele B., Tuytelaars T., editors. Springer; Zurich, Switzerland: 2014. Single-image super-resolution: a benchmark; pp. 372–386. September 6-12, 2014, Proceedings, Part IV. [DOI] [Google Scholar]

[bib16] 16.Johnson J., Alahi A., Fei-Fei L. European Conference on Computer Vision. 2016. Perceptual losses for real-time style transfer and super-resolution. [Google Scholar]

[bib17] 17.Goodfellow I.J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y. Generative adversarial networks. Adv. Neural Inf. Process. Syst. 2014;3:2672–2680. [Google Scholar]

[bib18] 18.Ledig C., Theis L., Huszar F., Caballero J., Cunningham A., Acosta A., Aitken A., Tejani A., Totz J., Wang Z. IEEE Computer Society; 2016. Photo-realistic Single Image Super-resolution Using a Generative Adversarial Network. [Google Scholar]

[bib19] 19.He K., Zhang X., Ren S., Sun J. IEEE; 2016. Deep Residual Learning for Image Recognition. [DOI] [Google Scholar]

[bib20] 20.Simonyan K., Zisserman A. 3rd International Conference on Learning Representations. 2015. Very deep convolutional networks for large-scale image recognition. [Google Scholar]

[bib21] 21.Kim J., Lee J.K., Lee K.M. IEEE Conference on Computer Vision and Pattern Recognition. 2016. Accurate image super-resolution using very deep convolutional networks. [Google Scholar]

[bib22] 22.Lim B., Son S., Kim H., Nah S., Lee K.M. IEEE Conference on Computer Vision and Pattern Recognition Workshops. CVPRW); 2017. Enhanced deep residual networks for single image super-resolution; p. 2017. [Google Scholar]

[bib23] 23.Ioffe S., Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift, International conference on machine learning. pmlr. 2015:448–456. [Google Scholar]

[bib24] 24.Huang G., Liu Z., Laurens V., Weinberger K.Q. IEEE Computer Society; 2016. Densely Connected Convolutional Networks. [Google Scholar]

[bib25] 25.Zhang Y., Tian Y., Kong Y., Zhong B., Fu Y. IEEE; 2018. Residual Dense Network for Image Super-resolution. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Zhang Y., Li K., Li K., Wang L., Zhong B., Fu Y. 2018. Image Super-resolution Using Very Deep Residual Channel Attention Networks. [Google Scholar]

[bib27] 27.Wang X., Yu K., Wu S., Gu J., Liu Y., Dong C., Loy C.C., Qiao Y., Tang X. European Conference on Computer Vision. 2018. ESRGAN: enhanced super-resolution generative adversarial networks. [Google Scholar]

[bib28] 28.Wang X., Yu K., Dong C., Loy C.C. IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2018. Recovering realistic texture in image super-resolution by deep spatial feature transform; p. 2018. [Google Scholar]

[bib29] 29.Li X., Liu Z., Luo P., Loy C.C., Tang X. IEEE; 2017. Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade. [Google Scholar]

[bib30] 30.Liu Z., Li X., Luo P., Loy C.C., Tang X. Deep learning markov random field for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018:1. doi: 10.1109/TPAMI.2017.2737535. –1. [DOI] [PubMed] [Google Scholar]

[bib31] 31.Long J., Shelhamer E., Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015;39:640–651. doi: 10.1109/CVPR.2015.7298965. [DOI] [PubMed] [Google Scholar]

[bib32] 32.Yang X., Wang S., Han J., Guo Y., Li T. RSAMSR: a deep neural network based on residual self-encoding and attention mechanism for image super-resolution. Optik. 2021;245 [Google Scholar]

[bib33] 33.Song D., Wang Y., Chen H., Xu C., Xu C., Tao D. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2021. AdderSR: towards energy efficient image super-resolution; pp. 15648–15657. [Google Scholar]

[bib34] 34.Chen H., Wang Y., Xu C., Shi B., Xu C. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2020. AdderNet: do we really need multiplications in deep learning? [Google Scholar]

[bib35] 35.Guo Y., Chen J., Wang J., Chen Q., Cao J., Deng Z., Xu Y., Tan M. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR); 2020. Closed-loop matters: dual regression networks for single image super-resolution. [Google Scholar]

[bib36] 36.Pandey G., Ghanekar U. Single image super-resolution using multi-scale feature enhancement attention residual network. Optik. 2021;231 [Google Scholar]

[bib37] 37.Sajjadi M., Scholkopf B., Hirsch M. IEEE International Conference on Computer Vision. 2017. EnhanceNet: single image super-resolution through automated texture synthesis. [Google Scholar]

[bib38] 38.Blau Y., Mechrez R., Timofte R., Michaeli T., Zelnik-Manor L. Proceedings of the European Conference on Computer Vision. ECCV) Workshops; 2018. The 2018 PIRM challenge on perceptual image super-resolution. [Google Scholar]

[bib39] 39.Rad M.S., Bozorgtabar B., Marti U.V., Basler M., Ekenel H.K., Thiran J.P. IEEE/CVF International Conference on Computer Vision. ICCV; 2019. SROBB: targeted perceptual loss for single image super-resolution; p. 2019. [Google Scholar]

[bib40] 40.Jolicoeur-Martineau A. International Conference on Learning Representations. 2019. The relativistic discriminator: a key element missing from standard GAN.https://openreview.net/forum?id=S1erHoR5t7 [Google Scholar]

[bib41] 41.Bruna J., Sprechmann P., Lecun Y. 4th International Conference on Learning Representations. 2016. Super-resolution with deep convolutional sufficient statistics. [Google Scholar]

[bib42] 42.Gatys L.A., Ecker A.S., Bethge M. MIT Press; 2015. Texture Synthesis Using Convolutional Neural Networks. [Google Scholar]

[bib43] 43.Agustsson E., Timofte R. IEEE Conference on Computer Vision and Pattern Recognition Workshops. CVPRW); 2017. NTIRE 2017 challenge on single image super-resolution: dataset and study; p. 2017. [Google Scholar]

[bib44] 44.Bevilacqua M., Roumy A., Guillemot C., Morel A. 2012. Low-complexity Single Image Super-resolution Based on Nonnegative Neighbor Embedding. Bmvc. [DOI] [Google Scholar]

[bib45] 45.Martin D., Fowlkes C., Tal D., Malik J. IEEE International Conference on Computer Vision. 2002. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. [Google Scholar]

[bib46] 46.Huang J.B., Singh A., Ahuja N. IEEE; 2015. Single Image Super-resolution from Transformed Self-Exemplars. [Google Scholar]

[bib47] 47.Wang Z., Bovik A.C., Sheikh H.R., Simoncelli E.P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004 doi: 10.1016/j.jvcir.2019.102655. [DOI] [PubMed] [Google Scholar]

[bib48] 48.Zhang R., Isola P., Efros A.A., Shechtman E., Wang O. IEEE; 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. [Google Scholar]

[bib49] 49.Paszke A., Gross S., Chintala S., Chanan G., Yang E., DeVito Z., Lin Z., Desmaison A., Antiga L., Lerer A. NIPS 2017 Workshop on Autodiff. Long Beach, California; USA: 2017. Automatic differentiation in PyTorch.https://openreview.net/forum?id=BJJsrmfCZ [Google Scholar]

[bib50] 50.Kingma D.P., Ba J., Adam . In: 3rd International Conference on Learning Representations, ICLR 2015. Bengio Y., LeCun Y., editors. Conference Track Proceedings; San Diego, CA, USA: May 7-9, 2015. A method for stochastic optimization; p. 2015.http://arxiv.org/abs/1412.6980 [Google Scholar]

[bib51] 51.Rakotonirina N.C., Rasoanaivo A. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP); 2020. ESRGAN+ : further improving enhanced super-resolution generative adversarial network; pp. 3637–3641. [Google Scholar]

PERMALINK

ESRGAN-DP: Enhanced super-resolution generative adversarial network with adaptive dual perceptual loss

Jie Song

Huawei Yi

Wenqian Xu

Xiaohui Li

Bo Li

Yuanyuan Liu

Abstract

1. Introduction

2. Related work

Table 1.

3. Methods

3.1. 3.1. adversarial network structure

Fig. 1.

3.2. Perceptual loss

3.2.1. VGG loss and ResNet loss

Fig. 2.

3.2.2. Complementarity of VGG features and ResNet features

Fig. 3.

Fig. 4.

3.2.3. Magnitude issue and dynamic weighting

4. Experiments

4.1. Datasets and evaluation metrics

4.2. Training details

4.3. Hyperparameter analysis

4.3.1. Determination of hyperparameter μ

Fig. 5.

Table 2.

4.3.2. Determination of hyperparameter β

Fig. 6.

Table 3.

4.3.3. SRGAN and ESRGAN under the DP loss with the optimal hyperparameter

Table 4.

4.3.4. The positive correlation between SRGAN-DP and ESRGAN-DP

Table 5.

4.4. Comparison with the state-of-the-art technologies

Table 6.

Fig. 7.

Fig. 8.

4.5. Ablation study

Table 7.

Fig. 9.

5. Conclusions

Author contribution statement

Data availability statement

Declaration of interest's statement

Acknowledgments

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases