Skip to main content
Computational Intelligence and Neuroscience logoLink to Computational Intelligence and Neuroscience
. 2022 Sep 26;2022:8628402. doi: 10.1155/2022/8628402

Lightweight Image Super-Resolution Based on Re-Parameterization and Self-Calibrated Convolution

Sufan Zhang 1, Xi Chen 1,, Xingwei Huang 1
PMCID: PMC9529453  PMID: 36199958

Abstract

Image super-resolution technique can improve image quality by increasing image clarity, bringing a better user experience in real production scenarios. However, existing convolutional neural network methods usually have very deep network layers and a large number of parameters, which causes feature information to be lost as the network deepens, and models with a large numbers of parameters are not suitable for deploying on resource-constrained mobile devices. To address the above problems, we propose a novel lightweight image super-resolution network (RepSCN) based on re-parameterization and self-calibration convolution. Specifically, to reduce the computational cost while capturing more high-frequency details, we designed a re-parameterization distillation block (RepDB) and a self-calibrated distillation block (SCDB). They can improve the reconstruction results by aggregating the local distilled feature information under different receptive fields without introducing extra parameters. On the other hand, the positional information of the image is also crucial for super-resolution reconstruction. Nevertheless, existing lightweight SR methods mainly adopt the channel attention mechanism, which ignores the importance of positional information. Therefore, we introduce a lightweight coordinate attention mechanism (CAM) at the end of RepDB and SCDB to enhance the feature representation at both spatial and channel levels. Numerous experiments have shown that our network has better reconstruction performance with reduced parameters than other classical lightweight super-resolution models.

1. Introduction

Image super-resolution (SR) is an important technique used in computer vision tasks for image processing. It is a process that enables image reconstruction by learning a nonlinear mapping between a high-resolution (HR) image and its low-resolution (LR) counterpart. Since SR can increase the resolution while preserving as much as possible the image texture details, it plays a significant role in medical imaging [1], security and surveillance imaging [2], remote sensing images, and preprocessing of some advanced computer vision tasks [3,4]. In general, SR is an ill-posed problem because one LR image may correspond to multiple HR images [5].

Recently, deep learning methods [611] have demonstrated great value in computer vision tasks. With the development of convolutional neural networks (CNNs), various CNN-based methods for image super-resolution have emerged and have achieved significant performance breakthroughs. Dong et al. [12] first proposed a network (SRCNN) consisting of three convolutional layers to learn the mapping of LR images to HR images. SRCNN upsamples the input image to a larger size before mapping it, increasing computational difficulty and slower model training. To address this issue, Dong et al. [13] introduced FSRCNN by upsampling the feature maps after the network output, which reduces a significant amount of computation and running time. After this, considering that a limited number of CNN layers cannot fully utilize the feature information of LR images, Kim et al. [14] presented the core idea that deepening and widening the network structure can lead to better performance and proposed a 20-layer network (VDSR). VDSR employs residual learning for the first time in the SR task, effectively speeding up the convergence of the network and avoiding vanishing or exploding gradient problems. EDSR [15] removes the batch normalization (BN) layer, which is disadvantageous to the SR task and improves the performance of the deep network. DRCN [16] and DRRN [17] use a recursive approach and parameter sharing strategy to reduce parameters further but increase the depth and width of the network. Zhang et al. [18] proposed a residual dense network (RDN) with dense skip connections and residual structure that can capture more contextual information to reconstruct images.

Most of the above approaches are focused on improving performance by designing wider and deeper network. Still, as the network depth and width increase, the computational requirements and memory consumption also increase, making deep networks unsuitable for applications on mobile devices. Moreover, the deepening of the network causes the low-dimensional information of the image to fade away in the continuous nonlinear mapping, which is not conducive to the reconstruction of high-quality images. Therefore, Ahn et al. [19] proposed the lightweight cascading residual network CARN-M, which replaces the vanilla convolution with group convolution to speed up the network inference while reducing the number of parameters. Hui et al. [20] proposed an information distillation network (IDN), which improves the expressiveness of the network by combining the output relevant information with the redundant feature information obtained from distillation. However, these approaches reduce parameters at the cost of substantial performance and do not achieve a favorable trade-off. Then, Hui et al. improved the IDN by proposing the information multidistillation network (IMDN) [22]. IMDN designed an information multi-distillation block (IMDB) to progressively extract helpful information and aggregate two different features, achieving a good trade-off between network complexity and reconstruction quality. Nevertheless, although IMDN reduces the number of network parameters, it sacrifices part of the performance and cannot fully utilize the representational power of convolutional neural networks.

To further improve the performance of the lightweight super-resolution model, we elaborately design a lightweight image super-resolution network based on re-parameterization and self-calibrated convolution, named RepSCN. Due to the excessive amount of deep neural network parameters and the computational resources required, we follow the shallow network structure of IMDN. Considering that information multidistillation block (IMDB) in IMDN employs channel splitting operation to extract features, making features inconsistent in the number of channels before and after convolution, inconvenient to benefit from residual learning using identity connection, and brings inflexibility to the network structure design. To improve the inferential and representational power of the network, we designed a re-parameterization distillation block (RepDB) and a self-calibrated distillation block (SCDB) to replace the IMDB. In the shallow feature extraction stage, the re-parameterization convolution (RepConv) in RepDB can collect more useful information than the standard convolution without introducing extra cost in the inference phase. In the deep feature extraction stage, the self-calibrated block (SCB) with a more extensive convolutional receptive field works as the feature extraction module of SCDB, which can help the network to generate feature maps containing rich high-frequency details. It is worth mentioning that RepDB and SCDB do not introduce additional parameters. On the other hand, lightweight models with a small number of parameters limit the SR performance improvement. The contrast-aware attention (CCA) layer in IMDN only learns feature mappings from the channel dimension, which is inefficient. Therefore, we choose to introduce a lightweight coordinate attention mechanism (CAM) [22] at the end of RepDB and SCDB, which captures not only cross-channel but also direction-aware and position-sensitive information. It can be demonstrated through comparative and ablation experiments that the proposed network structure achieves high SR quality while maintaining a modest model size.

For this paper, the main contributions are as follows:

  1. We, for the first time, introduce the idea of structural re-parameterization in the distillation network and propose a re-parameterization distillation block (RepDB) to speed up the inference of the model while further improving the performance of SR reconstruction.

  2. We propose a self-calibrated distillation block (SCDB) with a self-calibrated block (SCB) to increase the receptive field of convolutional layers, which can generate features containing more contextual information without introducing additional parameters.

  3. Based on RepDB and SCDB, we design a lightweight and efficient SR network (RepSCN), which can fuse multiscale features under different receptive fields to enhance feature representation. A coordinate attention mechanism (CAM) is also introduced to further improve performance. Numerous experiments have demonstrated that RepSCN achieves comparable SR performance with state-of-the-art models while using a modest number of parameters.

2. Related Work

2.1. Singe-Image Super-Resolution

Recently, deep learning models have dramatically advanced the development of single-image super-resolution (SISR) tasks. Dong et al. [12] first combined deep learning techniques with SISR and proposed a network (SRCNN) consisting of three convolutional layers to learn the mapping between HR images and LR images in an end-to-end manner. Compared with SRCNN, ESPCN [23] designed a subpixel convolution layer to upsample the feature maps to the target size only at the end part of the network, greatly reducing the computational and storage complexity of the model. VDSR [14] introduced residual learning to tackle the problem of difficulty in training deep networks, and SR performance has improved with the deepening of the network. Later, to reduce the number of network parameters, Kim et al. [16] proposed a novel recursive network DRCN that achieves good performance using a parameter sharing strategy. LapSRN [24] used a pyramid framework to gradually enlarge the size of the input image patch. Tai et al. [25] proposed a persistent memory network (MemNet), which merges previous feature information using skip connections and solves the long-term dependency problem of deep models. EDSR [15] removes the BN layer in residual blocks, based on which RDN [18] introduces the dense connections, which reduces the parameters and improves the performance compared to EDSR. RCAN [26] introduced the channel attention mechanism to form a new residual structure. SRFBN [27] proposes a feedback mechanism to improve the representational power of the network.

To better deploy the model on mobile devices, Ahn et al. [19] proposed CARN, which uses a recursive cascading mechanism to learn multilevel feature representations. IDN [20] effectively combines local long and short path features using group convolution. IMDN [27] proposes an information multiple-distillation block (IMDB) that extracts hierarchical features using channel split operation. Later, RFDN [28] improved IMDB with two parallel convolutional operations to separate the feature channels. FDIWN [29] proposed the Wide-residual Distillation Interaction Block (WDIB) to interact features with different scales. LBNet [30] integrates CNN and Transformer for building a more efficient model.

In perceptual-driven methods, Ledig et al. [31] proposed a generative adversarial network (GAN)-based model SRGAN with a new perceptual loss function defined. Then, ESRGAN [32] introduced a Residual-in-Residual Dense Block (RRDB) on the top of SRGAN to enhance the visual quality further. However, although GAN-based models can generate finer texture details, there is always the problem of geometric structure distortion in the recovered images.

2.2. Attention Mechanism

The attention mechanism is a mechanism for resource allocation, and in computer vision tasks, it is used to find correlations between data and then focus on certain important features. Hu et al. [33] proposed a squeeze-and-excitation network (SENet) that learns the degree of dependency of each channel and redistributes channel feature information according to the interdependence between channels. After that, considering that channel attention can effectively help the network to learn the high-frequency information of images in SR tasks, Zhang et al. [26] introduced the channel attention mechanism into the residual learning-based block and proposed the RCAN model. The Efficient Channel Attention (ECA) module [34] improves SENet by generating channel attention via fast 1D convolution and is suitable for application in lightweight networks. Moreover, combining channel and spatial attention is an important development in attention mechanisms. CBAM [35] inferred attention weights along both spatial and channel dimensions. SRRAM [36], on the other hand, proposed a residual attention module (RAM) based on CBAM that is more suitable for SR tasks. Recently, Wang et al. [37] proposed a balanced attention mechanism (BAM) for SISR, combining the structure of MaxPool for spatial attention and AvgPool for channel attention, which is lightweight and efficient. Due to the effectiveness of the attention model, our network introduces a coordinate attention mechanism [22] in the local feature aggregation part to further enhance the network performance.

3. Proposed Methods

3.1. Network Framework

The overall framework of the proposed network is illustrated in Figure 1, and our RepSCN consists of four parts: feature extraction module, nonlinear feature mapping module, feature fusion part, and reconstruction module. Algorithm 1 formulates the network forward step. Taking ILR and ISR as the input and output of the network, respectively, the feature extraction module is composed of a convolutional layer with 56 channels and 3 × 3 kernel size, which can be expressed as follows:

F0=fsILR, (1)

where fs(·) denotes the shallow feature extraction function. After that, the shallow feature F0 is fed to the nonlinear feature mapping module for extracting useful feature information, which consists of three re-parameterization distillation block (RepDB) and three self-calibrated distillation block (SCDB) stacked in a chain-like manner, as shown in the following equation:

Fi=fRepDBFi1,i=1,,n,Fj=fSCDBFj1,j=n+1,,n+m, (2)

where fRepDB(·) and fSCDB(·) denote the RepDB function and SCDB function, respectively. Fi−1 and Fi represent the input and output features of the i-th RepDB, and similarly, Fj−1 and Fj represent the input and output features of the (jn)-th SCDB, respectively. All these intermediate features are aggregated together by a concatenation operation, and then a 1 × 1 and 3 × 3 convolutional layer is used to compress the feature channels and further refine the features. The feature fusion part can be formulated as follows:

Ff=ffcatF1,,Fn,,Fm, (3)

where cat(·) stands for the concatenation operation of channel dimension andff(·) is the 1 × 1 convolution and 3 × 3 convolution. Finally, the aggregated featureFfand shallow feature F0 are element-wise summed and upsampled to the target size by the upsampling reconstruction module as follows:

Figure 1.

Figure 1

Architecture of the proposed framework.

Algorithm 1.

Algorithm 1

RepSCN function.

ISR=frecFf+F0, (4)

where frec(·) denotes the reconstruction module, which upsamples the image using subpixel convolution to obtain the reconstructed SR image.

We denote the training set as {IHRt, ILRt}t=1T, which contains T iterations of all LR-HR image pairs. The loss function of our RepSCN can be described as

θ=1Tt=1TIHRtfRepSCNILRt1, (5)

where θ represents all learnable parameters in the network and fRepSCN(·) is our RepSCN function. ||·||1 means the L1 norm.

3.2. Re-Parameterization Distillation Block

As shown in Figure 2, our re-parameterization distillation block (RepDB) consists of cascaded re-parameterization convolutions (RepConv), 1 × 1 convolution, and 3 × 3 convolution for reducing the number of feature channels and coordinate attention mechanism (CAM). The whole module utilizes residual connections to extract valuable features progressively.

Figure 2.

Figure 2

The left figure shows the architecture of our proposed reparameterization distillation block (RepDB). The right figure presents the reparameterization convolution (RepConv). Here, the number of input channels and the number of output channels both are set to 2, so a convolution layer has four matrices.

In Figure 2, inspired by RFDN [28], the feature distillation operation is implemented by a 1 × 1 convolution that compresses the feature channels at a rate of 0.5, and the re-parameterization convolution is utilized to refine the features little by little. Given an input feature Fin, the process of the i-th RepDB can be formulated as follows:

Fdistill_1i,Frefine_1i=D1iFini,R1iFini,Fdistill_2i,Frefine_2i=D2iFrefine_1i,R2iFrefine_1i,Fdistill_3i,Frefine_3i=D3iFrefine_2i,R3iFrefine_2i,Frefine_4i=R4iFrefine_3i, (6)

where Fdistill_ji and Frefine_ji represent the i-th distilled feature and refined feature of the j-th RepDB, respectively. R1i,R2i,R3i represent the three re-parameterization convolutional layers of the i-th RepDB, and R4i is a 3 × 3 convolutional layer to decrease the number of channels of the refined features. Rji represents the j-th 1 × 1 convolutional layer of the i-th RepDB. Then, the distilled features and the final refined feature are concatenated together and added to the input feature Fin to obtain the output feature Fout of the module. It can be expressed as follows:

Fout=fCAMfCcatFdistill1i,Fdistill2i,Fdistill3i,Frefine4i+Fin, (7)

where cat(·) is a concatenation operation in channel dimensions and fC(·) denotes a 1 × 1 convolution used to compress feature channels to the same size as the input features. fCAM(·) is the coordinate attention mechanism.

Since the multibranch structure can lead to high-performance benefits for the network, inspired by RepVGG [38], we propose a re-parameterization convolution (RepConv) based on the RepVGG block but more applicable for SR tasks. Unlike the RepVGG block, as shown in Figure 2, RepConv removes the BN layer that is ineffective for the SR task. In the training phase, RepConv can be represented as y = g(x) + f(x) + x, where g(x) and f(x) correspond to 1 × 1 convolution and 3 × 3 convolution, respectively. Although we can obtain various receptive fields by employing different convolution kernels on different branches, the multibranch structure is slower during inference and has an increased memory occupation.

To address the above problem, we transform the trained multipath model into a single-path model in the inference phase. The specific procedure is shown in Figure 2. Given a re-parameterization convolution with the number of input channels and output channels both set to 2, the 1 × 1 convolution can be transformed into a 3 × 3 convolution by adding zero-padding. Setting the convolution kernel parameters of the current channel to 1 and the remaining channels to 0, the identity layer can also be transformed into the form of 1 × 1 convolution and further into 3 × 3 convolution. According to the principle of convolutional additivity, the convolutional layers in three branches are fused to form a new 3 × 3 convolution finally, and the bias of this 3 × 3 convolution is the sum of the bias of the previous 1 × 1 and 3 × 3 convolution. The experimental results show that the re-parameterization convolution can effectively improve the super-resolution reconstruction performance compared with the normal convolution.

3.3. Self-Calibrated Distillation Block

To recover more high-frequency details by using the low-frequency information of the image, we propose the self-calibrated distillation block (SCDB) to extract features in the deep network. As shown in Figure 3, the overall structure of the self-calibrated distillation block is the same as the re-parameterization distillation block, except that the re-parameterization convolution of the feature refinement part is replaced by a self-calibrated block, which can enlarge the receptive field of convolution and thus obtain richer contextual information, which helps to generate realistic HR images.

Figure 3.

Figure 3

Architecture of our proposed self-calibrated distillation block (SCDB).

Inspired by SCNet [39], we improved the SCConv in SCNet by removing its BN layers and embedding a global residual connection to alleviate the pressure during training. The proposed self-calibrated block is shown in Figure 4, where the input features Xn−1 are equally divided into Xn−1′ and Xn−1 along the channel dimension by a 1 × 1 convolution. Xn−1′ and Xn−1 are processed in two different branches, and the processed features are concatenated together and summed with Xn−1to obtain the final output Xn.

Figure 4.

Figure 4

Architecture of our proposed self-calibrated block (SCB).

In the upper branch, we first perform an average pooling downsampling operation, a convolutional feature transformation operation, and a bilinear interpolation upsampling operation on the input features Xn−1', and then we sum it with the input feature to generate the spatial attention feature map. The spatial attention feature map is fused with the transformed Xn−1′, and the output Xn' of this branch is obtained after a 3 × 3 convolution. The whole process can be expressed as follows:

Xn'=f3f2Xn1'σfUpf1fDownXn1'+Xn1', (8)

wherefUp(·) andfDown(·)denote the mean pooling upsampling operation and the bilinear interpolation upsampling operation, respectively. f1(·), f2(·), and f3(·) are 3 × 3 convolutions, and σ(·) is the sigmoid function.

In the lower branch, the input feature Xn−1 is fed into a 3 × 3 convolution layer to obtain the output Xn. Xn is concatenated with the output Xn′ of the upper branch and summed with the feature Xn−1 containing a large amount of low-frequency information to generate the output Xn of the n-th self-calibrated block. This process can be formulated as follows:

Xn=f4Xn1,Xn=catXn,Xn+Xn1, (9)

where f4(·) is the 3 × 3 convolution and cat(·) represents the concatenation operation along the channel dimensions.

3.4. Coordinate Attention Mechanism

The channel attention mechanism (e.g., SE Block in SENet [33]) can effectively improve the reconstruction performance but ignores the locational information, which is critical to capturing the spatial structure of the image in the SR task. CBAM [35] combines channel attention and spatial attention by feedforward propagation, which is likely to cause the loss of locational information. Therefore, to augment the feature representations of the network, we introduce the coordinate attention mechanism [22], which embeds locational information into channel attention and can learn long-range dependencies between spatial locations. It is also well adapted for application to lightweight SR models.

As shown in Figure 5, given an input X ∈ ℝ56∗H∗W, 56 is the number of feature channels of the input X, and Hand Wdenote the height and width of the image, respectively. We utilize two spatially pooling kernels (H, 1) and (1, W) to encode each channel along the horizontal and vertical directions, respectively, and the output Y of the c -th channel at height h and width w can be formulated as follows:

Ychh=1W0,iWXch,i,Ycww=1H0jHXcj,w. (10)

Figure 5.

Figure 5

Architecture of coordinate attention module (CAM).

The above transformation obtains a pair of feature maps Yh ∈ ℝ56∗H∗1 and Yw ∈ ℝ561∗W, and Yh andYware concatenated together and then fed to a 1 × 1 convolution layer to generate the intermediate feature F. It can be described as follows:

F=δfconvcatYh,Yw, (11)

where cat(·) represents the concatenation operation along the spatial dimension, fconv(·) represents the 1 × 1 convolution, and δ(·) is the h-swish activation function. Then, we split Finto Fhand Fw along the spatial dimension and obtain the attention weights Zh and Zw by two 1 × 1 convolution transformations. The above process can be formulated as follows:

Zh=σfhFh,Zw=σfwFw, (12)

where σ(·)is the sigmoid function. Finally, the input featuresXand attention weights Zhand Zware multiplied to generate the output feature Y, which can be formulated as follows:

Yci,j=Xci,jZchiZcwj. (13)

4. Experimental Results

4.1. Datasets and Evaluation Metrics

In our experiments, we use 800 high-quality RGB training images from the publicly available dataset DIV2K [40] as the training set to train our model. To test the performance of the model, we use four commonly used benchmark datasets: Set5 [41], Set14 [42], BSD100 [43], and Urban100 [44]. Set5, Set14, and BSD100 contain natural scene images, and Urban100 contains only urban scene images. In addition, we evaluated the SR results by calculating the widely used peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [45] on the Y channel of transformed YCrbr space.

4.2. Implementation Details

In the data preprocessing stage, the training set is augmented with data by horizontal and vertical flipping and 90°, 180°, and 270° rotations. Besides, we convert the HR images to the LR training images using a bicubic interpolation downsampling operation in MATLAB.

In the training stage, the input image patch size is 64 × 64. We optimize the model using ADAM algorithm with parameters set to β1 = 0.9, β2 = 0.999, and ϵ=10−8. The batch size is set as 32, and the initial learning rate is set to 5e − 4 and is reduced by half every 200 epochs for a total of 1000 epochs.

The ×2, ×3, and ×4 models are trained from scratch when training the final models. The entire network is implemented on the PyTorch framework with an NVIDIA RTX 3080 GPU.

4.3. Ablation Study

In this section, we will discuss the differences between our proposed method and IMDN [27]. As shown in Figure 6, we designed four blocks that each forms the body part of the network shown in Figure 1 in a stacked manner, based on which we conducted ablation experiments to validate the effectiveness of each of the proposed blocks. The structure of IMDB, the building block of IMDN, is shown in Figure 6(a). The channel splitting operation limits the number of channels before and after feature extraction, making it difficult to introduce identity connections. Therefore, we designed the feature distillation block (FDB) shown in Figure 6(b), which uses 1 × 1 convolution for channel reduction and is more flexible than IMDB.

Figure 6.

Figure 6

Structure diagrams of different building blocks used in the ablation study (a) IMDB, (b) FDB, (c) RepDB and (d) SCDB.

Table 1 shows the results of the ablation experiments. From the first two rows of Table 1, we observe that FDB improves the performance compared to IMDB, and the PSNR value increases by 0.05 dB on the Urban100 dataset. In the second and third rows, the coordinate attention mechanism (CAM) achieves better experimental results than the contrast-aware channel attention layer (CCA Layer) with approximately the same number of parameters. By comparing the last four rows, we can see that both the re-parameterization convolution (RepConv) and the self-calibrated block (SCB) can bring significant performance benefits to the SR network after replacing the standard convolution, and the combination of the re-parameterization distillation block (RepDB) and the self-calibrated distillation block (SCDB) can maximize this benefit when used together. Thus, we choose 3RepDB + 3SCDB as the main structure of the proposed RepSCN. Comparing the first and last rows, we can observe that RepSCN outperforms IMDN on each data set with the same experimental settings, especially on the Urban100 dataset, where the PSNR value improves by 0.20db and the SSIM value improves by 0.0055.

Table 1.

Ablation studies of the effects of IMDB, FDB, RepDB, and SCDB with ×4 scale factor on test sets.

Method Scale Params (K) Set5 PSNR/SSIM Set14 PSNR/SSIM BSD100 PSNR/SSIM Urban100 PSNR/SSIM
6IMDB ×4 715 32.15/0.8940 28.56/0.7808 27.54/0.7348 25.97/0.7826
6FDB + CCA ×4 737 32.16/0.8943 28.58/0.7814 27.55/0.7348 26.02/0.7831
6FDB + CAM ×4 742 32.17/0.8946 28.59/0.7816 27.55/0.7355 26.05/0.7842
6RepDB ×4 742 32.20/0.8948 28.61/0.7818 27.56/0.7356 26.09/0.7862
6SCDB ×4 801 32.23/0.8951 28.61/0.7819 27.57/0.7359 26.11/0.7863
3RepDB+3SCDB ×4 772 32.22/0.8953 28.61/0.7820 27.58/0.7363 26.17/0.7881

4.4. Model Complexity Analysis

The number of model parameters is an important metric to evaluate the complexity of a lightweight network. To directly show that the proposed RepSCN method obtains better SR results with fewer parameters, we compared RepSCN with the following 13 lightweight networks on ×2 Urban100 dataset: SRCNN [12], FSRCNN [13], VDSR [14], DRCN [16], DRRN [17], LapSRN [24], MemNet [25], IDN [20], EDSR-baseline [15], SRMDNF [46], CARN [19], MADNet [47], and IMDN [27]. As shown in Figure 7, we can see that our RepSCN outperforms the other lightweight networks by a large margin while maintaining a modest model size.

Figure 7.

Figure 7

Trade-off between performance and number of parameters on Urban100 dataset with scale factor ×2.

The number of Mult-Adds is another important metric to measure the complexity of the SR model. Figure 8 shows the effect of Mult-Adds against PSNR for RepSCN and mainstream lightweight networks on the Urban100 ×2 dataset. We can find that RepSCN achieves better reconstruction results with fewer Mult-Add operations than other lightweight networks.

Figure 8.

Figure 8

PSNR performance versus Mult-Adds on Urban100 dataset with scale factor ×2.

4.5. Comparison with State-Of-The-Arts

To intuitively show the effectiveness and efficiency of the proposed RepSCN model, we compared our RepSCN with various lightweight SR methods, including SRCNN [12], FSRCNN [13], VDSR [14], DRCN [16], LapSRN [24], MemNet [25], IDN [20], EDSR-baseline [15], SRMDNF [46], CARN [19], MADNet [47], IMDN [27], MSICF [48], and MSWSR [49]. To make a fair comparison with the above methods, we train our models individually for the scaling factors of ×2, ×3, and ×4.

Table 2 shows the number of parameters and objective evaluation metrics values for different models on the four benchmark datasets. We can see that the experimental results of IMDN are better than those of other methods except RepSCN at the upscaling factor of 2×. However, when the scale factor is set to ×4, the experimental results of IMDN are not as well compared with CARN and EDSR because CARN and EDSR with deeper and wider network structures can learn more feature information on large-scale datasets. Our proposed RepSCN has the same depth as IMDN, but by merging RepDB with SCDB, we obtain a larger receptive field and significantly enhance the representational power of the network. As a result, RepSCN achieves the best reconstruction performance on all four benchmark datasets with different upsampling factors, especially on the Set14 and Urban100 datasets, where the average PSNR values improve by 0.05 dB and 0.15 dB compared to IMDN, and the average SSIM values improve by 0.0013 and 0.0034 compared to IMDN, respectively. In summary, our RepSCN achieves the best trade-off against other lightweight methods in terms of performance and parameters.

4.6. Qualitative Comparisons

Figure 9 shows the subjective visual effect on scale ×4 from the Urban100 dataset. As can be seen from images “Img044” and “Img085,” most of the compared methods suffer from severe blurring artifacts, in contrast to our RepSCN, which produces more accurate lines and reconstructs more high-frequency details. From the image “Img025,” we observe that other methods produce the wrong reconstruction of the image texture direction to some extent. RepSCN can recover clear and correct texture details and generate more realistic visual results.

Figure 9.

Figure 9

Visual comparisons of RepSCN with other SR methods on Urban100 dataset with scale factor ×4.

5. Conclusion

In this work, we propose a lightweight convolutional neural network based on re-parameterization and self-calibrated convolution for SISR. Specifically, we design the re-parameterization distillation module (RepDB) and the self-calibrated distillation module (SCDB) as the building blocks of the SR model. RepDB and SCDB can aggregate local features from different stages to obtain a more powerful feature representation, and re-parameterization convolution (RepConv) and self-calibrated blocks (SCB) are used to extract features at a fine-grained level. Moreover, we introduce the coordinate attention mechanism (CAM) to obtain long-distance dependencies between different locations, which is beneficial to recover more high-frequency details. Numerous experiments show that our method can achieve comparable performance with state-of-the-art lightweight networks. Our RepSCN reconstructs high-quality images with a small number of parameters.

In the future, we will explore more efficient re-parameterization strategies and use a single-branch network structure to accelerate model inference, which allows our models to be embedded in edge devices. In addition, we will also apply our approach to other image processing fields such as hyperspectral remote sensing and medical imaging.

Table 2.

Average PSNR (dB)/SSIM values with the scale factors ×2, ×3, and ×4 on Set5, Set14, BSD100, and Urban100. The best performance is highlighted in red, and the second-best performance is highlighted in blue.

Method Scale Params Set5 PSNR/SSIM Set14 PSNR/SSIM BSD100 PSNR/SSIM Urban100 PSNR/SSIM
SRCNN [12] ×2 8K 36.66/0.9542 32.45/0.9067 31.36/0.8879 29.50/0.8946
FSRCNN [13] ×2 13K 37.00/0.9558 32.63/0.9088 31.53/0.8920 29.88/0.9020
VDSR [14] ×2 666K 37.53/0.9587 33.03/0.9124 31.90/0.8960 30.76/0.9140
DRCN [16] ×2 1774K 37.63/0.9588 33.04/0.9118 31.85/0.8942 30.75/0.9133
LapSRN [24] ×2 251K 37.52/0.9591 32.99/0.9124 31.80/0.8952 30.41/0.9103
DRRN [17] ×2 298K 37.74/0.9591 33.23/0.9136 32.05/0.8973 31.23/0.9188
MemNet [25] ×2 678K 37.78/0.9597 33.28/0.9142 32.08/0.8978 31.31/0.9195
IDN [20] ×2 553K 37.83/0.9600 33.30/0.9148 32.08/0.8985 31.27/0.9196
EDSR-baseline [15] ×2 1370K 37.99/0.9604 33.57/0.9175 32.16/0.8994 31.98/0.9272
SRMDNF [46] ×2 1511K 37.79/0.9601 33.32/0.9159 32.05/0.8985 31.33/0.9204
CARN [19] ×2 1592K 37.76/0.9590 33.52/0.9166 32.09/0.8978 31.92/0.9256
IMDN [27] ×2 694K 38.00/0.9605 33.63/0.9177 32.19/0.8996 32.17/0.9283
MADNet [47] ×2 878K 37.94/0.9604 33.46/0.9167 32.10/0.8988 31.74/0.9246
MSICF [48] ×2 4292K 37.89/0.9605 33.41/0.9153 32.15/0.8992 31.47/0.9220
MSWSR [49] ×2 1228K 37.49/0.9583 33.23/0.9123 31.88/0.8929 31.14/0.9169
RepSCN(Ours) ×2 753K 38.01/0.9606 33.70/0.9192 32.19/0.8999 32.36/0.9307
SRCNN [12] ×3 8K 32.75/0.9090 29.30/0.8215 28.41/0.7863 26.24/0.7989
FSRCNN [13] ×3 13K 33.18/0.9140 29.37/0.8240 28.53/0.7910 26.43/0.8080
VDSR [14] ×3 666K 33.66/0.9213 29.77/0.8314 28.82/0.7976 27.14/0.8279
DRCN [16] ×3 1774K 33.82/0.9226 29.76/0.8311 28.80/0.7963 27.15/0.8276
LapSRN [24] ×3 502K 33.81/0.9220 29.79/0.8325 28.82/0.7980 27.07/0.8275
DRRN [17] ×3 298K 34.03/0.9244 29.96/0.8349 28.95/0.8004 27.53/0.8378
MemNet [25] ×3 678K 34.09/0.9248 30.00/0.8350 28.96/0.8001 27.56/0.8376
IDN [20] ×3 553K 34.11/0.9253 29.99/0.8354 28.95/0.8013 27.42/0.8359
EDSR-baseline [15] ×3 1555K 34.37/0.9270 30.28/0.8417 29.09/0.8052 28.15/0.8527
SRMDNF [46] ×3 1528K 34.12/0.9254 30.04/0.8382 28.97/0.8025 27.57/0.8398
CARN [19] ×3 1592K 34.29/0.9255 30.29/0.8407 29.06/0.8034 28.06/0.8493
IMDN [27] ×3 703K 34.36/0.9270 30.32/0.8417 29.09/0.8046 28.17/0.8519
MADNet [47] ×3 930K 34.26/0.9262 30.29/0.8410 29.04/0.8033 27.91/0.8464
MSICF [48] ×3 4292K 34.24/0.9266 30.09/0.8371 29.01/0.8024 27.69/0.8411
MSWSR [49] ×3 −/− −/− −/− −/−
RepSCN(Ours) ×3 761K 34.49/0.9277 30.38/0.8433 29.09/0.8054 28.30/0.8553
SRCNN [12] ×4 8K 30.48/0.8628 27.50/0.7513 26.90/0.7101 24.52/0.7221
FSRCNN [13] ×4 13K 30.72/0.8660 27.61/0.7550 26.98/0.7150 24.62/0.7280
VDSR [14] ×4 666K 31.35/0.8838 28.01/0.7674 27.29/0.7251 25.18/0.7524
DRCN [16] ×4 1774K 31.53/0.8854 28.02/0.7670 27.23/0.7233 25.14/0.7510
LapSRN [24] ×4 502K 31.54/0.8852 28.09/0.7700 27.32/0.7275 25.21/0.7562
DRRN [17] ×4 298K 31.68/0.8888 28.21/0.7720 27.38/0.7284 25.44/0.7638
MemNet [25] ×4 678K 31.74/0.8893 28.26/0.7723 27.40/0.7281 25.50/0.7630
IDN [20] ×4 553K 31.82/0.8903 28.25/0.7730 27.41/0.7297 25.41/0.7632
EDSR-baseline [15] ×4 1518K 32.09/0.8938 28.58/0.7813 27.57/0.7357 26.04/0.7849
SRMDNF [46] ×4 1552K 31.96/0.8925 28.35/0.7787 27.49/0.7337 25.68/0.7731
CARN [19] ×4 1592K 32.13/0.8937 28.60/0.7806 27.58/0.7349 26.07/0.7837
IMDN [27] ×4 715K 32.21/0.8948 28.58/0.7811 27.56/0.7353 26.04/0.7838
MADNet [47] ×4 1002K 32.11/0.8939 28.52/0.7799 27.52/0.7340 25.89/0.7782
MSICF [48] ×4 4292K 31.91/0.8923 28.35/0.7751 27.46/0.7308 25.64/0.7692
MSWSR [49] ×4 1228K 32.01/0.8914 28.47/0.7776 27.48/0.7311 25.78/0.7744
RepSCN(Ours) ×4 772K 32.22/0.8953 28.61/0.7820 27.58/0.7363 26.17/0.7881

Acknowledgments

This work was supported in part by the Scientific Research Fund of Hunan Provincial Education Department under Grant no. 19C0028 and in part by the Young Teachers' Growth Plan of Changsha University of Science and Technology under Grant no. 2019QJCZ011.

Data Availability

The data supporting the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  • 1.Huang Y., Shao L., Frangi A. F. Simultaneous super-resolution and cross-modality synthesis of 3d medical images using weakly-supervised joint convolutional sparse coding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2017; Honolulu, HI, USA. pp. 6070–6079. [Google Scholar]
  • 2.Rasti P., Uiboupin T., Escalera S., Anbarjafari G. Convolutional neural network super resolution for face recognition in surveillance monitoring. Proceedings of the International Conference on Articulated Motion and Deformable Objects; July 2016; Palma de Mallorca, Spain. pp. 175–184. [Google Scholar]
  • 3.Haris M., Shakhnarovich G., Ukita N. Task-driven super resolution: object detection in low-resolution images. Proceedings of the International Conference on Neural Information Processing; December 2021; Springer, Sanur, Indonesia. pp. 387–395. [Google Scholar]
  • 4.Sajjadi M. S., Scholkopf B., Hirsch M. Enhancenet: single image super-resolution through automated texture synthesis. Proceedings of the IEEE International Conference on Computer Vision; October 2017; Venice, Italy. pp. 4491–4500. [Google Scholar]
  • 5.Li J., Pei Z., Zeng T. From Beginner to Master: A Survey for Deep Learning-Based Single-Image Super-resolution. 2021. https://arxiv.org/abs/2109.14335 .
  • 6.Xia R., Chen Y., Ren B. R. Improved Anti-occlusion Object Tracking Algorithm Using Unscented Rauch-Tung-Striebel Smoother and Kernel Correlation Filter. Journal of King Saud University-Computer and Information Sciences . 2022;34:6008–6018. [Google Scholar]
  • 7.Zhang J., Feng W., Yuan T., Wang J., Sangaiah A. K. SCSTCF: spatial-channel selection and temporal regularized correlation filters for visual tracking. Applied Soft Computing . 2022;118 doi: 10.1016/j.asoc.2022.108485.108485 [DOI] [Google Scholar]
  • 8.Zhang J., Sun J., Wang J., Li Z., Chen X. An object tracking framework with recapture based on correlation filters and Siamese networks. Computers & Electrical Engineering . 2022;98 doi: 10.1016/j.compeleceng.2022.107730.107730 [DOI] [Google Scholar]
  • 9.Zhang J., Zou X., Kuang L. D. CCTSDB 2021: A More Comprehensive Traffic Sign Detection Benchmark. Human-centric Computing and Information Sciences . 2022;12 [Google Scholar]
  • 10.Zhang J., Sun J., Wang J., Yue X. G. Visual object tracking based on residual network and cascaded correlation filters. Journal of Ambient Intelligence and Humanized Computing . 2021;12(8):8427–8440. doi: 10.1007/s12652-020-02572-0. [DOI] [Google Scholar]
  • 11.Wang L., Dai L., Bian H., Ma Y., Zhang J. Concrete cracking prediction under combined prestress and strand corrosion. Structure and Infrastructure Engineering . 2019;15(3):285–295. doi: 10.1080/15732479.2018.1550519. [DOI] [Google Scholar]
  • 12.Dong C., Loy C. C., He K., Tang X. Learning a deep convolutional network for image super-resolution. Proceedings of the European Conference on Computer Vision; September 2014; Zurich, Switzerland. pp. 184–199. [Google Scholar]
  • 13.Dong C., Loy C. C., Tang X. Accelerating the superresolution convolutional neural network. Proceedings of the European Conference on Computer Vision; October 2016; Amsterdam, Netherlands. pp. 391–407. [Google Scholar]
  • 14.Kim J., Lee J., Lee K. M. Accurate image super-resolution using very deep convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2016; Las Vegas, NV, USA. pp. 1646–1654. [Google Scholar]
  • 15.Lim B., Son S., Kim H., Nah S., Lee K. M. Enhanced deep residual networks for single image super-resolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; July 2017; Hawaii, USA. pp. 136–144. [Google Scholar]
  • 16.Kim J., Lee J. K., Lee K. M. Deeply-recursive convolutional network for image super-resolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2016; Las Vegas, NV, USA. pp. 1637–1645. [Google Scholar]
  • 17.Tai Y., Yang J., Liu X. Image super-resolution via deep recursive residual network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2017; Honolulu, HI, USA. pp. 2790–2798. [Google Scholar]
  • 18.Zhang Y., Tian Y., Kong Y., Zhong B., Fu Y. Residual dense network for image super-resolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; June 2018; Salt Lake City, UT, USA. pp. 2472–2481. [Google Scholar]
  • 19.Ahn N., Kang B., Sohn K. A. Fast, accurate, and lightweight super-resolution with cascading residual network. Proceedings of the European Conference on Computer Vision (ECCV); September 2018; Munich, Germany. pp. 252–268. [Google Scholar]
  • 20.Hui Z., Wang X., Gao X. Fast and accurate single image super-resolution via information distillation network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; June 2018; Salt Lake City, UT, USA. pp. 723–731. [Google Scholar]
  • 21.Hui Z., Gao X., Yang Y., Wang X. Lightweight image super-resolution with information multi-distillation network. Proceedings of the 27th Acm International Conference on Multimedia; October 2019; New York,NY,USA. pp. 2024–2032. [Google Scholar]
  • 22.Hou Q., Zhou D., Feng J. Coordinate attention for efficient mobile network design. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 2021; Nashville, TN, USA. pp. 13713–13722. [Google Scholar]
  • 23.Shi W., Caballero J., Huszar F. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2016; Las Vegas, NV, USA. pp. 1874–1883. [Google Scholar]
  • 24.Lai W., Huang J., Ahuja N., Yang M. Deep laplacian pyramid networks for fast and accurate super-resolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2017; Hawaii, USA. pp. 624–632. [Google Scholar]
  • 25.Tai Y., Yang J., Liu X., Xu C. Memnet: a persistent memory network for image restoration. Proceedings of the IEEE International Conference on Computer Vision; October 2017; Venice, Italy. pp. 4539–4547. [Google Scholar]
  • 26.Zhang Y., Li K., Li K. Image super-resolution using very deep residual channel attention networks. Proceedings of the European Conference on Computer Vision (ECCV); September 2018; Munich, Germany. pp. 286–301. [Google Scholar]
  • 27.Li Z., Yang J., Liu Z. Feedback network for image super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 2019; Long Beach, CA, USA. pp. 3867–3876. [Google Scholar]
  • 28.Liu J., Tang J., Wu G. Residual feature distillation network for lightweight image super-resolution. Proceedings of the European Conference on Computer Vision; August 2020; Glasgow, UK. pp. 41–55. [Google Scholar]
  • 29.Gao G., Li J., Wang Z. Feature Distillation Interaction Weighting Network for Lightweight Image Super-resolution. 2021. https://arxiv.org/abs/2112.08655 .
  • 30.Gao G., Wang Z., Li J. Lightweight Bimodal Network for Single-Image Super-resolution via Symmetric CNN and Recursive Transformer. 2022. https://arxiv.org/abs/2204.13286 .
  • 31.Ledig C., Theis L., Huszár F. Photo-realistic single image super-resolution using a generative adversarial network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; July 2017; Honolulu, HI, USA. pp. 4681–4690. [Google Scholar]
  • 32.Wang X., Yu K., Wu S. Esrgan: enhanced super-resolution generative adversarial networks. Proceedings of the European Conference on Computer Vision (ECCV); September 2018; Munich, Germany. pp. 63–79. [Google Scholar]
  • 33.Hu J., Shen L., Sun G. Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; JUNE 2018; Salt Lake City, UT, USA. pp. 7132–7141. [Google Scholar]
  • 34.Wang Q., Wu B., Zhu P. Eca-Net: efficient channel attention for deep convolutional neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 2020; New Orleans, LA. pp. 11531–11539. [Google Scholar]
  • 35.Woo S., Park J., Lee J. Y., Kweon I. S. Cbam: convolutional block attention module. Proceedings of the European Conference on Computer Vision; September 2018; Munich, Germany. pp. 3–19. [Google Scholar]
  • 36.Wang X., Gu Y., Gao X., Hui Z. Dual residual attention module network for single image super resolution. Neurocomputing . 2019;364:269–279. doi: 10.1016/j.neucom.2019.06.078. [DOI] [Google Scholar]
  • 37.Wang F., Hu H., Shen C. BAM: A Lightweight and Efficient Balanced Attention Mechanism for Single Image Super Resolution. 2021. https://arxiv.org/abs/2104.07566 .
  • 38.Ding X., Zhang X., Ma N. Repvgg: making vgg-style convnets great again. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; JUNE 2021; Nashville, TN, USA. pp. 13733–13742. [Google Scholar]
  • 39.Liu J. J., Hou Q., Cheng M. M., Wang C., Feng J. Improving convolutional networks with self-calibrated convolutions. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 2020; Seattle, WA, USA. pp. 10096–10105. [Google Scholar]
  • 40.Agustsson E., Timofte R. Ntire 2017 challenge on single image super-resolution: dataset and study. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; July 2017; Hawaii, USA. pp. 126–135. [Google Scholar]
  • 41.Bevilacqua M., Roumy A., Guillemot C., Alberi-Morel M. L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Proceedings of the Proceedings British Machine Vision Conference; September 2012; London, UK. pp. 1–10. [Google Scholar]
  • 42.Zeyde R., Elad M., Protter M. On single image scale-up using sparse-representations. Proceedings of the International Conference on Curves and Surfaces; June 2010; Avignon France. pp. 711–730. [Google Scholar]
  • 43.Martin D., Fowlkes C., Tal D., Malik J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Proceedings of the Eighth IEEE International Conference on Computer Vision; July 2001; Vancouver, BC, Canada. pp. 416–423. [Google Scholar]
  • 44.Huang J. B., Singh A., Ahuja N. Single image super-resolution from transformed self-exemplars. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; June 2015; Boston, MA, USA. pp. 5197–5206. [Google Scholar]
  • 45.Wang Z., Bovik A. C., Sheikh H. R., Simoncelli E. P. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing . 2004;13(4):600–612. doi: 10.1109/tip.2003.819861. [DOI] [PubMed] [Google Scholar]
  • 46.Zhang K., Zuo W., Zhang L. Learning a single convolutional super-resolution network for multiple degradations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; JUNE 2018; Salt Lake City, UT, USA. pp. 3262–3271. [Google Scholar]
  • 47.Lan R., Sun L., Liu Z., Lu H., Pang C., Luo X. MADNet: a fast and lightweight network for single-image super resolution. IEEE Transactions on Cybernetics . March 2021;51(3):1443–1453. doi: 10.1109/tcyb.2020.2970104. [DOI] [PubMed] [Google Scholar]
  • 48.Hu Y., Gao X., Li J., Huang Y., Wang H. Single image super-resolution with multi-scale information cross-fusion network. Signal Processing . 2021;179 doi: 10.1016/j.sigpro.2020.107831.107831 [DOI] [Google Scholar]
  • 49.Zhang H., Xiao J., Jin Z. Multi-scale image super-resolution via a single extendable deep network. IEEE Journal of Selected Topics in Signal Processing . 2021;15(2):253–263. doi: 10.1109/jstsp.2020.3045282. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data supporting the findings of this study are included within the article.


Articles from Computational Intelligence and Neuroscience are provided here courtesy of Wiley

RESOURCES