Abstract
Deep convolutional neural networks (CNNs) have pushed forward the frontier of super-resolution (SR) research. However, current CNN models exhibit a major flaw: they are biased towards learning low-frequency signals. This bias becomes more problematic for the image SR task which targets reconstructing all fine details and image textures. To tackle this challenge, we propose to improve the learning of high-frequency features both locally and globally and introduce two novel architectural units to existing SR models. Specifically, we propose a dynamic highpass filtering (HPF) module that locally applies adaptive filter weights for each spatial location and channel group to preserve high-frequency signals. We also propose a matrix multi-spectral channel attention (MMCA) module that predicts the attention map of features decomposed in the frequency domain. This module operates in a global context to adaptively recalibrate feature responses at different frequencies. Extensive qualitative and quantitative results demonstrate that our proposed modules achieve better accuracy and visual improvements against state-of-the-art methods on several benchmark datasets.
1. Introduction
Image SR is a modeling task that estimates a highresolution (HR) image from its low-resolution (LR) counterpart. Image SR is a challenging and ill-posed problem since multiple solutions exist for any LR input. Given the recent advances in deep learning, convolutinal neural network (CNN) based SR methods have been leveraged in a wide variety of research domains such as biomedicine, object recognition, and hyper-spectral imaging [9, 21, 32, 43].
The promising results and potential impact of SR in these domains have garnered attention from the vision research community. Many CNN-based methods have been proposed [4, 5, 6, 7, 17, 20, 47, 49] and significantly outperform traditional methods. In line with the ‘very deep’ paradigm, these methods use over-parameterized networks with hundreds of layers. This approach is usually coupled with a recent architectural breakthrough known as residual learning. Residual learning alleviates the degradation problem due to increased depth, and simplifies the learning task, which improves network convergence.
Although these advancements have enhanced performance and are now commonplace in SR networks, these methods still suffer from a serious flaw (see Figure 1). It has been demonstrated that neural networks exhibit a bias towards low-frequency signals. Figure 2 illustrates a prime example of this. In the output of a popular and robust SR baseline, RCAN [47], we can see that the high-frequency data are significantly reduced, causing the reconstruction to be overly smooth. This is due to many aspects of training, such as the loss function, architecture type, and optimization method. Ledig et al. [20] already showed that standard pixel-wise metrics (`1 or `2) tend to pull the reconstruction towards an average of the possible reconstructions equidistant in terms of the `2 loss on the natural image manifold.
Figure 1:

Visual comparison (×4) on “image 024” from Urban100. Existing methods suffer from blurring artifacts.
Figure 2:

Comparison of distributions for a sample of sequential pixels sampled from the patches shown in Figure 1. Existing methods produce an overly smooth distribution.
Similarly, higher frequencies struggle to propagate due to the architecture and optimization method of the networks [2]. They become quickly saturated with low-frequency patterns first, thereby halting the learning of additional information. Since there is high information redundancy between channels, many recent works propose using various attention mechanisms to re-weight channels. The classic channel attention mechanism, SENet [13], suffers from one major drawback. Qin et al. [33] theoretically demonstrated that by using global average pooling, SENet discards all other frequencies except the lowest one. Another issue that arises is aliasing, the phenomenon that high-frequency signals degenerate after sampling. This is due to downsampling layers, which are widely used in deep networks to reduce parameters and computation [51]. When we consider image SR applications, these flaws are exacerbated since the modeling task requires high-frequency information to be complete.
Motivated by these issues, we propose to bridge this divide by ensuring that high-frequency information propagates through the network. We tackle this problem both locally and globally. Our global approach is to modify the existing channel attention mechanism by utilizing a broader frequency spectrum in contrast to existing methods. This increases the representational power of the network and preserves the inter-dependencies between features. We propose to amplify high-frequency details in a dynamic and context aware fashion in addition to a novel channel attention mechanism. We learn a different high-pass filtering kernel for each spatial location, which is then applied to the input features at their respective location. Low-frequency information is preserved via long- and short-range skip connections. By following the convolution operation with a high-pass filtering operation, we pivot the network’s learning capacity to more difficult high-frequency details.
In summary, our main contributions are as follows:
We propose a dynamic high-pass filtering layer for image super-resolution (SR) networks. This module enhances the network’s discriminative learning ability by enabling it to focus on useful spatial content.
We further propose a matrix multi-spectral channel attention mechanism that predicts the attention map of features decomposed in the frequency domain. The feature channels are then adaptively rescaled based on their maximum frequency response.
We provide visual results and analyses regarding our proposed modules. We also conduct extensive comparisons with recent image SR methods and achieve significant gains quantitatively and visually.
2. Related Work
Image Super-resolution.
State-of-the-art deep learning based SR methods postulate the problem as a dense regression task that learns an end-to-end mapping represented by a deep CNN between low-resolution and high-resolution images. The pioneering work by Dong et al. [6] first utilized deep learning to solve the SR problem using a three-layer CNN and further improved the training efficiency in followup work [7]. Following this first attempt, many works have achieved better performance by using the “very deep paradigm” that increases the depth and width of the CNNs and integrates residual learning [17, 24, 47, 49]. More recent works integrate different channel and spatial attention mechanisms to utilize the interaction of different layers, channels, and positions. Dai et al. [4] propose SAN, which includes an attention module to learn feature interdependencies by considering second-order statistics of features. Niu et al. proposed HAN [30], which includes both a layer attention module and a channel-spatial attention module, to emphasize hierarchical features by considering the correlations among layers. RBAN [5] consists of two types of attention modules for feature expression and feature correlation learning. We differ from these works by explicitly focusing on the learning of high-frequency signals.
Visual Attention.
SENet [13] accomplishes channel attention using a single global descriptor for each channel by global average pooling. These descriptors are then passed to a multi-layer perceptron (MLP) to calculate the weights of each channel. Several works have extended this original scheme by also integrating spatial attention, including CBAM [41], DAN [10] and scSE [34]. Additional works incorporate a variety of techniques to reduce redundancy of the fully connected layers in the MLP (ECANet) [39] and to selectively aggregate channels (SKNet) [22].
However, most of these methods use only the lowest frequency component (via averaging) of the features’ frequency spectrum, as theoretically demonstrated in Qin et al. [33]. To overcome this, FcaNet [33] builds on the original SENet by proposing a frequency-based approach to channel attention. This is done by grouping channels and assigning the same single frequency to each channel in a given group. The global descriptor for each channel is its corresponding frequency coefficient calculated via the Discrete Cosine Transform. In this way, they expand the frequencies being utilized by the attention mechanism. We adapt and improve this mechanism to image SR by considering multiple frequency components for each channel.
Adaptive Filtering Layer.
Image filtering is a classic computer vision technique in image restoration tasks, including super-resolution, de-noising, and in-painting [36]. Previous works have integrated classic filters (e.g., Gaussian) into deep models to tackle vision tasks at different levels [14, 42, 46]. However, those filters have fixed elements, restricting the adaptation to specific spatial locations and image content. Moreover, these filters require careful tuning of hyperparameters. Therefore, recent works also make the filters learnable during optimization and spatially-varying based on local features [16, 35, 51]. Specifically, Zou et al. [51] restrict the learned filters to be low-pass to counter the aliasing artifacts in model downsampling layers. We incorporate their approach into super-resolution models by introducing the dynamic high-pass filtering (HPF) layer. The HPF layer can better preserve the high-frequency signals in deep models, which is favorable for the SR task since it requires fine details and textures.
3. Proposed Method
In this section, we introduce our method, Dynamic Filtering and Spectral Attention (DFSA). It consists of two novel modules that can be seamlessly integrated into existing SR architectures (e.g., RCAN [47]) to improve the performance in super-resolution, including the Matrix MultiSpectral Channel Attention (MMCA) module (Sec. 3.2) and the Dynamic High-Pass Filtering Layer (HPF) module (Sec. 3.1). These modules conduct local and global frequency modulation dynamically. HPF amplifies the highfrequencies of input features by dynamically learning and applying different high-pass kernels for each spatial location. MMCA then relatively rescales channels using their maximal frequency response. Figure 3 demonstrates how these modules are integrated into a standard residual block used in image SR networks.
Figure 3:

Comparison of residual blocks in original (a) RCAN [47] and (b) ours. We add our dynamic high-pass filtering (HPF) layer after the first convolution and replace the standard channel attention with our modified multi-spectral channel attention (MMCA).
3.1. Dynamic High-Pass Filtering Layer (HPF)
Following the design approach of [51], the filtering layer learns to dynamically generate different spatial and channel high-pass kernels, which are then applied to their respective locations. Using the same kernel across the spatial extent of the input features may not accurately capture all the high frequency details since the frequency of a signal can vary dramatically across spatial locations. Consequently, we learn a different high-pass kernel for each spatial location. In a similar vein, we can also learn a different kernel for each channel. This would incur severe computational overhead. It is sufficient to split the channels into groups since there is information redundancy in channel features. Thus, we split the C channels into g groups and predict a different set of high-pass kernels for each group. Figure 4 illustrates the HPF module. Given an input X ∈ RH×W×C, we learn g kernels for each spatial location (i,j) of X then apply these kernels to X in their respective local locations and groups to produce our output Y. Note that for each spatial location (i,j) there are a set of points (indicated by gray boxes overlaid on X in Figure 4) surrounding it which are involved in the application of kernel wg. This technique enables us to propagate high-frequencies to the subsequent layer. By using this module throughout the depth of the network, we can preserve the high-frequency information.
Figure 4:

Weight generation (G(X)) and application in the dynamic filtering layer as described in [51] (a) compared to our modification in (b). For each group of channels, we predict a different k × k high pass kernel for each spatial location. The kernels are then applied to their respective locations to produce the final output.
To learn the filters, we follow [51] by applying a standard convolution followed by a batch normalization to the input feature X, where X ∈ RN×H×W×C. This produces our kernels, w where . In [51], the authors ensure their filters are low-pass by constraining the weights to be positive and sum to one by applying the softmax function. To produce the corresponding high-pass kernel, we simply invert this by subtracting the low-pass kernel from the identity kernel as indicated in (b) of Figure 4.
3.2. Matrix Multi-Spectral Channel Attention
Channel Attention (CA). After amplifying high frequency details in the feature extraction layers of the residual block, we next operate in a global context by using CA. Recall that the standard approach, SENet [13], calculates the average of each channel using global average pooling (GAP).
We revisit theoretical findings from [33] which demonstrate that this approach is only using the lowest frequency information of the input features. Thus, any image enhancement(i.e., image SR, deblurring, denoising, etc.) network that uses CA is discarding other potentially useful high frequency information for image reconstruction. We claim that these high frequency components carry valuable information. As such, we propose a modified CA mechanism that uses several frequency components for each channel.
Transformation to Frequency Domain.
There are several transformation methods one can use to decompose a signal to its spatial frequency spectrum. The predominant method for frequency analysis is the Discrete Fourier Transform (DFT). Although this is widely used, we will instead focus on another attractive method due to its simplicity, the Discrete Cosine Transform (DCT) [1]. The DCT uses a sum of cosine functions oscillating at different frequencies to express a set of data points. One can view the DCT as a special case of the DFT by only considering the real components of the decomposition. The DCT has a unique property which makes it the heart of the most widely used image compression standard and digital image format. The DCT has strong “energy compaction” which implies that a large proportion of the total signal energy is contained in a handful of coefficients. This is especially true for natural image data where there is generally large regions of uniform signal.
For an input x ∈ RH×W where H is the height of x, and W is the width of x, the 2D DCT frequency spectrum, g ∈ RH×W is defined as:
| (1) |
For simplicity, we omit normalizing constants which do not affect the results. As discussed in FCANet [33] this decomposition produces coefficients gh,w which are simply a weighted sum of the input. The parameters h and w control the frequency of the cosine functions. Suppose h and w in Eq. 1 are 0, we have:
| (2) |
If we set h = 0, w = 0, then we can see that the cosine terms evaluate to 1, and we are simply summing the input (and dividing by a normalizing factor). In Eq. 2, g0,0 represents the lowest frequency component of the 2D DCT, and it is proportional to GAP.
Matrix Multi-Spectral CA.
We approach the design of our CA mechanism using these findings. Since our goal is to utilize more of the frequency spectrum of the features, we follow [33] and transform our input to a frequency embedding using the DCT. The global descriptor for each channel is then the maximum frequency response. We provide additional technical details below.
The benefit of using the DCT is that we can pre-compute the DCT weights as a pre-processing step. That way, during training and testing, there is little additional overhead. The specifics of our method are described in Figure 6. Suppose for each channel, C, in our input features X, where X ∈ RC×H×W we want to use J frequency components. We pre-compute the matrix of DCT weights, A ∈ RJ×C×H×W using equation (1). That is, for the rth frequency component guv, we calculate where r ∈ 0,1,2,..J. Note that r corresponds to a specific component (u,v),
Figure 6:

Our MMCA module. The input feature is first transformed to the frequency domain using the discrete cosine transform. The resulting matrix is max-pooled then fed as input to an MLP which provides the channel attention.
Expanding X such that X ∈ R1×C×H×W then performing element wise multiplication followed by a spatial sum produces our DCT coefficients. These coefficients are our J global descriptors. More specifically: where D ∈ RC×J.
To reduce the matrix of frequency global descriptors, we take the maximum frequency response for each channel, C. We then apply the function F(x) where F corresponds to FC layers followed by a standard sigmoid denoted by the function S(x), where as follows:
Finally, the input features X are re-weighted using the final calculated attention. Thus, each of the J frequencies contributes to the final attention. FcaNet [33] groups channels and assigns the same frequency component to channels within the same group. On the other hand, we do not make this restriction and instead take the maximum response over J components for each channel individually.
4. Experiments
4.1. Settings
Datasets.
There are a variety of datasets for image SR with varying image content, resolution, and quality. To train and test our model, we use the DIV2K [38] image dataset. DIV2K is a newly proposed, rich image dataset consisting of 800 training images, 100 testing images, and 100 validation images. To enrich the training set with more diverse textures, we also use the Flickr2K dataset [24]. For testing, we use five standard benchmark datasets: Set5 [3], Set14 [44], B100 [27], Urban100 [15], and Manga109 [28]. Evaluation Metrics. To evaluate our method, we follow standard practice and report the peak signal-to-noise-ratio (PSNR) and the structural similarity metric (SSIM) [40]. These metrics are applied to the Y channel (i.e. luminance) of the the transformed RGB images in the YCbCr space. Training Settings. To train our models, a batch of 16 LR RGB images are randomly sampled and cropped to a size of 48×48. Training patches are augmented using random horizontal flips and 90° rotation. Our models are trained using the ADAM optimizer [18] by setting β1=0.9, β2=0.99 and ϵ= 10−8. The initial learning rate is set to 10−4 and halved every 200 epochs. We use the `1 loss since it has been empirically demonstrated to outperform the `2 loss for image SR tasks. The model is implemented in PyTorch [31] and trained using a single Nvidia V100 GPU.
We integrate our proposed modules, HPF and MMCA, into RCAN [47]. RCAN consists of 10 residual groups (RG) which each contain 20 residual blocks (RB). The number of channels is set to 64. To reduce the computational overhead, we place our components only in the last RB of each RG of RCAN. The HPF module is added after the first convolution as illustrated in Figure 3 while the CA is swapped with our MMCA. We set the number of groups in the HPF to 8. The number of frequency components for each channel is also 8. The chosen components are a combination of high and low frequencies. These hyper-parameter settings are discussed in more detail in their corresponding ablation study subsections below. To compute the frequency coefficients, we first adaptively down-sample the input channels to a spatial extent of 7×7 similar to [33].
4.2. Ablation Studies
Position of HPF in the Standard Residual Block.
To determine where and how many HPF layers to place in the standard residual block (RB), we conducted an ablation study. Figure 6 illustrates the positioning of the HPF layer within a RB as following the first convolution layer. Alternatively, we could create symmetrical operation by placing another HPF layer after the second convolution as well such that each convolution is followed by a high-pass filtering operation. However, our experiments in table 2 demonstrate that adding a single HPF is sufficient. This also shows the effectiveness of the layer is not simply due to increasing the number of parameters.
Table 2:
Ablation study on the number of HPF modules in a standard residual block. Evaluated using Manga109 benchmark at ×4 scale.
| # | 0 | 1 | 2 |
|---|---|---|---|
| PSNR | 30.65 | 30.82 | 30.74 |
Number of HPF Groups.
To study the influence of the number of groups in the HPF module, we conduct an ablation study by varying the groups hyperparameter, similar to [51]. Table 3 demonstrates that increasing the number of groups generally leads to improved performance. Since we compute a different set of filters for each group, this computation can be expensive as the depth of the network increases (i.e. more residual blocks). To alleviate this, we take a middle ground by using 8 groups since there is little performance difference and is computationally more efficient. In this way, the learned filters can adapt to different frequencies across feature channels, while saving computational costs by learning the same filter per group.
Table 3:
Ablation study on the number of groups in the HPF module. Evaluated using Manga109 at ×4 scale.
| # | 2 | 4 | 8 | 16 |
|---|---|---|---|---|
| PSNR | 30.82 | 30.79 | 30.82 | 30.88 |
HPF Filter Analysis.
To better understand the behavior of the HPF module, we analyze the learned filters, similar to [51]. What differentiates various filters is their variance. For example, a k × k smoothing filter, also known as the average filter, has a variance of zero since it consists of equivalent elements each with a value of . Figure 7 visualizes variance of the learned filter weights across different spatial locations. The HPF module learns filters that spatially adapt to different image content. For example, in the first image of the bird in figure 7, there is high variance precisely where there are abrupt and sharp transitions at the leaf edges. Similarly, in the image of the building, there are several edges and pixel intensity fluctuations which our HPF filters are able to amplify. Thus, the learned filters can propagate high frequency details after the convolution while preserving useful image content. We can also see that the filters are able to capture higher frequency information with sharp intensity transitions while attenuating the lower frequency details such as the uniform background.
Figure 7:

Variance of learned dynamic high-pass kernel from the 4th residual block, 5th group. The kernel correctly learns to filter high-frequency details such as sharp pixel value transitions.
Number of Frequency Components.
To investigate the appropriate choice of the number of frequency components, we conduct an ablation study, similar to [33]. Table 1 compares the effect of using multiple frequency components in the channel attention module. The general trend is clear: increasing the number of frequency components will increase performance. However, at a certain point (16 frequency components in Table 1) the performance stagnates. All experiments using more than a single frequency component in our modified frequency based channel attention show a large performance gap when compared to the standard channel attention. We claim that this is due to the fact that only using one frequency components discards useful information. The additional features encode other salient information and can compensate the “soft” global statistics encoded by average-pooling. Consequently, pooling features based on their frequency results in meaningful global descriptors. This verifies our claim that adding additional frequency information aids the network in integrating more components from the wider frequency spectrum. Given these results, we use 8 components for our final models.
Table 1:
Ablation study on the number of frequency components. Evaluated using Urban100 at ×4 scale.
| # | 1 | 2 | 4 | 8 | 16 |
|---|---|---|---|---|---|
| PSNR | 26.24 | 26.36 | 26.38 | 26.39 | 26.33 |
The chosen frequency components are illustrated in Figure 5. Moving across the rows and columns of the DCT grid of basis functions in figure 5 corresponds to oscillating more either in the vertical or horizontal directions. Intuitively, the top left corner corresponds to zero oscillations in either directions (i.e., h = 0,w = 0 in equation (1)) which results in a constant term. On the other hand, the highest vertical and horizontal frequency component is in the bottom right corner. By choosing components in these corners of the DCT matrix, we provide a diverse spectrum for MMCA module.
Figure 5:

Visualization of the DCT basis functions. Orange boxes (top-left and bottom-right) indicate the chosen frequency components for the MMCA module.
Comparison with Other Attention Mechanisms.
We compare our method with the standard SENet and FcaNet. As demonstrated in table 4, our modified frequency channel attention outperforms both baselines. By incorporating a wider frequency spectrum of the input features, we are able to adaptively re-weight the channels which in turn enables a performance boost. The key difference between our method and FcaNet is that FcaNet groups channels and assigns the same frequency to each channel in a group. By instead computing multiple frequency coefficients for each channel then selecting the maximum frequency response, we are able to capture and focus on the high frequencies. Additionally, we can view the choice of frequencies as a toggle by which we expand the spectrum.
Table 4:
Comparison with other attention mechanisms in image SR. Evaluated by PSNR using Urban100 and Manga109 benchmarks at ×4 scale.
| Module | SENet | FcaNet | MMCA (Ours) |
|---|---|---|---|
| Urban100 | 26.24 | 26.29 | 26.39 |
| Manga109 | 30.65 | 30.67 | 30.77 |
4.3. Comparison with State-of-the-Art Methods
We extensively compare our method with 17 state-ofthe-art image SR methods in table 5. For qualitative comparisons, we compare with 7 state-of-the-art methods in very challenging cases.
Table 5:
Quantitative comparison with other state-of-the-art methods. Average PSNR (dB) and SSIM for scale factor ×2, ×3 and ×4 are shown for several benchmarks. Best and second best performance are bolded and underlined, respectively.
| Method | Scale | Set5 | Set14 | B100 | Urban100 | Manga109 | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | ||
| LapSRN [19] | ×2 | 37.52 | 0.9591 | 33.08 | 0.9130 | 31.08 | 0.8950 | 30.41 | 0.9101 | 37.27 | 0.9740 |
| MemNet [37] | ×2 | 37.78 | 0.9597 | 33.28 | 0.9142 | 32.08 | 0.8978 | 31.31 | 0.9195 | 37.72 | 0.9740 |
| EDSR [24] | ×2 | 38.11 | 0.9602 | 33.92 | 0.9195 | 32.32 | 0.9013 | 32.93 | 0.9351 | 39.10 | 0.9773 |
| SRMDNF [45] | ×2 | 37.79 | 0.9601 | 33.32 | 0.9159 | 32.05 | 0.8985 | 31.33 | 0.9204 | 38.07 | 0.9761 |
| DBPN [11] | ×2 | 38.09 | 0.9600 | 33.85 | 0.9190 | 32.27 | 0.9000 | 32.55 | 0.9324 | 38.89 | 0.9775 |
| RDN [49] | ×2 | 38.24 | 0.9614 | 34.01 | 0.9212 | 32.34 | 0.9017 | 32.89 | 0.9353 | 39.18 | 0.9780 |
| RCAN [47] | ×2 | 38.27 | 0.9614 | 34.12 | 0.9216 | 32.41 | 0.9027 | 33.34 | 0.9384 | 39.44 | 0.9786 |
| NLRN [25] | ×2 | 38.00 | 0.9603 | 33.46 | 0.9159 | 32.19 | 0.8992 | 31.81 | 0.9249 | N/A | N/A |
| RNAN [48] | ×2 | 38.17 | 0.9611 | 33.87 | 0.9207 | 32.31 | 0.9014 | 32.73 | 0.9340 | 39.23 | 0.9785 |
| SRFBN [23] | ×2 | 38.11 | 0.9609 | 33.82 | 0.9196 | 32.29 | 0.9010 | 32.62 | 0.9328 | 39.08 | 0.9779 |
| OISR [12] | ×2 | 38.21 | 0.9612 | 33.94 | 0.9206 | 32.36 | 0.9019 | 33.03 | 0.9365 | N/A | N/A |
| SAN [4] | ×2 | 38.31 | 0.9620 | 34.07 | 0.9213 | 32.42 | 0.9028 | 33.10 | 0.9370 | 39.32 | 0.9792 |
| CSNLN [29] | ×2 | 38.28 | 0.9616 | 34.12 | 0.9223 | 32.40 | 0.9024 | 33.25 | 0.9386 | 39.37 | 0.9785 |
| RFANet [26] | ×2 | 38.26 | 0.9615 | 34.16 | 0.9220 | 32.41 | 0.9026 | 33.33 | 0.9389 | 39.44 | 0.9783 |
| HAN [30] | ×2 | 38.27 | 0.9614 | 34.16 | 0.9217 | 32.41 | 0.9027 | 33.35 | 0.9385 | 39.46 | 0.9785 |
| NSR [8] | ×2 | 38.23 | 0.9614 | 33.94 | 0.9203 | 32.34 | 0.9020 | 33.02 | 0.9367 | 39.31 | 0.9782 |
| IGNN [50] | ×2 | 38.24 | 0.9613 | 34.07 | 0.9217 | 32.41 | 0.9025 | 33.23 | 0.9383 | 39.35 | 0.9786 |
| DFSA (Ours) | ×2 | 38.38 | 0.9620 | 34.33 | 0.9232 | 32.50 | 0.9036 | 33.66 | 0.9412 | 39.98 | 0.9798 |
| LapSRN [19] | ×3 | 33.82 | 0.9227 | 29.87 | 0.8320 | 28.82 | 0.7980 | 27.07 | 0.8280 | 32.21 | 0.9350 |
| MemNet [37] | ×3 | 34.09 | 0.9248 | 30.00 | 0.8350 | 28.96 | 0.8001 | 27.56 | 0.8376 | 32.51 | 0.9369 |
| EDSR [24] | ×3 | 34.65 | 0.9280 | 30.52 | 0.8462 | 29.25 | 0.8093 | 28.80 | 0.8653 | 34.17 | 0.9476 |
| SRMDNF [45] | ×3 | 34.12 | 0.9254 | 30.04 | 0.8382 | 28.97 | 0.8025 | 27.57 | 0.8398 | 33.00 | 0.9403 |
| RDN [49] | ×3 | 34.71 | 0.9296 | 30.57 | 0.8468 | 29.26 | 0.8093 | 28.80 | 0.8653 | 34.13 | 0.9484 |
| RCAN [47] | ×3 | 34.74 | 0.9299 | 30.65 | 0.8482 | 29.32 | 0.8111 | 29.09 | 0.8702 | 34.44 | 0.9499 |
| NLRN [25] | ×3 | 34.27 | 0.9266 | 30.16 | 0.8374 | 29.06 | 0.8026 | 27.93 | 0.8453 | N/A | N/A |
| RNAN [48] | ×3 | 34.66 | 0.9290 | 30.53 | 0.8463 | 29.26 | 0.8090 | 28.75 | 0.8646 | 34.25 | 0.9483 |
| SRFBN [23] | ×3 | 34.70 | 0.9292 | 30.51 | 0.8461 | 29.24 | 0.8084 | 28.73 | 0.8641 | 34.18 | 0.9481 |
| OISR [12] | ×3 | 34.72 | 0.9297 | 30.57 | 0.8470 | 29.29 | 0.8103 | 28.95 | 0.8680 | N/A | N/A |
| SAN [4] | ×3 | 34.75 | 0.9300 | 30.59 | 0.8476 | 29.33 | 0.8112 | 28.93 | 0.8671 | 34.30 | 0.9494 |
| CSNLN [29] | ×3 | 34.74 | 0.9300 | 30.66 | 0.8482 | 29.33 | 0.8105 | 29.13 | 0.8712 | 34.45 | 0.9502 |
| RFANet [26] | ×3 | 34.79 | 0.9300 | 30.67 | 0.8487 | 29.34 | 0.8115 | 29.15 | 0.8720 | 34.59 | 0.9506 |
| HAN [30] | ×3 | 34.75 | 0.9299 | 30.67 | 0.8483 | 29.32 | 0.8110 | 29.10 | 0.8705 | 34.48 | 0.9500 |
| NSR [8] | ×3 | 34.62 | 0.9289 | 30.57 | 0.8475 | 29.26 | 0.8100 | 28.83 | 0.8663 | 34.27 | 0.9484 |
| IGNN [50] | ×3 | 34.72 | 0.9298 | 30.66 | 0.8484 | 29.31 | 0.8105 | 29.03 | 0.8696 | 34.39 | 0.9496 |
| DFSA (Ours) | ×3 | 34.92 | 0.9312 | 30.83 | 0.8507 | 29.42 | 0.8128 | 29.44 | 0.8761 | 35.07 | 0.9525 |
| LapSRN [19] | ×4 | 31.54 | 0.8850 | 28.19 | 0.7720 | 27.32 | 0.7270 | 25.21 | 0.7560 | 29.09 | 0.8900 |
| MemNet [37] | ×4 | 31.74 | 0.8893 | 28.26 | 0.7723 | 27.40 | 0.7281 | 25.50 | 0.7630 | 29.42 | 0.8942 |
| EDSR [24] | ×4 | 32.46 | 0.8968 | 28.80 | 0.7876 | 27.71 | 0.7420 | 26.64 | 0.8033 | 31.02 | 0.9148 |
| SRMDNF [45] | ×4 | 31.96 | 0.8925 | 28.35 | 0.7787 | 27.49 | 0.7337 | 25.68 | 0.7731 | 30.09 | 0.9024 |
| DBPN [11] | ×4 | 32.47 | 0.8980 | 28.82 | 0.7860 | 27.72 | 0.7400 | 26.38 | 0.7946 | 30.91 | 0.9137 |
| RDN [49] | ×4 | 32.47 | 0.8990 | 28.81 | 0.7871 | 27.72 | 0.7419 | 26.61 | 0.8028 | 31.00 | 0.9151 |
| RCAN [47] | ×4 | 32.63 | 0.9002 | 28.87 | 0.7889 | 27.77 | 0.7436 | 26.82 | 0.8087 | 31.22 | 0.9173 |
| NLRN [25] | ×4 | 31.92 | 0.8916 | 28.36 | 0.7745 | 27.48 | 0.7306 | 25.79 | 0.7729 | N/A | N/A |
| RNAN [48] | ×4 | 32.43 | 0.8977 | 28.83 | 0.7871 | 27.72 | 0.7410 | 26.61 | 0.8023 | 31.09 | 0.9149 |
| SRFBN [23] | ×4 | 32.47 | 0.8983 | 28.81 | 0.7868 | 27.72 | 0.7409 | 26.60 | 0.8015 | 31.15 | 0.9160 |
| OISR [12] | ×4 | 32.53 | 0.8992 | 28.86 | 0.7878 | 27.75 | 0.7428 | 26.79 | 0.8068 | N/A | N/A |
| SAN [4] | ×4 | 32.64 | 0.9003 | 28.92 | 0.7888 | 27.78 | 0.7436 | 26.79 | 0.8068 | 31.18 | 0.9169 |
| CSNLN [29] | ×4 | 32.68 | 0.9004 | 28.95 | 0.7888 | 27.80 | 0.7439 | 27.22 | 0.8168 | 31.43 | 0.9201 |
| RFANet [26] | ×4 | 32.66 | 0.9004 | 28.88 | 0.7894 | 27.79 | 0.7442 | 26.92 | 0.8112 | 31.41 | 0.9187 |
| HAN [30] | ×4 | 32.64 | 0.9002 | 28.90 | 0.7890 | 27.80 | 0.7442 | 26.85 | 0.8094 | 31.42 | 0.9177 |
| NSR [8] | ×4 | 32.55 | 0.8987 | 28.79 | 0.7876 | 27.72 | 0.7414 | 26.61 | 0.8025 | 31.10 | 0.9145 |
| IGNN [50] | ×4 | 32.57 | 0.8998 | 28.85 | 0.7891 | 27.77 | 0.7434 | 26.84 | 0.8090 | 31.28 | 0.9182 |
| DFSA (Ours) | ×4 | 32.79 | 0.9019 | 29.06 | 0.7922 | 27.87 | 0.7458 | 27.17 | 0.8163 | 31.88 | 0.9266 |
Quantitative Results.
Table 5 shows quantitative comparisons for ×2, ×3, and ×4 results. As demonstrated in Table 5, our model outperforms the compared methods across scales and benchmarks. The consistently higher PSNR and SSIM values provide promising potential to investigating the frequency domain for image SR. Our method reaches a maximum PSNR increase of 0.52 dB for the ×2 scale, 0.48 dB for the ×3 scale, and 0.45 dB for the ×4 scale. The maximum PSNR increase indicates the maximum difference between our method and the second-best method that occurs over all datasets for a given scale.
As previously mentioned, we use RCAN as our SR backbone. Consequently, when we compare the number of parameters between our modified model and RCAN, they are roughly equivalent. Although this is the case, our model outperforms RCAN by maximum PSNR increases of 0.54 dB for the ×2 scale , 0.63 dB for the ×3 scale, and 0.66 dB for the ×4 scale. By modifying the last RB of each RG in RCAN to that of Figure 3 (b), we are able to focus on more informative features and amplify high frequency details. This observation indicates that the HPF and MMCA modules significantly improve the performance. In our model, the last RB of each RG serves as a gate which (1) passes through high frequency details and (2) operates on a broader frequency spectrum when rescaling the outgoing features. Since our modules are operating within a residual group, the low frequency details are preserved via skip connections, achieving better quantitative results.
Qualitative Results.
In Figure 8, we visually illustrate the qualitative comparisons for several images from the Urban100 benchmark on the ×4 scale. Our model reconstructs images more accurately than other methods. Different patterns are correctly produced by our method, while the output of other methods contain blurry patches or artifacts. For example, our method is particularly well-suited for line reconstruction. In “img034” of the Urban100 dataset, our method can correctly produce a subset of the bricks on the wall of the building. In “img059”, the horizontal lines are correctly and clearly produced by our method whereas RCAN and SAN produce random vertical stripes which are not present in the ground truth. The remaining methods all suffer from a blurring artifact in this patch. Our method is capable of alleviating the blurring artifacts and recovering more high frequency details. Even more so, our method can distinctly delineate several structures as illustrated in “img008” while other methods combine and blur lines in the vertical and/or horizontal direction. These comparisons serve to demonstrate that our modified residual block can extract more sophisticated features from the LR space.
Figure 8:

Visual comparison for ×4 SR on Urban100 dataset. Most compared methods suffer from blurring artifacts. Our method is able to reconstruct high-frequency details better than existing methods.
5. Conclusion
This paper introduces the matrix multi-spectral channel attention (MMCA) and dynamic high-pass filtering (HPF) modules to improve the learning of high-frequency features in the image SR task. With the novel and seamless integration of the proposed modules into a standard SR backbone (RCAN), we can sufficiently focus on high-frequency details in input features. Our experiments suggest that following the convolution layer with the dynamic high-pass filtering operation enables preserving essential details and textures. We combine this module with the MMCA to package a new, powerful residual block that can be seamlessly integrated into different architectures. For the MMCA module, we need to determine how to appropriately select frequency components. A promising path for further exploration would be to potentially incorporate this in the learning task.
Acknowledgements.
This work is partially supported by NIH award 5U54CA225088-03 and by NSF award IIS-1835231.
References
- [1].Ahmed Nasir, Natarajan T, and Rao Kamisetty R. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974. [Google Scholar]
- [2].Arpit Devansh, Jastrzebski Stanisaw, Ballas Nicolas, Krueger David, Bengio Emmanuel, Kanwal Maxinder S, Maharaj Tegan, Fischer Asja, Courville Aaron, Bengio Yoshua, et al. A closer look at memorization in deep networks. In ICML, 2017. [Google Scholar]
- [3].Bevilacqua Marco, Roumy Aline, Guillemot Christine, and Alberi-Morel Marie Line. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012. [Google Scholar]
- [4].Dai Tao, Cai Jianrui, Zhang Yongbing, Xia Shu-Tao, and Zhang Lei. Second-order attention network for single image super-resolution. In CVPR, 2019. [Google Scholar]
- [5].Dai Tao, Zha Hua, Jiang Yong, and Xia Shu-Tao. Image super-resolution via residual block attention networks. In CVPRW, 2019. [Google Scholar]
- [6].Dong Chao, Loy Chen Change, Kaiming He, and Tang Xiaoou. Image super-resolution using deep convolutional networks. TPAMI, 2016. [DOI] [PubMed] [Google Scholar]
- [7].Dong Chao, Loy Chen Change, and Xiaoou Tang. Accelerating the super-resolution convolutional neural network. In ECCV, 2016. [Google Scholar]
- [8].Fan Yuchen, Yu Jiahui, Mei Yiqun, Zhang Yulun, Fu Yun, Liu Ding, and Huang Thomas S. Neural sparse representation for image restoration. In NeurIPS, 2020. [Google Scholar]
- [9].Fang Linjing, Monroe Fred, Novak Sammy Weiser, Kirk Lyndsey, Schiavon Cara R, Yu Seungyoon B, Zhang Tong, Wu Melissa, Kastner Kyle, Latif Alaa Abdel, et al. Deep learning-based point-scanning super-resolution imaging. Nature Methods, pages 1–11, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Fu Jun, Liu Jing, Tian Haijie, Li Yong, Bao Yongjun, Fang Zhiwei, and Lu Hanqing. Dual attention network for scene segmentation. In CVPR, 2019. [Google Scholar]
- [11].Haris Muhammad, Shakhnarovich Greg, and Ukita Norimichi. Deep back-projection networks for super-resolution. In CVPR, 2018. [DOI] [PubMed] [Google Scholar]
- [12].He Xiangyu, Mo Zitao, Wang Peisong, Liu Yang, Yang Mingyuan, and Cheng Jian. Ode-inspired network design for single image super-resolution. In CVPR, 2019. [Google Scholar]
- [13].Hu Jie, Shen Li, and Sun Gang. Squeeze-and-excitation networks. In CVPR, 2018. [DOI] [PubMed] [Google Scholar]
- [14].Hu Ping, Shuai Bing, Liu Jun, and Wang Gang. Deep level sets for salient object detection. In CVPR, 2017. [Google Scholar]
- [15].Huang Jia-Bin, Singh Abhishek, and Ahuja Narendra. Single image super-resolution from transformed self-exemplars. In CVPR, 2015. [Google Scholar]
- [16].Jia Xu, De Brabandere Bert, Tuytelaars Tinne, and Gool Luc V. Dynamic filter networks. In NeurIPS, 2016. [Google Scholar]
- [17].Kim Jiwon, Lee Jung Kwon, and Lee Kyoung Mu. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016. [Google Scholar]
- [18].Kingma Diederik and Ba Jimmy. Adam: A method for stochastic optimization. In ICLR, 2014. [Google Scholar]
- [19].Lai Wei-Sheng, Huang Jia-Bin, Ahuja Narendra, and Yang MingHsuan. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, 2017. [DOI] [PubMed] [Google Scholar]
- [20].Ledig Christian, Theis Lucas, Huszar Ferenc, Caballero Jose, Cunningham Ándrew, Acosta Alejandro, Aitken Andrew, Tejani Alykhan, Totz Johannes, Wang Zehan, and Shi Wenzhe. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. [Google Scholar]
- [21].Li Jianan, Liang Xiaodan, Wei Yunchao, Xu Tingfa, Feng Jiashi, and Yan Shuicheng. Perceptual generative adversarial networks for small object detection. In CVPR, 2017. [Google Scholar]
- [22].Li Xiang, Wang Wenhai, Hu Xiaolin, and Yang Jian. Selective kernel networks. In CVPR, 2019. [Google Scholar]
- [23].Li Zhen, Yang Jinglei, Liu Zheng, Yang Xiaomin, Jeon Gwanggil, and Wu Wei. Feedback network for image superresolution. In CVPR, 2019. [Google Scholar]
- [24].Lim Bee, Son Sanghyun, Kim Heewon, Nah Seungjun, and Lee Kyoung Mu. Enhanced deep residual networks for single image super-resolution. In CVPRW, 2017. [Google Scholar]
- [25].Liu Ding, Wen Bihan, Fan Yuchen, Loy Chen Change, and Huang Thomas S. Non-local recurrent network for image restoration. In NeurIPS, 2018. [Google Scholar]
- [26].Liu Jie, Zhang Wenjie, Tang Yuting, Tang Jie, and Wu Gangshan. Residual feature aggregation network for image superresolution. In CVPR, 2020. [Google Scholar]
- [27].Martin David, Fowlkes Charless, Tal Doron, and Malik Jitendra. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001. [Google Scholar]
- [28].Matsui Yusuke, Ito Kota, Aramaki Yuji, Fujimoto Azuma, Ogawa Toru, Yamasaki Toshihiko, and Aizawa Kiyoharu. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 2017. [Google Scholar]
- [29].Mei Yiqun, Fan Yuchen, Zhou Yuqian, Huang Lichao, Huang Thomas S, and Shi Humphrey. Image superresolution with cross-scale non-local attention and exhaustive self-exemplars mining. In CVPR, 2020. [Google Scholar]
- [30].Niu Ben, Wen Weilei, Ren Wenqi, Zhang Xiangde, Yang Lianping, Wang Shuzhen, Zhang Kaihao, Cao Xiaochun, and Shen Haifeng. Single image super-resolution via a holistic attention network. In ECCV, 2020. [Google Scholar]
- [31].Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, DeVito Zachary, Lin Zeming, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in pytorch. 2017. [Google Scholar]
- [32].Qiao Chang, Li Di, Guo Yuting, Liu Chong, Jiang Tao, Dai Qionghai, and Li Dong. Evaluation and development of deep neural networks for image super-resolution in optical microscopy. Nature Methods, pages 1–9, 2021. [DOI] [PubMed] [Google Scholar]
- [33].Qin Zequn, Zhang Pengyi, Wu Fei, and Li Xi. Fcanet: Frequency channel attention networks. arXiv preprint arXiv:2012.11879, 2020. [Google Scholar]
- [34].Roy Abhijit Guha, Navab Nassir, and Wachinger Christian. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks. TMI, 2018. [DOI] [PubMed] [Google Scholar]
- [35].Su Hang, Jampani Varun, Sun Deqing, Gallo Orazio, Learned-Miller Erik, and Kautz Jan. Pixel-adaptive convolutional neural networks. In CVPR, 2019. [Google Scholar]
- [36].Szeliski Richard. Computer vision: algorithms and applications. Springer Science & Business Media, 2010. [Google Scholar]
- [37].Tai Ying, Yang Jian, Liu Xiaoming, and Xu Chunyan. Memnet: A persistent memory network for image restoration. In ICCV, 2017. [Google Scholar]
- [38].Timofte Radu, Agustsson Eirikur, Luc Van Gool MingHsuan Yang, Zhang Lei, Lim Bee, Son Sanghyun, Kim Heewon, Nah Seungjun, Lee Kyoung Mu, et al. Ntire 2017 challenge on single image super-resolution: Methods and results. In CVPRW, 2017. [Google Scholar]
- [39].Wang Qilong, Wu Banggu, Zhu Pengfei, Li Peihua, Zuo Wangmeng, and Hu Qinghua. Eca-net: Efficient channel attention for deep convolutional neural networks. In CVPR, 2020. [Google Scholar]
- [40].Wang Zhou, Bovik Alan C, Sheikh Hamid R, and Simoncelli Eero P. Image quality assessment: from error visibility to structural similarity. TIP, 2004. [DOI] [PubMed] [Google Scholar]
- [41].Woo Sanghyun, Park Jongchan, Lee Joon-Young, and Kweon In So. Cbam: Convolutional block attention module. In ECCV, 2018. [Google Scholar]
- [42].Xie Cihang, Wu Yuxin, van der Maaten Laurens, Yuille Alan L, and He Kaiming. Feature denoising for improving adversarial robustness. In CVPR, 2019. [Google Scholar]
- [43].Yuan Yuan, Zheng Xiangtao, and Lu Xiaoqiang. Hyperspectral image superresolution by transfer learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(5):1963–1974, 2017. [Google Scholar]
- [44].Zeyde Roman, Elad Michael, and Protter Matan. On single image scale-up using sparse-representations. In Proc. 7th Int. Conf. Curves Surf., 2010. [Google Scholar]
- [45].Zhang Kai, Zuo Wangmeng, and Zhang Lei. Learning a single convolutional super-resolution network for multiple degradations. In CVPR, 2018. [Google Scholar]
- [46].Zhang Richard. Making convolutional networks shiftinvariant again. In ICML, 2019. [Google Scholar]
- [47].Zhang Yulun, Li Kunpeng, Li Kai, Wang Lichen, Zhong Bineng, and Fu Yun. Image super-resolution using very deep residual channel attention networks. In ECCV, 2018. [Google Scholar]
- [48].Zhang Yulun, Li Kunpeng, Li Kai, Zhong Bineng, and Fu Yun. Residual non-local attention networks for image restoration. In ICLR, 2019. [Google Scholar]
- [49].Zhang Yulun, Tian Yapeng, Kong Yu, Zhong Bineng, and Fu Yun. Residual dense network for image super-resolution. In CVPR, 2018. [DOI] [PubMed] [Google Scholar]
- [50].Zhou Shangchen, Zhang Jiawei, Zuo Wangmeng, and Loy Chen Change. Cross-scale internal graph neural network for image super-resolution. In NeurIPS, 2020. [Google Scholar]
- [51].Zou Xueyan, Xiao Fanyi, Yu Zhiding, and Lee Yong Jae. Delving deeper into anti-aliasing in convnets. In BMVC, 2020. [Google Scholar]
