Abstract
Infrared image super-resolution (SR) remains a challenging task due to inherent limitations in existing approaches: convolutional neural network (CNN)-based approaches struggle with long-range dependency modeling, whereas transformer-based approaches are computationally expensive and tend to overlook fine local details. To address these issues, we propose a novel hybrid perception enhancement network (HPEN). Its core component is a hybrid perception enhancement block (HPEB), which effectively combines a token aggregation block (TAB) for global context modeling, a multi-scale feature enhancement block (MFEB) for local detail extraction, and a convolutional layer for feature refinement. Extensive experimental results demonstrate that the proposed HPEN achieves leading performance among compared methods. For the challenging
SR task, it attains the best PSNR and SSIM values among the evaluated lightweight SR approaches, while demonstrating remarkable efficiency advantages. Specifically, compared to HiT-SR, HPEN reduces FLOPs by 42.9%, uses only 9.4% of the GPU memory usage, and delivers a
faster inference speed. The code is available at https://github.com/smilenorth1/HPEN-main.
Keywords: Infrared image super-resolution, Hybrid perception, Token aggregation, Multi-scale
Subject terms: Computational science, Computer science, Scientific data
Introduction
Infrared imaging plays a vital role in several critical fields due to its all-weather capability and ability to capture thermal information. For instance, in surveillance and security systems1,2, it enables person and vehicle detection under low-visibility conditions; in defense and aerospace3, it supports target recognition and night-vision navigation; in medical diagnostics4,5, it assists in non-invasive screening such as inflammation detection and vascular imaging; and in industrial inspection6, it helps identify overheating components or structural defects in power equipment and machinery. However, the spatial resolution of infrared sensors is inherently limited, which often restricts their practical usefulness. Infrared image super-resolution (SR) offers a promising solution to enhance image quality beyond the constraints of physical hardware. Although existing methods have made considerable progress, they still face a fundamental challenge in balancing reconstruction accuracy, particularly in preserving fine details with computational efficiency required for real-time deployment.
Convolutional Neural Networks (CNNs)7–14 have become the dominant paradigm in image SR, leveraging their inherent strengths in local feature extraction and hierarchical representation learning. However, CNN-based methods7,8,10,15 are constrained by three fundamental limitations: (1) their restricted receptive fields impede the modeling of long-range dependencies; (2) static convolution kernels lack adaptability to multi-scale patterns; and (3) conventional architectures often fail to support effective cross-resolution interactions. Although some efforts have attempted to address these issues by designing deeper and wider networks (e.g., RCAN16 with over 400 layers), such expansions come at the cost of prohibitive computational complexity, which severely limits their practical deployment in efficiency-sensitive real-world scenarios.
Recent advances in vision transformer (ViT) architectures17–21 have demonstrated remarkable capabilities in image SR, primarily attributed to the global receptive field and powerful context modeling enabled by the self-attention (SA) mechanism. However, two critical limitations remain: (1) the computational complexity of self-attention leads to high memory and processing requirements17,18,22; and (2) its focus on long-range dependencies often results in inadequate representation of high-frequency local details, such as edges and textures23,24. Although subsequent studies have attempted to alleviate these issues through various ways (e.g., local self attention in clustering25, and hierarchical windows26), there are still challenges in balancing lightweight and detail preservation for infrared image SR.
To address these issues, we propose a novel hybrid perception enhancement block (HPEB) that systematically combines three complementary components: a token aggregation block (TAB) for comprehensive global feature modeling, a multi-scale feature enhancement block (MFEB) for extracting local-contextual details, and a
convolution for cross-channel feature refinement and spatial information optimization. Specifically, the TAB module follows a sophisticated three-stage process: it first groups similar image patches using learnable token centers updated via exponential moving average; then employs a dual-branch architecture integrating intra-group self-attention (IASA) and inter-group cross-attention (IRCA) to fuse local refinements with global semantics; and finally applies a convolutional feed-forward network (ConvFFN) to further enhance local feature representation. The MFEB adopts a parallel strategy, using dual-branch dilated convolutions (with dilation rates of 1 and 2) to capture both fine details and broader context, dynamically fuses these multi-scale features via a
convolution, and subsequently strengthens global dependencies through restormer Layers (RTL). By integrating these carefully designed modules, we construct an end-to-end trainable network named HPEN. Extensive experimental results demonstrates that the proposed HPEN achieves an optimal balance between computational efficiency and reconstruction quality, as quantitatively shown in Fig. 1.
Fig. 1.
Comprehensive comparison between the proposed method and other lightweight approaches for
SR on the CVC-09 dataset. The left subplot illustrates the reconstruction quality (PSNR) against both parameter count and computational complexity (FLOPs), with circle sizes representing FLOPs magnitude. The right subplot demonstrates PSNR in relation to inference time and GPU memory consumption, where circle sizes correspond to GPU memory usage. Experimental results indicate that the proposed HPEN achieves a favorable balance between computational overhead (parameters, FLOPs, runtime, and GPU memory) and reconstruction performance (PSNR) compared to existing lightweight SR models.
The main contributions of this work can be summarized as follows:
We propose a novel hybrid perception enhancement block (HPEB) that effectively integrates global structural modeling with local detail preservation in a unified architecture, enabling simultaneous capture of long-range dependencies and fine-grained textures for high-quality infrared image super-resolution.
We develop a multi-scale feature enhancement block (MFEB) that applies dual-branch dilated convolutions to extract local-contextual information and integrates restormer Layers (RTLs) to enhance global dependencies, thereby establishing a collaborative local-global optimization framework.
We conduct comprehensive evaluation of the proposed HPEN model across six datasets, demonstrating its ability to achieve an optimal trade-off between model complexity and reconstruction performance.
Related works
CNN-based SR methods
CNNs have established themselves as a cornerstone of modern computer vision, owing to their exceptional ability to learn hierarchical feature representations. This dominance naturally extends to the field of image SR. The pioneering SRCNN7 first introduces an end-to-end CNN framework for single image SR. Subsequently, VDSR10 advances training efficiency by incorporating residual learning27 and optimized convergence strategies. Later innovations, such as DRCN9 and DRRN28, further improve reconstruction quality through recursive structures and enhanced residual connections27. SRGAN12 revolutionizes perceptual quality by employing adversarial training to overcome the limitations of conventional loss functions. However, constrained by the inherently local receptive field of convolution operations, these methods struggle to capture long-range dependencies and global contextual information, leading to limited representational capacity. To address this issue, researchers have progressively increased network scale and complexity. For instance, EDSR29 significantly boosts performance by expanding to 32 layers and 43 million parameters, while RCAN16 constructs the architecture exceeding 400 layers through channel attention and residual connections. Nevertheless, this pursuit of performance through greater depth and width comes at a significant cost: the high computational complexity and memory footprint of these large models severely hinder their deployment in real-world applications scenarios.
To address computational complexity, numerous lightweight CNN8,11,13,14,30 designs have been developed to balance performance and efficiency through various strategies. For instance, FSRCNN8 and ESPCN15 improve efficiency by employing post-upsampling operations. CARN31 achieves fast and accurate SR using group convolution and cascaded feature integration. The concept of information distillation, introduced by IDN32 through enhancement and compression units, is further advanced by IMDN11 with multi-distillation modules that optimize the memory-time trade-off. RFDN30 further refines this concept with enhanced feature distillation links, winning the NTIRE 2022 Efficient SR Challenge. BSRN13 reduces computational redundancy via blueprint separable convolution, while ShuffleMixer33 minimizes FLOPs by employing large kernels combined with channel operations. More recent approaches incorporate dynamic feature modulation for greater flexibility. For example, SAFMN14 implements spatially adaptive modulation within a ViT-like architecture for long-range dependency modeling, and SMFANet34 leverages dual-branch processing to handle both local and non-local features. Despite these advancements, such methods remain fundamentally constrained by the use of fixed convolutional kernels, leading to edge blurring and detail loss. To overcome these limitations, we propose a hybrid perception enhancement block (HPEB), which facilitates effective global modeling and hierarchical feature interactions, thereby significantly improving reconstruction quality while maintaining high computational efficiency.
ViT-based SR methods
ViTs17 have recently shown strong potential for image SR, primarily due to their ability to model long-range dependencies and capture global context information. The pioneering work IPT18 first adapts the transformer architecture for SR tasks, establishing new performance benchmarks. However, the standard ViT design incurs very high computational costs, mainly because its global SA mechanism22 scales quadratically with input size. To mitigate this, several efficient variants have been developed. SwinIR20 employs shifted window attention to limit computation to local windows while maintaining cross-window connection, greatly reducing complexity. HAT35 further improves this approach by integrating channel attention with overlapping cross-attention, achieving state-of-the-art results. Restormer21proposes a complementary strategy using multi-head transposed attention to efficiently capture global context25. improves parallel processing through subgroup token balancing and incorporates residual local relation self-attention to harmonize global and local representations under linear complexity, while HiT-SR26 employs hierarchical window expansion to capture multi-scale contexts and long-range dependencies for efficient SR reconstruction.
Despite these advancements, current ViT-based SR methods still struggle to optimally balancing performance and efficiency. Most either sacrifice reconstruction quality to maintain reasonable computational costs, or achieve superior performance at the expense of prohibitive complexity. This fundamental trade-off highlights the need for a more balanced approach that can maintain high reconstruction quality while ensuring practical deploy-ability.
Infrared image SR methods
Deep learning has significantly advanced infrared image processing by demonstrating powerful capabilities in automated feature extraction and complex nonlinear mapping. Infrared image SR, in particular, has emerged as a promising application. Numerous deep learning methods have been developed specifically for this domain: TherISuRNet36 utilizes progressive upscaling with asymmetric residual learning, while PSRGAN37 a dual-path architecture that leverages visible light imagery to compensate for limited IR training data. Marivani et al.38 incorporate sparse priors within a multi-modal framework, and MoCoPnet39 integrates domain knowledge to address the inherent feature scarcity in IR small target SR. For efficient deployment, LISN40 achieves lightweight reconstruction through feature correlation aggregation and channel operations, while LKFormer41 replaces self-attention with large-kernel depth-wise convolutions for non-local modeling and employs gated-pixel feed-forward networks to improve information flow. Additionally, RLKA-Net42 employs recurrent strategies and large kernel attention to achieve efficient infrared SR. LIRSRN43 incorporates an attention enhancement module with spatial-frequency processing. Despite these advancements, existing methods still face limitations in optimally balancing reconstruction performance with computational efficiency. Most approaches either prioritize accuracy at the cost of practical deploy-ability, or achieve efficiency while compromising the restoration of fine details that are critical for infrared applications, highlighting the need for more balanced solutions in infrared SR domain.
In recent years, general-purpose computer vision architectures have achieved remarkable progress in tasks such as detection and classification. The object detection frameworks such as SSD44, Faster R-CNN45, and YOLOv846 have achieved strong performance in real-time processing, small object detection, and complex scene understanding by integrating lightweight designs11,30, multi-scale feature fusion14,34, and attention mechanisms13,42,47. Similarly, modern convolutional architectures like ConvNeXt48 have attained near-state-of-the-art accuracy in classification tasks, particularly in medical imaging through deep integration with attention modules. Mask R-CNN49 has also achieved significant results in the field of instance segmentation. Beyond perceptual tasks, related research in image security has introduced methods such as a structured chromatic encryption algorithm50 that combines color mapping with mathematical transformations to protect sensitive medical multimedia data, and a reversible fragile watermarking scheme51 that embeds two bits per pixel to generate dual watermarked images, achieving high capacity and robustness while preserving visual transparency. These advances collectively reflect a common design principle: balancing performance and complexity through architectural refinement and efficient attention fusion. However, these studies primarily target high-level vision tasks such as detection and classification in the visible spectrum, and their design paradigms are not directly applicable to infrared image super-resolution, a representative low-level vision task. Infrared imagery exhibits unique characteristics, including thermal radiation properties, low contrast, and complex noise patterns, which demand specialized models capable of enhanced detail recovery and targeted degradation modeling. Inspired by the “hybrid architecture” and “lightweight attention” principles in these advanced works, this paper focuses on addressing the specific challenges of infrared image SR. We propose the HPEN network, which incorporates a core hybrid perception enhancement block. By combining customized token aggregation with multi-scale convolutional enhancement, our approach achieves synergistic improvement of both global structures and local details in infrared images, establishing an improved balance between reconstruction accuracy and computational efficiency.
Proposed method
We propose a simple yet effective SR model that delivers accurate infrared image reconstruction through collaborative exploiting local and non-local features. The core innovation is the hybrid perception enhancement block (HPEB), a unified architecture that systematically combines complementary feature processing mechanisms. The HPEB operates in three sequential stages: (1) a token aggregation block (TAB) to capture global contextual dependencies; (2) a multi-scale feature enhancement module (MFEM) to extract fine-grained local details; and (3) a
convolutional layer for feature refinement. This synergistic design allows the model to focus on information-rich regions while maintaining computational efficiency, thereby establishing a new balance between performance and complexity in infrared SR. The details of each component are elaborated below.
Figure 2 illustrates the overall architecture of the proposed HPEN network. It consists of three main components: shallow feature extraction, deep feature extraction, and an upsampler module. Specifically, a
convolutional layer is first applied for the low-resolution input
to extract shallow features, denoted as
. This process can be formulated as:
![]() |
1 |
where
denotes shallow feature extraction operation. These shallow features are then fed into a stack of hybrid perception enhancement blocks (HPEBs). Each HPEB contains a token aggregation block (TAB), a multi-scale feature enhancement module (MFEM), and a
convolutional layer. This procedure can be expressed as:
![]() |
2 |
where
represents the deep feature extraction process carried out by the stacked HPEBs, and
refers to the resulting deep feature representation. To facilitate gradient flow and preserve low-frequency information, we incorporate global residual connections. This allows the network to concentrate on reconstructing high-frequency details. The upsampler module, which consists of a
convolution followed by a sub-pixel convolution layer15, then efficiently reconstructs the high-resolution output. This process can be represented as:
![]() |
3 |
where
is the upsampler module that reconstructs the high-resolution image, yielding the final output
. For loss function, we adopt the
loss, following the previous work34,42 in SR tasks, which is defined as:
![]() |
4 |
where
represents the ground-truth high-resolution image, and
denotes the
-norm operator.
Fig. 2.
Overall architecture of the proposed HPEN. The network comprises three fundamental components: an initial
convolutional layer for shallow feature extraction, a series of hybrid perception enhancement block (HPEBs) for deep feature extraction, and a final upsampling module for high-resolution reconstruction. Each HPEB represents the core building block and integrates three carefully designed sub-modules: a token aggregation block (TAB) for global dependency modeling, a multi-scale feature enhancement block (MFEB) for local detail extraction, and a
convolutional layer for feature refinement and integration.
Token aggregation block
In image SR tasks, conventional CNN-based methods14,29,30 are limited by the fixed receptive fields of their convolutional kernels, while transformer-based approaches18,20,35 often incur excessive computational costs and lack inherent spatial inductive bias despite their global modeling capacity. To address these issues, we introduce the token aggregation block (TAB) adapted from CATANet25, which efficiently captures feature interactions through content-aware dynamic reorganization and a hierarchical attention mechanism. As illustrated in Fig. 2(b), TAB consists of three core components: the content-aware token aggregation (CATA) module, intra-group self-attention (IASA), and inter-group cross-attention (IRCA). Given an input feature
, the CATA module dynamically clusters image patches using a set of learnable token centers. These centers are updated during training via an exponential moving average (EMA) strategy with a decay parameter
=0.999 to enable semantically coherent grouping. The CATA operation can be expressed as:
![]() |
5 |
where
denotes the layer normalization operation,
represents the clustered features. The IASA module then performs fine-grained feature interactions within each token subgroup. Overlapped grouping is used to preserve local spatial continuity. This operation can be written as:
![]() |
6 |
where
corresponds to the output feature of the IASA module. Subsequently, the IRCA module facilitates global semantic propagation by enabling interaction across token groups and a global center token. The IRCA module is formulated as follows:
![]() |
7 |
where
denotes the output feature of the IRCA module. Finally, a convolutional feed-forward network (ConvFFN) is applied to further integrates local features and channel information, thereby enhancing the model’s nonlinear representation capacity. This step is expressed as:
![]() |
8 |
where
denotes the
convolution,
represents the intermediate aggregated features,
corresponds to the convolution-enhanced feed-forward network, and
is the final output of the TAB module.
Multi-scale feature enhancement block
While the TAB focuses on modeling global dependencies through dynamic token aggregation, high-quality image reconstruction also critically depends on fine local details. To this end, we propose a multi-scale feature enhancement block (MFEB) that explicitly enhances local feature extraction while integrating global contextual information, forming a collaborative local-global optimization framework. As shown in Fig. 2(c), given an input feature
, the MFEB first splits it into two parallel branches, formulated as:
![]() |
9 |
where
denotes the channel splitting operation. One branch then applies a
convolution to capture fine-grained textures, while the other utilizes a
dilated convolution (dilation rate=2) to model broader contextual information. This process can be formulated as:
![]() |
10 |
where
represents a
dilated convolution with a dilation rate of 2,
indicates a depth-wise convolution of kernel size
, and
refers to the GELU activation function. The outputs of the two branches are then dynamically fused via a
convolution. The fused process can be expressed as:
![]() |
11 |
where
denotes the channel concatenation operation. To facilitate gradient flow across network levels, we incorporate residual connections. Finally, two successive restormer layers (RTLs)21 are applied to further strengthen global feature dependencies. This process can be written as:
![]() |
12 |
where
represents the restormer layer transformation.
Restormer layer
To enhance long-range dependency modeling and feature representation, we incorporate the RTL21. This module combines multi-dconv head transposed attention (MDTA) and gated-dconv feed-forward network (GDFN) to improve global context modeling and nonlinear transformation of fused features. As illustrated in Fig. 2(d), given an input
, the MDTA module aggregates both local and non-local pixel interactions, formulated as:
![]() |
13 |
where
denotes the output of the MDTA module. The GDFN then dynamically modulates feature transformation by suppressing less informative components and selectively propagating relevant information through the network. The operation can be formulated as:
![]() |
14 |
where
represents the output feature of the GDFN module, and
is defined as before.
Hybrid perception enhancement block
We construct the HPEB by integrating the TAB, MFEB, and a
convolutional layer in sequence. This three-stage cascaded design enables progressive feature refinement. Specifically, the TAB first performs content-aware global feature reorganization via dynamic token aggregation for the input. Given an input
, this step is expressed as:
![]() |
15 |
where
represents the output of the TAB module. The MFEB module then enhances local-contextual representations using multi-scale convolutions (dilation rates of 1 and 2), formulated as;
![]() |
16 |
where
represents the output of the MFEB module. Finally, a
convolution preserves multi-granularity details, preventing high-frequency information from being over-smoothed by preceding transformer operations, expressed as:
![]() |
17 |
where
denotes a
convolution operation, and
represents the final output of the HPEB.
Experimental results
Datasets and implementation details
Datasets. We use the IR700 dataset52 as the source of high-resolution (HR) images. This dataset comprises 700 infrared images with a resolution of
, encompassing diverse scenarios including urban landscapes, vegetation, pedestrians, vehicles, and low-visibility conditions at night. Its authentic noise distribution and atmospheric degradation characteristics help mitigate the domain shift problem common in synthetically generated data. The dataset is partitioned into training, validation, and test sets in an 8:1:1 ratio, resulting in 560, 70, and 70 images, respectively. Low-resolution (LR) images are synthesized by applying bicubic downsampling to the HR images. For evaluation, we utilize five test datasets: results-A53 and results-C54, CVC09-10053, IR700-test52, M3FD55, and my_test. We quantitatively evaluate the reconstruction quality using two standard metrics: peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), both computed on the luminance (Y) channel.
Implementation details. The model is trained with the following configuration. The LR input patches of size
are randomly cropped and augmented via random horizontal flipping and rotation. We train the network for 100,000 iterations with a batch size of 16, using the Adam optimizer56,
and
and a MultiStepLR57 learning rate scheduler. The initial learning rate is set to
and decays to a minimum of
. The network employs 8 HPEBs with 36 feature channels. All experiments are conducted on a Linux server using PyTorch 1.13.0 and CUDA 11.8 (i.e., PyTorch, using an NVIDIA GeForce RTX 3090 GPU (24GB VRAM), an Intel Xeon E5-2620 v4 CPU, and 32GB of system memory.
Comparisons with state-of-the-art methods
Quantitative comparison. The proposed method is compared with state-of-the-art lightweight SR approaches, including CARN31, IMDN11, RFDN30, BSRN13, SMSR58, ShuffleMixer33, HNCT59, PSRGAN37, SAFMN14, RLKA-Net42, SMFANet34, SCNet60, LIRSRN43, CATANet25, and HiT-SR26. Quantitative results for
and
magnification factors are summarized in Table 2. In addition to the standard PSNR and SSIM metrics, we also report the number of parameters (#Params) and computational complexity (#FLOPs). The FLOPs are uniformly measured using the fvcore library (i.e., fvcore.nn.flop_count_str) with an output resolution of
.
Table 2.
A comprehensive comparison with lightweight SR methods on five test datasets is presented. All reported PSNR and SSIM values are calculated on the luminance (Y) channel in the transformed color space. Computational complexity (#FLOPs) is evaluated using high-resolution images with a fixed resolution of
pixels. The top two indicators are highlighted in italic and bold italic font respectively for clear distinction.
| Scale | Model | Params(K) | FLOPs(G) | resultsA | resultsC | CVC-09 | IR700-test | M3FD | my_test |
|---|---|---|---|---|---|---|---|---|---|
|
CARN31 | 1592 | 297 | 39.75/0.9489 | 40.75/0.9589 | 44.83/0.9737 | 38.14/0.9435 | 40.20/0.9676 | 37.08/0.9157 |
| IMDN11 | 715 | 218 | 39.77/0.9491 | 40.76/0.9589 | 44.87/0.9740 | 38.19/0.9438 | 40.28/0.9677 | 37.09/0.9159 | |
| RFDN30 | 417 | 122 | 39.76/0.9491 | 40.77/0.9590 | 44.88/0.9739 | 38.18/0.9437 | 40.13/0.9670 | 37.11/0.9161 | |
| BSRN13 | 332 | 96 | 39.77/0.9495 | 40.78/0.9593 | 44.85/0.9739 | 38.22/0.9440 | 40.24/0.9673 | 37.10/0.9160 | |
| SMSR58 | 987 | 468 | 39.76/0.9491 | 40.76/0.9590 | 44.84/0.9737 | 38.14/0.9435 | 40.20/0.9669 | 37.08/0.9158 | |
| ShuffleMixer_tiny33 | 247 | 76 | 39.74/0.9490 | 40.76/0.9589 | 44.85/0.9738 | 38.17/0.9437 | 40.16/0.9673 | 37.08/0.9158 | |
| HNCT59 | 357 | 111 | 39.78/0.9492 | 40.78/0.9591 | 44.88/0.9739 | 38.21/0.9440 | 40.28/0.9679 | 37.11/0.9159 | |
| PSRGAN37 | 313 | - | 39.76/0.9490 | 40.71/0.9589 | 44.79/0.9731 | 38.16/0.9434 | 38.24/0.9524 | 37.09/0.9158 | |
| SAFMN14 | 228 | 69 | 39.75/0.9491 | 40.64/0.9589 | 44.86/0.9739 | 38.13/0.9435 | 40.29/0.9679 | 37.08/0.9158 | |
| RLKA-Net42 | 225 | 253 | 39.79/0.9493 | 40.78/0.9590 | 44.83/0.9737 | 38.19/0.9438 | 40.05/0.9657 | 37.09/0.9158 | |
| SMFANet34 | 186 | 52 | 39.77/0.9493 | 40.78/0.9592 | 44.85/0.9738 | 38.11/0.9434 | 40.28/0.9677 | 37.06/0.9157 | |
| SCNet60 | 146 | 56 | 39.74/0.9489 | 40.74/0.9588 | 44.82/0.9737 | 38.08/0.9432 | 40.29/0.9679 | 37.06/0.9157 | |
| LIRSRN43 | 49 | 14 | 39.40/0.9475 | 40.11/0.9573 | 43.98/0.9730 | 37.03/0.9364 | 39.46/0.9658 | 36.27/0.9087 | |
| CATANet25 | 477 | 197 | 39.80/0.9493 | 40.79/0.9590 | 44.89/0.9738 | 38.26/0.9441 | 40.34/0.9684 | 37.09/0.9159 | |
| HiT-SR26 | 847 | 299 | 39.81/0.9496 | 40.81/0.9592 | 44.90/0.9741 | 38.27/0.9440 | 40.36/0.9683 | 37.12/0.9160 | |
| HPEN(Ours) | 434 | 173 | 39.83/0.9497 | 40.82/0.9594 | 44.92/0.9740 | 38.28/0.9442 | 40.37/0.9684 | 37.13/0.9162 | |
|
CARN31 | 1592 | 121 | 34.77/0.8615 | 35.43/0.8783 | 40.54/0.9513 | 31.87/0.8447 | 32.93/0.8666 | 34.22/0.8752 |
| IMDN11 | 715 | 55 | 34.83/0.8626 | 35.51/0.8791 | 40.73/0.9521 | 31.93/0.8450 | 33.07/0.8679 | 34.27/0.8757 | |
| RFDN30 | 433 | 32 | 34.86/0.8630 | 35.50/0.8793 | 40.75/0.9523 | 32.01/0.8462 | 33.09/0.8679 | 34.26/0.8757 | |
| BSRN13 | 352 | 26 | 34.90/0.8636 | 35.56/0.8799 | 40.77/0.9523 | 32.07/0.8472 | 33.14/0.8702 | 34.29/0.8761 | |
| SMSR58 | 1008 | 119 | 34.88/0.8631 | 35.52/0.8792 | 40.74/0.9521 | 32.08/0.8465 | 33.09/0.8700 | 34.26/0.8754 | |
| ShuffleMixer_tiny33 | 251 | 21 | 34.79/0.8621 | 35.49/0.8788 | 40.60/0.9516 | 31.92/0.8447 | 32.99/0.8673 | 34.23/0.8753 | |
| HNCT59 | 373 | 29 | 34.92/0.8638 | 35.58/0.8802 | 40.84/0.9527 | 32.12/0.8480 | 33.15/0.8699 | 34.30/0.8761 | |
| PSRGAN37 | 350 | - | 34.81/0.8616 | 35.47/0.8790 | 40.67/0.9516 | 32.04/0.8459 | 31.70/0.8379 | 34.25/0.8754 | |
| SAFMN14 | 240 | 18 | 34.86/0.8628 | 35.51/0.8791 | 40.72/0.9521 | 32.08/0.8464 | 33.10/0.8690 | 34.28/0.8758 | |
| RLKA-Net42 | 245 | 65 | 34.89/0.8637 | 35.51/0.8799 | 40.79/0.9524 | 32.11/0.8474 | 33.14/0.8691 | 34.28/0.8758 | |
| SMFANet34 | 197 | 14 | 34.86/0.8630 | 35.51/0.8791 | 40.75/0.9523 | 32.10/0.8469 | 33.16/0.8699 | 34.29/0.8760 | |
| SCNet60 | 154 | 28 | 34.77/0.8609 | 35.44/0.8778 | 40.54/0.9512 | 32.04/0.8455 | 33.12/0.8692 | 34.25/0.8753 | |
| LIRSRN43 | 70 | 5 | 34.20/0.8574 | 34.73/0.8734 | 39.01/0.9454 | 31.18/0.8267 | 32.41/0.8592 | 33.35/0.8706 | |
| CATANet25 | 535 | 64 | 34.91/0.8637 | 35.55/0.8800 | 40.83/0.9525 | 32.13/0.8483 | 33.20/0.8711 | 34.33/0.8765 | |
| HiT-SR26 | 866 | 77 | 34.91/0.8638 | 35.56/0.8801 | 40.86/0.9527 | 32.14/0.8485 | 33.22/0.8711 | 34.30/0.8763 | |
| HPEN(Ours) | 445 | 44 | 34.93/0.8640 | 35.58/0.8803 | 40.88/0.9528 | 32.16/0.8487 | 33.25/0.8713 | 34.34/0.8768 |
We first conduct a systematic evaluation of the model’s channel dimension (“Dim”) and the number of HPEB modules (“Blocks”) to assess how different configurations affect performance, resource consumption, and efficiency. As shown in Table 1, the configuration with Dim=36 and Blocks=8 is selected as the baseline model, as it offers a favorable trade-off between efficiency and accuracy. Compared to a model of the same depth but wider channels (Dim=48, Blocks=8), this baseline reduces parameters and FLOPs by approximately 38.0% and 38.6%, respectively, while decreasing GPU memory usage by about 13.8%. When compared to the largest high-performance configuration (Dim=48, Blocks=12), the baseline achieves substantial savings of roughly 58.3% in parameters and 58.8% in FLOPs, while incurring an average performance drop of less than 0.5% across the three datasets. This configuration therefore balances resource efficiency and reconstruction accuracy effectively, justifying its selection as the baseline model.
Table 1.
Performance comparison of the proposed model across various channel configurations and module counts, with the baseline HPEN configuration highlighted in bold black.
| Ablation | Dim | Blocks | #Params [K] | #FLOPs [G] | #GPU Mem. [M] | #Avg.Time [ms] | resultsA | resultsC | M3FD |
|---|---|---|---|---|---|---|---|---|---|
| HPEN | 36 | 8 | 445 | 44.1 | 115.98 | 44.63 | 34.93/0.8640 | 35.58/0.8803 | 33.25/0.8713 |
| 36 | 10 | 553 | 54.8 | 117.79 | 47.93 | 34.98/0.8641 | 35.60/0.8804 | 33.27/0.8714 | |
| 36 | 12 | 660 | 65.5 | 120.21 | 50.04 | 35.02/0.8644 | 35.63/0.8806 | 33.29/0.8713 | |
| 48 | 8 | 718 | 71.8 | 134.62 | 48.96 | 35.00/0.8645 | 35.65/0.8807 | 33.31/0.8716 | |
| 48 | 10 | 892 | 89.4 | 138.04 | 54.32 | 35.08/0.8649 | 35.67/0.8810 | 33.35/0.8718 | |
| 48 | 12 | 1066 | 107.0 | 141.73 | 59.06 | 35.09/0.8648 | 35.70/0.8811 | 33.34/0.8717 |
As shown in Table 2, the proposed HPEN exhibits competitive performance in both reconstruction quality and in computational efficiency. For the
SR task, HPEN reaches a PSNR of 38.28dB on the IR700-test dataset and 39.83dB on the resultsA dataset, outperforming the strong baseline HiT-SR26 by 0.03% and 0.05%, respectively. Its advantage is more pronounced in the more challenging
task, where HPEN attains 40.88dB on the CVC-09 dataset and 34.34dB on the my_test dataset, outperforming HiT-SR26 and CATANet25 by 0.05% and 0.03%. Notably, these superior results are achieved with significantly lower computational cost. In the
setting, HPEN requires only 44G FLOPs, compared to 77G for HiT-SR26 and 64G for CATANet25, while still achieving a 0.06% higher PSNR on the IR700-test dataset. It is worth noting that while the LIRSRN43 maintains an extremely low parameter count and computational complexity (70K parameters and 5G FLOPs), it exhibits average performance degradation of 1.04dB in PSNR and 0.0102 in SSIM compared to HPEN, with the PSNR reduction reaching 4.6% specifically on the CVC-09 dataset. Table 2 also demonstrates that, excluding LIRSRN43, PSNR variations among all evaluated methods are confined within 0.35dB, with over 75% of the differences remaining below 0.10dB, making the 1.04dB performance gap particularly notable.
Additionally, we compare GPU memory usage and inference time of between the proposed method and mainstream methods, including RFDN30, ShuffleMixer33, RLKA-Net42, HNCT59, CATANet25 and HiT-SR26. As illustrated in Table 3, the proposed HPEN is highly efficient in terms of memory consumption, requiring only 115.98M, which represents a substantial reduction of approximately 90.58% and 83.54% compared to the most memory-intensive models, HiT-SR26 (1231.27M) and CATANet25 (704.38M), respectively. In terms of inference time, HPEN takes only 44.63ms, which is about 59.97% faster than CATANet (114.49ms) and 55.84% faster than HiT-SR (101.06ms). Although it ranks as the third fastest method behind RFDN and ShuffleMixer, its inference time remains highly competitive. Together with its leading PSNR/SSIM performance (Table 3), these results demonstrate that HPEN achieves a more favorable performance and efficiency trade-off than existing approaches. A visual summary of GPU memory and inference time across all methods is provided in Figure 1.
Table 3.
A comparison of computational efficiency for different
SR methods. The analysis compares average inference time (#Avg.Time) and GPU memory consumption (#GPU Mem) across 50 test images with
pixel resolution. All measurements were conducted under consistent hardware and software configurations to ensure comparative fairness.
To further assess the perceptual quality of the reconstructed images, we evaluate all approaches using learned perceptual image patch similarity (LPIPS)61 metric under a unified, pre-trained AlexNet62 backbone to ensure a fair comparison. As reported in Table 4, the proposed HPEN achieves the lowest LPIPS scores across all test sets, indicating superior perceptual fidelity. Notably, on the CVC-09 dataset, HPEN attains a score of 0.0296, significantly outperforming other lightweight approaches. The stacked perceptual-loss comparison in Fig. 3 further confirms that HPEN consistently yields the smallest perceptual deviation. These results demonstrate that HPEN better preserves perceptually important details in infrared image SR, generating outputs that are perceptually closer to the original HR images.
Table 4.
A comparison of LPIPS values across different methods based on the same pre-trained network for
SR results.
| Scale SR results | Model | resultsA | resultsC | CVC-09 | IR700-test | M3FD | my_test |
|---|---|---|---|---|---|---|---|
![]() |
CARN31 | 0.1252 | 0.1187 | 0.0613 | 0.1764 | 0.1509 | 0.1304 |
| ShuffleMixer_tiny33 | 0.1142 | 0.1114 | 0.0486 | 0.1625 | 0.1420 | 0.1246 | |
| SAFMN14 | 0.1114 | 0.1109 | 0.0401 | 0.1594 | 0.1389 | 0.1214 | |
| LIRSRN43 | 0.1235 | 0.1242 | 0.0406 | 0.1691 | 0.1482 | 0.1294 | |
| HiT-SR26 | 0.1016 | 0.1040 | 0.0299 | 0.1552 | 0.1338 | 0.1176 | |
| HPEN(Ours)33 | 0.1015 | 0.1036 | 0.0296 | 0.1547 | 0.1331 | 0.1168 |
Fig. 3.
Comprehensive comparison of LPIPS values between the proposed method and other lightweight approaches for
SR results across all test datasets.
Qualitative comparisons. Figure 4 presents a visual comparison of
SR results across multiple lightweight methods. The proposed HPEN exhibits excellent performance in both detail restoration and structural preservation. Notably, it effectively mitigates the over-smoothing artifacts commonly observed in outputs from IMDN11, BSRN13, ShuffleMixer33, and SAFMN14, preserving sharper edge contours and more complex textures. Furthermore, HPEN successfully recovers structural details that are inadequately reconstructed in HNCT59 and SMFANet34, which is particularly evident in the restoration of stripe patterns.
Fig. 4.
Visual comparisons of different methods on image001 from the resultsA dataset for
SR. The proposed HPEN reconstructs sharper edges and more authentic textures compared to other approaches.
LAM comparisons. Figure 5 compares the local attribute maps (LAMs)63 and corresponding diffusion indices (DIs)63 for image014 from the resultsC dataset. The proposed HPEN shows notably broader activation coverage and achieves a significantly higher DI of 3.25. Its extensive high-response regions (shown in red) visually reflect a stronger ability to integrate contextual information. This result contrasts sharply with the more limited attention range of HiT-SR26 (DI=1.35) and differs considerably from the restricted information utilization observed in other visualized models, all of which attain DI values no greater than 1.00.
Fig. 5.
A comparison of the local attribute maps (LAMs)63 and corresponding diffusion indices (DIs)63 across different methods. The LAM visualizes each pixel’s contribution from the LR input to the final SR output within the marked region, while the DI quantifies the spatial dispersion of influential pixels, with higher values indicating more widespread attention distribution. The LAM visualizations and corresponding DI results substantiate that the proposed HPEN effectively utilizes extensive contextual information during the reconstruction process.
Ablation study
To validate the architectural design of our model, we conduct systematic ablation studies by progressively removing its key components. This allows us to assess the necessity of each part and quantify its contribution to overall performance. The quantitative results on six benchmark datasets, presented in Table 5, clearly illustrate the role each module plays in achieving the reported performance gains.
Table 5.
Comparison of
SR performance results for HPEN and its architecture variants across the five test datasets. The FLOPs are measured using input images with
pixel resolution. “A
B” stands for replacing B with A, and “A
None” represents removing operation A. “STL” is swin transformer layer. The baseline HPEN configuration is prominently highlighted in black bold font.
| Ablation | Variant | Params[K] | FLOPs[G] | resultsA | resultsC | CVC-09 | IR700-test | M3FD | my_test |
|---|---|---|---|---|---|---|---|---|---|
| Baseline | HPEN | 445 | 44 | 34.93/0.8640 | 35.58/0.8803 | 40.88/0.9528 | 32.16/0.8487 | 33.25/0.8713 | 34.34/0.8768 |
| HPEB | TAB None |
295 | 32 |
0.04/ 0.0005 |
0.02/ 0.0004 |
0.09/ 0.0005 |
0.02/ 0.0009 |
0.09/ 0.0007 |
0.05/ 0.0007 |
MFEB None |
260 | 20 |
0.04/ 0.0006 |
0.03/ 0.0006 |
0.08/ 0.0004 |
0.03/ 0.0010 |
0.06/ 0.0004 |
0.02/ 0.0004 |
|
Conv None |
352 | 37 |
0.03/ 0.0004 |
0.02/ 0.0004 |
0.04/ 0.0002 |
0.01/ 0.0001 |
0.05/ 0.0003 |
0.03/ 0.0003 |
|
HPEB MFEB then TAB |
445 | 44 |
0.03/ 0.0004 |
0.02/ 0.0004 |
0.08/ 0.0002 |
0.01/= |
0.03/ 0.0002 |
0.02/ 0.0004 |
|
HPEB TAB then TAB |
410 | 33 |
0.02/ 0.0005 |
0.11/ 0.0005 |
0.06/ 0.0002 |
0.01/ 0.0006 |
0.05/ 0.0003 |
0.01/ 0.0004 |
|
HPEB MFEB then MFEB |
481 | 56 |
0.04/ 0.0006 |
0.03/ 0.0006 |
0.10/ 0.0004 |
0.04/ 0.0012 |
0.08/ 0.0005 |
0.04/ 0.0006 |
|
| TAB | LN None |
445 | 44 |
0.15/ 0.0019 |
0.13/ 0.0021 |
0.31/ 0.0024 |
0.08/ 0.0018 |
0.12/ 0.0009 |
0.14/ 0.0014 |
ConvFFN Channel MLP |
421 | 37 |
0.03/ 0.0005 |
0.04/ 0.0006 |
0.07/= |
0.04/ 0.0005 |
0.03/ 0.0001 |
0.02/ 0.0004 |
|
ConvFFN FFN |
437 | 40 |
0.01/ 0.0003 |
0.02/ 0.0003 |
0.05/ 0.0003 |
0.03/ 0.0002 |
0.02/= |
0.02/ 0.0003 |
|
| MFEB | DWConv None |
445 | 44 |
0.05/ 0.0007 |
0.02/ 0.0005 |
0.10/ 0.0004 |
0.01/ 0.0009 |
0.05/ 0.0003 |
0.03/ 0.0005 |
DilaConv None |
445 | 44 |
0.03/ 0.0005 |
0.02/ 0.0005 |
0.09/ 0.0004 |
0.01/ 0.0009 |
0.06/ 0.0004 |
0.02/ 0.0006 |
|
| DWConv then Conv | 495 | 51 |
0.05/ 0.0008 |
0.03/ 0.0007 |
0.12/ 0.0005 |
0.03/ 0.0011 |
0.07/ 0.0003 |
0.03/ 0.0006 |
|
| DWConv then DWConv | 401 | 41 |
0.04/ 0.0008 |
0.03/ 0.0008 |
0.10/ 0.0005 |
0.02/ 0.0013 |
0.04/ 0.0004 |
0.03/ 0.0007 |
|
| DWConv then DilaConv | 423 | 42 |
0.06/ 0.0007 |
0.03/ 0.0008 |
0.10/ 0.0004 |
0.04/ 0.0012 |
0.06/ 0.0004 |
0.03/ 0.0005 |
|
| DilaConv then DWConv | 423 | 42 |
0.04/ 0.0008 |
0.03/ 0.0008 |
0.10/ 0.0005 |
0.02/ 0.0013 |
0.04/ 0.0007 |
0.03/ 0.0007 |
|
RTL None |
318 | 25 |
0.03/ 0.0005 |
0.03/ 0.0005 |
0.10/ 0.0004 |
0.03/ 0.0011 |
0.09/ 0.0006 |
0.03/ 0.0004 |
|
RTL STL |
564 | 52 |
0.02/ 0.0002 |
0.01/ 0.0001 |
0.06/ 0.0003 |
0.02/ 0.0006 |
0.03/ 0.0001 |
0.02/ 0.0001 |
Effectiveness of hybrid perception enhancement block. To validate the effectiveness and synergistic interactions among the components of the HPEB module, we conduct comprehensive ablation studies. As summarized in Table 5, the full HPEN delivers optimal performance in the
SR task, attaining a PSNR of 32.16dB on the IR700-test dataset. Removing the TAB module (“TAB
None”) reduces the parameter count by 33.71% (to 295 K) but leads to notable performance degradation across all datasets; for instance, PSNR drops by 0.09dB (0.22%) on the CVC-09 dataset. Ablating either of the remaining components (“MFEB
None” and “Conv
None”) also harms model performance. Specifically, removing the MFEB module causes an average PSNR decrease of up to 0.04dB.
Notably, even when the overall model capacity remains unchanged, altering the processing order to MFEB followed by TAB (“HPEB
MFEB then TAB”) substantially degrades performance. As shown in Table 5, this change reduces PSNR by 0.08dB (0.20%) on the CVC-09 dataset. Replacing HPEB with a dual-TAB structure (“HPEB
TAB then TAB”) saves 7.19% in parameters, but at the cost of 0.11dB (0.31%) lower PSNR on the resultsC dataset, underscoring the essential role of the MFEB in multi-scale feature extraction for preserving infrared image structures. Conversely, a dual-MFEB configuration (“HPEB
MFEB then MFEB”) increases parameters by 8.09% without improving any metric; instead, it reduces PSNR on the CVC-09 dataset by 0.10dB. Together, the results in Table 5 validate the critical importance of the complementary TAB and MFEB design in balancing model efficiency and reconstruction performance for infrared image SR.
Token aggregation block. As shown in Table 5, the ablation study confirms the clear superiority of the complete HPEN (Baseline) over all architectural variants. For example, removing LayerNorm (“LN
None”) causes a marked decline in both PSNR and SSIM across datasets (e.g., resultsA drops from 34.93/0.8640 to 34.78/0.8621) without reducing parameters or computations, underscoring its essential role in training stability and representational capacity. Ablation results confirm that replacing ConvFFN with a channe MLP or standard FFN reduces parameters by 5.4% and 1.8%, respectively, yet causes notable performance drop (e.g., 0.07dB and 0.05dB PSNR decrease on the CVC-09 dataset). These comparisons clearly show that the complete TAB design in the Baseline is essential for achieving optimal reconstruction performance with high efficiency (445K parameters, 44G FLOPs). This design effectively integrates LayerNorm, dynamic token aggregation, and the convolutional feed-forward network (ConvFFN) into a synergistic unit.
Effectiveness of multi-scale feature enhancement block. To validate the effectiveness of the MFEB, we perform systematic ablation experiments. As reported in Table 5, removing the MFEB module (“MFEB
None”) results in an average PSNR degradation of 0.040 dB across all test datasets. Furthermore, to examine the influence of different receptive fields, we evaluate several convolutional combinations. Table 5 shows that using only depth-wise convolution (“DilaConv
None”) or only dilated convolution (“DWConv
None”) leads to an average PSNR decrease of 0.03dB and 0.04dB across all test datasets, respectively. These results confirm that relying on a single type of feature extraction substantially degrades model performance.
To further assess the efficiency of MFEB’s dual-branch architecture, we conduct extensive comparative experiments. The results demonstrate that while the DWConv-Conv cascade (“DWConv then Conv”) increases parameter count by 11.24% yet it consistently degrades performance across all datasets, reducing PSNR by 0.05dB (0.14%) on the resultsA dataset. Similarly, other cascade configurations lead to considerable performance deterioration. In particular, the DilaConv-DWConv structure (“DilaConv then DWConv”) causes an average PSNR drop of 0.05dB across all evaluation datasets. These findings confirm that the carefully designed parallel branches in the MFEB deliver more efficient feature extraction compared to sequential alternatives.
Furthermore, we introduce the RTL module, which enables cross-region feature interaction through spatial attention mechanisms, thereby effectively modeling long-range dependencies in infrared images. To evaluate its contribution, we perform an ablation study by removing the RTL module (“RTL
None”). This modification reduces model parameters by 28.5% compared to the baseline HPEN, but at the cost of performance degradation; the average PSNR and SSIM across five test datasets decreased by 0.044dB and 0.0006, respectively. Furthermore, although replacing RTL with the STL (“RTL
STL”) further increases the number of parameters and computational load, its performance is almost at the same level as “RTL
None”, indicating that the STL cannot effectively replace the lightweight designed RTL module in this architecture. In contrast, Baseline, with only 44 GFLOPs and 445 K parameters, still achieved a PSNR of 40.88 dB and an SSIM of 0.9528 on the CVC-09 datasets. These results indicate that while the RTL module increases model complexity, it is essential for maintaining high reconstruction quality in infrared SR tasks.
Conclusion
In this paper, we propose a lightweight hybrid perception enhancement network (HPEN) for infrared image super-resolution, whose core innovation is a novel hybrid perception enhancement block (HPEB). The HPEB architecture systematically integrates three complementary components: a Token Aggregation Block (TAB) for modeling long-range dependencies through dynamic feature reorganization, a multi-scale feature enhancement block (MFEB) for capturing fine-grained details via parallel dilated convolutions, and a
convolution for feature refinement. This synergistic design enables comprehensive feature learning while maintaining computational efficiency. Extensive experiments show that HPEN achieves state-of-the-art performance among lightweight SR methods, effectively balancing reconstruction quality with practical computational demands. The compelling performance–efficiency trade-off makes HPEN suitable for real-world applications where both accuracy and resource constraints are critical, such as enhancing low-resolution infrared footage in surveillance systems, recovering fine thermal textures for industrial inspection and predictive maintenance, and assisting higher-resolution analysis in medical thermography.
While the proposed HPEN achieves strong performance in infrared image SR, it is important to acknowledge its current limitations: The model is trained exclusively on data synthesized via bicubic downsampling, which may not fully capture the complex and varied degradation patterns (e.g., sensor-specific noise, motion blur, atmospheric effects) present in real-world infrared imaging, thus its generalization capability to such authentic conditions requires further validation. Additionally, although the model is designed to be efficient, its computational and memory requirements could still be challenging for deployment on extremely resource-constrained edge devices with stringent power and latency budgets. Furthermore, the work focuses on scaling factors of
and
; its effectiveness on more extreme SR tasks (e.g.,
and above) remains unexplored and presents a direction for future research.
Acknowledgements
This research was funded by Natural Science Foundation of Xinjiang Uygur Autonomous Region (2022D01C461, 2022D01C460), “Tianshan Talents” Famous Teachers in Education and Teaching project of Xinjiang Uygur Autonomous Region (2025), and the “Tianchi Talents Attraction Project” of Xinjiang Uygur Autonomous Region (2024TCLJ04).
Author contributions
Z.L. main contribution was to propose the methodology of this work and write the paper. J.T. guided the entire research process, participated in writing the paper and secured funding for the research. C.L. participated in the implementation of the algorithm. G.Z. was responsible for algorithm validation and data analysis. All authors have read and agreed to the published version of the manuscript.
Data availability
The datasets used in this study are publicly available. The IR700 dataset can be accessed from the corresponding article. Additional test datasets (results-A and results-C, CVC09-100, IR700-test, M3FD) are also publicly available via their respective cited sources. The implementation code for HPEN is now available at https://github.com/smilenorth1/HPEN-main. Materials availability: The datasets used or analysed during the current study are available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally to this work: Zepeng Liu and Jiya Tian.
Contributor Information
Zepeng Liu, Email: smilenorth089@163.com.
Guodong Zhang, Email: 2023021@xjit.edu.cn.
References
- 1.Zhan, Y., Hou, H., Sheng, G., Zhang, Y. & Jiang, X. Infrared image segmentation and temperature monitoring based on yolo model object detection results. In 2024 10th International Conference on Condition Monitoring and Diagnosis (CMD), 456–460, 10.23919/CMD62064.2024.10766200 (2024).
- 2.Tuerniyazi, A., Lan, J., Zeng, Y., Hu, J. & Zhuo, Y. Multiview angle uav infrared image simulation with segmented model and object detection for traffic surveillance. Sci. Reports15, 1–18. 10.1038/s41598-025-89585-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wang, S., Du, Y., Zhao, S. & Gan, L. Multi-scale infrared military target detection based on 3x-fpn feature fusion network. IEEE Access11, 141585–141597. 10.1109/ACCESS.2023.3343419 (2023). [Google Scholar]
- 4.Cui, H., Xu, Y., Zeng, J. & Tang, Z. The methods in infrared thermal imaging diagnosis technology of power equipment. In 2013 IEEE 4th International Conference on Electronics Information and Emergency Communication, 246–251, 10.1109/ICEIEC.2013.6835498 (2013).
- 5.Wang, Q., Jin, P., Wu, Y., Zhou, L. & Shen, T. Infrared image enhancement: A review. IEEE J. Sel. Top. Appl. Earth Obs.Remote. Sens.18, 3281–3299. 10.1109/JSTARS.2024.3523418 (2025). [Google Scholar]
- 6.Florio, C. et al. The potential of near infrared (nir) spectroscopy coupled to principal component analysis (pca) for product and tanning process control of innovative leathers. Sci. Reports15, 1–12. 10.1038/s41598-025-17598-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Dong, C., Loy, C. C., He, K. & Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 184–199, 10.1007/978-3-319-10593-2_13 (2014).
- 8.Dong, C., Loy, C. C. & Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 391–407, 10.1007/978-3-319-46475-6_25 (2016).
- 9.Kim, J., Lee, J. K. & Lee, K. M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1637–1645, 10.1109/CVPR.2016.181 (2016).
- 10.Kim, J., Lee, J. K. & Lee, K. M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the conference on Computer Vision and Pattern Recognition, 1646–1654, 10.1109/CVPR.2016.182 (2016).
- 11.Hui, Z., Gao, X., Yang, Y. & Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th acm international conference on Multimedia, 2024–2032, 10.1145/3343031.3351084 (2019).
- 12.Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the conference on Computer Vision and Pattern Recognition, 105–114, 10.1109/CVPR.2017.19 (2017).
- 13.Li, Z. et al. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 833–843, 10.1109/CVPRW56347.2022.00099 (2022).
- 14.Sun, L., Dong, J., Tang, J. & Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13190–13199, 10.1109/ICCV51070.2023.01213 (2023).
- 15.Shi, W., Caballero, J., Huszár, F., Totz, J. & Aitken, A. P. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1874–1883, 10.1109/CVPR.2016.207 (2016).
- 16.Zhang, Y. et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the IEEE/CVF on European Conference on Computer Vision, 286–301, 10.1007/978-3-030-01234-2_18 (2018).
- 17.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the conference on International Conference on Learning Representations, 10.48550/arXiv.2010.11929 (2021).
- 18.Chen, H. et al. Pre-trained image processing transformer. In Proceedings of the conference on Computer Vision and Pattern Recognition, 12299–12310, 10.1109/CVPR46437.2021.01212 (2021).
- 19.Zhou, Y. et al. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12734–12745, 10.1109/ICCV51070.2023.01174 (2023).
- 20.Liang, J. et al. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1833–1844, 10.1109/ICCVW54120.2021.00210 (2021).
- 21.Zamir, S. W. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 5728–5739, 10.48550/arXiv.2111.09881 (2022).
- 22.Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 5998–6008, 10.48550/arXiv.1706.03762 (2017).
- 23.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).
- 24.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).
- 25.Liu, X., Liu, J., Tang, J. & Wu, G. Catanet: Efficient content-aware token aggregation for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 10.48550/arXiv.2503.06896 (2025).
- 26.Zhang, X., Zhang, Y. & Yu, F. Hit-sr: Hierarchical transformer for efficient image super-resolution. In Proceedings of the conference on European Conference on Computer Vision, 483–500, 10.48550/arXiv.2205.04437 (2024).
- 27.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the conference on Computer Vision and Pattern Recognition, 770–778, 10.1109/CVPR.2016.90 (2016).
- 28.Tai, Y., Yang, J. & Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 3147–3155, 10.1109/CVPR.2017.298 (2017).
- 29.Lim, B., Son, S., Kim, H., Nah, S. & Lee, K. M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the conference on Computer Vision and Pattern Recognition Workshops, 136–144, 10.1109/CVPRW.2017.151 (2017).
- 30.Liu, J., Tang, J. & Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision workshops, 41–55, 10.48550/arXiv.2009.11551 (2020).
- 31.Ahn, N., Kang, B. & Sohn, K.-A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 252–268, 10.48550/arXiv.1803.08664 (2018).
- 32.Hui, Z., Wang, X. & Gao, X. Fast and accurate single image super-resolution via information distillation network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 723–731, 10.1109/CVPR.2018.00082 (2018).
- 33.Sun, L., Pan, J. & Tang, J. Shufflemixer: An efficient convnet for image super-resolution. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 17314–17326, 10.48550/arXiv.2205.15175 (2022).
- 34.Zheng, M., Sun, L., Dong, J. & Pan, J. Smfanet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 359–375, 10.1007/978-3-031-72973-7_21 (2025).
- 35.Chen, X., Wang, X., Zhou, J., Qiao, Y. & Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 22367–22377, 10.48550/arXiv.2205.04437 (2023).
- 36.Chudasama, V. et al. Therisurnet-a computationally efficient thermal image super-resolution network. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 388–397, 10.1109/CVPRW50498.2020.00051 (2020).
- 37.Huang, Y., Jiang, Z., Lan, R., Zhang, S. & Pi, K. Infrared image super-resolution via transfer learning and psrgan. IEEE Signal Process. Lett.28, 982–986. 10.1109/LSP.2021.3077801 (2021). [Google Scholar]
- 38.Marivani, I., Tsiligianni, E., Cornelis, B. & Deligiannis, N. Multimodal deep unfolding for guided image super-resolution. IEEE Transactions on Image Process.29, 8443–8456. 10.1109/TIP.2020.3014729 (2020). [DOI] [PubMed] [Google Scholar]
- 39.Ying, X. et al. Local motion and contrast priors driven deep network for infrared small target superresolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens.15, 5480–5495. 10.1109/JSTARS.2022.3183230 (2022). [Google Scholar]
- 40.Liu, S. et al. Infrared image super-resolution via lightweight information split network. Image and Video Process.10.48550/arXiv.2405.10561 (2024). [Google Scholar]
- 41.Qin, F. et al. Lkformer: large kernel transformer for infrared image super-resolution. Multimed Tools Appl83, 72063–72077. 10.1007/s11042-024-18409-3 (2024). [Google Scholar]
- 42.Liu, G., Zhou, S., Chen, X., Yue, W. & Ke, J. Recurrent large kernel attention network for efficient single infrared image super-resolution. IEEE Access12, 923–935. 10.1109/ACCESS.2023.3344830 (2024). [Google Scholar]
- 43.Lin, C.-A., Liu, T.-J. & Liu, K.-H. Lirsrn: A lightweight infrared image super-resolution network. IEEE Int. Symp. on Circuits Syst. (ISCAS) 1–5, 10.1109/ISCAS58744.2024.10558676 (2024).
- 44.Juhartini, Dwinita, A. & Desmiwati. Single shot multibox detector (ssd) in object detection: A review. Int. J. Adv. Comput. Informatics1, 118–127, 10.71129/ijaci.v1i2.pp118-127 (2025).
- 45.Ricki, S., Dicky, H. & Bahalwan, A. Fast region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics2, 34–40. 10.71129/ijaci.v2i1 (2025). [Google Scholar]
- 46.Vivi, A. & Surni, E. Yolov8 for object detection: A comprehensive review of advances, techniques, and applications. Int. J. Adv. Comput. Informatics2, 53–61. 10.71129/ijaci.v2i1 (2025). [Google Scholar]
- 47.Wang, Y., Li, Y., Wang, G. & Liu, X. Multi-scale attention network for single image super-resolution. J. Vis. Commun. Image Represent.80, 103300. 10.1016/j.jvcir.2021.103300 (2021). [Google Scholar]
- 48.Wanda, I. & Abubakar, A. A comprehensive review of convnext architecture in image classification: Performance, applications, and prospects. Int. J. Adv. Comput. Informatics2, 108–114. 10.71129/ijaci.v2i2 (2025). [Google Scholar]
- 49.Surni, E., Vivi, A. & Bahtiar, I. Mask region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics 1, 106–117, 10.71129/ijaci.v1i2.pp106-117 (2025).
- 50.Ch, R., Shaik, J., Srikavya, R., Sahu, M. & Sahu, A. K. A novel fiestal structured chromatic series-based data security approach. Discov. Internet of Things5, 106. 10.1007/s43926-025-00162-0 (2025). [Google Scholar]
- 51.Sahu, A. et al. Dual image-based reversible fragile watermarking scheme for tamper detection and localization. Pattern Analysis Appl.26, 571–590. 10.1007/s43926-025-00162-0 (2023). [Google Scholar]
- 52.Zou, Y. et al. Super-resolution reconstruction of infrared images based on a convolutional neural network with skip connections. Opt. Lasers Eng.146, 106717. 10.1016/j.optlaseng.2021.106717 (2021). [Google Scholar]
- 53.Liu, Y., Chen, X., Cheng, J., Peng, H. & Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavelets, Multiresolution Inf. Process.16, 1850018. 10.1142/S0219691318500182 (2018). [Google Scholar]
- 54.Bai, Y. et al. Ibfusion: An infrared and visible image fusion method based on infrared target mask and bimodal feature extraction strategy. IEEE Transactions on Multimed.26, 10610–10622. 10.1109/TMM.2024.3410113 (2024). [Google Scholar]
- 55.Liu, J. et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5811, 10.48550/arXiv.2203.16220 (2022).
- 56.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1412.6980 (2015).
- 57.Loshchilov, I. & Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1608.03983 (2017).
- 58.Wang, L. et al. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 4915–4924, 10.1109/CVPR46437.2021.00488 (2021).
- 59.Fang, J., Lin, H., Chen, X. & Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1102–1111, 10.1109/CVPRW56347.2022.00119 (2022).
- 60.Wu, G., Jiang, J., Jiang, K. & Liu, X. Fully 1 1 convolutional network for lightweight image super-resolution. Mach. Intell. Res. 21, 1–15. 10.1007/s11633-024-1501-9 (2025).
- 61.Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 586–595, 10.48550/arXiv.1801.03924 (2018).
- 62.Krizhevsky, A., Sutskever, I. & Hinton, G. E. The unreasonable effectiveness of deep features as a perceptual metric. In Advances in Neural Information Processing Systems10.1145/3065386 (2012).
- 63.Gu, J. & Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9199–9208, 10.48550/arXiv.2011.11036 (2021).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used in this study are publicly available. The IR700 dataset can be accessed from the corresponding article. Additional test datasets (results-A and results-C, CVC09-100, IR700-test, M3FD) are also publicly available via their respective cited sources. The implementation code for HPEN is now available at https://github.com/smilenorth1/HPEN-main. Materials availability: The datasets used or analysed during the current study are available from the corresponding author on reasonable request.















































































































































































































































